Introduction

With the rapid development of smart cities, safe and reliable intelligent transportation systems have become an urgent demand. Intelligent transportation systems can provide real-time traffic information and signage to help drivers avoid accidents and dangerous situations. In addition, the development of autonomous driving technology will also reduce accidents caused by human driving in the future. As a branch of object detection, traffic sign detection is an indispensable part of automatic driving technology in intelligent transportation systems1,2,3,4. It has great practical value for ensuring safe vehicle driving, alleviating traffic congestion, and building smart cities5,6,7. Despite significant progress in object detection techniques, traffic sign detection remains a challenging task due to several factors. First, traffic signs are often small in size and may appear blurry or dim, particularly in low-light conditions or adverse weather. This makes it difficult for detection models to extract sufficient features for accurate recognition. Second, traffic signs are frequently surrounded by cluttered backgrounds, such as trees, buildings, or other road elements, which can confuse detection models and lead to false positives. Third, achieving a balance between high detection accuracy and computational efficiency is a persistent challenge, especially for real-time applications in resource-constrained environments, such as embedded systems in vehicles.

Nowadays, traffic sign detection8 are mainly divided into one-stage and two-stage algorithms. The principles of the two algorithms are different. The two-stage detector generally classifies the candidate regions, whereas the one-stage detector uses a regression method, which can directly give the detection results for the input image. R-CNN9,10,11 series are the classic representative two-stage detectors and have achieved very good detection results. However, traffic sign recognition requires the network to have high detection accuracy and fast detection speed12,13,14,15. The detection speed of this series of algorithms is slow and is not suitable for detecting traffic signs. The one-stage algorithms represented by the SSD16 and YOLO series17,18,19,20,21,22,23 have a faster detection speed and lower complexity of the network model, allowing real-time target detection with a higher detection speed and better accuracy24,25. Compared with the R-CNN series algorithm, the YOLO series algorithm can detect objects more efficiently in the way of single-stage detection. However, in some complex scenes or small target detection, the YOLO series algorithm may have a certain performance loss26,27,28.

Compared to high-precision detectors, YOLOv4-tiny may have slightly lower detection accuracy, but it performs well in terms of training efficiency and detection speed and is suitable for applications requiring real-time target detection, such as real-time video analysis, traffic monitoring, face recognition, etc29,30,31,32. It can perform object detection in images or videos in a relatively short period of time, providing immediate feedback. Thus, this paper conducts in-depth research on traffic sign detection using YOLOv4-tiny and finds three problems as follows: (1) In YOLOv4-tiny, the backbone network struggles to automatically prioritize important features and suppress irrelevant ones, leading to a gradual decline in the model’s discrimination ability when interference persists. (2) YOLOv4-tiny mainly uses single-branch ordinary convolution with a fixed kernel size, leading to a uniform receptive field. This restricts the extracted features, making complex detection tasks difficult due to the absence of multi-scale capabilities. (3) When using YOLOv4-tiny to identify traffic signs, the accuracy is hindered by the relatively small size of the signs, low resolution, unclear features, and other objective factors. This often results in missed detections and false positives, reducing the effectiveness of small target recognition.

Addressing the aforementioned issues, we introduce MASG-Net, an end-to-end lightweight detection approach, grounded in multi-scale awareness and semantic guidance. First, we introduce an ultra-lightweight channel attention mechanism into MobileNetV3 to create a novel E-block structure. Based on this structure, we design E-mobilenet, a lightweight backbone network that significantly improves feature extraction while reducing the number of parameters, making it suitable for real-time applications. To address the limitations of small feature maps in capturing sufficient information for small targets, we propose the multi-scale dilated convolution spatial pyramid pooling (MDSPP) module. This module expands the receptive field of the feature map, enabling the network to capture global and local context information more effectively. Further, we introduce the semantic information guidance (SIG) module, which leverages deep semantic information to guide the shallow feature layer. This design enhances the distinction between traffic signs and their backgrounds, reducing the negative impact of cluttered environments and improving detection performance for small and blurry signs. The ablation study indicates that the integrated application of E-mobilenet, MDSPP and SIG tends to outperform their independent usage. In contrast to many mainstream traffic sign detection algorithms, the main innovations of this paper are detailed as follows:

  1. (1)

    A new backbone feature extraction network, E-mobilenet, is designed by enhancing MobileNetV3’s lightweight cell structure with a channel attention mechanism. This backbone replaces YOLOv4-tiny’s backbone, improving feature extraction efficiency while maintaining a lightweight design.

  2. (2)

    The proposed MDSPP module incorporates multi-scale dilated convolutions to provide rich multi-scale receptive field information. This design addresses the problem of information loss caused by large-scale pooling operations, enhancing the network’s ability to capture global context.

  3. (3)

    The introduction of the SIG module enhances the detection of small traffic signs by leveraging deep semantic information to guide the shallow feature layer. This module improves the model’s resistance to cluttered backgrounds and preserves critical semantic information for small target detection.

The overall structure of this paper is as follows. We first introduce the research work related to this experiment, and then detail the innovations in this paper in the MASG-Net section. The “Experiments” section provides comparative experiments, as well as qualitative and quantitative analysis of test results. Finally, the work of this paper is summarized.

Related work

Two-stage detectors

The two-stage detectors are to generate target candidate boxes by a regional proposal network (RPN), and then classify and regression these candidate boxes to get the final detection results.

In 2014, Girschick proposed the R-CNN, which surpassed YannLecun’s contemporaneous end-to -end OverFeat33 in terms of performance. In 2015, SPP-Net34 added a spatial pyramid pool structure35 between the convolutional layer and the fully connected layer, which not only ensured performance, but also greatly improved detection speed. In 2016, Fast R-CNN algorithm is proposed. The algorithm scales each feature matrix by ROI-Pooling36 layer to a 7\(\times\)7 feature map, and then flattens the feature map through a series of fully connected layers to get the prediction result. In addition, Kaiming He and Girshick of Microsoft Research proposed Faster R-CNN algorithm and proposed RPN, which can share the feature information extracted by a convolutional neural network throughout the network process, saving computing costs and solving the problem of slow generation of positive and negative sample candidate frames by Fast R-CNN algorithm37 The Mask R-CNN38 algorithm added a branch fully convolutional network (FCN)39 layer on the basis of border recognition for semantic mask recognition.

The main difference between the two-stage detectors lies in the specific structure and optimization mode of RPN and the target classification regression network40,41. The two-stage detector usually has high detection accuracy, but the detection speed is relatively slow42. Thus, it is suitable for scenarios that require high detection accuracy, such as medical image analysis and security checks.

One-stage detectors

The one-stage detector extracts the advanced features of the image through the convolutional network, and then fuses the feature map to complete the object detection and classification43. Currently, one-stage detectors mainly include the YOLO series, SSD, RefineDet44, etc.

The SSD algorithm used the weighted sum of data enhancement, positioning, and confidence losses to train the model, which was faster, but the training was difficult, resulting in low algorithm accuracy. Shifeng Zhang et al. proposed the RefineDet detection method, which uses two-stage regression to improve detection accuracy and realized end-to-end multi-task training. The YOLO series includes multiple one-stage detection algorithms. YOLOv117 is the first algorithm in the YOLO series, and YOLOv420 combined various performance enhancing modules, making it one of the models with better detection performance in the YOLO series. YOLOX21 conducted classification and regression separately, which increased the complexity of the model. YOLOv622 introduced the new frame regression loss function of SIoU45 to improve the training speed and regression accuracy. The YOLOv723 network adopted a feature pyramid network and an improved backbone network to achieve more accurate and faster target detection, but with higher requirements for device performance.

Fig. 1
figure 1

Network architecture of the MASG-Net.

YOLO-NAS46 is the latest algorithm in the YOLO series, which employs the neural architecture search to achieve a balance between accuracy and computational complexity. However, YOLO-NAS is still in the research stage and has not been widely applied and verified. The YOLO tiny series are the lightweight versions of YOLO, featuring a smaller model size and faster detection speed, suitable for real-time detection on mobile terminals and other scenarios with limited computing resources47,48,49,50. YOLO tiny series includes YOLOv3-tiny51, YOLOv4-tiny52, and YOLOv7-tiny23 three versions. Although YOLOv3-tiny has a very fast inference speed, the detection accuracy is relatively low. YOLOv7-tiny is the latest YOLO-tiny series and introduces several improvements over YOLOv7. These changes optimize its detection speed and model size, but may also result in a slight loss in detection accuracy. For example, when the intersection over union is larger, the detection accuracy of the YOLOv7-tiny network is lower. Therefore, among these three models, YOLOv4-tiny is the most mature lightweight model, which has the advantages of small model size and high detection accuracy.

Attention mechanism

The attention mechanism is used to simulate human visual attention. In a deep learning model, it can automatically learn to assign different attention weights to different parts of the input, thereby improving the model’s ability to understand and express the input53,54,55. SENet56 pays attention to the information on the channel using adaptive weights, in which only a relatively small full-connection layer is introduced, so the number of parameters is relatively small. CBAM57 is improved and proposed to obtain useful information from both space and channel. However, the introduction of CBAM module will increase the complexity of the network, resulting in increased computing and memory requirements in the training process, and thus increasing the time cost of training. The coordinate attention (CA) mechanism58 is applicable to scenes with spatial dimensions. ECANet59 proposed by the author of this paper is a relatively efficient channel attention mechanism, which is suitable for models with high detection efficiency requirements. For scenes with larger feature map sizes, ECANet can consider both the channel dimension and the spatial dimension of attention, which is more efficient.

Shuffle attention60 that combines channel attention and spatial attention. It improves the feature representation capability and network performance by grouping, calculating, and applying attention to the input feature map. Efficient local attention (ELA)61 is a lightweight attention mechanism using 1D convolution and group-normalized feature enhancement. The essence of scaled dot-product attention62 is to quantify the similarity between the query and the key through the dot product, then assign attention weights through softmax, and weighted sum the value vectors according to these weights to form a context-sensitive representation of each position in the input sequence.

In this article, we introduce the ECANet structure in the MobilenetV3 cell structure to form a new backbone network of the E-mobilenet. Unlike the SE block, our module uses a lightweight design that minimizes computational overhead, making it more suitable for real-time applications. Compared to the ECA module, which focuses on local channel interactions, our module incorporates a broader context to enhance feature extraction for small and blurry traffic signs.

Methodology

Overall structure

The network structure of the proposed MASG-Net is shown in Fig. 1, and its improvements mainly include three points. Firstly, we propose a new backbone feature extraction network E-mobilenet, which is based on the MobileNetV363 and ECANet. Secondly, we proposed a new multi-scale dilated convolution spatial pyramid pooling structure. Finally, we introduce a semantic information guidance (SIG) module to enhance tiny sign detection by leveraging deep semantic information to guide the shallow feature layer.

We find that the parameters of the backbone network CSPdarknet53_tiny account for the majority of the parameters of YOLOv4-tiny. Therefore, in order to reduce the model size of YOLOv4-tiny, it is necessary to reduce the number of parameters of its backbone network CSPdarknet53_tiny. As we know, MobileNetV3 is an ultra-lightweight cnn model for mobile devices and has a small model size. Thus, we first replace the CSPdarknet53_tiny of YOLOv4-tiny with MobileNetV3.

Table 1 E-mobilenet network structure.

Then, we integrate the ECANet attention mechanism into the MobileNetV3 model. ECANet overcomes the contradiction between performance and complexity to learn effective channel attention in a more efficient way by employing local cross-channel interactions that significantly reduce the complexity of the network model while maintaining performance. To improve the detection accuracy of the model for small targets, we propose to add the MDSPP module after the E-mobilenet. Because the deep feature output by the backbone network contains limited information for small targets due to the small size of the feature map, the receptive field size can be effectively enhanced to obtain more global information after the MDSPP, enabling the network to extract more abundant features. Taking a step further, we propose the SIG module, which enhances the semantic information of the shallow feature layer, improving the distinction between the target and the background and reducing the negative impact of complex backgrounds on detection performance. This design also significantly retains important semantic information in small traffic sign targets. MASG-Net is very suitable for deployment on resource limited vehicle terminal devices for traffic sign recognition due to its high detection accuracy and real-time performance.

Fig. 2
figure 2

Improved cell structure E-block.

E-mobilenet structure

In order to carry out feature extraction of input images more efficiently on the premise of ensuring low complexity of the model, we design an ultra-lightweight backbone network structure E-mobilenet as shown in Table 1.

The third column represents the proportion of the number of channels in the E-block that are up-dimensioned and then down-dimensioned in the inverse residual structure. The fourth column represents the number of channels in the feature layer output after the second column of operations. The sixth column NL represents the type of non-linear activation function, HS and RE are the h-swish and RELU6 activation function, respectively. H-swish function has the characteristics of no upper bound, lower bound, smooth and non-monotonic. The seventh column, parameter s, represents the step size used for each convolution or E-block structure. Moreover, the definitions of ReLU6 and h-swish activation function are as follows:

$$\begin{aligned} ReLU6\left( x \right) =min\left( max\left( x,0 \right) ,6 \right) \end{aligned}$$
(1)
$$\begin{aligned} h-swish\left( x \right) = x\cdot \frac{ReLU6\left( x+ 3 \right) }{6} \end{aligned}$$
(2)

The E-block, which is shown in Fig. 2, adopts a backward residual structure with a linear bottleneck and includes three convolution layers: \(1\times 1\) convolution to reduce the dimension, \(3\times 3\) convolution to extract features, and \(1\times 1\) convolution to restore the dimension. Moreover, ECANet is integrated into the E-block to improve its performance.

The attention mechanism in the original MobilenetV3 is implemented in the same way as SENet, which employs two fully connected layers to capture nonlinear cross-channel interactions. However, this mechanism has two defects: it cannot capture the attention in the spatial dimension and two fully connected layers will increase the number of network parameters. Therefore, we introduced ECANet into the MobilenetV3 cell structure to form a new ultra-lightweight cell structure E-block. ECANet cancels the two fully connected layers and the feature extraction is carried out directly through a one-dimensional convolution to obtain the weight of each dimension. This way can make the weight learning process more simple and direct. The convolution kernel size of this one-dimensional convolution is obtained by adaptive calculation and represents the coverage of local cross-channel interactions. The weight sharing means that each set of convolution uses exactly the same weight, which greatly reduces the number of parameters. Specifically, the number of parameters is reduced from the original SENet’s \(2C^{2}/r\) to k, where C is the number of channels, r is the dimensionality reduction hyperparameter and k is the convolution kernel size. In addition, given the channel dimension C, k can be adaptively determined as:

$$\begin{aligned} k=\left| \frac{\log _{2}{\left( C \right) } }{\gamma } + \frac{b}{\gamma } \right| _{odd} \end{aligned}$$
(3)
Fig. 3
figure 3

The structure diagram of the proposed MDSPP module.

where odd indicates that the value is odd, \(\gamma\) is set to 2, and b is set to 1.

Multi-scale dilated convolution spatial pyramid pooling

The backbone network of YOLOv4-tiny primarily relies on single-branch ordinary convolution with a fixed kernel size, leading to a deterministic and uniform receptive field. This limitation results in extracted features with limited information, making it challenging to handle complex detection tasks due to the lack of multi-scale capabilities.

Based on the SPP structure, we proposed a multi-scale dilated convolution spatial pyramid pooling (MDSPP) structure. The SPP structure is essentially a multi-scale pooling, which extracts multi-scale pooling information for the same feature layer. Since the input feature layer is relatively fixed with respect to the original image receptive field, the enhancement of the receptive field by the structure is not obvious after the fusion of multi-scale pooling information. In order to better enrich the receptive field scale and improve the feature extraction capability of the network, dilated convolutions with different hole rates are introduced in each branch of the SPP structure. This structure was designed to increase the receptive field of the feature map, thus helping the network capture context information at more scales. The specific structure is shown in Fig. 3.

Fig. 4
figure 4

The structure diagram of the proposed SIG module.

Semantic information guidance

MDSPP first divides the input feature layer into three main branches for three different scales of the dilated convolution, each with a convolution kernel of size 3 × 3, but with dilation rates of 1, 3 and 5, respectively. By using dilated convolution, the MDSPP module avoids the need for multiple large kernels, which would increase the number of trainable parameters. This design aligns with the ultra-lightweight nature of MASG-Net, ensuring that the model remains compact and suitable for real-time applications in resource-constrained environments. Then the extracted features with different receptive fields are passed through the feature pyramid pooling layer, where the pooling pyramid is also divided into three branches and is articulated after the dilated convolution, with the maximum pooling window size of 5, 9 and 13, respectively. The output of these three branches is then connected to obtain the final output of the structure. Based on the pyramid pool structure, MDSPP forms a feature extraction structure that can greatly enhance the receptive field. It can effectively enhance the feature extraction capability of the network without greatly increasing the complexity of the network model.

When using YOLOv4-tiny to identify traffic signs, the accuracy is hindered by the relatively small size of the signs, low resolution, unclear features, and other objective factors. This often results in missed detections and false positives, reducing the effectiveness of small target recognition. Drawing from recent research on defect detection64, we propose a SIG module that utilizes deep feature layers to guide shallow feature layers. By refining the semantics of the shallow feature layer, the influence of complex backgrounds on detection performance is reduced, and the semantic details of small traffic sign targets are effectively preserved. Additionally, traffic signs are typically small targets, and details about small targets are richer in shallow features due to higher spatial resolution in the shallow layer. Infusing semantic information into these shallow features can enhance and highlight the information representation of these small targets. For instance, the distinct shape and color of a traffic sign can be accentuated by the crucial semantic details from high-level features, aiding the network in accurately identifying small targets during detection.

The detailed structure of the SIG is depicted in Fig. 4. The workflow of SIG proceeds as follows. Initially, the deep output features undergo max pooling and average pooling. Furthermore, the output feature is represented as

$$\begin{aligned} f^{'} = cat(\varphi _{MP}(f_{n}),\varphi _{AP}(f_{n})) \end{aligned}$$
(4)

where cat denotes cascading operation, \(\varphi _{MP}\) and \(\varphi _{AP}\) refer to max pooling and average pooling operations, respectively. Taking a step further, we combine the features of the two branches to encompass more detailed global information. Following, the CBS module with a 1 \(\times\) 1 convolution adjusts the channel count, while the CBS module with a 3 \(\times\) 3 convolution enhances local context. The residual edge is then added element-wise to the deep feature map:

$$\begin{aligned} f^{''}= \Phi (\xi \left\{ \beta \left\{ Conv_{3\times 3}\left\{ \xi \left\{ \beta [Conv_{1\times 1}(f^{'})]\right\} \right\} \right\} \right\} +f_{n}) \end{aligned}$$
(5)

where \(\beta\) is the batch normalization (BN), \(\xi\) represents the SiLU activation function, and \(\Phi\) is the element-wise sum operation. The features then pass through a CBS block with a 1 \(\times\) 1 convolution and a multi-spectral channel attention (MSCA)65 to obtain the deep feature map’s weight, which is activated by a modified ReLU function:

$$\begin{aligned} Y_{n}= \tau \left\{ GAP \left\{ \sigma _{1}\left\{ \beta [Conv_{1\times 1}(f^{''})]\right\} \right\} \right\} \end{aligned}$$
(6)

where \(\tau\) is the ReLU activation function, and MSCA denotes the multi-spectral channel attention operation. The MSCA mechanism dynamically adjusts the weights of different feature channels, enabling the network to focus on the most informative channels while suppressing irrelevant or redundant ones. This dynamic weighting process enhances the network’s ability to learn target-specific features, which is particularly important for small and complex objects like traffic signs.

Through the above procedures, the SIG module leverages deep semantic information from the backbone network to guide the shallow feature layer. This design enhances the distinction between traffic signs and complex backgrounds, improving the detection of small traffic signs and reducing false positives caused by cluttered environments. Unlike traditional feature fusion methods, the SIG module explicitly strengthens the semantic information in shallow layers, which is critical for detecting small and dim traffic signs.

Experiments

Settings

In order to verify the effectiveness of the proposed E-mobilenet, MDSPP, and SIG modules, several comparative tests are conducted in this section. The experimental environment and the parameter settings are shown in Table 2.

Moreover, during the training process, the current training weight file is saved in time after the end of each epoch. At the same time, the change of the loss function is observed during the network training process. The model is tested when the loss function tends to be stable, indicating that the model has converged. At last, in order to eliminate the randomness of experimental results, the average of the model weights of 20 epochs after stabilization is taken for validation.

Table 2 Experimental environment and parameter settings.

Dataset and evaluation metrics

Dataset

  1. (1)

    CCTSDB dataset: The Chinese traffic sign database (CCTSDB)66 is produced by Zhang Jianming’s team of Hunan Key Laboratory of Integrated Transportation Big Data Intelligent Processing of Changsha University of Science and Technology. Up to now, 15,734 images have been uploaded, including nearly 40,000 traffic sign targets. The current labeling data is divided into three categories: Indication sign, prohibition sign, warning sign. In this paper, the CCTSDB data set is divided into CCTSDB_l and CCTSDB_s according to the size of traffic signs in the image. Among them, 11,4735 images with large traffic signs were divided into CCTSDB_l dataset, and the remaining 4000 images with small traffic signs constituted CCTSDB_s dataset.

  2. (2)

    GTSDB dataset: The German traffic sign detection benchmark (GTSDB)67 is a standard dataset for traffic sign detection, featuring 900 high-resolution images of 43 common German traffic sign types. It includes diverse scenes with varying weather, lighting, and challenges like partial occlusion, making it ideal for testing detection algorithm robustness in real-world applications. Widely used in autonomous driving and intelligent transportation research, GTSDB is a key benchmark in traffic sign detection.

  3. (3)

    TT100K dataset: The Tsinghua-Tencent 100K (TT100K)68 is a large-scale traffic sign detection and recognition benchmark with over 100,000 high-resolution images and 221 types of traffic signs commonly found on Chinese roads. Featuring significant class imbalances and challenging scenarios like occlusion, blur, and lighting variations, it is widely used to assess target detection and classification algorithms, making it a key resource in autonomous driving and intelligent transportation research.

We divided the data set into a 7:3 ratio of training sets and validation sets, and these images contained vehicle information in each scene and traffic signs at each location. The authenticity and universality of the data set are guaranteed. In addition, in order to verify the effectiveness of the improved ultra-lightweight and high-precision network structure in practical applications, we use mobile phones to shoot images of real scenes inside and around the campus. These scenes include traffic sign images under different circumstances, covering different angles, different lighting conditions, and different distances. The actual application scenario is simulated more realistically, which is helpful in evaluating the performance of the improved network structure in a complex environment.

Evaluation metrics

When evaluating the target detection model, the accuracy and speed are generally measured. The accuracy evaluation index mainly includes four kinds:

  • Precision (Pr): represents the proportion of samples classified as positive that are truly positive.

    $$\begin{aligned} Pr = \frac{TP}{TP+FP}. \end{aligned}$$
    (7)
  • Recall (Re): represents the proportion of samples that are correctly classified as positive in the true positive category.

    $$\begin{aligned} Re = \frac{TP}{TP+FN}. \end{aligned}$$
    (8)
  • F1 score (F): the accuracy rate and recall rate are considered comprehensively, and it is the harmonic average of the two.

    $$\begin{aligned} F=\frac{2 \times Pr \times Re}{Pr+Re}, \end{aligned}$$
    (9)

    where TP, FP and FN are true positive examples, false positive examples and false negative examples, respectively.

  • mAP: represents the average of the accuracy rates for all classes. In addition, AP represents the average accuracy of a single class, corresponding to the area under the precision recall curve, and mAP represents the average accuracy across all categories. The size of the mAP must be in the range [0,1], and the larger the better.

    $$\begin{aligned} & AP=\int _{0}^{1} P\left( R \right) dR, \end{aligned}$$
    (10)
    $$\begin{aligned} & mAP=\frac{1}{C} \sum _{i=1}^{C}AP_{i}, \end{aligned}$$
    (11)

    where P, R, P(R), C and \(AP_{i}\) represent the accuracy rate, the recall rate, the precision recall curve, the total number of classes and the AP value of class i, respectively.

  • Frame per second (FPS): which represents how many images are recognized per second.

Fig. 5
figure 5

MASG-Net and existing methods computational complexity analysis in terms of number of parameters and mAP. (a) and (b) show the test results using the CCTSDB_s and TT100K, respectively.

Results and analysis

Table 3 The performance comparisons of various models on the CCTSDB_s dataset.
Table 4 The performance comparisons of various models on the GTSDB dataset.
Table 5 The performance comparisons of various models on the TT100K dataset.

Quantitative comparison with state-of-the-arts

In order to comprehensively compare the performance of MASG-Net and the other current mainstream networks, Table 3 shows their results of mAP and FPS on CCTSDB_s and their model size. Compared with the large complex network SSD_512 and YOLOv4, MASG-Net still has a certain gap in detection accuracy but is significantly ahead in terms of detection speed and model size. The YOLOv4-tiny+AFPN+RFB network72 is built by adding adaptive feature pyramid networks (AFPN) and receptive field block (RFB) modules to YOLOv4-tiny. Compared to YOLOv4-tiny+AFPN+RFB, MASG-Net reduces the number of model parameters and improves the detection accuracy and speed. Compared to the latest YOLOv7-tiny, the detection accuracy of the network is similar, but MASG-Net uses a lightweight backbone network and still leads in model size. In addition, we can also see that lightweight models, such as MASG-Net and YOLOv4-tiny, have significantly faster detection speed than complex models, such as SSD and YOLO. Therefore, the proposed MASG-Net has superior comprehensive performance for traffic sign detection applications.

Fig. 6
figure 6

Visualization results of different algorithms on the CCTSDB dataset.

Fig. 7
figure 7

Visualization results of different algorithms on self-shot real scenes.

Fig. 8
figure 8

Visualization results of different algorithms on the TT100K dataset.

In GTSDB, we have compared MASG-Net with several state-of-the-art methods published in recent years, including both lightweight and high-accuracy detection networks commonly used for traffic sign detection tasks. The comparison models includes YOLOv4-tiny, Ren et al.69, Tang et al.70, Zhang et al.71 and Yao et al.72. From the detection results in Table 4, it can be seen that MASG-Net has achieved suboptimal performance in multiple indicators, and the mAP has reached 90.8%. In GTSDB dataset, the scale and angle of traffic signs in the image may change due to the shooting distance, camera angle of view or the installation position of the sign, especially the detection of small targets at long distances or oblique angles. To address this challenge, MASG-Net introduces a global-local perception module that can simultaneously capture long-distance dependent global information and fine-grained local features. This module enhances the model’s understanding of the macroscopic structure and microscopic features of traffic signs, thereby improving detection accuracy. MASG-Net achieves 203.4 FPS, combining high detection accuracy with impressive inference speed.

To further assess the effectiveness of the proposed algorithm, it is compared with five target detection models SSD, YOLOv4-tiny, YOLOv7-tiny, and YOLOv8n73 on the TT100K dataset. Table 5 presents the results, where Params indicates the total parameters needed for model training, and mAP evaluates overall detection accuracy across all categories, reflecting the model’s performance. Experimental results show that our algorithm achieves the second highest mAP among all compared models, reaching 68.6%, which is superior to most other models in detection accuracy. From the specific data in Table 5, it can be seen that the proposed algorithm performs best in the io and pl50 detection tasks, with detection accuracies reaching 83.8% and 68.4% respectively, achieving the best results. At the same time, MASG-Net achieved the second best performance on the p11, pl40 and po detection tasks. In addition, for the detection of other traffic sign categories, the proposed algorithm also maintained a high accuracy and achieved the second overall ranking of the mAP indicator.

To better show the lightweight and effectiveness of the proposed method, we draw a figure in which an x-axis indicates the number of parameters and the y-axis is the performance of different methods. In Fig. 5, we compare the performance of MASG-Net with preceding algorithms, including SSD, YOLOv4-tiny, YOLOv4, YOLOv7-tiny, and YOLOv4-tiny+AFPN+RFB. As shown in Fig. 5, the MASG-Net achieves a superior balance between lightweight design and high accuracy, outperforming other models with fewer parameters. This result shows that the proposed algorithm can achieve superior detection performance while reducing the number of parameters, fully reflecting the balance between computational efficiency and accuracy, and showing high practical application value.

Qualitative results

We visualize the detection effects in specific scenarios of different methods on the CCTSDB dataset, as shown in Fig. 6. YOLOv4, YOLOv4-tiny, YOLOv7-tiny and MASG-Net are respectively used to detect the public data set. It can be seen that the proposed MASG-Net achieves high detection accuracy. This is because MASG-Net accurately identifies traffic signs by capturing fine-grained features and memorizing long-term contexts. Additionally, it includes MDSSP and SIG modules to minimize external interference and ambiguous detections. The visual comparisons clearly illustrate MASG-Net’s superior ability to detect small and difficult-to-recognize traffic signs while maintaining fewer false positives in cluttered backgrounds.

To better verify the generalizability of MASG-Net, we use mobile devices to photograph real scenes such as traffic signs and surrounding roads on campus, as shown in Fig. 7. YOLOv4, YOLOv4-tiny, YOLOv7-tiny and MASG-Net are, respectively, used to detect the random scene. By observing the detection results of the network model, the improved model not only effectively improves the probability of judging the prediction box as a certain type, but also the position of the prediction box is more accurate than that given by YOLOv4, YOLOv4-tiny and YOLOv7-tiny, and the center point of the prediction box basically coincides with the center point of the traffic sign. On the whole, MASG-Net has achieved significant improvement in the detection effect of small targets and dim and fuzzy traffic sign pictures with insufficient light. The practicability and robustness of MASG-Net have been verified by testing the environmental pictures of traffic signs around the campus taken by mobile phones. It is proved that MASG-Net has strong generalization ability in real road environment scene.

On the TT100K dataset, we have performed a visual comparative analysis of the baseline models YOLOv4-tiny, YOLOv7-tiny, and the proposed algorithm MASG-Net, as shown in Fig. 8. These results showcase detection outputs for various challenging scenarios, such as small, blurry, and occluded traffic signs, as well as signs in low-light environments. The comparison results show that the proposed algorithm outperforms the YOLOv4-tiny and YOLOv7-tiny models in detecting various types of traffic signs, including i5, io, p11, pl40, pl50 and pn. The prediction box generated by the proposed MASG-Net has a higher degree of match with the actual sign area, especially in the case of complex background, partial occlusion, aging, defacement and other reasons that lead to missing information, showing stronger robustness.

Ablation study

In order to evaluate MASG-Net more systematically and comprehensively, its trained models on the CCTSDB dataset and its subset CCTSDB_s and CCTSDB_l were used for ablation experimental tests, and the final results are shown in Table 6.

Table 6 Ablation study results on the CCTSDB dataset.

From the performance in Table 6 on CCTSDB_s, it can be seen that the performance of MASG-Net have been significantly improved compared with the original network. The new backbone network E-mobilenet brings obvious accuracy improvement, and the average accuracy index mAP is increased from 91.4 to 92.4%. This verifies the feature extraction ability of the backbone network E-mobilenet. This design resolves the performance-complexity trade-off by using local cross-channel interactions, significantly reducing network complexity while maintaining performance. By introducing multi-scale dilated convolution operation into SPP structure, the new MDSPP module improves the mAP of the model from 92.4 to 94.1%, and the precision, recall and F1 score are increased by 0.9%, 2.7% and 1.8%, respectively. This indicates that MDSPP can effectively enhance the ability of the network to extract features. Furthermore, when equipped with the proposed SIG module, it achieves 0.1 points gains on mAP, 0.7 points gains on F1 score and the recall increase from 93.2 to 93.9%. This demonstrates the effectiveness of our SIG module in enhancing detection performance of tiny signs.

Compared with the network before the improvement, the accuracy, recall rate, F1 score and mAP of MASG-Net are increased by 4.7%, 5.0%, 4.8% and 2.8%, respectively. It can be seen that MASG-Net effectively solves the problem that YOLOv4-tiny is not strong in feature extraction ability when detecting small targets of traffic signs. Moreover, the MASG-Net is not only suitable for detecting small traffic signs, but also for detecting large traffic signs. Therefore, compared to YOLOv4-tiny, the proposed MASG-Net has improved performance in the recognition of traffic signs of different sizes.

Potential limitations

Although the MDSPP module improves the receptive field and enhances the detection of small targets, the performance may degrade when detecting extremely small traffic signs that occupy only a few pixels in the image. This is due to the inherent limitations of feature extraction at such a small scale. In addition, high-speed motion can cause significant motion blur, which reduces the clarity of traffic signs and makes detection more challenging. While MASG-Net enhances the receptive field and captures global contextual information, extreme motion blur may still lead to missed detections or false positives. As visible, the performance of MASG-Net may still be affected under extreme adverse weather conditions, such as heavy rain, fog, or snow, where the visibility of traffic signs is significantly reduced.

Conclusion

In this paper, an ultra-lightweight and high-precision network, MASG-Net, is proposed on the basis of YOLOv4-Tiny network for traffic sign detection applications. Firstly, an ultra-lightweight feature extraction network, E-mobilenet, is designed to enhance the feature extraction capability of the network while effectively reducing the number of parameters. Secondly, based on SPP, the MDSPP is proposed, which greatly enhances the receptive field range of the feature map and enables the network to obtain more global information. Finally, we propose a SIG module that utilizes deep feature layers to guide shallow feature layers. By refining the semantics of the shallow feature layer, the influence of complex backgrounds on detection performance is reduced. The combination of the E-mobilenet backbone, MDSPP and SIG modules significantly improves the detection of small, dim, and blurry traffic signs, especially in challenging environments such as low-light conditions. Compared with the network before improvement, the precision, recall rate, F1 score, and mAP of MASG-Net are increased by 4.7%, 5.0%, 4.8% and 2.8%, respectively. It can be seen that MASG-Net effectively solves the problem that YOLOv4-tiny is not strong in feature extraction ability when detecting small targets of traffic signs. Compared to other models, it has better detection accuracy and smaller model complexity. In addition, the feasibility of MASG-Net to detect traffic signs in the real scene is verified by the detection of road environment pictures.

However, there are still some shortcomings in the research work. Vehicles driving at high speed may have an impact on the imaging effect, the pictures captured by the camera may have fuzzy deformation and be difficult to identify, and other vehicles, pedestrians or buildings may also partially block the traffic signs, affecting the detection effect. Subsequent detection algorithms need to be able to accurately identify these situations. To further validate the real-time performance of MASG-Net on devices with limited computational resources (e.g., automotive ECUs or edge devices), we plan to deploy the model on platforms such as NVIDIA Jetson Nano, Raspberry Pi, or similar hardware.