Introduction

UAVs, commonly known as drones, are flying machines that operate without a human pilot onboard. They have been extensively utilized across various domains for many years. For instance, UAVs are employed in aerial photography and mapping1, 3D modeling of buildings combined with MR techniques2, as well as in assisting communication3 and target detection4. Due to the rapid growth of UAV applications, they have attracted significant attention from both the industrial and academic sectors5. Specifically, UAVs are regarded as a potentially solution for resolving challenges associated with the design of wireless communication networks. This is due to their highly flexible deployment and dynamic mobility characteristics. However, the scarcity of spectrum resources has hindered substantial rate improvements for many emerging applications6. A potential solution involves the use of AMC, which enables UAVs to select the most suitable modulation method adaptively based on current channel conditions and transmission distance. This approach effectively enhances data transmission rates7.

In this paper, the UVAs are constrained to consumer-grade specifications, with a focus on limited power consumption and size. The wireless communication network with a fog computing communication system architecture8 is formed based on such UAVs, allowing UAVs to process data locally without transmitting it to remote cloud servers. Due to the dynamic mobility characteristics of UAVs, UAV communication networks have intermittent links and fluid topology9. AMC10 is one technique for automatically identifying the modulation scheme of signal, with no or some prior knowledge. This process acts as an intermediate step between signal detection and demodulation. In UAV communication systems, the application of AMC enables UAV to adaptively select the most suitable modulation method based on current channel conditions and transmission distance, thereby effectively enhancing data transmission7. Furthermore, as a key technology11 for dynamic spectrum access (DSA), AMC plays a significant role in enhancing spectrum efficiency and expanding communication capacity12. In complex electromagnetic environments13, AMC technology further demonstrates its potential in monitoring unauthorized devices or signals, particularly in applications such as UAV identification14, interference recognition15, and electronic countermeasures16. In conclusion, AMC serve an important function in optimizing the UAV communication performance and ensuring communication security.

After decades of development, AMC methods have diverged into two main categories: traditional recognition methods and DL-based recognition methods. And the traditional AMC approaches can be also broadly divided into two types: likelihood-based (LB)17 hypothesis testing and feature-based (FB)18,19 methods. LB is rooted in Bayesian theory and aims to achieve the optimal estimation of modulation schemes by minimizing the probability of misclassification. Although these methods are theoretically optimal, their practical application is not ideal due to the lack of necessary prior knowledge in noncooperative communication scenarios. On the other hand, FB methods involve extracting features and constructing a classifier. The common features include constellation features20, statistical parameter features21, and wavelet transform features22 etc. These methods usually rely on expert experience and have limitations in automation. Therefore, with the increasing complexity of communication systems and the demand for rapid response in UAV systems. Traditional AMC methods no longer meet current application requirements. Against this backdrop, DL-based AMC technology has emerged.

DL possesses the ability to automatically extract complex features from training samples and build classifiers, making it highly suitable for AMC tasks. O’Shea et al.23 were the first to validate the effectiveness of DL in AMC and released the RML2016.10 A dataset, which attracted a large number of researchers to participate. The most popular model architectures for DL in AMC include CNN24, long short-term memory (LSTM)25, and transformers 26. In 2017, O’Shea et al.27 proposed a CLDNN network that combines CNN and LSTM to explore the relationship between signal modulation recognition and network depth. F. Zhang et al. 28 introduced a robust structure called MCNet, based on residual connections and asymmetric convolution kernels, which achieved significant performance improvements. J. Xu et al. 29 proposed a hybrid structure called MCLDNN, featuring 1D convolution, 2D convolution and LSTM, which can extract and fuse spatiotemporal features from the In-phase (I) components, Quadrature (Q) components, and IQ of the signal. Cui30 developed an IQCLNet network that uses convolution kernels capable of extracting IQ-related features, achieving an average recognition accuracy of 81.7% when identifying 11 types of modulation methods at SNR > 0dB. The above networks designed corresponding complex structures starting from the phase features, temporal features, or fused features of the signal. However, these methods did not fully consider the model’s adaptability to edge devices.

F. Wang et al.31 proposed a lightweight radio transformer (MobileRaT) method based on deep learning, which reduces model weights through iterative training combined with information entropy-based weight pruning. A. Gon et al.32 introduced an AMC approach using hybrid data augmentation and a lightweight neural network. The network structure was designed with depth-wise separable convolutions to lower computational complexity. Q. Zheng et al.33 presented a lightweight deep neural network framework that integrates residual modules, LSTM modules, and attention mechanisms, along with the use of hybrid data augmentation techniques. These methods have contributed to the realization of lightweight AMC technology and edge computing. While lightweight models can reduce computational complexity and energy consumption, maintaining high accuracy remains a challenge.

In light of this, we focus on the design of light weight and low complexity models for resource constrained scenarios while maintaining high accuracy. The main contributions is summarized as follows:

  • Development of a lightweight neural network model: for the AMC task, we have developed a novel ultra lightweight neural network model. This model contains only 8,815 parameters but simultaneously demonstrates high performance, making it highly suitable for deployment in resource constrained applications.

  • Design of a DSECA module: we have designed a dual-stream efficient channel attention (DSECA) module that can effectively capture the correlations between different channels, thereby enhancing the representational power of features. Moreover, the DSECA module is highly versatile and can be easily integrated into any existing CNN architecture.

  • Design of a DSFFM module: we have designed an innovative dual-stream feature fusion module (DSFFM) that enhances the model’s ability to capture features at different granularities by fusing output feature maps from different layers. This strategy contributes to improving model performance.

Theoretical model

Drone communication model

In the UAV communication system, the ground control station is responsible for real-time monitoring of the UAV’s status and sending control instructions. The UAV navigates autonomously according to a predetermined flight plan and conducts reconnaissance on the communication links in space in real-time. Equipped with an embedded microprocessor, the UAV can immediately demodulate and analyze the received signals. During the UAV communication process, the received signal can be expressed as

$$r(t)=\rho {e^{\tilde {j}{\Delta _{fo}}(t)}}\int_{0}^{{{\tau _0}}} s \left( {{\Delta _{co}}(t - \tau )} \right)d\tau +\tilde {n}(t)$$
(1)

where \(\rho\) denotes the channel amplitude gain following a Rayleigh distribution within the range (0,1]. \({\Delta _{fo}}(t)\) and \({\Delta _{co}}(t)\) are the carrier frequency offset and the sampling rate offset respectively. \({\tau _0}\) is the maximum delay spread. \(s(t)\) represents the noise-free complex transmitted signal. And \(\tilde {n}(t)\) represents the additive white Gaussian noise (AWGN) with zero mean and variance \({\sigma ^2}\).

Signal representation and data augmentation

The signal received by the UAV is typically recorded in IQ mode34. In this mode, the continuous signal \(r(t)\) is internally sampled by the UAV and converted into a discrete time sequence \(r(n)\) for storage. The discrete signal \(r(n)\) consists of two data vectors, the in-phase component \({r_I}\) and the quadrature component \({r_Q}\). Therefore, the received signal \(r(n)\) can be rewritten as

$$r(n)={r_I}(n)+j{r_Q}(n), n=0,1,...N-1$$
(2)

Phase shift33 is an effective means to achieve data augmentation in radio signal processing. By rotating the radio signal, we can generate data samples with different phase angles without altering the signal amplitude, thereby increasing the diversity of the dataset. Specifically, by performing a phase rotation on the received signal \(({r_I},{r_Q})\) around its origin, we can obtain the following augmented signal35 sample \((r_{I}^{\prime },r_{Q}^{\prime })\).

$$\left[ {\begin{array}{*{20}{l}} {r_{I}^{\prime }} \\ {r_{Q}^{\prime }} \end{array}} \right]=\left[ {\begin{array}{*{20}{r}} {\cos \theta }&{ - \sin \theta } \\ {\sin \theta }&{\cos \theta } \end{array}} \right]\left[ {\begin{array}{*{20}{l}} {{r_I}} \\ {{r_Q}} \end{array}} \right]$$
(3)

where \(\theta\) is the rotation angle. By rotating the received signal by 0, π/2, π, and 3π/2 respectively, the original sample data will be augmented fourfold, yielding the augmented signal set \(X=(r,{r_{\pi /2}},{r_\pi },{r_{\pi /4}})\). Figure 1 demonstrates the constellation diagram of CPFSK signal samples, which can visually observe the distribution of sample points obtained after different phase rotations.

Fig. 1
Fig. 1
Full size image

Constellation diagram of the GFSK signal with phase-shift data augmentation.

Problem formulation

The objective of this study is to adopt a DL-based approach to solve the AMC task. The aim is to construct a neural network model f that is both highly accurate and low in complexity, thereby enabling it to accurately identify the modulation scheme of the input sample signal x. The model f maps the input signal x to the modulation class output y through parameters w. The objective is to find a set of weights w such that the predicted modulation class \(\hat {y}\) is as consistent as possible with the actual modulation class y when given the input signal x, maximizing the probability of correct prediction. The process can be formulated as follows.

$$\operatorname{argmax} P(f(y|{\mathbf{x}};w))=y|\hat {y}$$
(4)

Our proposed AMC method

Model structure

We proposed an ultra-lightweight network model ULNN for AMC, as shown in Fig. 2. The network consists of three key components: the lightweight convolution module (LCM), the dual-stream efficient channel attention module (DSECA), and the dual-stream feature fusion module (DSFFM). The DSECA module is included within the structure of the LCM module. In the following subsections, we will introduce the three modules in detail.

Fig. 2
Fig. 2
Full size image

The overall architecture of the ULNN network.

Lightweight convolution module

The sample signal x is size of 128 × 2, where the real component \({{\mathbf{x}}_0}\) is in the first column and the imaginary component \({{\mathbf{x}}_1}\) is in the second column. Before the signal enters the LCM module, it is necessary to perform a complex convolution operation on the sample to exploit the correlation between the I and Q channels of the IQ signal36. The complex convolution operation37 is shown in Eq. (5). Here, w and k represent the weight and size of the complex convolution kernel, respectively. Complex numbers inherently include phase information, and complex convolution can process both magnitude and phase simultaneously. Therefore, despite being more complicated than real convolution, complex convolution is a more appropriate choice for data samples that have been augmented by phase rotation. It should be noted that after the complex convolution layer, there is also a complex batch normalization (BN) layer and a complex activation function ReLU connected. The output feature map after complex convolution is shown in Eq. (6).

$${w_k}*{\mathbf{x}}=\left[ {\begin{array}{*{20}{l}} {\operatorname{conv} \left( {{{\mathbf{x}}_0},{w_{k,0}}} \right) - \operatorname{conv} \left( {{{\mathbf{x}}_1},{w_{k,1}}} \right)} \\ {\operatorname{conv} \left( {{{\mathbf{x}}_0},{w_{k,1}}} \right)+\operatorname{conv} \left( {{{\mathbf{x}}_1},{w_{k,0}}} \right)} \end{array}} \right]$$
(5)
$${M_{CV}}={f_{\operatorname{Re} LU}}[{f_{BN}}(w*{\mathbf{x}})]$$
(6)

The primary role of the LCM module is to extract deep features and reduce feature dimensions. As shown in Fig. 3, the LCM module mainly consists of split convolution, channel shuffle, and attention mechanism operations. Split convolution is an efficient convolution operation used for extracting feature maps. It includes two main parts. The first is depth-wise convolution (DW): convolution is performed on each input channel separately, rather than mixing all channels together, which can significantly reduce computational costs and model complexity. The second is pointwise convolution (PW): after DW, 1 × 1 convolution kernels are used to combine channel information, enhancing the network’s expressive ability while maintaining the depth of the feature map. After split convolution, since it is difficult for different groups of feature maps to exchange information, this may lead to reduced learning efficiency of the network. Therefore, we introduce a channel shuffle operation to increase the information flow between the output feature maps. Since channel shuffle simply achieves information flow by rearranging channels, it avoids using additional convolution or fully connected layers, thereby reducing the number of model parameters and computational complexity. After obtaining the output feature map F through shuffling, we apply an attention mechanism operation to further realize local cross channel interaction.

Fig. 3
Fig. 3
Full size image

The architecture of the lightweight convolution module.

Dual-stream efficient channel attention module

The DSECA module is an improvement based on the ECA38 module. As shown in Fig. 4, the ECA module extracts aggregated features \({M_1}\) from the feature map F by using Global Average Pooling (GAP) and a one-dimensional convolution (Conv1D). Building on the ECA module, the DSECA adds a parallel path of Global Max Pooling (GMP) and Conv1D to extract salient features \({M_2}\) from the feature map F. Subsequently, \({M_1}\) and \({M_2}\) are vertically concatenated along the channel dimension to form the comprehensive channel weight M. Finally, the comprehensive channel weight M is multiplied with the original feature map F, achieving a reweighting of the features across different channels. This design allows the DSECA module to more effectively highlight useful features and suppress less important ones, enhancing the model’s ability to capture key information. The overall attention process can be summarized as

$$F \in {R^{C \times H \times W}}\;\;\;{\kern 1pt} ({M_c} \in {R^{C \times 1 \times 1}})$$
(7)
$$k={\left| {\frac{{{{\log }_2}C+b}}{\gamma }} \right|_{{\text{odd}}}}$$
(8)
$${M_1}=\sigma ({w_k}*{g_1}(F))$$
(9)
$${M_2}=\sigma ({w_k}*{g_2}(F))$$
(10)
$${F^\prime }=[{M_1}(F);{M_2}(F)] \times F$$
(11)

In the formulas mentioned above, k represents the size of the convolution kernel, which is adaptively matched to the number of channels C by constants b and r. w denotes the weight of the convolution kernel. \({g_1}\) and \({g_2}\) respectively represent GAP and GMP. σ is the sigmoid function. In summary, DSECA can be easily integrated into any CNN architecture, contributing to the construction of more efficient deep learning models with improved generalization capabilities.

Fig. 4
Fig. 4
Full size image

The architecture of the dual-stream ECA module.

Dual-stream feature fusion module

Upon returning to the initial phase where the sample x enters the network, after undergoing complex convolution, it cascades through six LCM modules. The DSFFM module is applied to the last three LCM modules with the aim of fusing the features of these deeper layers. As depicted in Fig. 5, the input to the DSFFM module is the feature map \(F_{3}^{\prime }\), which is the output of the third LCM module. Similar to the strategy of the DSECA module, the fourth, fifth, and sixth LCM modules are processed through parallel GAP and GMP branches, resulting in the feature maps \({F_{Ave}}\) and \({F_{Max}}\). Within each branch, an ‘Add’ operation is used for feature fusion, and Conv1D is employed for dimensionality reduction. Finally, \({F_{Ave}}\) and \({F_{Max}}\) are vertically concatenated along the channel dimension to obtain the final output feature map \({F_{End}}\). The overall process can be summarized as

$${F_{Ave}}=\sigma (w*\sum\limits_{{l=4}}^{6} {{g_1}} \left( {F_{l}^{\prime }} \right))$$
(12)
$${F_{Max}}=\sigma (w*\sum\limits_{{l=4}}^{6} {{g_2}} \left( {F_{l}^{\prime }} \right))$$
(13)
$${F_{End}}=({F_{Ave}};{F_{Max}})$$
(14)

Here w represents the weights of the Conv1D kernels, and \(F_{1}^{\prime }\) is the output feature map of the first LCM module. The activation function used is the ReLU function, denoted by σ. After passing through the DSFFM module, the output feature map \({F_{End}}\) is concatenated to a fully connected layer. A cross-entropy loss function SoftMax is employed to map the output of the fully connected layer to probabilities for each modulation category, with the label corresponding to the highest probability considered as the predicted modulation type of the input signal. The DSFFM module allows the network to integrate feature maps from different layers, thereby capturing features at varying levels of granularity. This strategy contributes to enhancing the model’s performance because it is capable of capturing both local detailed information and more global contextual information.

Fig. 5
Fig. 5
Full size image

The architecture of the dual-stream feature fusion module.

Experimental methodology

To gain a deeper understanding of the characteristics of the ULNN network, we designed a series of experiments to address the following three research questions: (1) Ablation experiments: We evaluated the specific impact of data augmentation, DSECA and DSFFM modules, as well as the dual-stream structure on model performance. (2) Model performance analysis: We compared the ULNN network with five other benchmark models to analyze its relative performance on the RML2016.10 A dataset. (3) Model complexity analysis: We compared the ULNN network with the other five benchmark models in terms of the number of parameters and other aspects. These experiments aim to comprehensively assess the effectiveness and efficiency of the ULNN network in the task of AMC.

Datasets

All experiments will be conducted on the public dataset RML2016.10 A, which is generated using Python scripts with the GNU Radio processing library. The signals consist of ASCII text and audio, with channel impairments such as sampling frequency offset, Rayleigh fading, and time delay. The specific details of the dataset are shown in Table 1.

Table 1 Details of the dataset RML2016.10 A.

Baseline models

In the evaluation of network model performance, we compared model with five existing bench mark. These models include: (1) CLDNN27: CLDNN is a classic structure in speech recognition that has been successfully transferred to the field of electromagnetic signal recognition. CLDNN effectively extracts temporal features of signals by combining CNN and LSTM. (2) MCNet28: MCNet effectively captures temporal features and prevents overfitting by arranging nonsymmetric kernels and residual modules in parallel. (3) MCLDNN29: MCLDNN is a space-time multi-stream architecture that integrates the temporal feature information of the I, Q, and both IQ streams. (4) IQCLNet30: IQCLNet is also based on the CNN and LSTM architecture, but unlike CLDNN, IQCLNet places more emphasis on extracting the correlation features between the I and Q streams. (5) FastMLDNN39: FastMLDNN introduces a novel lightweight single-stream neural network composed of group convolution layers and transformer encoding layers. These networks have all been validated on the RML2016.10 A datasets. Therefore, they serve as reasonable benchmarks for assessing the performance of the proposed ULNN model.

Evaluation metrics

In our experimental evaluation, we primarily conducted classification tasks for modulated signals, hence we used accuracy as the metric to assess the model’s classification capability. The formula for accuracy is as follows. Where PAcc represents the classification accuracy. Ncorrect denotes the number of the correctly identified samples and Ntest denotes the number of test samples.

$$P_{{{\text{Acc }}}}^{{}}=\frac{{N_{{{\text{correct }}}}^{{}}}}{{N_{{{\text{test }}}}^{{}}}} \times 100\%$$
(15)

Experiment results and analysis

In this section, we delve into the research questions posed in Sect. 4 and present the experimental results. To begin with, we should state the model training environment and parameter settings. All experiments were conducted on a GTX1080Ti GPU using Python 3.6, Tensorflow-gpu 1.14, and Keras 2.2.4. The number of training, validation, and testing samples was 308,000 (after DA), 33,000, and 110,000, respectively. For the optimization algorithm, we chose the Adam algorithm. And we use a dynamic learning rate adjustment strategy that the learning rate started at 0.001 and decreased by 80% if there was no improvement in accuracy on the validation set for 10 consecutive iterations. Additionally, we set the batch size to 128 and the number of epochs to 150 to guide the network training.

Ablation studies

To verify the impact of DA on model performance, we conducted comparative experiments on all models using both the original dataset and the augmented dataset. The experimental results are shown in Fig. 6. After using the augmented dataset, ULNN, IQCLNet, and CLDNN and achieved performance improvements of 7.01%, 4.62%, and 4.17% respectively. Among them, the proposed ULNN model showed the most significant performance gain. This indicates that the ability of ULNN to capture features is far greater than the feature expression of the original dataset. In contrast, the performance improvements of MCNet, MCLDNN and FastMLDNN were relatively limited, at 1.25%, 0.72% and 0.71% respectively. This suggests that the generalization capabilities of these three networks on the original dataset were already close to optimal, and DA had a smaller effect on their performance improvement. Overall, the use of rotational phase augmentation can enhance model performance to a certain extent, which is mainly attributed to two factors: one is that DA provides richer feature expressions, helping the model to capture more subtle features. The other is that the data samples themselves have phase offsets37. Rotational phase augmentation helps to correct phase offsets, thereby improving the model’s processing ability for data.

Fig. 6
Fig. 6
Full size image

The classification performance of different networks with DA and without DA.

In the ablation study, we first validated the effectiveness of the dual-branch structure in both the DSECA and DSFFM module. In these two modules, we employed the same dual-branch structure strategy, which involves extracting the average features from the feature maps through the GAP branch and extracting the salient features through the GMP branch, followed by concatenation to obtain fused features. To verify the efficacy of the dual-branch structure, we conducted experiments with single-branch and dual-stream structures in each module and calculated the average recognition accuracy for 11 classes of modulated signals. It is important to note that when conducting an ablation experiment on one module, the other module maintained the dual-branch structure. As shown in Table 2, the dual-branch structure achieved performance improvements over the single-branch structure in both the DSECA module and the DSFFM module. This is because the dual-branch structure extracts features from different perspectives, providing richer information and enhancing the model’s expressive power.

Table 2 Ablation experiments on dual-branch structure.

Next, we evaluated the impact of the DSECA module and the DSFFM module on network performance. The experiment involved multiple network variants, detailed as follows (1) Base Model (BM): The base model is structured with a complex convolution layer and followed by six LCM modules in cascade. A SoftMax layer is employed at the final stage to perform classification. Compared to the complete model, the base model does not include the DSECA and DSFFM modules; however, all other structural features and parameters remain consistent. (2) BM + DSECA: The network variant with the DSECA module added to the base model. (3) BM + DSFFM: The network variant with the DSFFM module added to the base model. (4) BM + Both: The complete network model that includes both the DSECA and DSFFM modules.

In Fig. 7, we compared the average recognition accuracy of these four network variants. The results show that the average recognition accuracy of the BM, BM + DSECA, BM + DSFFM, and BM + Both networks were 60.84%, 61.94%, 61.27%, and 62.83% respectively. The experimental results indicate that each module in the ULNN has a positive gain effect on network performance, with the greatest gain achieved when both modules are present, reaching an increase of 1.99%.

Fig. 7
Fig. 7
Full size image

Ablation experiments on different modules of the ULNN network.

Comparison of model performance

As shown in Fig. 8, we tested all networks on the dataset RML2016.10 A and arranged their average recognition accuracy in descending order of performance. The specific results are as follows: FastMLDNN:63.01%, ULNN: 62.83%, MCLDNN: 62.22%, IQCLNet: 61.17%, CLDNN: 60.18% and MCNet: 58.43%. The average recognition accuracy of the six networks is 61.30%. From these data, it can be seen that our proposed ULNN network exceeding the average network recognition accuracy by 1.53%, and outperforming the MCNet, which has the lowest recognition accuracy, by 4.4%. But there is still a 0.18% gap from the best performing FastMLDNN. From Fig. 8, it is evident that FastMLDNN significantly outperforms other networks in low SNR ranging from − 20 dB to − 8 dB. This superior performance can be attributed to the use of a novel central distance expansion loss function incorporated with the cross entropy loss during the training process.

Fig. 8
Fig. 8
Full size image

The classification performance of different networks on RML2016.10 A.

Additionally, in Fig. 9, we present the validation set loss rates for all models over 150 epochs. From the figure, it can be observed that MCLDNN reached the minimum loss at around 20 epochs, but the loss rate showed a gradual increasing trend in subsequent training rounds. This indicates that MCLDNN suffers from overfitting, which also explains why the performance improvement was almost negligible in the ablation experiments regarding DA for MCLDNN. The other networks did not exhibit overfitting issues, with their loss rates converging steadily as training rounds increased.

Fig. 9
Fig. 9
Full size image

The validation loss of different networks in 150 epochs.

To better analyze the network’s performance, we divided the SNR range of the dataset into two intervals: low SNR ([-20dB, -2dB]) and high SNR ([0dB, 18dB]). In Table 3, we present the average recognition accuracy of all networks within these two SNR ranges. In the low SNR range, FastMLDNN achieved the highest recognition accuracy at 34.78%, ULNN network achieved the second highest recognition accuracy at 33.77%. The remaining networks did not exceed 33%. As mentioned earlier, FastMLDNN use of an improved loss function resulted in superior anti-interference performance at low SNR.

In the high SNR range, there was a significant difference in the performance of the networks. MCLDNN achieved the highest recognition accuracy at 91.97%. Our network, ULNN, followed closely behind, also showing excellent performance at 91.89%. In contrast, MCNet had an accuracy of only 84.63%. Combining this with the results from the previous ablation experiments regarding DA, it can be inferred that the MCNet network may have underfitting issues. Overall, whether under low SNR or high SNR conditions, our ULNN network demonstrated superior performance, giving it a greater advantage in complex electromagnetic environments.

Table 3 The average classification performance of different networks on different SNR.

Complexity analysis

In the task of AMC, considering the application requirements for resource constrained scenarios, we conducted a detailed comparison of the model’s number of parameters, weight size, single instance inference time and floating point operations (FLOPs). As shown in Table 4, The number of parameters in ULNN we proposed is the lowest among all networks, with a total of 8,825. The storage space required only 34.48 KB, representing a mere 2.17% of the largest network, MCLDNN. Furthermore, ULNN achieves a higher recognition accuracy than MCLDNN. While FastMLDNN reached the highest average recognition accuracy, its weight size, inference time, and FLOPs are significantly larger than those of ULNN.

However, in terms of single instance inference time, our network is not the fastest. The MCNet performs better in this aspect. Additionally, we can observe that the inference time does not strictly correlate positively with the number of parameters. This is because the model’s inference time is not only related to the number of parameters but also to the model’s computational volume, memory access volume, hardware platform, and other factors. Therefore, even though our model has the fewest parameters, due to the operational characteristics of the depth-wise separable convolution within the model, the inference time may be constrained by memory bandwidth and data input/output (IO) limitations. Moreover, since the MCLDNN, IQCLNet, and CLDNN networks all contain LSTM structures, the LSTM requires processing each element in the sequence, and the longer the sequence, the greater the computational volume needed. Also, an LSTM model includes multiple gating structures and cell states, all of which require multistep computations. Hence, the inference times for these three networks are generally longer.

In terms of FLOPs, the computational load of ULNN is only 0.0007, which is at least two orders of magnitude lower than that of other networks. Consequently, ULNN necessitates a mere fraction of the computational resources typically required by other networks during both training and inference processes. This not only serves to reduce the hardware overhead but also has the potential to decrease energy consumption. In summary, ULNN achieves a good balance between accuracy and efficiency in addressing the contradiction existing in AMC lightweight. Moreover, ULNN has low requirements for storage space and computing power. These advantages make it more competitive in practical applications, especially in resource-constrained environments.

It should be noted that, due to the limited onboard storage and computational resources of drones. When implementing a modulation signal identification scheme for UAVs based on fog computing communication architecture, it may encounter challenges such as energy constraints, communication delays, network stability, and security issues faced by UAVs acting as fog nodes.

Table 4 The complexity of different networks.

Conclusion

This paper proposes a lightweight and efficient ULNN for AMC tasks. Firstly, we verify the validity of DA, DSFFM and DSECA modules by ablation experiments. Then in the comparison experiment with other benchmark networks, the proposed ULNN network shows excellent performance at both low SNR and high SNR. In addition, our network can achieve high recognition accuracy while remaining lightweight. The experimental results show that ULNN is an efficient neural network with strong anti-jamming ability and suitable for resource constrained environment. In addition, in the ablation experiments of DA, we noticed that data enhancement had different effects on different complexity networks. Therefore, how to balance the model complexity and sample size is also a worthy research direction. This helps us to design more accurate and efficient models.