Abstract
Deep Neural Networks (DNNs) have been shown to be vulnerable to adversarial examples, significantly hindering the development of deep learning technologies in high-security domains. A key challenge is that current defense methods often lack universality, as they are effective only against certain types of adversarial attacks. This study addresses this challenge by focusing on analyzing adversarial examples through changes in model attention, and classifying attack algorithms into attention-shifting and attention-attenuation categories. Our main novelty lies in proposing two defense modules: the Feature Pyramid-based Attention Space-guided (FPAS) module to counter attention-shifting attacks, and the Attention-based Non-Local (ANL) module to mitigate attention-attenuation attacks. These modules enhance the model’s defense capability with minimal intrusion into the original model. By integrating FPAS and ANL into the Wide-ResNet model within a boosting framework, we demonstrate their synergistic defense capability. Even when adversarial examples are embedded with patches, our models showed significant improvements over the baseline, enhancing the average defense rate by 5.47% and 7.74%, respectively. Extensive experiments confirm that this universal defense strategy offers comprehensive protection against adversarial attacks at a lower implementation cost compared to current mainstream defense methods, and is also adaptable for integration with existing defense strategies to further enhance adversarial robustness.
Similar content being viewed by others
Introduction
Deep learning technologies have catalyzed transformative advances across various domains, including image classification1, object detection2, semantic segmentation3, and natural language processing4, demonstrating remarkable efficacy in addressing complex challenges. However, adversarial examples expose critical vulnerabilities in deep neural networks (DNNs), undermining their robustness, defined as the model’s ability to maintain high performance under adversarial or noisy inputs, and reliability, reflecting the model’s consistent behavior across diverse conditions. This fragility was first identified by Szegedy et al.5 in 2013, who noted DNNs’ susceptibility to small but intentional input perturbations that disrupt their input-output relationships. The implications of this weakness are particularly serious for high-stakes applications, such as autonomous driving and medical diagnostics, where model accuracy and reliability are paramount, and adversarial interference could result in severe consequences, raising ethical and safety concerns.
The widespread threat posed by adversarial examples requires an in-depth investigation into their impact on DNNs6,7. Although numerous defenses have been proposed, adversarial training8,9,10,11,12,13 remains a primary approach for enhancing model robustness by incorporating adversarial examples into the training set. Building on this, Lei et al.14 proposed Generalized Adversarial Training (GAT) to extend robustness from simple \(l _{p}\)-ball perturbations to complex semantic changes, including variations in hue, saturation, brightness, contrast, and rotation. Yang et al.15 introduced Data-Adaptive Adversarial Training (DAAT), which adapts perturbation sizes through a calibration network, improving robustness and accuracy. Jin et al.16 furthered adversarial training by adding random noise to network weights, flattening the loss landscape to enhance both robustness and clean accuracy. Despite their effectiveness against specific attacks, these methods still face challenges in generalizing to novel or unforeseen adversarial methods.Input preprocessing techniques, such as image compression and filtering17,18,19,20,21, provide additional defense options but may incur latency or reduce input data quality, potentially affecting model performance on legitimate samples. The GridMask22 method, for instance, generates a binary mask that disrupts adversarial perturbations by masking parts of the input image, thus reducing attack efficacy. Other defense methods, such as defensive distillation23,24, enhanced loss functions25,26,27 and model structure optimization, have demonstrated some effectiveness against previously unseen adversarial attacks. Adversarial Logit Pairing (ALP)28, within adversarial training, aligns clean and adversarial example logits, adding a regularization term that encourages similar embeddings for both versions of the same sample. This alignment helps the model better represent data structures internally. Similarly, the TRADES29 method partitions the adversarial loss into the classification error of clean samples and a trade-off term representing the KL divergence between clean and adversarial logits. Unlike ALP, TRADES optimizes the classification of clean samples and the robustness of adversarial logits without directly targeting adversarial accuracy, effectively balancing robustness and clean accuracy. Additionally, the Attention-based Adversarial Defense (AAD)30 method addresses adversarial attacks by aligning visual attention between adversarial and clean samples, ensuring focus remains on target objects and reducing feature divergence. AAD also selectively incorporates moderately challenging adversarial examples based on observed attention shifts to further strengthen model robustness.
A key challenge facing current defense methodologies is their limited generalizability, primarily due to an incomplete understanding of how adversarial attacks affect model behavior. While some defenses are effective against specific attack types, they often fail against novel or unforeseen strategies. This limitation arises because most defenses are designed to counter specific perturbation patterns without fully addressing the broader effects these attacks have on model performance. Consequently, their effectiveness declines when confronted with evolving adversarial techniques that exploit previously unrecognized vulnerabilities. This underscores the urgent need for defense strategies that are not only robust but also adaptable to address the diverse and rapidly advancing spectrum of adversarial threats.
Wu et al., for example, observed changes in class activation maps when they applied the Fast Gradient Sign Method (FGSM) to attack ResNet50 on ImageNet, finding that adversarial attacks cause the model’s attention to shift to incorrect target regions. This finding motivated us to investigate adversarial attack impact from an attention perspective. Previous studies have shown that attention-based defenses or attacks can significantly influence model performance. Deng et al.31 explored a spatially transformed attack method based on attention, highlighting the importance of focusing on meaningful areas to create effective adversarial examples. Chen et al.32 investigated how attention mechanisms can compress adversarial perturbations, suggesting that attention-based defenses can resist adversarial examples by concentrating on essential image regions. Zhu et al.33 enhanced adversarial example transferability using an attention mechanism to disrupt distinct features, showcasing the effectiveness of attention-guided strategies in adversarial contexts. Wang et al.34 used attention mechanisms to generate adversarial patches with strong generalization, underscoring attention’s role in creating more effective attacks. Chen et al.35 introduced an attack method targeting DNNs’ attention to increase example transferability, emphasizing attention’s vulnerability to adversarial manipulations. He et al.36 proposed AEAED, integrating multi-head self-attention in autoencoders, demonstrating its utility in capturing subtle adversarial perturbations across multiple scales, and outperforming baseline detection models in adversarial example detection. Yu et al.37 introduced TGA-ZSR, employing text-guided attention to improve zero-shot robustness in vision-language models, effectively aligning adversarial and clean attention distributions while maintaining generalization to clean samples. Feng et al.38 proposed a Dual Attention-Guided Method to create cross-task adversarial examples by capturing overlapping discriminative regions, while Li et al. introduced a post-hoc soft attention mechanism to expand adversarial examples by altering separate semantic units. Liu et al.39 developed the Content Feature Optimization Attack (CFOA) targeting content features in white-box models, illustrating the utility of attention-guided feature perturbations in enhancing adversarial example transferability.
Visualization of image attention under different attack algorithms. Columns 1 and 4 represent clean examples, columns 2 and 5 depict attention maps for clean examples, columns 3 and 6 display attention maps for adversarial examples. The labels below the images indicate model predictions, where black indicates correct predictions, and red indicates incorrect predictions.
Notably, as depicted in the third column of Fig. 1, we discern that adversarial attacks, irrespective of their type, induce a progressive diversion of the model’s attention from the accurately classified entities towards non-pertinent objects or background areas, corroborating the findings of Wu et al. Furthermore, our research uncovers a novel phenomenon, as demonstrated in the sixth column of the figure, where certain adversarial examples manage to breach the defenses while maintaining the model’s attention on the target object, a stark contrast to the behavior observed with clean samples.
To overcome the limitations in the generalizability of current defense methods, this research aims to develop a more comprehensive and universally applicable defense strategy. Specifically, we seek to address this gap by investigating the role of model attention in the context of adversarial attacks. Attention mechanisms40,41,42,43,44,45,46, which are crucial in directing a model’s computational focus, present a promising avenue for creating countermeasures against adversarial manipulation. By carefully analyzing the shifts and attenuations in model attention induced by adversarial examples, we aim to uncover the strategies employed by these attacks, thus providing a foundation for targeted defense mechanisms. Through an extensive examination of the attention views across a wide range of adversarial examples, we categorize attack algorithms into two primary classes based on their effects: attacks that induce attention shifts and those that cause attention attenuation.
Compared to other attention-based defense methods, the approach proposed in this paper performs a more detailed categorization during the attack analysis phase, rather than treating all attacks as a single attention bias issue. This refined categorization facilitates the development of more targeted defense strategies, addressing the attention distribution issues induced by different types of adversarial attacks.
In response to these categorizations, we propose two tailored defense modules: the Feature Pyramid-based Attention Space-guided (FPAS) module, specifically designed to counteract attention-shifting attacks by spatially retracting the adversarial attention shifts, and the Attention-based Non-Local (ANL) module, crafted to enhance the model’s focus on critical features, thus mitigating attention-attenuation attacks. By integrating FPAS and ANL into the Wide-ResNet model within a boosting framework, we demonstrate their synergistic defense capability. The empirical validation of these modules highlights their effectiveness in reinforcing the resilience of DNNs against a broad spectrum of adversarial attacks. Moreover, our proposed defense modules can be integrated with existing defense strategies as components to further improve adversarial robustness.
In summary, our key contributions are as follows:
-
We propose a novel classification method of adversarial attacks by analyzing the changes in model attention, distinguishing between attacks that induce attention shifts and those that lead to attention attenuation. This classification method provides a foundational understanding for developing targeted defense mechanisms.
-
We propose the Feature Pyramid-based Attention Space-guided (FPAS) module to specifically counter attention-shifting attacks, effectively re-directing the model’s focus to the relevant features of the input.
-
We present the Attention-based Non-Local (ANL) module to address attention-attenuation attacks, enhancing the model’s emphasis on critical features and thereby boosting its robustness.
The remainder of this paper is organized as follows. Section 2 categorizes attacks into two types based on the changes in model attention caused by various attack algorithms. Section 3 endeavors to delve into the mechanisms underlying these two attack phenomena from a model-based perspective and proposes corresponding defense strategies. Section 4 presents a thorough evaluation of our defense approach, affirming its efficacy. Section 5 concludes the paper, summarizing our findings and underscoring their significance for advancing the field of adversarial machine learning. Finally, Section 6 provides a comprehensive discussion of the study, highlighting both limitations and failure cases of the proposed defense strategies.
Classification of attack effects based on attention view
In this section, we categorize attacks into two types based on the changes in model attention caused by various attack algorithms: Attention-shifting Attacks and Attention-attenuation Attacks. Understanding these distinct attack types is crucial for developing effective defense mechanisms and enhancing the robustness of machine learning models.
The attention matrix for clean samples is denoted as \(Att_{clean}\left( x_{i} \right) \in {[} 0,1 {]}\) and for adversarial examples, it is \(Att_{t}\left( x_{i} \right) \in {[} 0,1 {]}\), where t represents the attack algorithm. A softmax normalization applies to the difference between the attention matrices of the adversarial and clean samples. The variance is then computed employing the resulting matrix, which acts as a meter to measure the degree of attentional shifts brought about by the adversarial attacks. The computation procedure is illustrated in Eq. (1):
We collected the attack results of FGSM, I-FGSM, PGD, MI-FGSM, DI2-FGSM, TI-FGSM, Deepfool, C&W, and Square algorithms on the Wide-ResNet model for the ImageNet datasets. Detailed data can be found in Table 1.
We use the average variance \({\bar{\sigma }}_{t}\) of the attention change matrix as the threshold. If \(\sigma _{t} >{\bar{\sigma }} _{t}\), it indicates that the adversarial perturbation on this sample has dispersed the model’s attention. Conversely, if \(\sigma _{t} \le {\bar{\sigma }} _{t}\) , it indicates that although the adversarial perturbation on this sample can break the Wide-ResNet model, the attention location has not undergone significant dispersion. From Table 1, it can be observed that among all the adversarial examples that break the Wide-ResNet model, a significant proportion of them have attention change variances exceeding the average variance value. This indicates that the model’s attention has shifted considerably on these samples. We define this kind of attention change as attention-shifting.
There is another subset of samples where the attention change variance has not undergone significant alteration, indicating that the attention location on these samples has not changed significantly. This phenomenon is particularly evident in query-based attack algorithms, where the proportion of such samples in Deepfool, C&W, and Square algorithms exceeds 50%. Next, we investigate the numerical changes in attention values in the key regions on samples where \(\sigma _{t} \le {\bar{\sigma }} _{t}\). We employ masking techniques to extract attention important regions using thresholds \(\alpha\) of 0.5, 0.6, 0.7, and 0.8. Next, we examine the degree to which adversarial attacks have caused attention degradation in the critical areas. The computation procedure is illustrated in Eqs. (2) and (3):
The performance of each attack algorithm on attention change in the ImageNet datasets is depicted in Table 2. While the focus areas of the model in these adversarial examples do not show positional shifts when compared to clean samples in terms of attention location, the table indicates that adversarial perturbations lead to a numerical attenuation in the model’s attention to significant regions within the samples. On the ImageNet dataset, the average attenuation surpasses 13%. We define this type of attention change as attention-attenuation.
Based on the above analysis, we categorize the current mainstream adversarial attack algorithms into two types from the perspective of model attention change:
-
1.
Attention-shifting Attacks: These attacks cause the model’s attention from the correct object region to areas unrelated to the model’s decision. The focal regions of the model experience a positional shift.
-
2.
Attention-attenuation Attacks: Attacks of this type do not induce a shift in the model’s attention position. However, although the model’s focus remains on the correct object, the degree of attention diminishes due to the influence of adversarial perturbations.
Defense strategy
In this section, we endeavor to explain the mechanisms underlying these two attack phenomena from a model-based perspective in order to devise effective defense strategies. For each type of attack, we present our respective defense strategies and design corresponding defense modules. By understanding the fundamental principles that govern these attacks, we aim to mitigate their impact and enhance the robustness of our deep learning models.
Analysis based on model perspective
We conduct t-SNE visualization on the high-dimensional features of clean samples from CIFAR-10 on the Wide-ResNet model, as illustrated in Fig 2.
We observe that in the high-dimensional space, there is a significant separation between different classes. Additionally, besides the clustered points corresponding to the 10 classes, scattered points are not close to any specific class cluster but instead distributed relatively randomly in the plane. Subsequently, we apply t-SNE visualization to the adversarial examples generated by nine different attack algorithms. We separate all sample points in the t-SNE space into two sets where \(x_{k}^{i,j} \in R_{t-SNE}\), \(\left( i,j \right)\) represents the coordinates of a point, and k denotes either attention attenuation or shift, based on the summarized two types of adversarial attack effects: attention attenuation or shift. We create a distance matrix E for each set by calculating the distances \(d\left( m,n \right)\) between any two locations under the labels of the ten categories, as indicated by Eqs. (4) and (5):
Equations (6) and (7) demonstrate how to compute the entropy \(H\left( x\right)\) using the distance matrix E as a probability distribution,
The average entropy values of the sample points are calculated separately for the two sets of sample points segregated by class labels, resulting in bar charts of entropy values for the attention-shifting and attention-attenuation sample point sets generated by different attack methods, as shown in Fig. 3.
The entropy values of the attention-shifting sample set are consistently lower than those of the attention-attenuation sample set across almost all attack methods. The bar chart depicting entropy in Fig. 3 demonstrates that, within the high-dimensional feature space of Wide-ResNet, adversarial attacks result in a more concentrated distribution of clean sample points with attention shift. In contrast, clean sample points affected by attention attenuation are comparatively dispersed in space. This observation validates our hypothesis that samples affected by attention shifts during adversarial attacks exhibit reduced intra-class distances and a denser distribution within the model’s high-dimensional feature space. This observation validates our hypothesis: the sample set showing attention shift in adversarial attacks tends to have a smaller intra-class distance and a more concentrated distribution in the model’s high-dimensional feature space. The attack causes the model’s attention to shift from the correct class cluster to other class clusters. In contrast, samples impacted by attention attenuation tend to display greater intra-class distances, with sample points dispersed across various class clusters. This scattering effect, induced by the attacks, weakens the model’s attention in the high-dimensional space, causing it to diffuse to class clusters proximate to the sample point, and consequently, leading to misclassifications.
Building on the aforementioned findings, we have developed defense strategies specifically designed to counteract attention-shifting and attention-attenuation attacks, detailed as follows:
Feature Pyramid-based Attention Space-guided Defense Module (FPAS): This module is designed to counteract attention-shifting attacks, which fundamentally alter the model’s focal points within the high-dimensional feature space. Incorporating defensive mechanisms at the early stages of the model’s feedforward network is essential to mitigate such attacks. By employing strategies such as disturbance removal and perturbation distribution disruption to refocus attention on the appropriate areas of the image. To achieve this, we introduce the Feature Pyramid-based Attention Space-guided Defense Module. This module disrupts the distribution of adversarial perturbations using a feature pyramid, simultaneously extracting higher-dimensional image features. Additionally, it incorporates a channel attention mechanism to guide the model towards more relevant image features.
Attention-based Non-Local Defense Module (ANL): The fundamental cause of attention-attenuation attacks is the widespread dispersion of the dataset within the model’s high-dimensional feature space, leading to excessively narrow decision boundaries among various classes. Exploring internal information within image features to improve the model’s aggregation of high-dimensional characteristics across many classes should be the main goal of a defense strategy against such attacks. In this context, we propose a defense module called Attention-based Non-Local (ANL), leveraging a self-attention mechanism to uncover dependencies within image features and strengthen the model’s focus on crucial regions of useful features. This module aims to bolster the model’s resilience against attention decay attacks by enriching the amalgamation of high-dimensional features, thus augmenting the model’s focus on key feature regions.
Feature pyramid-based attention space-guided defense module (FPAS)
The feature pyramid structure was initially developed to enable neural network models to handle computer vision tasks on multi-scale images. By constructing pyramidal layers at different levels, feature pyramids can adapt to image features of varying scales, thereby maintaining a certain degree of scale invariance. This multi-scale feature fusion strategy has been widely applied in several fields, including video summarization47, UAV image object detection48,49, remote sensing image object detection50,51 and video saliency prediction52,53, demonstrating its effectiveness in improving model performance. For example, the AMFM model proposed by Zhang et al.54 optimizes the context capture and fusion process in video summarization through a multi-granularity fusion strategy. Similarly, CFANet enhances the efficiency and accuracy of UAV image object detection by aggregating cross-layer features, while the FFAGRNet model addresses scale variation issues through full-scale feature aggregation and grouped feature reconstruction. These studies illustrate that multi-scale feature fusion not only enhances the model’s perception of objects at different scales but also improves its robustness to complex backgrounds and random noise. Given the strengths of the feature pyramid, this study incorporates this structure into the design of the FPAS module.
The GridMask defense algorithm has proven that misclassifications caused by adversarial perturbations result from the collective impact of perturbations on each pixel of the image. Therefore, introducing dilated blocks in convolution can disrupt the distribution of adversarial perturbations, achieving a defensive effect. FPAS constructs a feature pyramid structure using dilated convolutions with different dilation rates to capture multi-scale features. Dilated convolution increases the receptive field by enlarging the spacing between sampling points in the convolutional kernel, avoiding the information loss caused by traditional feature pyramid structures that rely on upsampling and downsampling operations. Adversarial perturbations are small-scale local disturbances that induce changes at specific scales. The feature pyramid structure integrates feature layers from different scales, enabling the model to rely on multi-level information rather than a single scale for decision-making. This effectively disrupts the distribution of adversarial perturbations.
The overall structure of the FPAS module is illustrated in Fig. 4, consisting of a pyramid module and a channel attention module. The specific steps of the pyramid module are as follows:
-
1.
Perform dilated convolution operations with three sets of 3\(\times\)3 kernels, each with dilation rates of 1, 3, and 5, resulting in three feature maps, denoted as \(f_{1}\), \(f_{2}\), and \(f_{3}\). Ensure that \(f_{1}\), \(f_{2}\), and \(f_{3}\) have the same dimensions by applying the proper strides and padding.
-
2.
To maintain the continuity of image features while disrupting the distribution of adversarial perturbations, employ 1\(\times\)1 convolutions to smooth and fuse adjacent features, obtaining feature maps \(p_{1}\), \(p_{2}\), and \(p_{3}\).
-
3.
Concatenate \(p_{1}\), \(p_{2}\), and \(p_{3}\) along the channel dimension and feed the resulting tensor into the channel attention module.
The dilation rates (1, 3, and 5) in the FPAS module are chosen to extract features at multiple scales, thereby enhancing the model’s robustness against adversarial perturbations. Lower dilation rates (e.g., 1) capture fine-grained local features, while higher dilation rates (e.g., 3 and 5) expand the receptive field to capture more global information. This multi-scale feature extraction provides a comprehensive representation of the spatial information in the image’s true features, counteracting subtle perturbations in adversarial examples. By using varied dilation rates, the FPAS module disrupts the distribution of adversarial perturbations while preserving image content, making attacks less effective across multi-scale features.
Due to the possibility of redundant and less informative features resulting from the fusion and concatenation operations in the pyramid module of FPAS, the channel attention module plays a crucial role in not only allocating channel weights for the feature maps but also eliminating these redundant features that contribute little to the model’s decision-making. The goal is to guide the model to focus more on the feature maps that are more useful for decision-making and have lower susceptibility to adversarial perturbations. The channel structure of SENet inspires the design of the channel attention module. The feature maps output from the feature pyramid layer are concatenated along the channel dimension and undergo global average pooling, compressing each channel into a single feature value. Subsequently, these values pass through a fully connected layer, allowing the model to learn channel-wise feature dependencies. Finally, the scaled channel-wise attention weights obtained from the fully connected layer are applied to the original input feature maps through channel-wise multiplication.
Attention-based non-local defense module (ANL)
The root cause of attention attenuation in attacks is the model’s failure to perfectly differentiate between feature information from different classes in the high-dimensional feature space. We hypothesize that one possible reason for the small inter-class distances in high-dimensional features is the use of small receptive field sizes during model learning. These small fields only consider local region information, lacking a holistic understanding of the features of salient objects in the image. The Non-Local algorithm in the self-attention mechanism can introduce global information, enabling the model to capture global features in hidden layers.
In the convolutional layers of the network model, the input image feature x is processed through convolution and split into three branches, denoted as query(x), key(x), and value(x). The query(x) and key(x) branches compute the importance matrix of feature information. In this study, we perform an inner product operation on the query(x) and key(x) branches, obtaining the internal feature importance score f(x) for the input feature x. The calculation is shown in Eq. (8):
Using the softmax function to normalize f(x) yields the self-attention matrix describing the internal feature correlations within the image. Finally, after performing an inner product operation between the self-attention matrix and the value(x) branch and adding it to the residual connection branch, we obtain the output feature y, which has enhanced internal feature correlations through the Non-Local module. The calculation is shown in Eq. (9):
To enhance the model’s ability to extract image features from the self-attention matrix under the interference of adversarial perturbations and to strengthen the model’s focus on key regions, we incorporate adversarial training methods during model training. Throughout the training process, corresponding clean samples and adversarial examples are input into the model, utilizing the Non-Local module to explore the correlation of internal features within each sample. The overall structure diagram of the ANL module is illustrated in Fig. 5.
To guide the model to enhance attention on regions in the image that are positively correlated with the model’s decision, we draw inspiration from ALP and introduce attention loss into the regularization term of the model training loss function: \(\beta L_{a}(x_{clean},x_{adv})\) .Where \(x_{clean}\) represents the image feature of clean samples, \(x_{adv}\) represents the image feature of adversarial examples, and \(L_{a}(.)\) is the similarity function of the self-attention matrix, representing the distance between the attention maps generated by clean and adversarial examples in the Non-Local module. In our experiments, we treated \(\beta\) as a hyperparameter and tested multiple values (0.001, 0.005, 0.01, and 0.05). We found that setting \(\beta\) to 0.01 effectively reduced overfitting while enhancing the model’s overall performance.
Experimental results and analysis
Experiment settings
Datasets: To demonstrate the generality of the proposed defense module, all defense-related experiments in this study were validated on two publicly available datasets: ImageNet and CIFAR-10. Due to limitations in GPU computational resources and for improved efficiency, this study followed the approach of CMC55, randomly sampling 100 classes from the original 1000 ImageNet classes to create the ImageNet-100 small dataset for experimentation. The ImageNet original training set was split into a training set and validation set in an 8:2 ratio, with the original ImageNet validation set used as the test set in this study. Additionally, the image sizes in the ImageNet dataset were standardized to 224\(\times\)224 during preprocessing in the experiments. The CIFAR-10 dataset consists of 10 classes, and the images are of size 32\(\times\)32, with standard training, validation, and test datasets provided by the official source. Detailed information about the experimental datasets is presented in Table 3.
Hyperparameters: The initial learning rate was set to 0.01, decaying by a factor of 0.1 every 2 epochs, with momentum of 0.9 and weight decay of 1e-4. The dropout rate was 0.3. Adversarial training started from epoch 0, with a max norm of 1 for perturbations and 10 steps for attacks. The SGD optimizer was used with these settings.
Baseline Models: In this study, the baseline models on both datasets are built upon the Wide-ResNet with parameter adjustments based on the dataset scale. The specific structural parameters for the two baseline models are outlined in Tables 4 and 5.
White-box defense performance
The Wide-ResNet models defined in Tables 4 and 5 are designated as baseline models. The new models obtained by replacing the Conv1 layer in the baseline with the FPAS module are denoted as FPAS models. Models incorporating the ANL module before the global average pooling layer in the baseline are referred to as ANL models. Adversarial training, known for its simplicity and effectiveness, is a fundamental configuration for defense models. The DDN algorithm is integrated into the training process of baseline, FPAS, and ANL models to generate adversarial examples. Fine-tuning involves pairing adversarial and clean samples and training the models for 20 epochs, resulting in defense models denoted as baseline_at, FPAS_at, and ANL_at.
To evaluate the white-box attack resilience of FPAS and ANL, we conduct adversarial attack experiments on ImageNet-100 and CIFAR-10 datasets using five white-box attack algorithms against the baseline_at, FPAS_at, and ANL_at models. The white-box experiments involve 10 iterations for each algorithm, and the defensive capabilities of the models are compared. The experimental results are illustrated in Figs. 6 and 7.
As illustrated in Fig. 6 on the CIFAR-10 dataset, the defense capability of the ANL_at model excels during both the rapid escalation and convergence phases of attack strength. For instance, on the CIFAR-10 dataset, in the 10th epoch, the I-FGSM algorithm achieves a recognition accuracy of 30.5% on the baseline_at model, while the ANL_at model achieves an accuracy of 35.8%. On the ImageNet-100 dataset, except for the PGD experiment group, the ANL_at model outperforms the baseline_at model in the initial stage of rapid attack strength escalation. As the intensity of the attacks stabilizes, the disparity in defensive performance between the models diminishes. Overall, across five different adversarial example sets from both datasets, models with the ANL module exhibit superior defense capabilities compared to the original models in white-box defense.
As illustrated in Fig. 7, the recognition accuracy of the FPAS_at model is consistently higher than that of the baseline_at model during both the rapid growth and convergence phases of attack strength. Although the PGD experiment group on the ImageNet-100 dataset shows a slight exception, where the defense capability of the FPAS_at model is inferior to the baseline_at model in the early rounds of attack iterations, as the attack strength stabilizes, the FPAS_at model’s defense capability becomes superior. In summary, considering five sets of adversarial examples from both datasets, models with the FPAS module effectively enhance resistance to various white-box adversarial attack algorithms.
Black-box defense performance
We pre-trained AlexNet and Vgg19 on the ImageNet-100 and CIFAR-10 datasets using clean samples as the training data, serving as alternative models for attack targets. Subsequently, we applied nine attack algorithms to attack the AlexNet and Vgg19 models, generating nine sets of adversarial examples. Finally, these nine sets of adversarial examples were fed into the baseline, baseline_at, FPAS_at, and ANL_at models, and the recognition accuracies of each model are presented in Tables 6 and 7.
From Tables 6 and 7, it can be observed that the FPAS_at model on the ImageNet-100 dataset improved the recognition accuracy of I-FGSM adversarial examples from 46.3% to 57.7%, and against the Square attack algorithm, the FPAS_at model achieved an accuracy of 68.6%. On the CIFAR-10 dataset, the defense performance of the FPAS_at model was superior to the baseline and baseline_at models. Compared to the baseline model, although the FPAS_at model experienced a 5.5% decrease in accuracy on clean samples, it achieved an improvement of over 20% in accuracy on most adversarial examples.
Tables 6 and 7 reveal that on the ImageNet-100 dataset, PGD attacks lowered the recognition accuracy of the baseline model to 49.7%, while the ANL_at model increased the accuracy by nearly 10% compared to the baseline, reaching 59.4%. Overall, the ANL_at model demonstrated a significantly enhanced ability to resist each attack algorithm compared to the baseline model. When compared to the baseline_at model, which incorporated adversarial training alone, the ANL_at model exhibited higher defense capabilities. Due to the enhancement of the model’s attention on critical areas by the ANL module, even the recognition accuracy on clean samples increased from 72.2% to 74.3%. On the CIFAR-10 dataset, the ANL_at model further improved defense capabilities, achieving around a 5% performance boost against several transfer-based alternative attack algorithms.
As shown in Table 1, adversarial attacks such as I-FGSM, PGD, MI-FGSM, \(\hbox {DI}^2\)-FGSM, and TI-FGSM tend to generate more samples that cause shifts in the model’s attention. In contrast, attacks like FGSM, Deepfool, and C&W produce more samples that lead to attention attenuation. To address these effects, we propose the FPAS module to counteract attention shifts and the ANL module to mitigate attention decay. Consequently, the FPAS_at model demonstrates stronger defenses against I-FGSM, PGD, MI-FGSM, \(\hbox {DI}^2\)-FGSM, and TI-FGSM, whereas the ANL_at model is more effective against FGSM, Deepfool, and C&W attacks. These results confirm the effectiveness of our defense modules.
Additionally, statistical significance testing (p-value) was conducted to assess whether the success rates of the FPAS_at and ANL_at models showed significant improvement compared to the baseline_at on different adversarial examples. The results indicate that the p-values for FPAS_at and ANL_at on the adversarial example dataset derived from ImageNet-100 are 0.0004 and 1.06e-06, respectively-both considerably below the 0.05 threshold. The p-values for FPAS_at and ANL_at on the adversarial example dataset derived from CIFAR-10 are 0.002 and 0.0005, respectively-both considerably below the 0.05 threshold. These findings confirm that the FPAS_at and ANL_at models achieve statistically significant improvements in success rate over the baseline_at model, demonstrating superior effectiveness in handling adversarial examples.
Attention space guidance effect
We again use the Grad-CAM algorithm56 to visualize the attention of the FPAS model, comparing it with the attention of the baseline model. This is done to assess the defense effect of FPAS against attention-shifting attacks from the perspective of changes in attention spatial positions. Clean samples and nine sets of adversarial examples are separately fed into the baseline and FPAS models. The generated attention matrices by Grad-CAM are denoted as \(Att^{baseline}_{clean}\), \(Att^{FPAS}_{clean}\), \(Att^{baseline}_{adv}\) and \(Att^{FPAS}_{adv}\). We focus on the attention given by the models to the core regions of the images. Therefore, using a threshold of 0.5, we binarize the attention matrices:
Next, we calculate the Intersection over Union (IoU) between the attention of clean samples and adversarial examples separately for the baseline and FPAS models. A smaller IoU indicates a more severe shift in the model’s core attention. The calculation is performed as shown in Eq. (11).
In all samples \(x_{i}\) exhibiting attention shift on the ImageNet-100 and CIFAR-10 datasets, we counted the number of samples for which \(IoU^{FPAS}(x_{i})>IoU^{baseline}(x_{i})\).Finally, we calculated the success rate \(Num\_Rate\), representing the proportion of successfully pulling back attention by FPAS, and the average core attention area \(IoU_{avg}\) for each sample across all sample groups. The experimental results are presented in Tables 8 and 9.
On the ImageNet-100 dataset, the FPAS module successfully guided attention in 64.2% of adversarial examples, with the average core attention area per sample elevated to 78.8%. Similarly, on the CIFAR-10 dataset, FPAS successfully guided attention in 59.9% of samples, and the average core attention area \(IoU_{avg}\) per sample was increased by 2.4% with the FPAS module.
Observing Fig. 8, the changes in attention on adversarial examples before and after defense with the FPAS module are evident. Adversarial attacks shifted attention away from the correct object region in samples on the baseline model. However, incorporating the FPAS module successfully guided the shift of attention back to the correct areas.
Attention enhancement effect
After obtaining IoU according to Formula (11), to more accurately assess the effect of attention enhancement while excluding the influence of attention shifting, we calculated the average core attention matrix \(Att'(x_{i}),\) denoted as \(Avg^{baseline}_{clean}\), \(Avg^{ANL}_{clean}\), \(Avg^{baseline}_{adv}\) and \(Avg^{ANL}_{adv}\), for both clean and adversarial examples in the baseline and ANL models under IoU thresholds of 60%, 70%, and 80%. Following this, we defined the change in core attention values for an image \(x_{i}\) before and after the attack in the model as \(Weak(x_{i})\), calculated as follows:
For each sample, we independently determined the attention change values on the CIFAR-10 and ImageNet-100 datasets. The average attention change value for each pixel in the core region per image before and after using the ANL module is denoted as \(Defense\_Value.\)
Where \(img\_count\) represents the total number of samples in the dataset, we calculate the number of samples in the dataset for which \(Weak_{ANL}(x_{i}>Weak_{baseline}(x_{i})\) and denote it as \(Att\_Count\). The proportion of samples with enhanced attention, represented by the Attention Enhancement Ratio \(Num\_Rate\), can be expressed by the formula (15):
The experimental results, as shown in Tables 10 and 11, reveal that after employing the ANL module for defense, over 60% of the samples witness an enhancement in the attention values of their key regions, both in the ImageNet-100 and CIFAR-10 datasets. Taking the case of \(IoU > 60\%\), on the ImageNet-100 dataset, an average of 65.2% of the samples experience an enhancement in the attention values of their key regions, with an average attention increase of 0.019 per pixel across all samples. In the CIFAR-10 dataset, when \(IoU > 80\%\), the proportion of samples with attention enhancement peaks at 70.2%, with an average attention increase of 0.024 per pixel in the key regions.
The three sets of adversarial examples are incorrectly classified by the baseline model, as Fig. 9 demonstrates. However, after integrating the ANL defense module, the key regions’ attention for all three sets of adversarial examples is enhanced. Taking the example of the corgi sample, the baseline model assigns an attention value of 2246.0 to the corgi region, misclassifying it as a toy terrier. In contrast, the ANL model elevates the attention to 3904.8, correctly classifying the corgi sample after enhancing the model’s attention to the key region. Overall, the proposed Attention-based Non-Local defense module significantly improves the model’s focus on crucial features in the image, effectively countering attention decay-type attacks.
Performance comparison of integrated defense strategies
Utilizing the boosting framework for ensemble learning, we integrated the FPAS and ANL modules on the Wide-ResNet baseline network, constructing the defense model boosting(FPAS, ANL). Subsequently, we selected prominent algorithms in the field, namely DDN_at, GridMask, ALP, TRADES and AAD, to compare their integrated defense solutions against our boosting(FPAS, ANL) approach. The experimental results are presented in Tables 12 and 13.
The results from the experiments in the tables indicate that, among all the ensemble solutions, the boosting(FPAS, ANL) approach exhibits the best performance. Even on clean samples, our boosting(FPAS, ANL) model achieved improvements of 2.3% and 5.4% on the ImageNet-100 and CIFAR-10 datasets, respectively, compared to the baseline model. This research also demonstrates that not all defense algorithms can be successfully combined through boosting in order to create a more resilient defense model. For instance, on the ImageNet-100 dataset, the integrated models of GridMask and ALP, under Deepfool attack, achieve a recognition accuracy of 62.8%, whereas individually, their accuracies reach 70.4% and 70.1%, respectively. The integrated models experienced a decline in defense efficacy. We attribute this phenomenon to the fact that GridMask and ALP, in their original designs, target similar types of adversarial attack algorithms. This similarity leads to a situation where their defense strategies, when integrated, cannot effectively counter a broader range of attack algorithms.
In contrast, the superior performance of our boosting ensemble defense, combining the FPAS and ANL strategies, can be attributed to the fact that these two defense strategies were not initially biased toward specific attack algorithms. Instead, they were designed with a focus on adversarial attack effects, proposing defense strategies tailored to two distinct types of attention changes. Therefore, when integrating these two defense strategies, they complement each other, each having its own emphasis and compatibility, resulting in superior collaborative defense performance.
Statistical significance testing confirmed that the boosting(FPAS, ANL) approach achieved significant improvements over the baseline model, with p-values of 7.99e-11 for ImageNet-100 and 0.001 for CIFAR-10, both below the 0.05 threshold. These results highlight the effectiveness of combining complementary strategies like FPAS and ANL, which target distinct adversarial effects, leading to superior collaborative defense performance.
Effectiveness of integrating our approach with existing methods
The experimental setup mirrors that of the black-box defense experiments. We combined several existing defense strategies, including defensive distillation (AdaAD, AdaAD_IAD), ALP, GridMask, and TRADES, with our proposed method (boosting(FPAS, ANL)) to evaluate their combined effectiveness.
The experimental results, presented in Table 14, indicate that integrating our method with existing defense strategies significantly enhances the average defense success rate across nine black-box attack scenarios. Specifically, Ours+ALP achieves a 2.2% improvement over ALP, Ours+GridMask demonstrates a 2.7% enhancement over GridMask, and Ours+TRADES achieves a notable 6% increase over TRADES. The most effective combinations are Ours+AdaAD and Ours+AdaAD_IAD, which improve the average defense success rate by 7.1% and 1.8%, respectively, compared to our baseline method (boosting(FPAS, ANL)).
These findings highlight the flexibility and efficacy of our proposed approach. By integrating with various existing defense strategies, our method consistently enhances overall defense performance, providing a comprehensive solution for defending against adversarial attacks.
Evaluation of robustness against adversarial patches
Adversarial patches are meticulously optimized contiguous pixel blocks within an input image, designed to cause a machine learning model to misclassify the image. When integrated into input images, these adversarial patches generate out-of-distribution samples. This occurs because the injected patch introduces a spurious correlation with the target label, which often shifts the input sample away from the manifold of natural images. Adversarial patches can be physically realized as stickers and placed on real-world objects. Evaluating the robustness of machine learning models against these attacks is of paramount importance, given their potential to critically impact real-world applications with significant physical consequences.
The experimental setup mirrors that of the black-box defense experiments. We employed the adversarial patch generation method proposed by Pintor et al.57 to embed the generated patches into the ImageNet-100 dataset. This approach was used to assess the robustness of the model against adversarial attacks. Comparing Tables 6 and 15, we can see that embedding adversarial patches in the adversarial example significantly reduces baseline accuracy. Combined with adversarial training, the defense success rate increased. Our FPAS_at and ANL_at models improved the average defense rate by 5.47% and 7.74% compared with baseline_at when facing samples embedded with adversarial patches.
Cost of the defense module
On a GeForce RTX 2080 Ti machine with 16GB of RAM, we conducted full-load training of baseline, FPAS, ANL, and boosting (FPAS, ANL) models, incorporating adversarial training strategies on both ImageNet-100 and CIFAR-10 datasets. As shown in Table 16, compared to the baseline model, the ANL_at model increased the training time per iteration by 2m49s and 35s on the ImageNet-100 and CIFAR-10 datasets, respectively. The FPAS_at model increased the training time per iteration by 24s on both datasets, while the boosting(FPAS, ANL) model increased the training time per iteration by 3m18s and 35s on the two datasets, respectively. Overall, the training time per epoch for our models has increased slightly, but the magnitude of this increase is minimal. Such training time costs are deemed acceptable for constructing model defense solutions.
Conclusion
Our study significantly advances the understanding of adversarial attacks on Deep Neural Networks (DNNs) by dissecting the mechanisms through which adversarial examples manipulate model attention, leading to misclassifications. We identified two primary categories of adversarial impacts on DNNs: those inducing shifts in the model’s attention and those resulting in its attenuation. To counteract these adversarial tactics, we introduced two innovative defense modules: the Feature Pyramid-based Attention Space-guided (FPAS) module and the Attention-based Non-Local (ANL) module. The FPAS module reduces the shifting attention in adversarial examples, improving the model’s defense performance. The ANL module, on the other hand, enhances the model’s focus on critical features, efficiently constructing a robust defense model with low implementation cost and minimal intrusion into the original model. Through extensive experiments, including white-box and black-box defenses on the ImageNet-100 and CIFAR-10 datasets, we validated the superior performance of our proposed defense modules.
Our boosting(FPAS, ANL) model demonstrates strong robustness against nine types of black-box adversarial attacks, achieving an average accuracy of 65.3% on the ImageNet-100 dataset and 66.6% on the CIFAR-10 dataset. This represents improvements of 5.8% and 4.1%, respectively, compared to the baseline_at model. Our FPAS_at and ANL_at models demonstrated significant improvements over the baseline, enhancing the average defense rate by 5.47% and 7.74%, respectively, when facing samples embedded with adversarial patches. Moreover, by employing an ensemble framework to construct a collaborative defense model, we compared it with other mainstream defense algorithms in the industry. The results showed that our universal defense strategy can provide more robust and comprehensive defense capabilities at a lower implementation cost compared to current mainstream defense methods. This work not only elucidates the role of attention mechanisms in the vulnerability of DNNs to adversarial examples but also sets a new benchmark for the development of robust, interpretable defense strategies that safeguard the integrity of deep learning models in high-stakes applications.
Future work can explore the integration of these defense mechanisms with other model architectures, such as transformers, and assess their applicability to a broader range of adversarial attacks. This may include testing against different types and complexities of adversarial attacks, as well as evaluating their performance in various application domains like autonomous driving and medical diagnostics. Due to the design of FPAS and ANL in multi-scale feature fusion and enhancement of key feature attention, they possess the potential for cross-domain applications. These modules can be integrated into other models and applied to fields such as object detection, semantic segmentation, natural language processing, and multimodal tasks. Additionally, research can further optimize these defense modules to reduce their impact on model performance while maintaining or improving their adversarial robustness. These directions will not only drive advancements in the field of adversarial defense but also provide stronger protection for deep learning models in practical applications.
Discussion
Limitations
Despite the strong defense capabilities demonstrated by the proposed algorithms, several limitations must be acknowledged. While the introduced modules add minimal computational overhead, they still result in a modest increase in training time compared to the baseline model. This trade-off between robustness and efficiency makes the modules less suitable for real-time processing or rapid deployment, particularly in resource-constrained systems. Therefore, an ideal adversarial defense algorithm should strike a balance between robustness and efficiency, minimizing implementation costs while maintaining strong defensive performance.
Moreover, although the strategies show robustness against a wide range of adversarial perturbations, vulnerabilities may emerge when confronted with highly adaptive or dynamic adversarial attacks. These include adversaries employing model-based feature disruption or dynamic attack patterns specifically designed to exploit the attention mechanisms leveraged by the modules. Such attacks may undermine the model’s robustness by dynamically shifting focus or camouflaging perturbations in critical regions. Future experiments will focus on validating the defense strategies against these adaptive adversarial methods, providing valuable insights into the robustness and generalizability of the proposed modules.
Additionally, the defense schemes require retraining the original model, which introduces implementation overhead. While preprocessing techniques could potentially bypass retraining, they often result in limited defense effectiveness and remain vulnerable to sophisticated adversarial examples. This retraining requirement poses a challenge for the practical deployment of the proposed methods in certain scenarios. Future work should aim to address these limitations by exploring adaptive mechanisms to counter dynamic adversarial attacks, extending the methodology to non-image domains, and developing universal attention-guided modules compatible with pre-trained models. These modules would integrate robust defense capabilities without the need for retraining, thereby reducing implementation overhead. Furthermore, prioritizing lightweight implementations will enhance the universality and practicality of the proposed defense strategies while ensuring strong performance across a wide range of applications.
Future directions and scalability
The scalability of the proposed modules to more complex architectures, such as transformers, represents a key direction for future research. Transformer-based models, such as Vision Transformer (ViT) and Swin Transformer, exhibit inherent robustness due to their large parameter capacity and unique architectural properties. This robustness makes it inherently difficult to design effective adversarial attacks against these models, as their extensive training and hierarchical feature processing tend to mitigate conventional perturbations. However, the core reliance on self-attention mechanisms also introduces specific vulnerabilities that adversaries can exploit. Existing attack strategies targeting Transformers include Attention Hijacking Attacks, which manipulate attention weights to disrupt the focus on critical features, and Patch-based Adversarial Attacks, which exploit the patch-based processing structure of vision models. These attacks, although challenging to design, can significantly compromise the model’s performance by targeting its most critical computational components.
Our proposed FPAS and ANL modules provide initial steps toward addressing these challenges. FPAS strengthens multi-scale feature extraction, mitigating the effects of localized perturbations, while ANL reinforces global feature interactions, reducing susceptibility to attention hijacking. Nevertheless, designing effective defenses for Transformer-specific vulnerabilities remains an open problem. Future work will explore novel defense strategies tailored to the unique characteristics of Transformer architectures, focusing on the interplay between large parameter spaces, hierarchical processing, and adversarial resilience. These efforts aim to deepen our understanding of Transformer robustness while advancing the development of adaptive and scalable defense solutions.
Cross-domain applications and future validation
The proposed modules exhibit potential adaptability beyond their initial application in adversarial defense for image classification. For example, in medical imaging, they could potentially enhance resilience to adversarial noise in diagnostic tasks, such as analyzing radiographic or MRI scans, where perturbations might significantly impact clinical decisions. Similarly, in autonomous driving, the modules may improve the robustness of real-time object detection systems against adversarial patches or occlusions. In remote sensing, their applicability could be explored in UAV-based object detection under diverse environmental conditions and adversarial scenarios.
Planned experiments will aim to systematically evaluate the modules on datasets such as CheXpert for medical imaging, KITTI for autonomous driving, and DOTA for remote sensing. These studies will focus on assessing robustness across diverse input data distributions, with the objective of validating the modules’ potential generalizability in practical applications.
Failure cases
Understanding failure cases is essential for refining adversarial defense strategies. We define set A as the samples correctly classified by the baseline model on clean inputs, and set B as those misclassified by the baseline model on adversarial inputs. The union of sets A and B, denoted as set C, represents samples where adversarial attacks have compromised the baseline model’s defenses. Our integrated model, combining FPAS and ANL, misclassifies a subset of adversarial examples represented as set D. The union of sets C and D, designated as set F, encompasses all failure cases where the proposed defense strategies are compromised.
Through comprehensive analysis using attention visualization and statistical evaluation, several critical challenges were identified. Subtle adversarial perturbations often evade detection, exploiting the model’s inherent limitations in attention sensitivity. Additionally, significant spatial overlaps between the attention regions of clean and adversarial examples undermine the effectiveness of attention-guided defenses. The FPAS and ANL modules also exhibit vulnerabilities to dynamic or localized perturbations, which can overwhelm their response mechanisms. Furthermore, attacks targeting specific feature scales or employing class-agnostic strategies further degrade the robustness of these modules.
To address these failure cases, future work should focus on enhancing dynamic adaptation, multi-scale feature refinement, and localized perturbation handling. Adaptive attention mechanisms can be developed to better respond to dynamic or localized perturbations, with temporal and spatial coherence checks to mitigate such attacks. The FPAS module can be refined to improve its sensitivity to subtle perturbations and its ability to detect adversarial patterns across multiple scales, potentially through hierarchical feature weighting or amplification. Additionally, enhancing the ANL module’s robustness against localized attacks with techniques like saliency filtering or adversarial region masking is necessary. Analyzing the failure patterns within set F will provide valuable insights to guide the development of specialized countermeasures, such as tailored adversarial training datasets and defense strategies for specific attack classes. These iterative refinements will help improve the defense models, ensuring their reliability and effectiveness against increasingly sophisticated adversarial threats.
Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
References
Touvron, H. et al. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning 10347–10357 (PMLR, 2021).
Wang, C.-Y., Bochkovskiy, A. & Liao, H.-Y. M. Scaled-yolov4: scaling cross stage partial network. In Proceedings of the IEEE/cvf Conference on Computer Vision and Pattern Recognition 13029–13038 (2021).
Zheng, S. et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 6881–6890 (2021).
Otter, D. W., Medina, J. R. & Kalita, J. K. A survey of the usages of deep learning for natural language processing. IEEE Trans. Neural Netw. Learn. Syst. 32, 604–624 (2020).
Szegedy, C. et al. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013).
Liu, Z. et al. Hygloadattack: Hard-label black-box textual adversarial attacks via hybrid optimization. Neural Netw. 2024, 106461 (2024).
Zhou, W. et al. Hidim: A novel framework of network intrusion detection for hierarchical dependency and class imbalance. Comput. Secur. 148, 104155 (2025).
Gowal, S. et al. Achieving robustness in the wild via adversarial mixing with disentangled representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 1211–1220 (2020).
Farnia, F., Zhang, J. M. & Tse, D. Generalizable adversarial training via spectral normalization. arXiv preprint arXiv:1811.07457 (2018).
Rony, J. et al. Decoupling direction and norm for efficient gradient-based l2 adversarial attacks and defenses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 4322–4330 (2019).
Wang, J. & Zhang, H. Bilateral adversarial training: towards fast training of more robust models against adversarial attacks. In Proceedings of the IEEE/CVF International Conference on Computer Vision 6629–6638 (2019).
Chan, A., Tay, Y., Ong, Y. S. & Fu, J. Jacobian adversarially regularized networks for robustness. arXiv preprint arXiv:1912.10185 (2019).
He, L. et al. Boosting adversarial robustness via self-paced adversarial training. Neural Netw. 167, 706–714 (2023).
Hsiung, L., Tsai, Y.-Y., Chen, P.-Y. & Ho, T.-Y. Towards compositional adversarial robustness: generalizing adversarial training to composite semantic perturbations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 24658–24667 (2023).
Yang, S. & Xu, C. One size does not fit all: data-adaptive adversarial training. In European Conference on Computer Vision 70–85 (Springer, 2022).
Jin, G., Yi, X., Wu, D., Mu, R. & Huang, X. Randomized adversarial training via taylor expansion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 16447–16457 (2023).
Das, N. et al. Shield: Fast, practical defense and vaccination for deep learning using jpeg compression. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining 196–204 (2018).
Sun, B., Tsai, N.-h., Liu, F., Yu, R. & Su, H. Adversarial defense by stratified convolutional sparse coding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 11447–11456 (2019).
Mustafa, A., Khan, S. H., Hayat, M., Shen, J. & Shao, L. Image super-resolution as a defense against adversarial attacks. IEEE Trans. Image Process. 29, 1711–1724 (2019).
Pouya, S. Defense-gan: protecting classifiers against adversarial attacks using generative models. Retrieved from [SPACE]arXiv:1805.06605 (2018).
Zhang, Y., Zhang, T., Wang, S. & Yu, P. An efficient perceptual video compression scheme based on deep learning-assisted video saliency and just noticeable distortion. Eng. Appl. Artif. Intell. 141, 109806 (2025).
Chen, P., Liu, S., Zhao, H., Wang, X. & Jia, J. Gridmask data augmentation. arXiv preprint arXiv:2001.04086 (2020).
Papernot, N., McDaniel, P., Wu, X., Jha, S. & Swami, A. Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE symposium on security and privacy (SP), 582–597 (IEEE, 2016).
Huang, B. et al. Boosting accuracy and robustness of student models via adaptive adversarial distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 24668–24677 (2023).
Pang, T. et al. Rethinking softmax cross-entropy loss for adversarial robustness. arXiv preprint arXiv:1905.10626 (2019).
Gu, S. & Rigazio, L. Towards deep neural network architectures robust to adversarial examples. arXiv preprint arXiv:1412.5068 (2014).
Goodfellow, I. et al. Generative adversarial networks. Commun. ACM 63, 139–144 (2020).
Kannan, H., Kurakin, A. & Goodfellow, I. Adversarial logit pairing. arXiv preprint arXiv:1803.06373 (2018).
Zhang, H. et al. Theoretically principled trade-off between robustness and accuracy. In International Conference on Machine Learning 7472–7482 (PMLR, 2019).
Wu, S. et al. Attention, please! adversarial defense via attention rectification and preservation. arXiv preprint arXiv:1811.09831 (2018).
Deng, T. & Zeng, Z. Generate adversarial examples by spatially perturbing on the meaningful area. Pattern Recogn. Lett. 125, 632–638 (2019).
Chen, X. et al. Feature distillation in deep attention network against adversarial examples. IEEE Trans. Neural Netw. Learn. Syst. 34, 3691–3705 (2021).
Zhu, J. et al. Attention-guided transformation-invariant attack for black-box adversarial examples. Int. J. Intell. Syst. 37, 3142–3165 (2022).
Wang, J., Liu, A., Bai, X. & Liu, X. Universal adversarial patch attack for automatic checkout using perceptual and attentional bias. IEEE Trans. Image Process. 31, 598–611 (2021).
Chen, S., He, Z., Sun, C., Yang, J. & Huang, X. Universal adversarial attack on attention and the resulting dataset damagenet. IEEE Trans. Pattern Anal. Mach. Intell. 44, 2188–2197 (2020).
He, M., Cui, M., Liang, Y. & Liu, H. Aeaed: Attention-enhanced autoencoder for adversarial example detection with multi-scale feature learning. J. Intell. Knowl. Eng. 2, 99 (2024).
Yu, L., Zhang, H. & Xu, C. Text-guided attention is all you need for zero-shot robustness in vision-language models. arXiv preprint arXiv:2410.21802 (2024).
Feng, W., Xu, N., Zhang, T., Zhang, Y. & Wu, F. Enhancing cross-task transferability of adversarial examples via spatial and channel attention. IEEE Trans. Multimedia (2024).
Li, Q. et al. Attention-sa: Exploiting model-approximated data semantics for adversarial attack. IEEE Trans. Inf. Forens. Secur. (2024).
Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 7132–7141 (2018).
Wang, Q. et al. Eca-net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 11534–11542 (2020).
Woo, S., Park, J., Lee, J.-Y. & Kweon, I. S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV) 3–19 (2018).
Wang, X., Girshick, R., Gupta, A. & He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 7794–7803 (2018).
Chen, S.-B. et al. Remote sensing scene classification via multi-branch local attention network. IEEE Trans. Image Process. 31, 99–109 (2021).
Misra, D., Nalamada, T., Arasanipalai, A. U. & Hou, Q. Rotate to attend: Convolutional triplet attention module. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision 3139–3148 (2021).
Li, H. et al. A defense method based on attention mechanism against traffic sign adversarial samples. Inf. Fusion 76, 55–65 (2021).
Zhang, Y., Liu, Y., Kang, W. & Tao, R. Vss-net: Visual semantic self-mining network for video summarization. IIEEE Trans. Circ. Syst. Video Technol. 34, 2775–2788 (2024).
Zhang, Y., Wu, C., Guo, W., Zhang, T. & Li, W. Cfanet: Efficient detection of uav image based on cross-layer feature aggregation. IEEE Trans. Geosci. Remote Sens. 61, 1–11 (2023).
Zhang, Y., Wang, S., Zhang, Y. & Yu, P. Asymmetric light-aware progressive decoding network for rgb-thermal salient object detection. J. Electron. Imaging 34, 013005–013005 (2025).
Zhang, Y., Liu, T., Yu, P., Wang, S. & Tao, R. Sfsanet: Multiscale object detection in remote sensing image based on semantic fusion and scale adaptability. IEEE Trans. Geosci. Remote Sens.62, 586 (2024).
Zhang, Y., Zhen, J., Liu, T., Yang, Y. & Cheng, Y. Adaptive differentiation siamese fusion network for remote sensing change detection. IEEE Geosci. Remote Sens. Lett. (2024).
Zhang, Y., Wu, C., Zhang, T. & Zheng, Y. Full-scale feature aggregation and grouping feature reconstruction based uav image target detection. IEEE Trans. Geosci. Remote Sens. (2024).
Zhang, Y., Zhang, T., Wu, C. & Tao, R. Multi-scale spatiotemporal feature fusion network for video saliency prediction. IEEE Trans. Multimedia 26, 4183–4193 (2024).
Zhang, Y., Liu, Y. & Wu, C. Attention-guided multi-granularity fusion model for video summarization. Expert Syst. Appl. 249, 123568 (2024).
Tian, Y., Krishnan, D. & Isola, P. Contrastive multiview coding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16 776–794 (Springer, 2020).
Chattopadhay, A., Sarkar, A., Howlader, P. & Balasubramanian, V. N. Grad-cam++: generalized gradient-based visual explanations for deep convolutional networks. In 2018 IEEE winter conference on applications of computer vision (WACV), 839–847 (IEEE, 2018).
Pintor, M. et al. Imagenet-patch: A dataset for benchmarking machine learning robustness against adversarial patches. Pattern Recogn. 134, 109064 (2023).
Acknowledgements
This work was supported in part by National Key Research and Development Program of China under Grant [2023YFC3010302]. National Natural Science Foundation of China [Grant No.82101079], Key R&D program of Jiangsu province [BE2023836] (Corresponding author: Yining Hu). The first two authors contributed equally to this work.
Author information
Authors and Affiliations
Contributions
Jiawei Zhao, Siqi Gu, and Zihan Qin drafted the manuscript. Lizhe Xie designed the methodology and experiments, while Jiawei Zhao and Siqi Gu carried out the experiments. Zihan Qin also prepared the figures and tables. Zheng Wang and Yuning Zhang contributed to the manuscript review and editing. Yining Hu and Lizhe Xie secured funding, with Yining Hu further providing validation and supervision. All authors reviewed and approved the final manuscript.
Corresponding author
Ethics declarations
Competing Interests
The authors have no competing interests to declare that are relevant to the content of this article. The authors have no relevant financial or non-financial interests to disclose. All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zhao, J., Xie, L., Gu, S. et al. Universal attention guided adversarial defense using feature pyramid and non-local mechanisms. Sci Rep 15, 5237 (2025). https://doi.org/10.1038/s41598-025-89267-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-89267-8