Universal attention guided adversarial defense using feature pyramid and non-local mechanisms

Zhao, Jiawei; Xie, Lizhe; Gu, Siqi; Qin, Zihan; Zhang, Yuning; Wang, Zheng; Hu, Yining

doi:10.1038/s41598-025-89267-8

Download PDF

Article
Open access
Published: 12 February 2025

Universal attention guided adversarial defense using feature pyramid and non-local mechanisms

Jiawei Zhao¹^na1,
Lizhe Xie²^na1,
Siqi Gu¹,
Zihan Qin¹,
Yuning Zhang³,
Zheng Wang¹ &
…
Yining Hu¹

Scientific Reports volume 15, Article number: 5237 (2025) Cite this article

2651 Accesses
Metrics details

Subjects

Abstract

Deep Neural Networks (DNNs) have been shown to be vulnerable to adversarial examples, significantly hindering the development of deep learning technologies in high-security domains. A key challenge is that current defense methods often lack universality, as they are effective only against certain types of adversarial attacks. This study addresses this challenge by focusing on analyzing adversarial examples through changes in model attention, and classifying attack algorithms into attention-shifting and attention-attenuation categories. Our main novelty lies in proposing two defense modules: the Feature Pyramid-based Attention Space-guided (FPAS) module to counter attention-shifting attacks, and the Attention-based Non-Local (ANL) module to mitigate attention-attenuation attacks. These modules enhance the model’s defense capability with minimal intrusion into the original model. By integrating FPAS and ANL into the Wide-ResNet model within a boosting framework, we demonstrate their synergistic defense capability. Even when adversarial examples are embedded with patches, our models showed significant improvements over the baseline, enhancing the average defense rate by 5.47% and 7.74%, respectively. Extensive experiments confirm that this universal defense strategy offers comprehensive protection against adversarial attacks at a lower implementation cost compared to current mainstream defense methods, and is also adaptable for integration with existing defense strategies to further enhance adversarial robustness.

An enhanced ensemble defense framework for boosting adversarial robustness of intrusion detection systems

Article Open access 23 April 2025

Defense against adversarial attacks: robust and efficient compressed optimized neural networks

Article Open access 17 March 2024

A deep learning method for optimizing semantic segmentation accuracy of remote sensing images based on improved UNet

Article Open access 10 May 2023

Introduction

Deep learning technologies have catalyzed transformative advances across various domains, including image classification¹, object detection², semantic segmentation³, and natural language processing⁴, demonstrating remarkable efficacy in addressing complex challenges. However, adversarial examples expose critical vulnerabilities in deep neural networks (DNNs), undermining their robustness, defined as the model’s ability to maintain high performance under adversarial or noisy inputs, and reliability, reflecting the model’s consistent behavior across diverse conditions. This fragility was first identified by Szegedy et al.⁵ in 2013, who noted DNNs’ susceptibility to small but intentional input perturbations that disrupt their input-output relationships. The implications of this weakness are particularly serious for high-stakes applications, such as autonomous driving and medical diagnostics, where model accuracy and reliability are paramount, and adversarial interference could result in severe consequences, raising ethical and safety concerns.

The widespread threat posed by adversarial examples requires an in-depth investigation into their impact on DNNs^6,7. Although numerous defenses have been proposed, adversarial training^{8,9,10,11,12,13} remains a primary approach for enhancing model robustness by incorporating adversarial examples into the training set. Building on this, Lei et al.¹⁴ proposed Generalized Adversarial Training (GAT) to extend robustness from simple $l _{p}$-ball perturbations to complex semantic changes, including variations in hue, saturation, brightness, contrast, and rotation. Yang et al.¹⁵ introduced Data-Adaptive Adversarial Training (DAAT), which adapts perturbation sizes through a calibration network, improving robustness and accuracy. Jin et al.¹⁶ furthered adversarial training by adding random noise to network weights, flattening the loss landscape to enhance both robustness and clean accuracy. Despite their effectiveness against specific attacks, these methods still face challenges in generalizing to novel or unforeseen adversarial methods.Input preprocessing techniques, such as image compression and filtering^{17,18,19,20,21}, provide additional defense options but may incur latency or reduce input data quality, potentially affecting model performance on legitimate samples. The GridMask²² method, for instance, generates a binary mask that disrupts adversarial perturbations by masking parts of the input image, thus reducing attack efficacy. Other defense methods, such as defensive distillation^23,24, enhanced loss functions^25,26,27 and model structure optimization, have demonstrated some effectiveness against previously unseen adversarial attacks. Adversarial Logit Pairing (ALP)²⁸, within adversarial training, aligns clean and adversarial example logits, adding a regularization term that encourages similar embeddings for both versions of the same sample. This alignment helps the model better represent data structures internally. Similarly, the TRADES²⁹ method partitions the adversarial loss into the classification error of clean samples and a trade-off term representing the KL divergence between clean and adversarial logits. Unlike ALP, TRADES optimizes the classification of clean samples and the robustness of adversarial logits without directly targeting adversarial accuracy, effectively balancing robustness and clean accuracy. Additionally, the Attention-based Adversarial Defense (AAD)³⁰ method addresses adversarial attacks by aligning visual attention between adversarial and clean samples, ensuring focus remains on target objects and reducing feature divergence. AAD also selectively incorporates moderately challenging adversarial examples based on observed attention shifts to further strengthen model robustness.

A key challenge facing current defense methodologies is their limited generalizability, primarily due to an incomplete understanding of how adversarial attacks affect model behavior. While some defenses are effective against specific attack types, they often fail against novel or unforeseen strategies. This limitation arises because most defenses are designed to counter specific perturbation patterns without fully addressing the broader effects these attacks have on model performance. Consequently, their effectiveness declines when confronted with evolving adversarial techniques that exploit previously unrecognized vulnerabilities. This underscores the urgent need for defense strategies that are not only robust but also adaptable to address the diverse and rapidly advancing spectrum of adversarial threats.

Wu et al., for example, observed changes in class activation maps when they applied the Fast Gradient Sign Method (FGSM) to attack ResNet50 on ImageNet, finding that adversarial attacks cause the model’s attention to shift to incorrect target regions. This finding motivated us to investigate adversarial attack impact from an attention perspective. Previous studies have shown that attention-based defenses or attacks can significantly influence model performance. Deng et al.³¹ explored a spatially transformed attack method based on attention, highlighting the importance of focusing on meaningful areas to create effective adversarial examples. Chen et al.³² investigated how attention mechanisms can compress adversarial perturbations, suggesting that attention-based defenses can resist adversarial examples by concentrating on essential image regions. Zhu et al.³³ enhanced adversarial example transferability using an attention mechanism to disrupt distinct features, showcasing the effectiveness of attention-guided strategies in adversarial contexts. Wang et al.³⁴ used attention mechanisms to generate adversarial patches with strong generalization, underscoring attention’s role in creating more effective attacks. Chen et al.³⁵ introduced an attack method targeting DNNs’ attention to increase example transferability, emphasizing attention’s vulnerability to adversarial manipulations. He et al.³⁶ proposed AEAED, integrating multi-head self-attention in autoencoders, demonstrating its utility in capturing subtle adversarial perturbations across multiple scales, and outperforming baseline detection models in adversarial example detection. Yu et al.³⁷ introduced TGA-ZSR, employing text-guided attention to improve zero-shot robustness in vision-language models, effectively aligning adversarial and clean attention distributions while maintaining generalization to clean samples. Feng et al.³⁸ proposed a Dual Attention-Guided Method to create cross-task adversarial examples by capturing overlapping discriminative regions, while Li et al. introduced a post-hoc soft attention mechanism to expand adversarial examples by altering separate semantic units. Liu et al.³⁹ developed the Content Feature Optimization Attack (CFOA) targeting content features in white-box models, illustrating the utility of attention-guided feature perturbations in enhancing adversarial example transferability.

Notably, as depicted in the third column of Fig. 1, we discern that adversarial attacks, irrespective of their type, induce a progressive diversion of the model’s attention from the accurately classified entities towards non-pertinent objects or background areas, corroborating the findings of Wu et al. Furthermore, our research uncovers a novel phenomenon, as demonstrated in the sixth column of the figure, where certain adversarial examples manage to breach the defenses while maintaining the model’s attention on the target object, a stark contrast to the behavior observed with clean samples.

To overcome the limitations in the generalizability of current defense methods, this research aims to develop a more comprehensive and universally applicable defense strategy. Specifically, we seek to address this gap by investigating the role of model attention in the context of adversarial attacks. Attention mechanisms^{40,41,42,43,44,45,46}, which are crucial in directing a model’s computational focus, present a promising avenue for creating countermeasures against adversarial manipulation. By carefully analyzing the shifts and attenuations in model attention induced by adversarial examples, we aim to uncover the strategies employed by these attacks, thus providing a foundation for targeted defense mechanisms. Through an extensive examination of the attention views across a wide range of adversarial examples, we categorize attack algorithms into two primary classes based on their effects: attacks that induce attention shifts and those that cause attention attenuation.

Compared to other attention-based defense methods, the approach proposed in this paper performs a more detailed categorization during the attack analysis phase, rather than treating all attacks as a single attention bias issue. This refined categorization facilitates the development of more targeted defense strategies, addressing the attention distribution issues induced by different types of adversarial attacks.

In response to these categorizations, we propose two tailored defense modules: the Feature Pyramid-based Attention Space-guided (FPAS) module, specifically designed to counteract attention-shifting attacks by spatially retracting the adversarial attention shifts, and the Attention-based Non-Local (ANL) module, crafted to enhance the model’s focus on critical features, thus mitigating attention-attenuation attacks. By integrating FPAS and ANL into the Wide-ResNet model within a boosting framework, we demonstrate their synergistic defense capability. The empirical validation of these modules highlights their effectiveness in reinforcing the resilience of DNNs against a broad spectrum of adversarial attacks. Moreover, our proposed defense modules can be integrated with existing defense strategies as components to further improve adversarial robustness.

In summary, our key contributions are as follows:

We propose a novel classification method of adversarial attacks by analyzing the changes in model attention, distinguishing between attacks that induce attention shifts and those that lead to attention attenuation. This classification method provides a foundational understanding for developing targeted defense mechanisms.
We propose the Feature Pyramid-based Attention Space-guided (FPAS) module to specifically counter attention-shifting attacks, effectively re-directing the model’s focus to the relevant features of the input.
We present the Attention-based Non-Local (ANL) module to address attention-attenuation attacks, enhancing the model’s emphasis on critical features and thereby boosting its robustness.

The remainder of this paper is organized as follows. Section 2 categorizes attacks into two types based on the changes in model attention caused by various attack algorithms. Section 3 endeavors to delve into the mechanisms underlying these two attack phenomena from a model-based perspective and proposes corresponding defense strategies. Section 4 presents a thorough evaluation of our defense approach, affirming its efficacy. Section 5 concludes the paper, summarizing our findings and underscoring their significance for advancing the field of adversarial machine learning. Finally, Section 6 provides a comprehensive discussion of the study, highlighting both limitations and failure cases of the proposed defense strategies.

Classification of attack effects based on attention view

In this section, we categorize attacks into two types based on the changes in model attention caused by various attack algorithms: Attention-shifting Attacks and Attention-attenuation Attacks. Understanding these distinct attack types is crucial for developing effective defense mechanisms and enhancing the robustness of machine learning models.

The attention matrix for clean samples is denoted as $Att_{clean}\left( x_{i} \right) \in {[} 0,1 {]}$ and for adversarial examples, it is $Att_{t}\left( x_{i} \right) \in {[} 0,1 {]}$, where t represents the attack algorithm. A softmax normalization applies to the difference between the attention matrices of the adversarial and clean samples. The variance is then computed employing the resulting matrix, which acts as a meter to measure the degree of attentional shifts brought about by the adversarial attacks. The computation procedure is illustrated in Eq. (1):

$$\begin{aligned} \sigma _{t} \left( x_{i} \right) =\sigma \left( softmax\left( Att_{clean}\left( x_{i} \right) -Att_{t}\left( x_{i} \right) \right) \right) \end{aligned}$$

(1)

We collected the attack results of FGSM, I-FGSM, PGD, MI-FGSM, DI2-FGSM, TI-FGSM, Deepfool, C&W, and Square algorithms on the Wide-ResNet model for the ImageNet datasets. Detailed data can be found in Table 1.

We use the average variance ${\bar{\sigma }}_{t}$ of the attention change matrix as the threshold. If $\sigma _{t} >{\bar{\sigma }} _{t}$, it indicates that the adversarial perturbation on this sample has dispersed the model’s attention. Conversely, if $\sigma _{t} \le {\bar{\sigma }} _{t}$ , it indicates that although the adversarial perturbation on this sample can break the Wide-ResNet model, the attention location has not undergone significant dispersion. From Table 1, it can be observed that among all the adversarial examples that break the Wide-ResNet model, a significant proportion of them have attention change variances exceeding the average variance value. This indicates that the model’s attention has shifted considerably on these samples. We define this kind of attention change as attention-shifting.

Table 1 Distribution of attention changes after attacks on ImageNet.

Subjects

Abstract

Similar content being viewed by others

An enhanced ensemble defense framework for boosting adversarial robustness of intrusion detection systems

Defense against adversarial attacks: robust and efficient compressed optimized neural networks

A deep learning method for optimizing semantic segmentation accuracy of remote sensing images based on improved UNet

Introduction

Classification of attack effects based on attention view

Defense strategy

Analysis based on model perspective

Feature pyramid-based attention space-guided defense module (FPAS)

Attention-based non-local defense module (ANL)

Experimental results and analysis

Experiment settings

White-box defense performance

Black-box defense performance

Attention space guidance effect

Attention enhancement effect

Performance comparison of integrated defense strategies

Effectiveness of integrating our approach with existing methods

Evaluation of robustness against adversarial patches

Cost of the defense module

Conclusion

Discussion

Limitations

Future directions and scalability

Cross-domain applications and future validation

Failure cases

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links