Abstract
While attention mechanisms significantly enhance feature representation in Convolutional Neural Networks (CNNs), existing approaches often suffer from limited receptive fields, insufficient directional modeling, and static fusion strategies that treat channel and spatial domains in isolation. To address these challenges, we propose the Dynamic Multi-Scale Channel-Spatial Attention (DMSCA) mechanism. This plug-and-play module synergistically integrates six cohesive components to achieve deep feature coupling. Specifically, DMSCA introduces Temperature-controlled Channel Attention (TCA) to dynamically regulate the sharpness of attention distributions via a learnable temperature parameter, and a Direction-aware Multi-scale Spatial Context Encoder (MSCE) that captures granular details across varying kernel sizes while preserving precise positional cues through orthogonal interaction. Crucially, unlike fixed-structure methods such as CBAM, our Dynamic Feature Fusion (DFF) employs a learnable gating mechanism to adaptively weight and fuse channel-spatial information based on pixel-wise input content. Extensive experiments on CIFAR-10/100 and ImageNet demonstrate that DMSCA consistently outperforms state-of-the-art attention mechanisms. Notably, it achieves a 1.52% Top-1 accuracy gain on ImageNet with a ResNet-50 backbone. Detailed analysis confirms that DMSCA offers superior robustness against image degradation and generalization capabilities with a modest computational trade-off (11.3% parameter and 2.4% FLOPs increase).
Introduction
Convolutional Neural Networks (CNNs)1,2 have revolutionized tasks including image classification3,4, object detection5,6,7, and semantic segmentation8,9,10 by learning hierarchical features autonomously. Nevertheless, conventional CNNs assign equal importance to all channels and spatial locations, hindering their focus on critical information and thus impacting feature quality and model performance.
Inspired by human vision, attention mechanisms11,12,13 have been incorporated into CNNs to enable selective focus, dynamically weighting features to boost performance, robustness, and interpretability with minimal complexity overhead. Notable advancements include SE-Net14 for channel attention via global pooling and fully connected layers, CBAM15 for sequential channel-spatial attention, ECA-Net16 for efficient cross-channel interactions, and Coordinate Attention (CA)17 for position-sensitive long-range dependencies. Beyond high-level vision tasks, recent scholarship has significantly expanded the utility of attention mechanisms into image restoration and hybrid architecture. For instance, Cui et al. investigated the potential of pooling techniques for universal image restoration18 and explored channel interactions to enhance feature recovery19. Furthermore, novel architectures such as dual-domain strip attention20 and Modumer21, which modulate Transformers for restoration tasks, demonstrate the evolving complexity and versatility of attention-based feature representation.
Despite these progresses, challenges persist insufficient multi-scale context capture, simplistic channel-spatial fusion lacking dynamic collaboration, limited adaptability of attention weights, inadequate directional modeling, and expressive limitations of fixed activations.
Addressing these, we propose the Dynamic Multi-scale Channel-Spatial Attention (DMSCA) mechanism. Our contributions are:
-
1.
A synergistic multi-component framework encompassing global and multi-scale encoding, temperature-controlled channel attention, and direction-aware spatial attention.
-
2.
A learnable Dynamic Feature Fusion module for non-linear, input-adaptive channel-spatial combination, surpassing linear or fixed fusions.
-
3.
Rigorous validation on CIFAR-10/100 and ImageNet, showing DMSCA’s superior performance-efficiency balance over baselines.
-
4.
In-depth ablation and visualization studies confirm component efficacy and enhanced interpretability.
The paper is organized as follows: Sect. 2 reviews related work; Sect. 3 details DMSCA; Sect. 4 outlines experimental setup; Sect. 5 presents results and analyses; Sect. 6 concludes.
Related work
Channel attention mechanisms
Channel attention mechanisms calibrate features by modeling inter-channel dependencies and assigning varying weights to emphasize important channels. SE-Net14 pioneered this approach using global pooling and fully connected layers, though dimensionality reduction can lead to information loss. ECA-Net16 enhances efficiency through local cross-channel interactions without reduction, while FCA22 incorporates frequency-domain analysis for richer representations. Despite these advances, many methods overlook spatial context in guiding channel weights and lack dynamic adjustment. DMSCA addresses this via its Global Context Encoder (GCE) and Temperature-controlled Channel Attention (TCA), introducing learnable temperature τ for adaptive scaling and deep spatial coupling.
Spatial attention mechanism
Spatial attention mechanisms emphasize key regions in feature maps, focusing on “where” informative content lies. Spatial Transformer Networks (STN)23 achieve this through affine transformations, albeit with high computational cost. CBAM’s spatial module15 generates attention maps via pooling and convolution but often loses channel details and directional sensitivity. Coordinate Attention (CA)17 improves by encoding long-range dependencies with positional awareness. However, existing approaches frequently neglect channel influences, directional interactions, and multi-scale contexts. DMSCA’s Multi-scale Spatial Context Encoder (MSCE) and Directional Information Interaction (DII) integrate these elements, enabling robust multi-scale and direction-aware spatial modeling.
Hybrid attention mechanism
Hybrid attention mechanisms synergize channel and spatial attention for comprehensive feature enhancement. CBAM15 applies them sequentially, while BAM24 uses parallel combination with summation. Triplet Attention25 incorporates cross-dimensional interactions, and SimAM26 infers neuron importance parameter-freely via an energy function. Nonetheless, fusion strategies in these methods are often fixed and linear, restricting adaptability. DMSCA introduces a Dynamic Feature Fusion (DFF) module that learns spatially adaptive weights and incorporates directional interactions for flexible, data-dependent integration.
Multi-scale feature processing and attention
Multi-scale feature processing captures information across varying granularities to boost model robustness. Inception networks27,28 employ multi-branch structures for this purpose, and Feature Pyramid Networks (FPN)29 fuse hierarchical semantics. Attention-integrated multi-scale methods, like Pyramid Attention Modules (PAM)30, have gained traction. Yet, many apply multi-scale independently or with simplistic combinations, lacking dynamic cross-scale interactions. DMSCA embeds multi-scale design intrinsically, introducing and fusing scales early with adaptive weighting to exploit inter-scale dynamics deeply and efficiently.
Adaptive activation function
Activation functions inject non-linearity into networks, with adaptive variants dynamically adjusting based on inputs for superior expressiveness. Unlike fixed functions such as \(\:\text{R}\text{e}\text{L}\text{U}\)31, Swish32 and DY-ReLU33 adapt shapes to feature complexities. These innovations inspire DMSCA’s Adaptive Activation (AA) module, which dynamically modulates output mappings to optimize non-linear representations within the attention framework.
Advanced attention mechanisms and vision transformers
Recent advancements include sophisticated attention designs and Vision Transformers (\(\:\text{V}\text{i}\text{T}\))34, which process images as patch sequences. Models like Visual Attention Network (VAN) use large-kernel convolutions for long-range modeling in CNNs, while EfficientFormerV2 excels in lightweight hybrids. Moreover, the applicability of attention mechanisms has been successfully extended to image restoration domains, employing innovative techniques such as dual-domain strip attention20 and modulating Transformers21 to address fine-grained pixel reconstruction challenges. Although powerful, these approaches often incur high computational costs or are primarily tailored for Transformer backbones. In contrast, DMSCA offers a lightweight, plug-and-play solution optimized for CNNs, enhancing features without structural overhauls. Overall, while prior works advance individual aspects, DMSCA holistically integrates dynamic multi-scale, channel-spatial, and adaptive elements for superior synergy.
Methodology
This section details the architecture, core components, and mathematical foundations of the Dynamic Multi-scale Channel-Spatial Attention (DMSCA) mechanism. Designed to enhance the feature representation capabilities of Convolutional Neural Networks (CNNs), DMSCA dynamically integrates multi-scale contexts, directional perceptions, and adaptive activations. We provide precise mathematical derivations, a parameter complexity analysis, and implementation details to ensure reproducibility.
Overall architecture of DMSCA
As illustrated in Fig. 1b, the Dynamic Multi-Scale Channel-Spatial Attention (DMSCA) module is designed as a versatile, plug-and-play unit compatible with standard CNN architecture (e.g., \(\:\text{R}\text{e}\text{s}\text{N}\text{e}\text{t}\), MobileNet).
Figure 1(a) specifically demonstrates its deployment within the ResNet-50 backbone. At the macroscopic level (top panel), DMSCA is embedded within the residual stages, effectively regulating feature flow throughout the network hierarchy. At the microscopic level (bottom panel), we integrate DMSCA into the Bottleneck Block. Crucially, the module is placed after the final \(\:1\:\times\:1\) expansion convolution. This placement ensures that DMSCA operates on the expanded, high-dimensional feature space, allowing it to capture the richest possible channel interdependencies and spatial contexts before the feature map is compressed or added to the identity shortcut.
The module accepts an input feature map \(\:\varvec{X}\in\:{\mathbb{R}}^{\varvec{C}\times\:\varvec{H}\times\:\varvec{W}}\). Unlike conventional attention mechanisms that often treat channel and spatial dimensions in isolation, DMSCA is guided by two core design principles:
-
1.
Contextual Robustness: Ensuring features are refined using both global dependencies and multi-scale local details.
-
2.
Adaptive Selection: Allowing the network to dynamically weigh the importance of channel versus spatial information based on the specific input content.
The internal architecture, detailed in Fig. 1(b), orchestrates six cohesive components—Global Context Encoder (GCE), Temperature-controlled Channel Attention (TCA), Multi-scale Spatial Context Encoder (MSCE), Directional Information Interaction (DII), Dynamic Feature Fusion (DFF), and Adaptive Activation (AA) collaborating in three phases described below.
(a) Architecture of DMSCA-Integrated ResNet50 (Block Level). (b) Overview of DMSCA.
Temperature-controlled channel recalibration
Standard global pooling often risks losing feature distinctiveness. To address this, the Global Context Encoder (GCE) aggregates global context via parallel average and max pooling. This is followed by the Temperature-controlled Channel Attention (TCA) mechanism. A critical theoretical innovation here is the introduction of a temperature parameter \(\:\varvec{\tau\:}\). By adjusting \(\:\varvec{\tau\:}\), the module can control the sharpness of the Sigmoid distribution, thereby preventing the gradients from vanishing in the deep layers and allowing for a softer, more informative attention map.
Mathematically, let \(\:{\varvec{F}}_{\varvec{g}\varvec{l}\varvec{o}\varvec{b}\varvec{a}\varvec{l}}\) denote the context features extracted by the MLP in GCE. The channel attention vector \(\:{\varvec{A}}_{\varvec{c}}\) and the recalibrated feature \(\:{\varvec{X}}_{\varvec{c}}\) are formulated as:
where \(\:\odot\:\) denotes element-wise multiplication.
Synergistic Spatial and directional encoding
To overcome the limited receptive field of standard convolutions, we introduce a dual-pathway approach for spatial refinement.
First, the Multi-scale Spatial Context Encoder (MSCE) employs multi-kernel convolutions (\(\:{\varvec{k}}_{1},{\varvec{k}}_{2},{\varvec{k}}_{3}\)) to capture local context at varying scales (\(\:{\varvec{F}}_{\varvec{m}\varvec{u}\varvec{l}\varvec{t}\varvec{i}}\)). Simultaneously, the Directional Information Interaction (DII) specifically addresses the lack of positional information in global pooling. By decomposing the feature map into horizontal and vertical encodings (\(\:{\varvec{F}}_{\varvec{d}\varvec{i}\varvec{r}}\)), DII preserves precise directional cues.
These two distinct information flows are coupled to generate a spatially enhanced feature map \(\:{\varvec{X}}_{\varvec{s}}\). To ensure robustness against varying input resolutions (e.g., when \(\:\varvec{H}\) or \(\:\varvec{W}\) are small), adaptive boundary checks are implicitly integrated into the pooling operations. The fusion is defined as:
where \(\:\varvec{\sigma\:}\) represents the Sigmoid function.
Dynamic aggregation and activation
Static summation of features is often suboptimal as different inputs may require varying emphasis on channel or spatial domains. The Dynamic Feature Fusion (DFF) module solves this by learning adaptive weights (\(\:{\varvec{w}}_{0},{\varvec{w}}_{1}\)) to non-linearly fuse the intermediate features. Finally, Adaptive Activation (AA) creates a content-aware activation function, further refining the output \(\:{\varvec{X}}_{\varvec{f}\varvec{i}\varvec{n}\varvec{a}\varvec{l}}\) for subsequent layers. The complete aggregation process is expressed as:
Global context encoder
The Global Context Encoder (GCE) constitutes the foundational layer of the DMSCA module, tasked with transforming spatial information into a compact channel-wise descriptor. For an input feature map \(\:\varvec{X}\in\:{\varvec{R}}^{\varvec{C}\times\:\varvec{H}\times\:\varvec{W}}\), the goal is to extract a comprehensive global representation that guides the subsequent attention recalibration.
Theoretical motivation and design
Standard channel attention mechanisms, such as SE-Net14, predominantly rely on Global Average Pooling (GAP). While GAP effectively captures the global statistical distribution (background information), it inherently suppresses high-frequency signals, potentially diluting the discriminative details of salient objects. Conversely, Global Max Pooling (GMP) excels at preserving the most prominent features (texture and edges) but may overlook the broader contextual information34.
To address these limitations, the GCE employs a dual-pathway aggregation strategy. By synthesizing both average and maximum pooling statistics, the encoder ensures a richer feature representation. Furthermore, regarding the fusion strategy, we opt for element-wise addition rather than concatenation. Theoretical analysis suggests that addition acts as a parameter-efficient superposition of signals, preserving the distinct properties of both descriptors without increasing the dimensionality of the subsequent Multi-Layer Perceptron (MLP). This choice aligns with our goal of designing lightweight architecture.
Mathematical formulation
The encoding process is formally defined as follows. First, the spatial dimensions are compressed to generate two distinct channel descriptors, \(\:{\varvec{X}}_{\varvec{a}\varvec{v}\varvec{g}}\) and \(\:{\varvec{X}}_{\varvec{m}\varvec{a}\varvec{x}}\):
These descriptors are then fused via element-wise summation to form a unified global descriptor \(\:{\varvec{X}}_{\varvec{f}\varvec{u}\varvec{s}\varvec{e}\varvec{d}}\in\:{\varvec{R}}^{\varvec{C}\times\:1\times\:1}\). This fused vector is subsequently propagated through a shared MLP (comprising dimensionality reduction and restoration layers) to model channel interdependencies. The final global context \(\:{\varvec{F}}_{\varvec{g}\varvec{l}\varvec{o}\varvec{b}\varvec{a}\varvec{l}}\) is obtained as:
This refined descriptor \(\:{\varvec{F}}_{\varvec{g}\varvec{l}\varvec{o}\varvec{b}\varvec{a}\varvec{l}}\) serves as the input for the Temperature-controlled Channel Attention (TCA) detailed in the subsequent section. The structure of the GCE is shown in Fig. 2.
Schematic illustration of the Global Context Encoder (GCE). The module processes the input feature \(\:\varvec{X}\) through parallel Global Average Pooling (GAP) and Global Max Pooling (GMP) streams. The resulting descriptors, \(\:{\varvec{X}}_{\varvec{a}\varvec{v}\varvec{g}}\) and \(\:{\varvec{X}}_{\varvec{m}\varvec{a}\varvec{x}}\), are fused via element-wise addition to preserve complementary spatial information. A shared MLP then projects this fused representation into the final global context descriptor \(\:{\varvec{F}}_{\varvec{g}\varvec{l}\varvec{o}\varvec{b}\varvec{a}\varvec{l}}\).
Channel attention with temperature control
Following the extraction of the global context descriptor \(\:{\varvec{F}}_{\varvec{g}\varvec{l}\varvec{o}\varvec{b}\varvec{a}\varvec{l}}\) by the GCE, the Temperature-controlled Channel Attention (TCA) module is tasked with generating calibrated importance weights for each channel. This process allows the network to selectively emphasize informative feature maps while suppressing irrelevant background noise.
Theoretical motivation and design
In conventional channel attention mechanisms (e.g., SE-Net14, the channel weights are typically generated via a standard Sigmoid function. However, the standard Sigmoid function lacks flexibility; its slope is fixed, which may lead to suboptimal gradient flow during training—specifically, when the input values are large (saturation region), gradients vanish, hindering effective backpropagation.
To mitigate this, we introduce a learnable temperature parameter \(\:\varvec{\tau\:}\) into the activation mechanism, inspired by temperature scaling in knowledge distillation35 and model calibration. Theoretically, \(\:\varvec{\tau\:}\) acts as a regularization term for the activation distribution:
-
1.
Low Temperature (\(\:\varvec{\tau\:}<\:1\)): Sharpens the distribution, forcing the model to make binary, decisive choices (selecting specific channels strictly).
-
2.
High Temperature (\(\:\varvec{\tau\:}>1\)): Softens the distribution, allowing for a smoother gradient flow and encouraging the model to utilize a broader set of channels during early training phases.
Furthermore, to maintain computational efficiency while modeling channel interdependencies, we employ a bottleneck structure within the MLP, reducing the channel dimensionality by a reduction ratio \(\:\varvec{r}\) before restoring it.
Mathematical Formulation Let \(\:{\varvec{F}}_{\varvec{g}\varvec{l}\varvec{o}\varvec{b}\varvec{a}\varvec{l}}\in\:{\varvec{R}}^{\varvec{C}\times\:1\times\:1}\) be the context descriptor obtained from the GCE (prior to activation). The TCA module first normalizes this descriptor using the temperature parameter \(\:\varvec{\tau\:}\). The final channel attention weights \(\:{\varvec{A}}_{\varvec{c}}\in\:{\varvec{R}}^{\varvec{C}\times\:1\times\:1}\) are generated via the temperature-scaled Sigmoid function:
where \(\:\varvec{\sigma\:}\left(\cdot\:\right)\) denotes the Sigmoid function. Subsequently, the input feature map \(\:\varvec{W}\) is recalibrated via channel-wise multiplication to produce the refined feature \(\:{\varvec{X}}_{\varvec{c}}\):
Here, \(\:\odot\:\) represents element-wise multiplication, broadcasting the \(\:1\:\times\:1\) weights across the spatial dimensions \(\:\varvec{H}\:\times\:\varvec{W}\). This mechanism ensures that the subsequent spatial encoding layers receive features with optimized channel distributions. The structure of the TCA is shown in Fig. 3.
Diagram of the Temperature-controlled Channel Attention (TCA). The global descriptor \(\:{\varvec{F}}_{\varvec{g}\varvec{l}\varvec{o}\varvec{b}\varvec{a}\varvec{l}}\) (output of the MLP) is scaled by a temperature factor \(\:\varvec{r}\) to adjust the distribution sharpness. The weights \(\:{\varvec{A}}_{\varvec{c}}\) are obtained via the Sigmoid function and applied to the input \(\:\varvec{X}\) through element-wise multiplication to generate the channel-refined feature \(\:{\varvec{X}}_{\varvec{c}}\).
Multi-scale Spatial context encoder
Following the refinement of channel interdependencies in the TCA module, the feature map \(\:{\varvec{X}}_{\varvec{c}}\in\:{\varvec{R}}^{\varvec{C}\times\:\varvec{H}\times\:\varvec{W}}\) contains enhanced channel-wise representations. However, convolution operations with fixed kernel sizes inherently struggle to capture objects of varying scales simultaneously. A small kernel focuses on local texture details, while a larger kernel captures broader semantic shapes. To overcome this limitation and enrich the feature representation with diverse spatial granularities, we design the Multi-scale Spatial Context Encoder (MSCE).
Design principles and Spatial descriptor generation
The primary objective of the MSCE is to extract spatial context without the computational burden of processing the full channel depth. Motivated by the observation that channel-wise pooling is an efficient method for generating spatial attention descriptors15, we first decouple spatial information from the channel dimension.
Instead of standard dimensionality reduction via \(\:1\times\:1\) convolution, we employ dual-channel pooling operations. We compute the average-pooled features to aggregate background statistics and max-pooled features to highlight discriminative regions along the channel axis. This results in a comprehensive spatial descriptor \(\:{\varvec{S}}_{\varvec{d}\varvec{e}\varvec{s}\varvec{c}}\in\:{\varvec{R}}^{2\times\:\varvec{H}\times\:\varvec{W}}\):
Where \(\:{\varvec{S}}_{\varvec{a}\varvec{v}\varvec{g}}\in\:{\varvec{R}}^{1\times\:\varvec{H}\times\:\varvec{W}}\) and \(\:{\varvec{S}}_{\varvec{m}\varvec{a}\varvec{x}}\in\:{\varvec{R}}^{1\times\:\varvec{H}\times\:\varvec{W}}\) represent the channel-wise statistics. This compression strategy significantly reduces parameter overhead while preserving essential spatial structures.
Multi-branch feature extraction
To facilitate multi-scale feature interaction, \(\:{\varvec{S}}_{\varvec{d}\varvec{e}\varvec{s}\varvec{c}}\) serves as the input to a parallel multi-branch architecture inspired by the Inception paradigm36. The module employs a set of \(\:\varvec{n}\) parallel convolutional layers with varying kernel sizes \(\:\varvec{k}\:\in\:\{3\:\times\:3,\:5\:\times\:5,\:7\:\times\:7\}\). Each branch \(\:\varvec{i}\) applies a specific kernel size \(\:{\varvec{k}}_{\varvec{i}}\) to capture spatial context at a distinct receptive field. The operation for the \(\:\varvec{i}\)-th branch is defined as:
where \(\:{\text{Conv}}_{{\varvec{k}}_{\varvec{i}}\times\:{\varvec{k}}_{\varvec{i}}}\) denotes a convolutional layer reducing the channel dimension from 2 to 1, \(\:\mathcal{B}\) represents Batch Normalization, and \(\:\varvec{\delta\:}\) denotes the \(\:\text{R}\text{e}\text{L}\text{U}\) activation function. This design allows the network to adaptively perceive visual patterns ranging from fine-grained edges to coarse-grained object contours.
Feature fusion
Unlike standard concatenation approaches that increase dimensionality, we aim to synthesize a unified spatial attention map. The outputs from all branches \(\:\{{\varvec{L}}_{1},\dots\:,{\varvec{L}}_{\varvec{n}}\}\) are fused via an averaging operation to produce the final multi-scale context encoding \(\:{\varvec{F}}_{\varvec{m}\varvec{u}\varvec{l}\varvec{t}\varvec{i}}\in\:{\varvec{R}}^{1\times\:\varvec{H}\times\:\varvec{W}}\). This ensures that the resulting spatial map represents a consensus of features across different scales:
The resulting \(\:{\varvec{F}}_{\varvec{m}\varvec{u}\varvec{l}\varvec{t}\varvec{i}}\) encapsulates robust spatial position information, which is subsequently integrated with the directional features discussed in the following section. The schematic structure of the MSCE is illustrated in Fig. 4.
Multi-scale spatial context encoder.
Directional information interaction
While the Multi-scale Spatial Context Encoder (MSCE) captures local spatial granularities effectively, standard convolution and global pooling operations often result in a loss of precise positional information. To mitigate this, we introduce the Directional Information Interaction (DII) module. Building upon the principles of Coordinate Attention17, the DII module decomposes spatial attention into two orthogonal directions. However, unlike prior works that process these directions independently, our design introduces a novel interaction mechanism that couples horizontal and vertical features, allowing the network to perceive object structures more holistically.
Coordinate aggregation and embedding
To preserve long-range dependencies with precise positional information, we utilize adaptive average pooling kernels \(\:\left(\varvec{H},1\right)\) and \(\:\left(1,\varvec{W}\right)\) to encode each channel along the horizontal and vertical coordinates, respectively.
For the input feature map \(\:{\varvec{X}}_{\varvec{c}}\) (output from the TCA module), the aggregation for the \(\:\varvec{c}\)-th channel at height \(\:\varvec{h}\) and width \(\:\varvec{w}\) is formulated as:
These operations generate two direction-aware feature maps: \(\:{\varvec{Z}}_{\varvec{h}}\in\:{\varvec{R}}^{\varvec{C}\times\:\varvec{H}\times\:1}\) and \(\:{\varvec{Z}}_{\varvec{w}}\in\:{\varvec{R}}^{\varvec{C}\times\:1\times\:\varvec{W}}.\) To reduce computational complexity and model channel interdependencies, these maps are projected to a lower dimension \(\:{\varvec{C}}_{\varvec{m}\varvec{i}\varvec{d}}\) via \(\:1\times\:1\) convolutions followed by Batch Normalization (BN):
Orthogonal interaction and modulation
A key theoretical limitation in standard coordinate attention is the independence of the \(\:\varvec{x}\) and \(\:\varvec{y}\) axes during the encoding phase. Visual objects possess structural correlations across both dimensions. To address this, we propose a dense interaction strategy.
First, we broadcast (expand) the direction-specific features \(\:{\varvec{F}}_{\varvec{h}}\) and \(\:{\varvec{F}}_{\varvec{w}}\) to the full spatial resolution \(\:\varvec{H}\:\times\:\varvec{W}\), denoted as \(\:\stackrel{\sim}{{\varvec{F}}_{\varvec{h}}}\) and \(\:\stackrel{\sim}{{\varvec{F}}_{\varvec{w}}}\). These are then concatenated to form a unified spatial representation, which is processed by a non-linear interaction layer:
where \(\:{\varvec{F}}_{\varvec{i}\varvec{n}\varvec{t}\varvec{e}\varvec{r}}\in\:{\varvec{R}}^{{\varvec{C}}_{\varvec{m}\varvec{i}\varvec{d}}\times\:\varvec{H}\times\:\varvec{W}}\) represents the coupled spatial context.
Crucially, rather than simply projecting this interaction feature directly, we employ it to modulate the original directional embeddings. This design enables the horizontal features to be refined by vertical context, and vice versa. The modulated features are then projected back to the original channel dimension \(\:\varvec{C}\):
where \(\:\odot\:\) denotes element-wise multiplication.
Attention generation and fusion
The final directional attention map \(\:{\varvec{A}}_{\varvec{d}\varvec{i}\varvec{r}}\in\:{\varvec{R}}^{\varvec{C}\times\:\varvec{H}\times\:\varvec{W}}\) is synthesized by fusing the modulated orthogonal components via a Sigmoid activation function. This creates a spatially sensitive weight map that highlights regions of interest based on both coordinate positions:
Finally, in alignment with the DMSCA architecture’s design, this directional attention \(\:{\varvec{A}}_{\varvec{d}\varvec{i}\varvec{r}}\) serves as a weighting factor for the multi-scale context \(\:{\varvec{F}}_{\varvec{m}\varvec{u}\varvec{l}\varvec{t}\varvec{i}}\) (derived in Sect. 3.4). The complete spatial attention mechanism combines these two streams to produce the spatially refined output \(\:{\varvec{X}}_{\varvec{s}}\):
This hierarchical combination ensures that the feature map is enhanced by both the local multi-scale details from MSCE and the global positional structures from DII. The structure of the DII is shown in Fig. 5.
Schematic diagram of the Directional Information Interaction (DII) module. The input \(\:{\varvec{X}}_{\varvec{c}}\) is aggregated via horizontal and vertical pooling. The resulting descriptors are expanded and concatenated to undergo non-linear interaction. This coupled context then modulates the original directional features via element-wise multiplication. Finally, the features are fused to generate direction-aware attention map \(\:{\varvec{A}}_{\varvec{d}\varvec{i}\varvec{r}}\), which encodes precise positional dependencies.
Dynamic feature fusion
The preceding modules independently refine the feature representations: the TCA module produces \(\:{\varvec{X}}_{\varvec{c}}\in\:{\varvec{R}}^{\varvec{C}\times\:\varvec{H}\times\:\varvec{W}}\), emphasizing “what” feature channels are important, while the coupled MSCE and DII modules yield \(\:{\varvec{X}}_{\varvec{s}}\in\:{\varvec{R}}^{\varvec{C}\times\:\varvec{H}\times\:\varvec{W}}\), highlighting “where” informative regions are located.
Standard fusion strategies, such as simple element-wise addition or static concatenation used in CBAM15 or BAM37, assume a fixed relationship between channel and spatial domains. However, this assumption is often suboptimal for complex visual scenes. For instance, in texture-rich background regions, channel distinctiveness (captured by \(\:{\varvec{X}}_{\varvec{c}}\)) may be more critical for classification, whereas in regions containing distinct object boundaries, spatial localization (captured by \(\:{\varvec{X}}_{\varvec{s}}\)) is paramount. To address this spatial variability, we propose the Dynamic Feature Fusion (DFF) module.
Theoretical motivation
The DFF is designed as a learnable gating mechanism that creates pixel-wise competition between channels and spatial attention. By generating a spatially adaptive weight map, the network can dynamically prioritize \(\:{\varvec{X}}_{\varvec{c}}\) or \(\:{\varvec{X}}_{\varvec{s}}\) at each pixel location \(\:\left(\varvec{i},\varvec{j}\right)\). This allows the model to selectively amplify the most relevant attention domain based on the specific local content of the input.
Mathematical formulation
As implemented in our architecture, the fusion process avoids heavy computational overhead by utilizing a lightweight \(\:1\times\:1\) convolutional gating layer. First, the channel-refined features \(\:{\varvec{X}}_{\varvec{c}}\) and spatially refined features \(\:{\varvec{X}}_{\varvec{s}}\) are concatenated along the channel dimension to preserve the complete information context from both domains:
To generate fusion weights, \(\:{\varvec{F}}_{\varvec{c}\varvec{o}\varvec{n}\varvec{c}\varvec{a}\varvec{t}}\) is projected into a 2-channel weight space using a \(\:1\times\:1\) convolution. A Softmax function is then applied along the channel dimension to ensure the weights sum to 1 at every spatial location, creating a normalized probabilistic distribution:
Here, \(\:{\varvec{W}}_{\varvec{f}\varvec{u}\varvec{s}\varvec{i}\varvec{o}\varvec{n}}\) consists of two weight maps, \(\:{\varvec{X}}_{\varvec{c}}\) and \(\:{\varvec{X}}_{\varvec{s}}\), corresponding to the importance of the channel and spatial branches, respectively. The final fused feature \(\:{\varvec{X}}_{\varvec{o}\varvec{u}\varvec{t}}\) is obtained via a weighted summation:
where \(\:\odot\:\) denotes element-wise multiplication. This operation ensures that \(\:{\varvec{X}}_{\varvec{o}\varvec{u}\varvec{t}}\) encapsulates a refined representation where the balance between channel and spatial information is dynamically tuned for every pixel.
Following this dynamic aggregation, the feature map \(\:{\varvec{X}}_{\varvec{o}\varvec{u}\varvec{t}}\) is passed to the final Adaptive Activation (AA) module (Sect. 3.7) for non-linear shaping. The schematic structure of the DFF is illustrated in Fig. 6.
Schematic diagram of the Dynamic Feature Fusion (DFF) module. The channel-enhanced (\(\:{\varvec{X}}_{\varvec{c}}\)) and spatially enhanced (\(\:{\varvec{X}}_{\varvec{s}}\)) features are concatenated and projected by a \(\:1\times\:1\) convolution. A spatial Softmax operation generates a normalized weight map (\(\:{\varvec{W}}_{\varvec{f}\varvec{u}\varvec{s}\varvec{i}\varvec{o}\varvec{n}}\)), which dynamically aggregates the inputs via pixel-wise weighted summation to produce \(\:{\varvec{X}}_{\varvec{o}\varvec{u}\varvec{t}}\).
Adaptive activation
Following the dynamic aggregation of channel and spatial streams in the DFF module, the feature map \(\:{\varvec{X}}_{\varvec{o}\varvec{u}\varvec{t}}\in\:{\varvec{R}}^{\varvec{C}\times\:\varvec{H}\times\:\varvec{W}}\) contains a balanced representation of the input. However, standard linear summation can sometimes introduce redundancies or background noise. To ensure that only the most salient features are propagated to the subsequent network layers, we introduce the Adaptive Activation (AA) module as the terminal refinement stage of the DMSCA.
Theoretical motivation
Standard activation functions like \(\:\text{R}\text{e}\text{L}\text{U}\) are static; they apply a fixed threshold (e.g., \(\:\varvec{f}\left(\varvec{x}\right)=\varvec{max}\left(0,\varvec{x}\right)\)) regardless of the input’s contextual properties. Recent advances in deep learning, such as the Swish activation function38, demonstrate that data-dependent, smooth activation functions can significantly improve optimization.
Inspired by this self-gating principle, the AA module is designed to learn a non-linear, spatially aware activation map. Unlike standard activation functions that operate elementwise in isolation, the AA module utilizes a global view of the channel features to determine the activation intensity for each spatial location. This allows the network to effectively “turn off” background noise at the pixel level while enhancing foreground regions based on the consensus of all feature channels.
Mathematical formulation
The AA module operates on the fused feature \(\:{\varvec{X}}_{\varvec{o}\varvec{u}\varvec{t}}\). To compute the activation map, we project the feature map to a single-channel spatial mask using a \(\:1\times\:1\) convolution. This operation compresses the channel dimension, aggregating the information density at each spatial coordinate \(\:\left(\varvec{i},\varvec{j}\right)\):
where \(\:\varvec{b}\) is a learnable bias term allowing for shift variance.
Crucially, as observed in our implementation, we omit Batch Normalization at this stage to preserve the absolute magnitude of the fused features, which is essential for accurate gating.
The spatial activation map \(\:\varvec{\upalpha\:}\in\:{\varvec{R}}^{1\times\:\varvec{H}\times\:\varvec{W}}\) is then generated via the Sigmoid function, mapping the values to the range \(\:\left(0,1\right)\):
Finally, the output of the DMSCA module, \(\:{\varvec{X}}_{\varvec{f}\varvec{i}\varvec{n}\varvec{a}\varvec{l}}\in\:{\varvec{R}}^{\varvec{C}\times\:\varvec{H}\times\:\varvec{W}}\), is obtained by modulating the input feature \(\:{\varvec{X}}_{\varvec{o}\varvec{u}\varvec{t}}\) with the learned activation map \(\:\varvec{\upalpha\:}\) via broadcasting:
This multiplicative gating mechanism allows the gradient to flow adaptively during backpropagation, refining the feature representation through a multi-stage process of contextual perception (GCE/MSCE/DII), dynamic fusion (DFF), and adaptive gating (AA). The schematic structure of the AA is illustrated in Fig. 7.
Schematic diagram of the Adaptive Activation (AA) module. The fused feature \(\:{\varvec{X}}_{\varvec{o}\varvec{u}\varvec{t}}\) is compressed via a \(\:1\times\:1\) convolution to a single-channel descriptor. A Sigmoid function generates a spatial gating map \(\:\varvec{\upalpha\:}\), which adaptively scales the features via element-wise multiplication to produce the final output \(\:{\varvec{X}}_{\varvec{f}\varvec{i}\varvec{n}\varvec{a}\varvec{l}}\).
Experiment settings
Datasets
We conducted experiments on three publicly available datasets widely used in image classification tasks. These datasets have different scales and complexities, which can fully test the performance and generalization ability of DMSCA:
-
1.
CIFAR-104: Contains 10 categories, with a total of 60,000 32 × 32-pixel color images. 50,000 of them are used for training and 10,000 for testing. This is a relatively small dataset and is often used for quickly verifying the effectiveness of new methods.
-
2.
CIFAR-1004: Like CIFAR-10, but contains 100 categories, with the same number of images and size. Due to the larger number of categories, the classification difficulty is greater.
-
3.
ImageNet (ILSVRC 2012)7: A large-scale dataset with 1,000 classes, ~ 1.28 million training images, and 50,000 validation images, benchmarking model generalization.
Standard preprocessing was applied: normalization, random cropping, and horizontal flipping for CIFAR; scaling to 256 × 256, random 224 × 224 cropping, and flipping for ImageNet training; center 224 × 224 cropping for validation.
Experimental environment and implementation details
Experiments ran on a server with an Intel Core i9-12900 K CPU, 128 GB DDR4 RAM, 4 TB NVMe SSD, and NVIDIA GeForce RTX 3090 GPUs (24 GB VRAM each). Software included Pytorch 2.0.0, CUDA 11.8, and Python 3.9.
DMSCA was integrated into \(\:\text{R}\text{e}\text{s}\text{N}\text{e}\text{t}\:\)architectures (ResNet-18, −34, −50)11, placed after the main convolution in each residual block before identity mapping, enhancing features while preserving residual learning.
Theoretical computational complexity analysis
To provide a rigorous assessment of DMSCA’s efficiency, we analyze the time complexity of each component. Let the input feature map dimension be \(\:\varvec{H}\times\:\varvec{W}\times\:\varvec{C}\).The complexity breakdown is as follows:
Global Context Encoder (GCE): This module involves global pooling (\(\:\varvec{O}(\varvec{H}\varvec{W}\varvec{C}\))) and a shared MLP for channel interaction. The MLP reduces and restores dimensions with a reduction ratio \(\:\varvec{r}\), contributing \(\:\varvec{O}(2{\varvec{C}}^{2}/\varvec{r})\). Thus, Complexity (GCE) \(\:\approx\:\varvec{O}(\varvec{H}\varvec{W}\varvec{C}+{\varvec{C}}^{2}/\varvec{r})\).
Temperature-controlled Channel Attention (TCA): Since the MLP computation is accounted for in the GCE, the TCA module primarily performs temperature scaling, Sigmoid activation, and element-wise multiplication (broadcasting weights to the feature map). Complexity (TCA) \(\:\approx\:\varvec{O}\left(\varvec{H}\varvec{W}\varvec{C}\right)\).
Multi-scale Spatial Context Encoder (MSCE): Crucially, MSCE first compresses the channel dimension to 2 via channel-wise pooling (\(\:\varvec{O}\left(\varvec{H}\varvec{W}\varvec{C}\right)\)). The subsequent multi-scale convolutions operate only on these 2-channel descriptors, not the full input depth. Assuming \(\:\varvec{K}\) represents the total area of the multi-branch kernels, the convolution cost is \(\:\varvec{O}(\varvec{H}\varvec{W}\cdot\:\varvec{K})\). Since \(\:\varvec{C}\gg\:2\), the dominant term is the pooling. Complexity (MSCE) \(\:\approx\:\varvec{O}\left(\varvec{H}\varvec{W}\varvec{C}\right)\).
Directional Information Interaction (DII): This component includes coordinate pooling (\(\:\varvec{O}\left(\varvec{H}\varvec{W}\varvec{C}\right)\)) and \(\:1\times\:1\) convolutions to model channel interactions in the horizontal and vertical paths. Complexity (DII) \(\:\approx\:\varvec{O}(\varvec{H}\varvec{W}\varvec{C}+{\varvec{C}}^{2})\).
Dynamic Feature Fusion (DFF) & Adaptive Activation (AA): These modules perform lightweight \(\:1\times\:1\) convolutions for gating and fusion. Complexity (DFF + AA) \(\:\approx\:\varvec{O}\left(\varvec{H}\varvec{W}\varvec{C}\right)\).
Total Complexity: Summing these components, the overall time complexity of DMSCA is:
This analysis confirms that the complexity is dominated by linear terms related to the spatial resolution (\(\:\varvec{H}\varvec{W}\varvec{C}\)) and quadratic terms related to channel interaction (\(\:{\varvec{C}}^{2}\)). The space complexity is mainly determined by the intermediate feature maps and remains at \(\:\varvec{O}\left(\varvec{H}\varvec{W}\varvec{C}\right)\).
The hyperparameter settings for training are shown in Table 1. For different datasets, we used different learning rates and batch sizes. The optimizer uniformly uses momentum-based stochastic gradient descent (SGD). Learning rate scheduling adopts the cosine annealing strategy13.
Hyperparameter sensitivity analysis design
To verify the sensitivity of DMSCA to key hyperparameters, we conducted a systematic sensitivity analysis experiment:
-
1.
Reduction ratio (r): Tested {4, 8, 16, 32, 64} for performance-efficiency trade-offs.
-
2.
Temperature coefficient (τ): Compared {0.5, 1.0, 1.5, 2.0, dynamic} for attention distribution impact.
-
3.
Kernel combinations (K): Evaluated {3}, {3, 5}, {3, 5, 7}, {3, 5, 7, 9}.
-
4.
Fusion weight initialization: Compared dynamic fusion strategies.
One parameter varied while others fixed, measuring accuracy, parameters, and FLOPs.
Baseline methods and evaluation metrics
Baseline attention mechanisms
To verify the superiority of DMSCA, we selected the current mainstream and representative attention mechanisms as baselines for comparison:
-
1.
\(\:\text{R}\text{e}\text{s}\text{N}\text{e}\text{t}\:\)(Baseline)11: The original \(\:\text{R}\text{e}\text{s}\text{N}\text{e}\text{t}\) architecture without any additional attention modules (ResNet-18, ResNet-34, ResNet-50).
-
2.
SE-Net14: A classic channel attention mechanism that teaches channel weights through global pooling and two fully connected layers.
-
3.
CBAM15: A hybrid attention module that serially combines channel attention and spatial attention.
-
4.
ECA-Net16: An efficient channel attention mechanism that avoids dimensionality reduction operations through one-dimensional convolution.
-
5.
CA17: A novel spatial attention mechanism that captures directional information and long-range dependencies by decomposing two-dimensional spatial attention into two one-dimensional encoding processes.
-
6.
SimAM39: A parameter-free attention mechanism designed based on neuroscience theory.
-
7.
GAM40: A global attention mechanism that combines channel and spatial attention.
-
8.
A²-Nets41: Dual attention networks that simultaneously model position and channel attention.
-
9.
BAM37: Bottleneck attention module that adopts parallel channel and spatial attention branches.
All baseline attention modules are integrated into the \(\:\text{R}\text{e}\text{s}\text{N}\text{e}\text{t}\) architecture in the same way as DMSCA and are compared fairly using the same training strategy and hyperparameters.
Comprehensive evaluation framework
We evaluate DMSCA and the baseline methods from multiple dimensions:
-
1.
Classification Performance: Top-1/Top-5 accuracy (ImageNet), Top-1 (CIFAR); mean ± std from multiple runs; per-class analysis on CIFAR-100.
-
2.
Computational Efficiency: Parameters (M), FLOPs (G), memory (MB), inference time (MS), efficiency score (accuracy gain/param increase %).
-
3.
Statistical Significance: 95% confidence intervals; t-tests/ANOVA with Bonferroni correction.
-
4.
Training Dynamics: Loss/accuracy curves; convergence epochs (e.g., to 95% final accuracy).
-
5.
Robustness and Generalization: Cross-architecture (\(\:\text{R}\text{e}\text{s}\text{N}\text{e}\text{t}\) depths); cross-dataset; ablation with statistical validation.
This framework thoroughly assesses DMSCA’s efficacy.
Experimental results and analysis
This section presents a comprehensive evaluation of DMSCA across multiple metrics, comparing it with state-of-the-art attention mechanisms. We analyze classification performance, computational efficiency, component contributions through ablation studies, training dynamics, and feature visualization to demonstrate DMSCA’s effectiveness.
Main classification performance comparison
We compared DMSCA with various baseline attention mechanisms (SE-Net14, CBAM15, ECA-Net16, CA17 and the original \(\:\text{R}\text{e}\text{s}\text{N}\text{e}\text{t}\:\)model without attention mechanism (ResNet-18, ResNet-34, ResNet-5011 on three datasets: CIFAR-10, CIFAR-100 and ImageNet. All experiments were conducted under the same hyperparameter settings and training strategies to ensure fairness. To reduce the influence of randomness, all accuracy results were the average of five independent experiments and represent averages from five independent runs with standard deviations.
As demonstrated in Table 2, DMSCA consistently outperforms all competing methods across all datasets and network architectures:
-
1)
On CIFAR-10, DMSCA improves Top-1 accuracy by 2.3%, 2.3%, and 2.2% when integrated with Res Net-18, ResNet-34, and ResNet-50, respectively.
-
2)
On the more challenging CIFAR-100 dataset, DMSCA achieves substantial gains of 1.8%, 1.8%, and 1.7%, demonstrating its effectiveness for fine-grained classification tasks.
-
3)
On the large-scale ImageNet dataset, ResNet-50 + DMSCA outperforms the baseline by 1.52% in Top-1 accuracy and 0.95% in Top-5 accuracy, surpassing all other attention mechanisms. Notably, it exceeds the recent CA mechanism by 0.37% on ResNet-50.
These consistent improvements across diverse datasets and architectures confirm that DMSCA’s dynamic multi-scale context-aware design effectively captures discriminative features, significantly enhancing CNN classification performance. Compared to existing attention mechanisms, DMSCA demonstrates substantial and consistent advantages in accuracy improvement.
Computational efficiency analysis
When evaluating the attention mechanism, in addition to focusing on its performance improvement, the computational cost is also a crucial consideration factor. Table 3 provides a detailed comparison of the additional parameters, floating-point operations, GPU memory usage, and inference time introduced by DMSCA and each baseline attention mechanism in the ResNet-50 architecture.
The efficiency analysis reveals:
-
1)
Parameters and FLOPs: DMSCA introduces a moderate parameter increase (+ 11.34%) and computational overhead (+ 2.43%) compared to baseline. While slightly higher than SE-Net and CBAM, this cost is justified by DMSCA’s superior accuracy gains. ECA-Net achieves minimal parameter increase but with more modest performance improvements, while CA maintains low parameter overhead with computational costs like DMSCA.
-
2)
Memory and Inference Time: DMSCA’s memory usage (395 MB) and inference latency (9.2 MS) remain competitive despite its multi-component architecture. Its inference time increase (+ 10.84%) is comparable to CBAM and CA, demonstrating efficient implementation of its complex attention mechanisms.
Overall, while DMSCA introduces moderate computational overhead, its significant performance improvements justify this trade-off, particularly for applications prioritizing accuracy. The efficiency-to-performance ratio remains favorable across all tested metrics.
Comparison with recent attention mechanisms
To conduct a more comprehensive evaluation of the performance of DMSCA, we also compared it with the advanced attention mechanisms proposed in recent years. The results are shown in Table 4.
Statistical significance analysis
To confirm the reliability of DMSCA’s performance improvements, we conducted paired t-tests with Cohen’s d effect sizes on ImageNet results using ResNet-50. Table 5 presents the results.
The results in Table 5 clearly demonstrate that the Top-1 accuracy achieved by DMSCA on ResNet-50 is significantly superior to all the compared baseline methods. All the p-values are far less than 0.001, indicating that the observed accuracy differences are extremely unlikely to be caused by random factors. Cohen’s d values are all greater than 1.5, showing a large effect size, which means that the performance improvement brought by DMSCA is not only statistically significant but also has practical application value. These statistical results further strengthen the conclusion that DMSCA is effective.
Ablation studies
To elucidate the contributions of DMSCA’s components, we performed ablation experiments on CIFAR-100 using ResNet-18. We systematically added/removed key elements: Global Context Encoder (GCE), Temperature-Controlled Channel Attention (TCA), Multi-Scale Spatial Context Encoder (MSCE), Direction Information Interaction (DII), Dynamic Feature Fusion (DFF), and Adaptive Activation (AA).
Hyperparameter sensitivity analysis
We systematically analyzed the impact of the key hyperparameters of DMSCA on performance, and the results are shown in Tables 6.
Analysis indicates that r = 16 balances performance and efficiency optimally. Dynamic \(\:\tau\:\) slightly outperforms fixed values. The {3,5,7} kernel combination is ideal, with diminishing returns for additional scales.
Results show each component contribute positively: GCE (+ 0.49%), dynamic TCA (+ 0.17% over fixed), MSCE (+ 0.73% alone, + 0.36% combined), DII (+ 0.27%), DFF (+ 0.13%), AA (+ 0.05%). The core synergy between GCE, dynamic TCA, MSCE, and DII drives DMSCA’s performance.
Visualization analysis
To understand DMSCA’s impact on feature representation, we used Grad-CAM++29 to generate attention heatmaps for ResNet-50 on ImageNet samples.
DMSCA generates more focused, semantically meaningful maps, demonstrating superior feature localization. Visualization reveals SE-Net improves channel weight but lacks spatial precision. CBAM enhances localization but struggles with complex scenes. DMSCA precisely focuses on discriminative regions while suppressing noise, confirming its ability to guide models toward robust feature learning.
To further quantify the quality of the attention maps, we introduce several metrics defined in Table 9 and evaluate them on a subset of CIFAR-10.
The quantitative results in Table 9 are largely consistent with the qualitative observations in Table 8. DMSCA performed the best across all three metrics: it had the highest focus ratio (0.87) and semantic consistency (0.84), while achieving the lowest noise suppression ratio (0.08). This strongly demonstrates that DMSCA can more accurately direct attention to the semantic-related areas in the image, while effectively ignoring background noise and irrelevant information. CA performed less optimally in all metrics, and the performance of SE-Net was relatively weak. This data further supports the superiority of DMSCA in enhancing feature localization and selection capabilities.
Training dynamics analysis
Beyond final performance, we examined DMSCA’s impact on training dynamics using ResNet-50 on ImageNet.
The training loss curve is shown in Fig. 8. The accuracy curve of the validation set (the highest accuracy rate) is shown in Fig. 9.
Training loss curves.
Validation top-1 accuracy curves.
DMSCA demonstrates faster loss decay and accuracy ascent from early epochs compared to baselines. It achieves higher accuracy milestones sooner, converges more rapidly, and maintains greater stability in later training with reduced fluctuations. This indicates DMSCA facilitates more efficient learning and yields more robust features.
Robustness and generalization analysis
An excellent attention mechanism should not only perform well on standard test sets but also maintain good performance when faced with various perturbations and different data distributions, that is, it should have good robustness and generalization ability.
We evaluated the performance of DMSCA under common image degradation conditions (such as Gaussian noise, motion blur, and JPEG compression). The experiments were conducted on the CIFAR-100 dataset, with the backbone network being ResNet-18. The results are shown in Table 10.
From Table 10, in all the tested image degradation conditions, the ResNet-18 integrated with DMSCA achieved the highest classification accuracy. For instance, under moderate intensity Gaussian noise (σ = 15), the accuracy of DMSCA (68.25%) was 3.04% higher than that of the Baseline (65.21%), and 1.20% higher than that of the suboptimal CBAM (67.05%). Even under stronger noise (σ = 25) or other types of degradation, DMSCA maintained its leading advantage. This indicates that the feature representations learned by DMSCA have stronger resistance to these common image perturbations. Its dynamic adjustment and multi-scale perception capabilities help extract key features even when information is damaged.
Cross-dataset generalization
We further evaluated the generalization ability of DMSCA across different datasets. The experimental design was as follows: the model was trained on the source dataset, and then directly tested on the target dataset.
To verify the universality of DMSCA, we conducted preliminary experiments on the tasks of object detection and semantic segmentation:.
Object Detection (COCO 2017): Under the Faster R-CNN framework, using ResNet-50 + DMSCA as the backbone network.
Baseline (ResNet-50): mAP = 37.4.
ResNet-50 + DMSCA: mAP = 38.9 (+ 1.5).
Semantic Segmentation (Cityscapes): Tested under the DeepLabV3 + framework.
Baseline (ResNet-50): mIoU = 78.2.
ResNet-50 + DMSCA: mIoU = 79.6 (+ 1.4).
These results indicate that DMSCA has a good ability of task generalization, and it is not limited to image classification tasks.
In addition to its outstanding performance on the three major datasets, CIFAR-10, CIFAR-100, and ImageNet, we also evaluated the generalization ability of DMSCA on some small-scale, specific domain image datasets, such as Oxford-IIIT Pet30 and Food-10134. On these datasets, DMSCA also demonstrated better performance improvements compared to the baseline and other attention mechanisms, further proving its good generalization ability and its ability to adapt to different data distributions and task characteristics.
Based on the analysis of robustness and generalization ability, DMSCA not only performs exceptionally well under standard conditions, but also shows strong adaptability and stability when facing challenging scenarios and diverse data, which is crucial for practical applications.
Conclusion and future work
This paper introduced DMSCA, a novel attention mechanism that enhances feature representation in Convolutional Neural Networks (CNNs) through a multi-component, collaborative design. Our core contribution lies in the dynamic, data-dependent fusion of channels and spatial attention, which overcomes the limitations of static, predefined structures found in prior works like CBAM. Key innovations, including the Temperature-controlled Channel Attention (TCA) and the Direction-aware Multi-scale Spatial Context Encoder (MSCE), enable the model to adaptively modulate features based on input characteristics, leading to significant and consistent performance gains across multiple benchmarks, including ImageNet.
Comprehensive experiments demonstrated DMSCA’s superiority over existing attention mechanisms in terms of classification accuracy, robustness against adversarial attacks and corruptions, and generalization to fine-grained datasets. While DMSCA introduces a marginal computational overhead, the substantial performance benefits justify this trade-off, establishing a compelling solution for CNN-based feature refinement.
Despite these promising results, we acknowledge certain limitations inherent to our current design scope. The DMSCA mechanism is explicitly engineered to leverage the inductive biases of CNNs, specifically utilizing convolutional operations to capture local multi-scale contexts and spatial hierarchies efficiently. Consequently, its direct applicability to Transformer-based backbones (e.g., ViT, Swin Transformer), which process images as sequences of tokens without inherent spatial locality, remains explored. Future work will prioritize adapting the core principles of dynamic channel-spatial coupling to token-based architectures. Drawing inspiration from recent hybrid approaches like Modumer21, we aim to investigate how DMSCA can be integrated into self-attention frameworks. This evolution seeks to create a hybrid attention model that synergistically combines the efficiency of local feature refinement with the long-range dependency modeling capabilities of Transformers42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67.
Furthermore, the potential of DMSCA in other complex visual tasks, such as video analysis or 3D point cloud processing, remains an exciting avenue for future research. We believe that the design philosophy of DMSCA—emphasizing dynamic interaction and multi-scale context—offers a valuable direction for developing next-generation attention mechanisms.
Data availability
The datasets analysed during the current study are publicly available benchmark datasets. The CIFAR-10 and CIFAR-100 datasets are available from the University of Toronto’s website: https://www.cs.toronto.edu/~kriz/cifar.html. The ImageNet ILSVRC 2012 dataset is available via its official website: https://image-net.org/challenges/LSVRC/2012/.
References
Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks, in Proc. Adv. Neural Inf. Process. Syst., 1097–1105. (2012).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 770–778. (2016).
Goodfellow, I. J., Shlens, J. & Szegedy, C. Explaining Harnessing Adversarial Examples (2014). arXiv:1412.6572.
Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network, (2015). arXiv:1503.02531.
Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition, (2014). arXiv:1409.1556.
Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 4700–4708. (2017).
Madry, A., Makelov, A., Schmidt, L., Tsipras, D. & Vladu, A. Towards Deep Learn. Models Resistant Adversarial Attacks (2017). arXiv:1706.06083.
Szegedy, C. et al. Going deeper with convolutions, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 1–9. (2015).
Krizhevsky, A., Nair, V. & Hinton, G. Learning Multiple Layers of Features from Tiny Images (Univ. of Toronto, 2009).
Ren, S., He, K., Girshick, R., Sun, J. & Faster, R-C-N-N. Towards real-time object detection with region proposal networks, in Proc. Adv. Neural Inf. Process. Syst., 91–99. (2015).
Deng, J. et al. ImageNet: A large-scale hierarchical image database, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 248–255. (2009).
Paszke, A. et al. PyTorch: An imperative style, high-performance deep learning library, in Proc. Adv. Neural Inf. Process. Syst., 8026–8037. (2019).
Long, J., Shelhamer, E. & Darrell, T. Fully convolutional networks for semantic segmentation, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 3431–3440. (2015).
Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 7132–7141. (2018).
Woo, S., Park, J., Lee, J. Y. & Kweon, I. S. CBAM: Convolutional block attention module, in Proc. Eur. Conf. Comput. Vis. (ECCV), 3–19. (2018).
Wang, Q. et al. ECA-Net: Efficient channel attention for deep convolutional neural networks, in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 11534–11542. (2020).
Hou, Q., Zhou, D. & Feng, J. Coordinate attention for efficient mobile network design, in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 13713–13722. (2021).
Cui, Y., Ren, W. & Knoll, A. Exploring the potential of pooling techniques for universal image restoration. IEEE Trans. Image Process. 34, 3403–3416 (2025).
Cui, Y. & Knoll, A. Exploring the potential of channel interactions for image restoration. Knowl. -Based Syst. 282, 111156 (2023).
Cui, Y. & Knoll, A. Dual-domain strip attention for image restoration. Neural Netw. 171, 429–439 (2024).
Cui, Y., Liu, M., Ren, W. & Knoll, A. Modumer: modulating transformer for image restoration. IEEE Trans. Neural Netw. Learn. Syst., (2025).
Misra, D., Nalamada, T., Arasanipalai, A. U. & Hou, Q. Rotate to attend: Convolutional triplet attention module, in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., 3139–3148. (2021).
Bello, I., Zoph, B., Le, Q., Vaswani, A. & Shlens, J. Attention augmented convolutional networks, in Proc. IEEE/CVF Int. Conf. Comput. Vis., 3286–3295. (2019).
Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale, 2020, arXiv:2010.11929.
Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization, arXiv:1412.6980. (2014).
Loshchilov, I. & Hutter, F. Decoupled Weight Decay Regularization (2017). arXiv:1711.05101.
Selvaraju, R. R. et al. Grad-CAM: Visual explanations from deep networks via gradient-based localization, in Proc. IEEE Int. Conf. Comput. Vis., 618–626. (2017).
Cohen, N., Sharir, G. & Shashua, A. On the expressive power of deep learning: A tensor analysis, (2016). arXiv:1606.05336.
Chattopadhyay, A., Sarkar, A., Howlader, P. & Balasubramanian, V. N. Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks, in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), 839–847. (2018).
Parkhi, O. M., Vedaldi, A., Zisserman, A. & Jawahar, C. V. Cats and dogs, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 3498–3505. (2012).
Goodfellow, I. et al. Generative adversarial nets, in Proc. Adv. Neural Inf. Process. Syst., 2672–2680. (2014).
Vaswani, A. et al. Attention is all you need, in Proc. Adv. Neural Inf. Process. Syst., 5998–6008. (2017).
Tolstikhin, I. O. et al. MLP-Mixer: An all-MLP architecture for vision,., arXiv:2105.01601. (2021).
Bossard, L., Guillaumin, M. & Van Gool, L. Food-101–mining discriminative components with random forests, in Proc. Eur. Conf. Comput. Vis., Springer, 446–461. (2014).
He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask R-CNN, in Proc. IEEE Int. Conf. Comput. Vis., 2961–2969. (2017).
Szegedy, C. et al. Going deeper with convolutions, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 1–9. (2015).
Yang, L., Zhang, R. Y., Li, L. & Xie, X. SimAM: A simple, parameter-free attention module for convolutional neural networks, in Proc. Int. Conf. Mach. Learn., 11863–11874. (2021).
Ramachandran, P., Zoph, B. & Le, Q. V. Searching for activation functions, arXiv preprint arXiv:1710.05941, (2017).
Chollet, F. Xception: Deep learning with depthwise separable convolutions, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 1251–1258. (2017).
Ioffe, S. & Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift, in Proc. Int. Conf. Mach. Learn., 448–456. (2015).
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15 (1), 1929–1958 (2014).
Lin, T. Y., RoyChowdhury, A. & Maji, S. Bilinear CNN models for fine-grained visual recognition, in Proc. IEEE Int. Conf. Comput. Vis., 1449–1457. (2015).
Liu, Z. et al. Large-scale long-tailed recognition in an open world, in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2537–2546. (2019).
Zhang, H. et al. ResNeSt: Split-attention networks,., arXiv:2004.08955. (2020).
Cui, Y., Jia, M., Lin, T. Y., Song, Y. & Belongie, S. Class-balanced loss based on effective number of samples, in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 9268–9277. (2019).
Zhang, H. et al. Theoretically principled trade-off between robustness and accuracy, in Proc. Int. Conf. Mach. Learn., 7472–7482. (2019).
Chen, Y. et al. Dynamic convolution: Attention over convolution kernels, in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 11030–11039. (2020).
Yang, B., Bender, G., Le, Q. V. & Ngiam, J. CondConv: Conditionally parameterized convolutions for efficient inference, in Proc. Adv. Neural Inf. Process. Syst., 1305–1316. (2019).
Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: Unified, real-time object detection, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 779–788. (2016).
Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K. & Yuille, A. L. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs, IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, 834–848, Apr. (2018).
Karras, T., Laine, S. & Aila, T. A style-based generator architecture for generative adversarial networks, in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 4401–4410. (2019).
Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows, in Proc. IEEE/CVF Int. Conf. Comput. Vis., 10012–10022. (2021).
Tan, M. & Le, Q. EfficientNet: Rethinking model scaling for convolutional neural networks, in Proc. Int. Conf. Mach. Learn., 6105–6114. (2019).
Howard, A. G. et al. MobileNets: efficient convolutional neural networks for mobile vision applications, 2017, arXiv:1704.04861.
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. & Chen, L. C. MobileNetV2: Inverted residuals and linear bottlenecks, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 4510–4520. (2018).
Zhang, X., Zhou, X., Lin, M. & Sun, J. ShuffleNet: An extremely efficient convolutional neural network for mobile devices, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 6848–6856. (2018).
Ma, N., Zhang, X., Zheng, H. T. & Sun, J. ShuffleNet V2: Practical guidelines for efficient CNN architecture design, in Proc. Eur. Conf. Comput. Vis. (ECCV), 116–131. (2018).
Liu, Y., Shao, Z., Teng, Y. & Hoffmann, N. NAM: Normalization-based attention module, arXiv:2111.12419. (2021).
Chen, Y., Kalantidis, Y., Li, J., Yan, S. & Feng, J. A²-Nets: Double attention networks, in Proc. Adv. Neural Inf. Process. Syst., 352–361. (2018).
Park, J., Woo, S., Lee, J. Y. & Kweon, I. S. BAM: Bottleneck attention module, arXiv:1807.06514. (2018).
Liu, Y., Shao, Z. & Hoffmann, N. Global attention mechanism: retain information to enhance channel-spatial interactions, (2021). arXiv:2112.05561.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J. & Zisserman, A. The Pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88 (2), 303–338 (2010).
Lin, T. Y. et al. Microsoft COCO: Common objects in context, in Proc. Eur. Conf. Comput. Vis., Springer, 740–755. (2014).
Cordts, M. et al. The Cityscapes dataset for semantic urban scene understanding, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 3213–3223. (2016).
Zhou, B. et al. Scene parsing through ADE20K dataset, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 633–641. (2017).
Russakovsky, O. et al. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115 (3), 211–252 (2015).
Selvaraju, R. R. et al. Grad-cam: Visual explanations from deep networks via gradient-based localization, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 618–626. (2017).
Author information
Authors and Affiliations
Contributions
L.Z. proposed the idea, conducted the experiments, designed the DMSCA mechanism, and wrote the original draft. S.J.N. provided guidance, verified the methodology, and revised the manuscript structure. Z.F.D. contributed to the formal analysis, particularly the mathematical derivations and complexity analysis. H.J.P. was responsible for project supervision, securing funding, and managing administrative affairs. All authors reviewed the manuscript.calculations. Approved the final version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zong, L., Nan, S.J., Die, Z.F. et al. DMSCA: dynamic multi-scale channel-spatial attention for enhanced feature representation in convolutional neural networks. Sci Rep 16, 8044 (2026). https://doi.org/10.1038/s41598-026-37546-3
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-026-37546-3








