Introduction

Convolutional Neural Networks (CNNs)1,2 have revolutionized tasks including image classification3,4, object detection5,6,7, and semantic segmentation8,9,10 by learning hierarchical features autonomously. Nevertheless, conventional CNNs assign equal importance to all channels and spatial locations, hindering their focus on critical information and thus impacting feature quality and model performance.

Inspired by human vision, attention mechanisms11,12,13 have been incorporated into CNNs to enable selective focus, dynamically weighting features to boost performance, robustness, and interpretability with minimal complexity overhead. Notable advancements include SE-Net14 for channel attention via global pooling and fully connected layers, CBAM15 for sequential channel-spatial attention, ECA-Net16 for efficient cross-channel interactions, and Coordinate Attention (CA)17 for position-sensitive long-range dependencies. Beyond high-level vision tasks, recent scholarship has significantly expanded the utility of attention mechanisms into image restoration and hybrid architecture. For instance, Cui et al. investigated the potential of pooling techniques for universal image restoration18 and explored channel interactions to enhance feature recovery19. Furthermore, novel architectures such as dual-domain strip attention20 and Modumer21, which modulate Transformers for restoration tasks, demonstrate the evolving complexity and versatility of attention-based feature representation.

Despite these progresses, challenges persist insufficient multi-scale context capture, simplistic channel-spatial fusion lacking dynamic collaboration, limited adaptability of attention weights, inadequate directional modeling, and expressive limitations of fixed activations.

Addressing these, we propose the Dynamic Multi-scale Channel-Spatial Attention (DMSCA) mechanism. Our contributions are:

  1. 1.

    A synergistic multi-component framework encompassing global and multi-scale encoding, temperature-controlled channel attention, and direction-aware spatial attention.

  2. 2.

    A learnable Dynamic Feature Fusion module for non-linear, input-adaptive channel-spatial combination, surpassing linear or fixed fusions.

  3. 3.

    Rigorous validation on CIFAR-10/100 and ImageNet, showing DMSCA’s superior performance-efficiency balance over baselines.

  4. 4.

    In-depth ablation and visualization studies confirm component efficacy and enhanced interpretability.

The paper is organized as follows: Sect. 2 reviews related work; Sect. 3 details DMSCA; Sect. 4 outlines experimental setup; Sect. 5 presents results and analyses; Sect. 6 concludes.

Related work

Channel attention mechanisms

Channel attention mechanisms calibrate features by modeling inter-channel dependencies and assigning varying weights to emphasize important channels. SE-Net14 pioneered this approach using global pooling and fully connected layers, though dimensionality reduction can lead to information loss. ECA-Net16 enhances efficiency through local cross-channel interactions without reduction, while FCA22 incorporates frequency-domain analysis for richer representations. Despite these advances, many methods overlook spatial context in guiding channel weights and lack dynamic adjustment. DMSCA addresses this via its Global Context Encoder (GCE) and Temperature-controlled Channel Attention (TCA), introducing learnable temperature τ for adaptive scaling and deep spatial coupling.

Spatial attention mechanism

Spatial attention mechanisms emphasize key regions in feature maps, focusing on “where” informative content lies. Spatial Transformer Networks (STN)23 achieve this through affine transformations, albeit with high computational cost. CBAM’s spatial module15 generates attention maps via pooling and convolution but often loses channel details and directional sensitivity. Coordinate Attention (CA)17 improves by encoding long-range dependencies with positional awareness. However, existing approaches frequently neglect channel influences, directional interactions, and multi-scale contexts. DMSCA’s Multi-scale Spatial Context Encoder (MSCE) and Directional Information Interaction (DII) integrate these elements, enabling robust multi-scale and direction-aware spatial modeling.

Hybrid attention mechanism

Hybrid attention mechanisms synergize channel and spatial attention for comprehensive feature enhancement. CBAM15 applies them sequentially, while BAM24 uses parallel combination with summation. Triplet Attention25 incorporates cross-dimensional interactions, and SimAM26 infers neuron importance parameter-freely via an energy function. Nonetheless, fusion strategies in these methods are often fixed and linear, restricting adaptability. DMSCA introduces a Dynamic Feature Fusion (DFF) module that learns spatially adaptive weights and incorporates directional interactions for flexible, data-dependent integration.

Multi-scale feature processing and attention

Multi-scale feature processing captures information across varying granularities to boost model robustness. Inception networks27,28 employ multi-branch structures for this purpose, and Feature Pyramid Networks (FPN)29 fuse hierarchical semantics. Attention-integrated multi-scale methods, like Pyramid Attention Modules (PAM)30, have gained traction. Yet, many apply multi-scale independently or with simplistic combinations, lacking dynamic cross-scale interactions. DMSCA embeds multi-scale design intrinsically, introducing and fusing scales early with adaptive weighting to exploit inter-scale dynamics deeply and efficiently.

Adaptive activation function

Activation functions inject non-linearity into networks, with adaptive variants dynamically adjusting based on inputs for superior expressiveness. Unlike fixed functions such as \(\:\text{R}\text{e}\text{L}\text{U}\)31, Swish32 and DY-ReLU33 adapt shapes to feature complexities. These innovations inspire DMSCA’s Adaptive Activation (AA) module, which dynamically modulates output mappings to optimize non-linear representations within the attention framework.

Advanced attention mechanisms and vision transformers

Recent advancements include sophisticated attention designs and Vision Transformers (\(\:\text{V}\text{i}\text{T}\))34, which process images as patch sequences. Models like Visual Attention Network (VAN) use large-kernel convolutions for long-range modeling in CNNs, while EfficientFormerV2 excels in lightweight hybrids. Moreover, the applicability of attention mechanisms has been successfully extended to image restoration domains, employing innovative techniques such as dual-domain strip attention20 and modulating Transformers21 to address fine-grained pixel reconstruction challenges. Although powerful, these approaches often incur high computational costs or are primarily tailored for Transformer backbones. In contrast, DMSCA offers a lightweight, plug-and-play solution optimized for CNNs, enhancing features without structural overhauls. Overall, while prior works advance individual aspects, DMSCA holistically integrates dynamic multi-scale, channel-spatial, and adaptive elements for superior synergy.

Methodology

This section details the architecture, core components, and mathematical foundations of the Dynamic Multi-scale Channel-Spatial Attention (DMSCA) mechanism. Designed to enhance the feature representation capabilities of Convolutional Neural Networks (CNNs), DMSCA dynamically integrates multi-scale contexts, directional perceptions, and adaptive activations. We provide precise mathematical derivations, a parameter complexity analysis, and implementation details to ensure reproducibility.

Overall architecture of DMSCA

As illustrated in Fig. 1b, the Dynamic Multi-Scale Channel-Spatial Attention (DMSCA) module is designed as a versatile, plug-and-play unit compatible with standard CNN architecture (e.g., \(\:\text{R}\text{e}\text{s}\text{N}\text{e}\text{t}\), MobileNet).

Figure 1(a) specifically demonstrates its deployment within the ResNet-50 backbone. At the macroscopic level (top panel), DMSCA is embedded within the residual stages, effectively regulating feature flow throughout the network hierarchy. At the microscopic level (bottom panel), we integrate DMSCA into the Bottleneck Block. Crucially, the module is placed after the final \(\:1\:\times\:1\) expansion convolution. This placement ensures that DMSCA operates on the expanded, high-dimensional feature space, allowing it to capture the richest possible channel interdependencies and spatial contexts before the feature map is compressed or added to the identity shortcut.

The module accepts an input feature map \(\:\varvec{X}\in\:{\mathbb{R}}^{\varvec{C}\times\:\varvec{H}\times\:\varvec{W}}\). Unlike conventional attention mechanisms that often treat channel and spatial dimensions in isolation, DMSCA is guided by two core design principles:

  1. 1.

    Contextual Robustness: Ensuring features are refined using both global dependencies and multi-scale local details.

  2. 2.

    Adaptive Selection: Allowing the network to dynamically weigh the importance of channel versus spatial information based on the specific input content.

The internal architecture, detailed in Fig. 1(b), orchestrates six cohesive components—Global Context Encoder (GCE), Temperature-controlled Channel Attention (TCA), Multi-scale Spatial Context Encoder (MSCE), Directional Information Interaction (DII), Dynamic Feature Fusion (DFF), and Adaptive Activation (AA) collaborating in three phases described below.

Fig. 1
Fig. 1
Full size image

(a) Architecture of DMSCA-Integrated ResNet50 (Block Level). (b) Overview of DMSCA.

Temperature-controlled channel recalibration

Standard global pooling often risks losing feature distinctiveness. To address this, the Global Context Encoder (GCE) aggregates global context via parallel average and max pooling. This is followed by the Temperature-controlled Channel Attention (TCA) mechanism. A critical theoretical innovation here is the introduction of a temperature parameter \(\:\varvec{\tau\:}\). By adjusting \(\:\varvec{\tau\:}\), the module can control the sharpness of the Sigmoid distribution, thereby preventing the gradients from vanishing in the deep layers and allowing for a softer, more informative attention map.

Mathematically, let \(\:{\varvec{F}}_{\varvec{g}\varvec{l}\varvec{o}\varvec{b}\varvec{a}\varvec{l}}\) denote the context features extracted by the MLP in GCE. The channel attention vector \(\:{\varvec{A}}_{\varvec{c}}\) and the recalibrated feature \(\:{\varvec{X}}_{\varvec{c}}\) are formulated as:

$$\:{\varvec{A}}_{\varvec{c}}=\varvec{\sigma\:}\left(\frac{{\varvec{F}}_{\varvec{g}\varvec{l}\varvec{o}\varvec{b}\varvec{a}\varvec{l}}}{\varvec{\tau\:}}\right),\hspace{1em}{\varvec{X}}_{\varvec{c}}=\varvec{X}\odot\:{\varvec{A}}_{\varvec{c}}$$

where \(\:\odot\:\) denotes element-wise multiplication.

Synergistic Spatial and directional encoding

To overcome the limited receptive field of standard convolutions, we introduce a dual-pathway approach for spatial refinement.

First, the Multi-scale Spatial Context Encoder (MSCE) employs multi-kernel convolutions (\(\:{\varvec{k}}_{1},{\varvec{k}}_{2},{\varvec{k}}_{3}\)) to capture local context at varying scales (\(\:{\varvec{F}}_{\varvec{m}\varvec{u}\varvec{l}\varvec{t}\varvec{i}}\)). Simultaneously, the Directional Information Interaction (DII) specifically addresses the lack of positional information in global pooling. By decomposing the feature map into horizontal and vertical encodings (\(\:{\varvec{F}}_{\varvec{d}\varvec{i}\varvec{r}}\)), DII preserves precise directional cues.

These two distinct information flows are coupled to generate a spatially enhanced feature map \(\:{\varvec{X}}_{\varvec{s}}\). To ensure robustness against varying input resolutions (e.g., when \(\:\varvec{H}\) or \(\:\varvec{W}\) are small), adaptive boundary checks are implicitly integrated into the pooling operations. The fusion is defined as:

$$\:{\varvec{X}}_{\varvec{s}}={\varvec{X}}_{\varvec{c}}\odot\:\varvec{\sigma\:}\left({\varvec{F}}_{\varvec{m}\varvec{u}\varvec{l}\varvec{t}\varvec{i}}+{\varvec{F}}_{\varvec{d}\varvec{i}\varvec{r}}\right)$$

where \(\:\varvec{\sigma\:}\) represents the Sigmoid function.

Dynamic aggregation and activation

Static summation of features is often suboptimal as different inputs may require varying emphasis on channel or spatial domains. The Dynamic Feature Fusion (DFF) module solves this by learning adaptive weights (\(\:{\varvec{w}}_{0},{\varvec{w}}_{1}\)) to non-linearly fuse the intermediate features. Finally, Adaptive Activation (AA) creates a content-aware activation function, further refining the output \(\:{\varvec{X}}_{\varvec{f}\varvec{i}\varvec{n}\varvec{a}\varvec{l}}\) for subsequent layers. The complete aggregation process is expressed as:

$$\:{\varvec{X}}_{\varvec{o}\varvec{u}\varvec{t}}={\varvec{w}}_{0}\odot\:{\varvec{X}}_{\varvec{c}}+{\varvec{w}}_{1}\odot\:{\varvec{X}}_{\varvec{s}}$$
$$\:{\varvec{X}}_{\varvec{f}\varvec{i}\varvec{n}\varvec{a}\varvec{l}}={\varvec{X}}_{\varvec{o}\varvec{u}\varvec{t}}\odot\:\varvec{\sigma\:}\left(\text{BN}\left(\text{Conv}\left({\varvec{X}}_{\varvec{o}\varvec{u}\varvec{t}}\right)\right)\right)$$

Global context encoder

The Global Context Encoder (GCE) constitutes the foundational layer of the DMSCA module, tasked with transforming spatial information into a compact channel-wise descriptor. For an input feature map \(\:\varvec{X}\in\:{\varvec{R}}^{\varvec{C}\times\:\varvec{H}\times\:\varvec{W}}\), the goal is to extract a comprehensive global representation that guides the subsequent attention recalibration.

Theoretical motivation and design

Standard channel attention mechanisms, such as SE-Net14, predominantly rely on Global Average Pooling (GAP). While GAP effectively captures the global statistical distribution (background information), it inherently suppresses high-frequency signals, potentially diluting the discriminative details of salient objects. Conversely, Global Max Pooling (GMP) excels at preserving the most prominent features (texture and edges) but may overlook the broader contextual information34.

To address these limitations, the GCE employs a dual-pathway aggregation strategy. By synthesizing both average and maximum pooling statistics, the encoder ensures a richer feature representation. Furthermore, regarding the fusion strategy, we opt for element-wise addition rather than concatenation. Theoretical analysis suggests that addition acts as a parameter-efficient superposition of signals, preserving the distinct properties of both descriptors without increasing the dimensionality of the subsequent Multi-Layer Perceptron (MLP). This choice aligns with our goal of designing lightweight architecture.

Mathematical formulation

The encoding process is formally defined as follows. First, the spatial dimensions are compressed to generate two distinct channel descriptors, \(\:{\varvec{X}}_{\varvec{a}\varvec{v}\varvec{g}}\) and \(\:{\varvec{X}}_{\varvec{m}\varvec{a}\varvec{x}}\):

$$\:{\varvec{X}}_{\varvec{a}\varvec{v}\varvec{g}}=\text{GAP}\left(\varvec{X}\right),\hspace{1em}{\varvec{X}}_{\varvec{m}\varvec{a}\varvec{x}}=\text{GMP}\left(\varvec{X}\right)$$

These descriptors are then fused via element-wise summation to form a unified global descriptor \(\:{\varvec{X}}_{\varvec{f}\varvec{u}\varvec{s}\varvec{e}\varvec{d}}\in\:{\varvec{R}}^{\varvec{C}\times\:1\times\:1}\). This fused vector is subsequently propagated through a shared MLP (comprising dimensionality reduction and restoration layers) to model channel interdependencies. The final global context \(\:{\varvec{F}}_{\varvec{g}\varvec{l}\varvec{o}\varvec{b}\varvec{a}\varvec{l}}\) is obtained as:

$$\:{\varvec{X}}_{\varvec{f}\varvec{u}\varvec{s}\varvec{e}\varvec{d}}={\varvec{X}}_{\varvec{a}\varvec{v}\varvec{g}}+{\varvec{X}}_{\varvec{m}\varvec{a}\varvec{x}}$$
$$\:{\varvec{F}}_{\varvec{g}\varvec{l}\varvec{o}\varvec{b}\varvec{a}\varvec{l}}=\text{MLP}\left({\varvec{X}}_{\varvec{f}\varvec{u}\varvec{s}\varvec{e}\varvec{d}}\right)$$

This refined descriptor \(\:{\varvec{F}}_{\varvec{g}\varvec{l}\varvec{o}\varvec{b}\varvec{a}\varvec{l}}\) serves as the input for the Temperature-controlled Channel Attention (TCA) detailed in the subsequent section. The structure of the GCE is shown in Fig. 2.

Fig. 2
Fig. 2
Full size image

Schematic illustration of the Global Context Encoder (GCE). The module processes the input feature \(\:\varvec{X}\) through parallel Global Average Pooling (GAP) and Global Max Pooling (GMP) streams. The resulting descriptors, \(\:{\varvec{X}}_{\varvec{a}\varvec{v}\varvec{g}}\) and \(\:{\varvec{X}}_{\varvec{m}\varvec{a}\varvec{x}}\), are fused via element-wise addition to preserve complementary spatial information. A shared MLP then projects this fused representation into the final global context descriptor \(\:{\varvec{F}}_{\varvec{g}\varvec{l}\varvec{o}\varvec{b}\varvec{a}\varvec{l}}\).

Channel attention with temperature control

Following the extraction of the global context descriptor \(\:{\varvec{F}}_{\varvec{g}\varvec{l}\varvec{o}\varvec{b}\varvec{a}\varvec{l}}\) by the GCE, the Temperature-controlled Channel Attention (TCA) module is tasked with generating calibrated importance weights for each channel. This process allows the network to selectively emphasize informative feature maps while suppressing irrelevant background noise.

Theoretical motivation and design

In conventional channel attention mechanisms (e.g., SE-Net14, the channel weights are typically generated via a standard Sigmoid function. However, the standard Sigmoid function lacks flexibility; its slope is fixed, which may lead to suboptimal gradient flow during training—specifically, when the input values are large (saturation region), gradients vanish, hindering effective backpropagation.

To mitigate this, we introduce a learnable temperature parameter \(\:\varvec{\tau\:}\) into the activation mechanism, inspired by temperature scaling in knowledge distillation35 and model calibration. Theoretically, \(\:\varvec{\tau\:}\) acts as a regularization term for the activation distribution:

  1. 1.

    Low Temperature (\(\:\varvec{\tau\:}<\:1\)): Sharpens the distribution, forcing the model to make binary, decisive choices (selecting specific channels strictly).

  2. 2.

    High Temperature (\(\:\varvec{\tau\:}>1\)): Softens the distribution, allowing for a smoother gradient flow and encouraging the model to utilize a broader set of channels during early training phases.

Furthermore, to maintain computational efficiency while modeling channel interdependencies, we employ a bottleneck structure within the MLP, reducing the channel dimensionality by a reduction ratio \(\:\varvec{r}\) before restoring it.

Mathematical Formulation Let \(\:{\varvec{F}}_{\varvec{g}\varvec{l}\varvec{o}\varvec{b}\varvec{a}\varvec{l}}\in\:{\varvec{R}}^{\varvec{C}\times\:1\times\:1}\) be the context descriptor obtained from the GCE (prior to activation). The TCA module first normalizes this descriptor using the temperature parameter \(\:\varvec{\tau\:}\). The final channel attention weights \(\:{\varvec{A}}_{\varvec{c}}\in\:{\varvec{R}}^{\varvec{C}\times\:1\times\:1}\) are generated via the temperature-scaled Sigmoid function:

$$\:{\varvec{A}}_{\varvec{c}}=\varvec{\sigma\:}\left(\frac{{\varvec{F}}_{\varvec{g}\varvec{l}\varvec{o}\varvec{b}\varvec{a}\varvec{l}}}{\varvec{\tau\:}}\right)$$

where \(\:\varvec{\sigma\:}\left(\cdot\:\right)\) denotes the Sigmoid function. Subsequently, the input feature map \(\:\varvec{W}\) is recalibrated via channel-wise multiplication to produce the refined feature \(\:{\varvec{X}}_{\varvec{c}}\):

$$\:{\varvec{X}}_{\varvec{c}}=\varvec{X}\odot\:{\varvec{A}}_{\varvec{c}}$$

Here, \(\:\odot\:\) represents element-wise multiplication, broadcasting the \(\:1\:\times\:1\) weights across the spatial dimensions \(\:\varvec{H}\:\times\:\varvec{W}\). This mechanism ensures that the subsequent spatial encoding layers receive features with optimized channel distributions. The structure of the TCA is shown in Fig. 3.

Fig. 3
Fig. 3
Full size image

Diagram of the Temperature-controlled Channel Attention (TCA). The global descriptor \(\:{\varvec{F}}_{\varvec{g}\varvec{l}\varvec{o}\varvec{b}\varvec{a}\varvec{l}}\) (output of the MLP) is scaled by a temperature factor \(\:\varvec{r}\) to adjust the distribution sharpness. The weights \(\:{\varvec{A}}_{\varvec{c}}\) are obtained via the Sigmoid function and applied to the input \(\:\varvec{X}\) through element-wise multiplication to generate the channel-refined feature \(\:{\varvec{X}}_{\varvec{c}}\).

Multi-scale Spatial context encoder

Following the refinement of channel interdependencies in the TCA module, the feature map \(\:{\varvec{X}}_{\varvec{c}}\in\:{\varvec{R}}^{\varvec{C}\times\:\varvec{H}\times\:\varvec{W}}\) contains enhanced channel-wise representations. However, convolution operations with fixed kernel sizes inherently struggle to capture objects of varying scales simultaneously. A small kernel focuses on local texture details, while a larger kernel captures broader semantic shapes. To overcome this limitation and enrich the feature representation with diverse spatial granularities, we design the Multi-scale Spatial Context Encoder (MSCE).

Design principles and Spatial descriptor generation

The primary objective of the MSCE is to extract spatial context without the computational burden of processing the full channel depth. Motivated by the observation that channel-wise pooling is an efficient method for generating spatial attention descriptors15, we first decouple spatial information from the channel dimension.

Instead of standard dimensionality reduction via \(\:1\times\:1\) convolution, we employ dual-channel pooling operations. We compute the average-pooled features to aggregate background statistics and max-pooled features to highlight discriminative regions along the channel axis. This results in a comprehensive spatial descriptor \(\:{\varvec{S}}_{\varvec{d}\varvec{e}\varvec{s}\varvec{c}}\in\:{\varvec{R}}^{2\times\:\varvec{H}\times\:\varvec{W}}\):

$$\:{\varvec{S}}_{\varvec{a}\varvec{v}\varvec{g}}=\frac{1}{\varvec{C}}{\sum\:}_{\varvec{i}=1}^{\varvec{C}}{\varvec{X}}_{\varvec{c}}\left(\varvec{i}\right),$$
$$\:{\varvec{S}}_{\varvec{m}\varvec{a}\varvec{x}}=\underset{\varvec{i}\in\:\{1,\dots\:,\varvec{C}\}}{\varvec{max}}{\varvec{X}}_{\varvec{c}}\left(\varvec{i}\right),$$
$$\:{\varvec{S}}_{\varvec{d}\varvec{e}\varvec{s}\varvec{c}}=\text{Concat}\left(\left[{\varvec{S}}_{\varvec{a}\varvec{v}\varvec{g}},{\varvec{S}}_{\varvec{m}\varvec{a}\varvec{x}}\right]\right),$$

Where \(\:{\varvec{S}}_{\varvec{a}\varvec{v}\varvec{g}}\in\:{\varvec{R}}^{1\times\:\varvec{H}\times\:\varvec{W}}\) and \(\:{\varvec{S}}_{\varvec{m}\varvec{a}\varvec{x}}\in\:{\varvec{R}}^{1\times\:\varvec{H}\times\:\varvec{W}}\) represent the channel-wise statistics. This compression strategy significantly reduces parameter overhead while preserving essential spatial structures.

Multi-branch feature extraction

To facilitate multi-scale feature interaction, \(\:{\varvec{S}}_{\varvec{d}\varvec{e}\varvec{s}\varvec{c}}\) serves as the input to a parallel multi-branch architecture inspired by the Inception paradigm36. The module employs a set of \(\:\varvec{n}\) parallel convolutional layers with varying kernel sizes \(\:\varvec{k}\:\in\:\{3\:\times\:3,\:5\:\times\:5,\:7\:\times\:7\}\). Each branch \(\:\varvec{i}\) applies a specific kernel size \(\:{\varvec{k}}_{\varvec{i}}\) to capture spatial context at a distinct receptive field. The operation for the \(\:\varvec{i}\)-th branch is defined as:

$$\:{\varvec{L}}_{\varvec{i}}=\varvec{\delta\:}\left(\mathcal{B}\left({\text{Conv}}_{{\varvec{k}}_{\varvec{i}}\times\:{\varvec{k}}_{\varvec{i}}}\left({\varvec{S}}_{\varvec{d}\varvec{e}\varvec{s}\varvec{c}}\right)\right)\right),$$

where \(\:{\text{Conv}}_{{\varvec{k}}_{\varvec{i}}\times\:{\varvec{k}}_{\varvec{i}}}\) denotes a convolutional layer reducing the channel dimension from 2 to 1, \(\:\mathcal{B}\) represents Batch Normalization, and \(\:\varvec{\delta\:}\) denotes the \(\:\text{R}\text{e}\text{L}\text{U}\) activation function. This design allows the network to adaptively perceive visual patterns ranging from fine-grained edges to coarse-grained object contours.

Feature fusion

Unlike standard concatenation approaches that increase dimensionality, we aim to synthesize a unified spatial attention map. The outputs from all branches \(\:\{{\varvec{L}}_{1},\dots\:,{\varvec{L}}_{\varvec{n}}\}\) are fused via an averaging operation to produce the final multi-scale context encoding \(\:{\varvec{F}}_{\varvec{m}\varvec{u}\varvec{l}\varvec{t}\varvec{i}}\in\:{\varvec{R}}^{1\times\:\varvec{H}\times\:\varvec{W}}\). This ensures that the resulting spatial map represents a consensus of features across different scales:

$$\:{\varvec{F}}_{\varvec{m}\varvec{u}\varvec{l}\varvec{t}\varvec{i}}=\frac{1}{\varvec{n}}{\sum\:}_{\varvec{i}=1}^{\varvec{n}}{\varvec{L}}_{\varvec{i}}.$$

The resulting \(\:{\varvec{F}}_{\varvec{m}\varvec{u}\varvec{l}\varvec{t}\varvec{i}}\) encapsulates robust spatial position information, which is subsequently integrated with the directional features discussed in the following section. The schematic structure of the MSCE is illustrated in Fig. 4.

Fig. 4
Fig. 4
Full size image

Multi-scale spatial context encoder.

Directional information interaction

While the Multi-scale Spatial Context Encoder (MSCE) captures local spatial granularities effectively, standard convolution and global pooling operations often result in a loss of precise positional information. To mitigate this, we introduce the Directional Information Interaction (DII) module. Building upon the principles of Coordinate Attention17, the DII module decomposes spatial attention into two orthogonal directions. However, unlike prior works that process these directions independently, our design introduces a novel interaction mechanism that couples horizontal and vertical features, allowing the network to perceive object structures more holistically.

Coordinate aggregation and embedding

To preserve long-range dependencies with precise positional information, we utilize adaptive average pooling kernels \(\:\left(\varvec{H},1\right)\) and \(\:\left(1,\varvec{W}\right)\) to encode each channel along the horizontal and vertical coordinates, respectively.

For the input feature map \(\:{\varvec{X}}_{\varvec{c}}\) (output from the TCA module), the aggregation for the \(\:\varvec{c}\)-th channel at height \(\:\varvec{h}\) and width \(\:\varvec{w}\) is formulated as:

$$\:{\varvec{z}}_{\varvec{h}}^{\varvec{c}}\left(\varvec{h}\right)=\frac{1}{\varvec{W}}{\sum\:}_{0\le\:\varvec{i}<\varvec{W}}{\varvec{X}}_{\varvec{c}}\left(\varvec{c},\varvec{h},\varvec{i}\right)$$
$$\:{\varvec{z}}_{\varvec{w}}^{\varvec{c}}\left(\varvec{w}\right)=\frac{1}{\varvec{H}}{\sum\:}_{0\le\:\varvec{j}<\varvec{H}}{\varvec{X}}_{\varvec{c}}\left(\varvec{c},\varvec{j},\varvec{w}\right)$$

These operations generate two direction-aware feature maps: \(\:{\varvec{Z}}_{\varvec{h}}\in\:{\varvec{R}}^{\varvec{C}\times\:\varvec{H}\times\:1}\) and \(\:{\varvec{Z}}_{\varvec{w}}\in\:{\varvec{R}}^{\varvec{C}\times\:1\times\:\varvec{W}}.\) To reduce computational complexity and model channel interdependencies, these maps are projected to a lower dimension \(\:{\varvec{C}}_{\varvec{m}\varvec{i}\varvec{d}}\) via \(\:1\times\:1\) convolutions followed by Batch Normalization (BN):

$$\:{\varvec{F}}_{\varvec{h}}=\mathcal{B}\left({\text{Conv}}_{\varvec{h}}\left({\varvec{Z}}_{\varvec{h}}\right)\right)\in\:{\varvec{R}}^{{\varvec{C}}_{\varvec{m}\varvec{i}\varvec{d}}\times\:\varvec{H}\times\:1}$$
$$\:{\varvec{F}}_{\varvec{w}}=\mathcal{B}\left({\text{Conv}}_{\varvec{w}}\left({\varvec{Z}}_{\varvec{w}}\right)\right)\in\:{\varvec{R}}^{{\varvec{C}}_{\varvec{m}\varvec{i}\varvec{d}}\times\:1\times\:\varvec{W}}$$

Orthogonal interaction and modulation

A key theoretical limitation in standard coordinate attention is the independence of the \(\:\varvec{x}\) and \(\:\varvec{y}\) axes during the encoding phase. Visual objects possess structural correlations across both dimensions. To address this, we propose a dense interaction strategy.

First, we broadcast (expand) the direction-specific features \(\:{\varvec{F}}_{\varvec{h}}\) and \(\:{\varvec{F}}_{\varvec{w}}\) to the full spatial resolution \(\:\varvec{H}\:\times\:\varvec{W}\), denoted as \(\:\stackrel{\sim}{{\varvec{F}}_{\varvec{h}}}\) and \(\:\stackrel{\sim}{{\varvec{F}}_{\varvec{w}}}\). These are then concatenated to form a unified spatial representation, which is processed by a non-linear interaction layer:

$$\:{\varvec{F}}_{\varvec{i}\varvec{n}\varvec{t}\varvec{e}\varvec{r}}=\varvec{\delta\:}\left(\mathcal{B}\left({\text{Conv}}_{1\times\:1}\left(\text{Concat}\left[\stackrel{\sim}{{\varvec{F}}_{\varvec{h}}},\stackrel{\sim}{{\varvec{F}}_{\varvec{w}}}\right]\right)\right)\right)$$

where \(\:{\varvec{F}}_{\varvec{i}\varvec{n}\varvec{t}\varvec{e}\varvec{r}}\in\:{\varvec{R}}^{{\varvec{C}}_{\varvec{m}\varvec{i}\varvec{d}}\times\:\varvec{H}\times\:\varvec{W}}\) represents the coupled spatial context.

Crucially, rather than simply projecting this interaction feature directly, we employ it to modulate the original directional embeddings. This design enables the horizontal features to be refined by vertical context, and vice versa. The modulated features are then projected back to the original channel dimension \(\:\varvec{C}\):

$$\:{\varvec{M}}_{\varvec{h}}={\text{Conv}}_{\varvec{o}\varvec{u}\varvec{t}}^{\varvec{h}}\left({\varvec{F}}_{\varvec{i}\varvec{n}\varvec{t}\varvec{e}\varvec{r}}\odot\:\stackrel{\sim}{{\varvec{F}}_{\varvec{h}}}\right)$$
$$\:{\varvec{M}}_{\varvec{w}}={\text{Conv}}_{\varvec{o}\varvec{u}\varvec{t}}^{\varvec{w}}\left({\varvec{F}}_{\varvec{i}\varvec{n}\varvec{t}\varvec{e}\varvec{r}}\odot\:\stackrel{\sim}{{\varvec{F}}_{\varvec{w}}}\right)$$

where \(\:\odot\:\) denotes element-wise multiplication.

Attention generation and fusion

The final directional attention map \(\:{\varvec{A}}_{\varvec{d}\varvec{i}\varvec{r}}\in\:{\varvec{R}}^{\varvec{C}\times\:\varvec{H}\times\:\varvec{W}}\) is synthesized by fusing the modulated orthogonal components via a Sigmoid activation function. This creates a spatially sensitive weight map that highlights regions of interest based on both coordinate positions:

$$\:{\varvec{A}}_{\varvec{d}\varvec{i}\varvec{r}}=\varvec{\sigma\:}\left({\varvec{M}}_{\varvec{h}}+{\varvec{M}}_{\varvec{w}}\right)$$

Finally, in alignment with the DMSCA architecture’s design, this directional attention \(\:{\varvec{A}}_{\varvec{d}\varvec{i}\varvec{r}}\) serves as a weighting factor for the multi-scale context \(\:{\varvec{F}}_{\varvec{m}\varvec{u}\varvec{l}\varvec{t}\varvec{i}}\) (derived in Sect. 3.4). The complete spatial attention mechanism combines these two streams to produce the spatially refined output \(\:{\varvec{X}}_{\varvec{s}}\):

$$\:{\varvec{X}}_{\varvec{s}}={\varvec{X}}_{\varvec{c}}\odot\:\left(\varvec{\upsigma\:}\left({\varvec{F}}_{\varvec{m}\varvec{u}\varvec{l}\varvec{t}\varvec{i}}\right)\odot\:{\varvec{A}}_{\varvec{d}\varvec{i}\varvec{r}}\right)$$

This hierarchical combination ensures that the feature map is enhanced by both the local multi-scale details from MSCE and the global positional structures from DII. The structure of the DII is shown in Fig. 5.

Fig. 5
Fig. 5
Full size image

Schematic diagram of the Directional Information Interaction (DII) module. The input \(\:{\varvec{X}}_{\varvec{c}}\) is aggregated via horizontal and vertical pooling. The resulting descriptors are expanded and concatenated to undergo non-linear interaction. This coupled context then modulates the original directional features via element-wise multiplication. Finally, the features are fused to generate direction-aware attention map \(\:{\varvec{A}}_{\varvec{d}\varvec{i}\varvec{r}}\), which encodes precise positional dependencies.

Dynamic feature fusion

The preceding modules independently refine the feature representations: the TCA module produces \(\:{\varvec{X}}_{\varvec{c}}\in\:{\varvec{R}}^{\varvec{C}\times\:\varvec{H}\times\:\varvec{W}}\), emphasizing “what” feature channels are important, while the coupled MSCE and DII modules yield \(\:{\varvec{X}}_{\varvec{s}}\in\:{\varvec{R}}^{\varvec{C}\times\:\varvec{H}\times\:\varvec{W}}\), highlighting “where” informative regions are located.

Standard fusion strategies, such as simple element-wise addition or static concatenation used in CBAM15 or BAM37, assume a fixed relationship between channel and spatial domains. However, this assumption is often suboptimal for complex visual scenes. For instance, in texture-rich background regions, channel distinctiveness (captured by \(\:{\varvec{X}}_{\varvec{c}}\)) may be more critical for classification, whereas in regions containing distinct object boundaries, spatial localization (captured by \(\:{\varvec{X}}_{\varvec{s}}\)) is paramount. To address this spatial variability, we propose the Dynamic Feature Fusion (DFF) module.

Theoretical motivation

The DFF is designed as a learnable gating mechanism that creates pixel-wise competition between channels and spatial attention. By generating a spatially adaptive weight map, the network can dynamically prioritize \(\:{\varvec{X}}_{\varvec{c}}\) or \(\:{\varvec{X}}_{\varvec{s}}\) at each pixel location \(\:\left(\varvec{i},\varvec{j}\right)\). This allows the model to selectively amplify the most relevant attention domain based on the specific local content of the input.

Mathematical formulation

As implemented in our architecture, the fusion process avoids heavy computational overhead by utilizing a lightweight \(\:1\times\:1\) convolutional gating layer. First, the channel-refined features \(\:{\varvec{X}}_{\varvec{c}}\) and spatially refined features \(\:{\varvec{X}}_{\varvec{s}}\) are concatenated along the channel dimension to preserve the complete information context from both domains:

$$\:{\varvec{F}}_{\varvec{c}\varvec{o}\varvec{n}\varvec{c}\varvec{a}\varvec{t}}=\text{Concat}\left(\left[{\varvec{X}}_{\varvec{c}},{\varvec{X}}_{\varvec{s}}\right]\right)\in\:{\varvec{R}}^{2\varvec{C}\times\:\varvec{H}\times\:\varvec{W}}$$

To generate fusion weights, \(\:{\varvec{F}}_{\varvec{c}\varvec{o}\varvec{n}\varvec{c}\varvec{a}\varvec{t}}\) is projected into a 2-channel weight space using a \(\:1\times\:1\) convolution. A Softmax function is then applied along the channel dimension to ensure the weights sum to 1 at every spatial location, creating a normalized probabilistic distribution:

$$\:{\varvec{W}}_{\varvec{f}\varvec{u}\varvec{s}\varvec{i}\varvec{o}\varvec{n}}=\text{Softmax}\left({\text{Conv}}_{2\varvec{C}\to\:2}\left({\varvec{F}}_{\varvec{c}\varvec{o}\varvec{n}\varvec{c}\varvec{a}\varvec{t}}\right)\right)\in\:{\varvec{R}}^{2\times\:\varvec{H}\times\:\varvec{W}}$$

Here, \(\:{\varvec{W}}_{\varvec{f}\varvec{u}\varvec{s}\varvec{i}\varvec{o}\varvec{n}}\) consists of two weight maps, \(\:{\varvec{X}}_{\varvec{c}}\) and \(\:{\varvec{X}}_{\varvec{s}}\), corresponding to the importance of the channel and spatial branches, respectively. The final fused feature \(\:{\varvec{X}}_{\varvec{o}\varvec{u}\varvec{t}}\) is obtained via a weighted summation:

$$\:{\varvec{w}}_{\varvec{c}}={\varvec{W}}_{\varvec{f}\varvec{u}\varvec{s}\varvec{i}\varvec{o}\varvec{n}}\left[:,0,:,:\right]$$
$$\:{\varvec{w}}_{\varvec{s}}={\varvec{W}}_{\varvec{f}\varvec{u}\varvec{s}\varvec{i}\varvec{o}\varvec{n}}\left[:,1,:,:\right]$$
$$\:{\varvec{X}}_{\varvec{o}\varvec{u}\varvec{t}}=\left({\varvec{w}}_{\varvec{c}}\odot\:{\varvec{X}}_{\varvec{c}}\right)+\left({\varvec{w}}_{\varvec{s}}\odot\:{\varvec{X}}_{\varvec{s}}\right)$$

where \(\:\odot\:\) denotes element-wise multiplication. This operation ensures that \(\:{\varvec{X}}_{\varvec{o}\varvec{u}\varvec{t}}\) encapsulates a refined representation where the balance between channel and spatial information is dynamically tuned for every pixel.

Following this dynamic aggregation, the feature map \(\:{\varvec{X}}_{\varvec{o}\varvec{u}\varvec{t}}\) is passed to the final Adaptive Activation (AA) module (Sect. 3.7) for non-linear shaping. The schematic structure of the DFF is illustrated in Fig. 6.

Fig. 6
Fig. 6
Full size image

Schematic diagram of the Dynamic Feature Fusion (DFF) module. The channel-enhanced (\(\:{\varvec{X}}_{\varvec{c}}\)) and spatially enhanced (\(\:{\varvec{X}}_{\varvec{s}}\)) features are concatenated and projected by a \(\:1\times\:1\) convolution. A spatial Softmax operation generates a normalized weight map (\(\:{\varvec{W}}_{\varvec{f}\varvec{u}\varvec{s}\varvec{i}\varvec{o}\varvec{n}}\)), which dynamically aggregates the inputs via pixel-wise weighted summation to produce \(\:{\varvec{X}}_{\varvec{o}\varvec{u}\varvec{t}}\).

Adaptive activation

Following the dynamic aggregation of channel and spatial streams in the DFF module, the feature map \(\:{\varvec{X}}_{\varvec{o}\varvec{u}\varvec{t}}\in\:{\varvec{R}}^{\varvec{C}\times\:\varvec{H}\times\:\varvec{W}}\) contains a balanced representation of the input. However, standard linear summation can sometimes introduce redundancies or background noise. To ensure that only the most salient features are propagated to the subsequent network layers, we introduce the Adaptive Activation (AA) module as the terminal refinement stage of the DMSCA.

Theoretical motivation

Standard activation functions like \(\:\text{R}\text{e}\text{L}\text{U}\) are static; they apply a fixed threshold (e.g., \(\:\varvec{f}\left(\varvec{x}\right)=\varvec{max}\left(0,\varvec{x}\right)\)) regardless of the input’s contextual properties. Recent advances in deep learning, such as the Swish activation function38, demonstrate that data-dependent, smooth activation functions can significantly improve optimization.

Inspired by this self-gating principle, the AA module is designed to learn a non-linear, spatially aware activation map. Unlike standard activation functions that operate elementwise in isolation, the AA module utilizes a global view of the channel features to determine the activation intensity for each spatial location. This allows the network to effectively “turn off” background noise at the pixel level while enhancing foreground regions based on the consensus of all feature channels.

Mathematical formulation

The AA module operates on the fused feature \(\:{\varvec{X}}_{\varvec{o}\varvec{u}\varvec{t}}\). To compute the activation map, we project the feature map to a single-channel spatial mask using a \(\:1\times\:1\) convolution. This operation compresses the channel dimension, aggregating the information density at each spatial coordinate \(\:\left(\varvec{i},\varvec{j}\right)\):

$$\:{\varvec{F}}_{\varvec{g}\varvec{a}\varvec{t}\varvec{e}}={\text{Conv}}_{\varvec{C}\to\:1}\left({\varvec{X}}_{\varvec{o}\varvec{u}\varvec{t}}\right)+\varvec{b}$$

where \(\:\varvec{b}\) is a learnable bias term allowing for shift variance.

Crucially, as observed in our implementation, we omit Batch Normalization at this stage to preserve the absolute magnitude of the fused features, which is essential for accurate gating.

The spatial activation map \(\:\varvec{\upalpha\:}\in\:{\varvec{R}}^{1\times\:\varvec{H}\times\:\varvec{W}}\) is then generated via the Sigmoid function, mapping the values to the range \(\:\left(0,1\right)\):

$$\:\varvec{\upalpha\:}=\varvec{\upsigma\:}\left({\varvec{F}}_{\varvec{g}\varvec{a}\varvec{t}\varvec{e}}\right)$$

Finally, the output of the DMSCA module, \(\:{\varvec{X}}_{\varvec{f}\varvec{i}\varvec{n}\varvec{a}\varvec{l}}\in\:{\varvec{R}}^{\varvec{C}\times\:\varvec{H}\times\:\varvec{W}}\), is obtained by modulating the input feature \(\:{\varvec{X}}_{\varvec{o}\varvec{u}\varvec{t}}\) with the learned activation map \(\:\varvec{\upalpha\:}\) via broadcasting:

$$\:{\varvec{X}}_{\varvec{f}\varvec{i}\varvec{n}\varvec{a}\varvec{l}}={\varvec{X}}_{\varvec{o}\varvec{u}\varvec{t}}\odot\:\varvec{\upalpha\:}.$$

This multiplicative gating mechanism allows the gradient to flow adaptively during backpropagation, refining the feature representation through a multi-stage process of contextual perception (GCE/MSCE/DII), dynamic fusion (DFF), and adaptive gating (AA). The schematic structure of the AA is illustrated in Fig. 7.

Fig. 7
Fig. 7
Full size image

Schematic diagram of the Adaptive Activation (AA) module. The fused feature \(\:{\varvec{X}}_{\varvec{o}\varvec{u}\varvec{t}}\) is compressed via a \(\:1\times\:1\) convolution to a single-channel descriptor. A Sigmoid function generates a spatial gating map \(\:\varvec{\upalpha\:}\), which adaptively scales the features via element-wise multiplication to produce the final output \(\:{\varvec{X}}_{\varvec{f}\varvec{i}\varvec{n}\varvec{a}\varvec{l}}\).

Experiment settings

Datasets

We conducted experiments on three publicly available datasets widely used in image classification tasks. These datasets have different scales and complexities, which can fully test the performance and generalization ability of DMSCA:

  1. 1.

    CIFAR-104: Contains 10 categories, with a total of 60,000 32 × 32-pixel color images. 50,000 of them are used for training and 10,000 for testing. This is a relatively small dataset and is often used for quickly verifying the effectiveness of new methods.

  2. 2.

    CIFAR-1004: Like CIFAR-10, but contains 100 categories, with the same number of images and size. Due to the larger number of categories, the classification difficulty is greater.

  3. 3.

    ImageNet (ILSVRC 2012)7: A large-scale dataset with 1,000 classes, ~ 1.28 million training images, and 50,000 validation images, benchmarking model generalization.

Standard preprocessing was applied: normalization, random cropping, and horizontal flipping for CIFAR; scaling to 256 × 256, random 224 × 224 cropping, and flipping for ImageNet training; center 224 × 224 cropping for validation.

Experimental environment and implementation details

Experiments ran on a server with an Intel Core i9-12900 K CPU, 128 GB DDR4 RAM, 4 TB NVMe SSD, and NVIDIA GeForce RTX 3090 GPUs (24 GB VRAM each). Software included Pytorch 2.0.0, CUDA 11.8, and Python 3.9.

DMSCA was integrated into \(\:\text{R}\text{e}\text{s}\text{N}\text{e}\text{t}\:\)architectures (ResNet-18, −34, −50)11, placed after the main convolution in each residual block before identity mapping, enhancing features while preserving residual learning.

Theoretical computational complexity analysis

To provide a rigorous assessment of DMSCA’s efficiency, we analyze the time complexity of each component. Let the input feature map dimension be \(\:\varvec{H}\times\:\varvec{W}\times\:\varvec{C}\).The complexity breakdown is as follows:

Global Context Encoder (GCE): This module involves global pooling (\(\:\varvec{O}(\varvec{H}\varvec{W}\varvec{C}\))) and a shared MLP for channel interaction. The MLP reduces and restores dimensions with a reduction ratio \(\:\varvec{r}\), contributing \(\:\varvec{O}(2{\varvec{C}}^{2}/\varvec{r})\). Thus, Complexity (GCE) \(\:\approx\:\varvec{O}(\varvec{H}\varvec{W}\varvec{C}+{\varvec{C}}^{2}/\varvec{r})\).

Temperature-controlled Channel Attention (TCA): Since the MLP computation is accounted for in the GCE, the TCA module primarily performs temperature scaling, Sigmoid activation, and element-wise multiplication (broadcasting weights to the feature map). Complexity (TCA) \(\:\approx\:\varvec{O}\left(\varvec{H}\varvec{W}\varvec{C}\right)\).

Multi-scale Spatial Context Encoder (MSCE): Crucially, MSCE first compresses the channel dimension to 2 via channel-wise pooling (\(\:\varvec{O}\left(\varvec{H}\varvec{W}\varvec{C}\right)\)). The subsequent multi-scale convolutions operate only on these 2-channel descriptors, not the full input depth. Assuming \(\:\varvec{K}\) represents the total area of the multi-branch kernels, the convolution cost is \(\:\varvec{O}(\varvec{H}\varvec{W}\cdot\:\varvec{K})\). Since \(\:\varvec{C}\gg\:2\), the dominant term is the pooling. Complexity (MSCE) \(\:\approx\:\varvec{O}\left(\varvec{H}\varvec{W}\varvec{C}\right)\).

Directional Information Interaction (DII): This component includes coordinate pooling (\(\:\varvec{O}\left(\varvec{H}\varvec{W}\varvec{C}\right)\)) and \(\:1\times\:1\) convolutions to model channel interactions in the horizontal and vertical paths. Complexity (DII) \(\:\approx\:\varvec{O}(\varvec{H}\varvec{W}\varvec{C}+{\varvec{C}}^{2})\).

Dynamic Feature Fusion (DFF) & Adaptive Activation (AA): These modules perform lightweight \(\:1\times\:1\) convolutions for gating and fusion. Complexity (DFF + AA) \(\:\approx\:\varvec{O}\left(\varvec{H}\varvec{W}\varvec{C}\right)\).

Total Complexity: Summing these components, the overall time complexity of DMSCA is:

$$\:\varvec{O}\left(\varvec{D}\varvec{M}\varvec{S}\varvec{C}\varvec{A}\right)\approx\:\varvec{O}(\varvec{H}\varvec{W}\varvec{C}+\frac{{\varvec{C}}^{2}}{\varvec{r}}+{\varvec{C}}^{2})$$

This analysis confirms that the complexity is dominated by linear terms related to the spatial resolution (\(\:\varvec{H}\varvec{W}\varvec{C}\)) and quadratic terms related to channel interaction (\(\:{\varvec{C}}^{2}\)). The space complexity is mainly determined by the intermediate feature maps and remains at \(\:\varvec{O}\left(\varvec{H}\varvec{W}\varvec{C}\right)\).

The hyperparameter settings for training are shown in Table 1. For different datasets, we used different learning rates and batch sizes. The optimizer uniformly uses momentum-based stochastic gradient descent (SGD). Learning rate scheduling adopts the cosine annealing strategy13.

Table 1 Main hyperparameter settings.

Hyperparameter sensitivity analysis design

To verify the sensitivity of DMSCA to key hyperparameters, we conducted a systematic sensitivity analysis experiment:

  1. 1.

    Reduction ratio (r): Tested {4, 8, 16, 32, 64} for performance-efficiency trade-offs.

  2. 2.

    Temperature coefficient (τ): Compared {0.5, 1.0, 1.5, 2.0, dynamic} for attention distribution impact.

  3. 3.

    Kernel combinations (K): Evaluated {3}, {3, 5}, {3, 5, 7}, {3, 5, 7, 9}.

  4. 4.

    Fusion weight initialization: Compared dynamic fusion strategies.

One parameter varied while others fixed, measuring accuracy, parameters, and FLOPs.

Baseline methods and evaluation metrics

Baseline attention mechanisms

To verify the superiority of DMSCA, we selected the current mainstream and representative attention mechanisms as baselines for comparison:

  1. 1.

    \(\:\text{R}\text{e}\text{s}\text{N}\text{e}\text{t}\:\)(Baseline)11: The original \(\:\text{R}\text{e}\text{s}\text{N}\text{e}\text{t}\) architecture without any additional attention modules (ResNet-18, ResNet-34, ResNet-50).

  2. 2.

    SE-Net14: A classic channel attention mechanism that teaches channel weights through global pooling and two fully connected layers.

  3. 3.

    CBAM15: A hybrid attention module that serially combines channel attention and spatial attention.

  4. 4.

    ECA-Net16: An efficient channel attention mechanism that avoids dimensionality reduction operations through one-dimensional convolution.

  5. 5.

    CA17: A novel spatial attention mechanism that captures directional information and long-range dependencies by decomposing two-dimensional spatial attention into two one-dimensional encoding processes.

  6. 6.

    SimAM39: A parameter-free attention mechanism designed based on neuroscience theory.

  7. 7.

    GAM40: A global attention mechanism that combines channel and spatial attention.

  8. 8.

    A²-Nets41: Dual attention networks that simultaneously model position and channel attention.

  9. 9.

    BAM37: Bottleneck attention module that adopts parallel channel and spatial attention branches.

All baseline attention modules are integrated into the \(\:\text{R}\text{e}\text{s}\text{N}\text{e}\text{t}\) architecture in the same way as DMSCA and are compared fairly using the same training strategy and hyperparameters.

Comprehensive evaluation framework

We evaluate DMSCA and the baseline methods from multiple dimensions:

  1. 1.

    Classification Performance: Top-1/Top-5 accuracy (ImageNet), Top-1 (CIFAR); mean ± std from multiple runs; per-class analysis on CIFAR-100.

  2. 2.

    Computational Efficiency: Parameters (M), FLOPs (G), memory (MB), inference time (MS), efficiency score (accuracy gain/param increase %).

  3. 3.

    Statistical Significance: 95% confidence intervals; t-tests/ANOVA with Bonferroni correction.

  4. 4.

    Training Dynamics: Loss/accuracy curves; convergence epochs (e.g., to 95% final accuracy).

  5. 5.

    Robustness and Generalization: Cross-architecture (\(\:\text{R}\text{e}\text{s}\text{N}\text{e}\text{t}\) depths); cross-dataset; ablation with statistical validation.

This framework thoroughly assesses DMSCA’s efficacy.

Experimental results and analysis

This section presents a comprehensive evaluation of DMSCA across multiple metrics, comparing it with state-of-the-art attention mechanisms. We analyze classification performance, computational efficiency, component contributions through ablation studies, training dynamics, and feature visualization to demonstrate DMSCA’s effectiveness.

Main classification performance comparison

We compared DMSCA with various baseline attention mechanisms (SE-Net14, CBAM15, ECA-Net16, CA17 and the original \(\:\text{R}\text{e}\text{s}\text{N}\text{e}\text{t}\:\)model without attention mechanism (ResNet-18, ResNet-34, ResNet-5011 on three datasets: CIFAR-10, CIFAR-100 and ImageNet. All experiments were conducted under the same hyperparameter settings and training strategies to ensure fairness. To reduce the influence of randomness, all accuracy results were the average of five independent experiments and represent averages from five independent runs with standard deviations.

Table 2 Top-1 accuracy (%) comparison of CIFAR-10, CIFAR-100, and ImageNet.

As demonstrated in Table 2, DMSCA consistently outperforms all competing methods across all datasets and network architectures:

  1. 1)

    On CIFAR-10, DMSCA improves Top-1 accuracy by 2.3%, 2.3%, and 2.2% when integrated with Res Net-18, ResNet-34, and ResNet-50, respectively.

  2. 2)

    On the more challenging CIFAR-100 dataset, DMSCA achieves substantial gains of 1.8%, 1.8%, and 1.7%, demonstrating its effectiveness for fine-grained classification tasks.

  3. 3)

    On the large-scale ImageNet dataset, ResNet-50 + DMSCA outperforms the baseline by 1.52% in Top-1 accuracy and 0.95% in Top-5 accuracy, surpassing all other attention mechanisms. Notably, it exceeds the recent CA mechanism by 0.37% on ResNet-50.

These consistent improvements across diverse datasets and architectures confirm that DMSCA’s dynamic multi-scale context-aware design effectively captures discriminative features, significantly enhancing CNN classification performance. Compared to existing attention mechanisms, DMSCA demonstrates substantial and consistent advantages in accuracy improvement.

Computational efficiency analysis

When evaluating the attention mechanism, in addition to focusing on its performance improvement, the computational cost is also a crucial consideration factor. Table 3 provides a detailed comparison of the additional parameters, floating-point operations, GPU memory usage, and inference time introduced by DMSCA and each baseline attention mechanism in the ResNet-50 architecture.

Table 3 Computational cost and efficiency comparison of different attention mechanisms on ResNet-50.

The efficiency analysis reveals:

  1. 1)

    Parameters and FLOPs: DMSCA introduces a moderate parameter increase (+ 11.34%) and computational overhead (+ 2.43%) compared to baseline. While slightly higher than SE-Net and CBAM, this cost is justified by DMSCA’s superior accuracy gains. ECA-Net achieves minimal parameter increase but with more modest performance improvements, while CA maintains low parameter overhead with computational costs like DMSCA.

  2. 2)

    Memory and Inference Time: DMSCA’s memory usage (395 MB) and inference latency (9.2 MS) remain competitive despite its multi-component architecture. Its inference time increase (+ 10.84%) is comparable to CBAM and CA, demonstrating efficient implementation of its complex attention mechanisms.

Overall, while DMSCA introduces moderate computational overhead, its significant performance improvements justify this trade-off, particularly for applications prioritizing accuracy. The efficiency-to-performance ratio remains favorable across all tested metrics.

Comparison with recent attention mechanisms

To conduct a more comprehensive evaluation of the performance of DMSCA, we also compared it with the advanced attention mechanisms proposed in recent years. The results are shown in Table 4.

Table 4 This extended comparison confirms dmsca’s superior performance. While SimAM introduces no additional parameters, its accuracy improvement is limited (+ 0.76%). GAM achieves the second-best accuracy improvement (+ 1.22%) with moderate parameter increase. DMSCA delivers the highest absolute performance gain (+ 1.52%), demonstrating that its additional parameter cost translates to meaningful accuracy improvements.

Statistical significance analysis

To confirm the reliability of DMSCA’s performance improvements, we conducted paired t-tests with Cohen’s d effect sizes on ImageNet results using ResNet-50. Table 5 presents the results.

Table 5 Statistical significance test for top-1 accuracy on imagenet with ResNet-50.

The results in Table 5 clearly demonstrate that the Top-1 accuracy achieved by DMSCA on ResNet-50 is significantly superior to all the compared baseline methods. All the p-values are far less than 0.001, indicating that the observed accuracy differences are extremely unlikely to be caused by random factors. Cohen’s d values are all greater than 1.5, showing a large effect size, which means that the performance improvement brought by DMSCA is not only statistically significant but also has practical application value. These statistical results further strengthen the conclusion that DMSCA is effective.

Ablation studies

To elucidate the contributions of DMSCA’s components, we performed ablation experiments on CIFAR-100 using ResNet-18. We systematically added/removed key elements: Global Context Encoder (GCE), Temperature-Controlled Channel Attention (TCA), Multi-Scale Spatial Context Encoder (MSCE), Direction Information Interaction (DII), Dynamic Feature Fusion (DFF), and Adaptive Activation (AA).

Hyperparameter sensitivity analysis

We systematically analyzed the impact of the key hyperparameters of DMSCA on performance, and the results are shown in Tables 6.

Table 6 Sensitivity analysis of key hyperparameters of DMSCA (CIFAR-100, ResNet-18).

Analysis indicates that r = 16 balances performance and efficiency optimally. Dynamic \(\:\tau\:\) slightly outperforms fixed values. The {3,5,7} kernel combination is ideal, with diminishing returns for additional scales.

Table 7 Ablation study of DMSCA components on CIFAR-100 with ResNet-18.

Results show each component contribute positively: GCE (+ 0.49%), dynamic TCA (+ 0.17% over fixed), MSCE (+ 0.73% alone, + 0.36% combined), DII (+ 0.27%), DFF (+ 0.13%), AA (+ 0.05%). The core synergy between GCE, dynamic TCA, MSCE, and DII drives DMSCA’s performance.

Visualization analysis

To understand DMSCA’s impact on feature representation, we used Grad-CAM++29 to generate attention heatmaps for ResNet-50 on ImageNet samples.

Table 8 Attention maps from different mechanisms.

DMSCA generates more focused, semantically meaningful maps, demonstrating superior feature localization. Visualization reveals SE-Net improves channel weight but lacks spatial precision. CBAM enhances localization but struggles with complex scenes. DMSCA precisely focuses on discriminative regions while suppressing noise, confirming its ability to guide models toward robust feature learning.

To further quantify the quality of the attention maps, we introduce several metrics defined in Table 9 and evaluate them on a subset of CIFAR-10.

Table 9 Quantitative evaluation of attention maps on CIFAR-10 Samples.

The quantitative results in Table 9 are largely consistent with the qualitative observations in Table 8. DMSCA performed the best across all three metrics: it had the highest focus ratio (0.87) and semantic consistency (0.84), while achieving the lowest noise suppression ratio (0.08). This strongly demonstrates that DMSCA can more accurately direct attention to the semantic-related areas in the image, while effectively ignoring background noise and irrelevant information. CA performed less optimally in all metrics, and the performance of SE-Net was relatively weak. This data further supports the superiority of DMSCA in enhancing feature localization and selection capabilities.

Training dynamics analysis

Beyond final performance, we examined DMSCA’s impact on training dynamics using ResNet-50 on ImageNet.

The training loss curve is shown in Fig. 8. The accuracy curve of the validation set (the highest accuracy rate) is shown in Fig. 9.

Fig. 8
Fig. 8
Full size image

Training loss curves.

Fig. 9
Fig. 9
Full size image

Validation top-1 accuracy curves.

DMSCA demonstrates faster loss decay and accuracy ascent from early epochs compared to baselines. It achieves higher accuracy milestones sooner, converges more rapidly, and maintains greater stability in later training with reduced fluctuations. This indicates DMSCA facilitates more efficient learning and yields more robust features.

Robustness and generalization analysis

An excellent attention mechanism should not only perform well on standard test sets but also maintain good performance when faced with various perturbations and different data distributions, that is, it should have good robustness and generalization ability.

We evaluated the performance of DMSCA under common image degradation conditions (such as Gaussian noise, motion blur, and JPEG compression). The experiments were conducted on the CIFAR-100 dataset, with the backbone network being ResNet-18. The results are shown in Table 10.

Table 10 Robustness performance of different attention mechanisms on CIFAR-100 (ResNet-18) for image degradation (top-1 accuracy %).

From Table 10, in all the tested image degradation conditions, the ResNet-18 integrated with DMSCA achieved the highest classification accuracy. For instance, under moderate intensity Gaussian noise (σ = 15), the accuracy of DMSCA (68.25%) was 3.04% higher than that of the Baseline (65.21%), and 1.20% higher than that of the suboptimal CBAM (67.05%). Even under stronger noise (σ = 25) or other types of degradation, DMSCA maintained its leading advantage. This indicates that the feature representations learned by DMSCA have stronger resistance to these common image perturbations. Its dynamic adjustment and multi-scale perception capabilities help extract key features even when information is damaged.

Cross-dataset generalization

We further evaluated the generalization ability of DMSCA across different datasets. The experimental design was as follows: the model was trained on the source dataset, and then directly tested on the target dataset.

Table 11 Evaluation of generalization ability across datasets.

To verify the universality of DMSCA, we conducted preliminary experiments on the tasks of object detection and semantic segmentation:.

Object Detection (COCO 2017): Under the Faster R-CNN framework, using ResNet-50 + DMSCA as the backbone network.

Baseline (ResNet-50): mAP = 37.4.

ResNet-50 + DMSCA: mAP = 38.9 (+ 1.5).

Semantic Segmentation (Cityscapes): Tested under the DeepLabV3 + framework.

Baseline (ResNet-50): mIoU = 78.2.

ResNet-50 + DMSCA: mIoU = 79.6 (+ 1.4).

These results indicate that DMSCA has a good ability of task generalization, and it is not limited to image classification tasks.

In addition to its outstanding performance on the three major datasets, CIFAR-10, CIFAR-100, and ImageNet, we also evaluated the generalization ability of DMSCA on some small-scale, specific domain image datasets, such as Oxford-IIIT Pet30 and Food-10134. On these datasets, DMSCA also demonstrated better performance improvements compared to the baseline and other attention mechanisms, further proving its good generalization ability and its ability to adapt to different data distributions and task characteristics.

Based on the analysis of robustness and generalization ability, DMSCA not only performs exceptionally well under standard conditions, but also shows strong adaptability and stability when facing challenging scenarios and diverse data, which is crucial for practical applications.

Conclusion and future work

This paper introduced DMSCA, a novel attention mechanism that enhances feature representation in Convolutional Neural Networks (CNNs) through a multi-component, collaborative design. Our core contribution lies in the dynamic, data-dependent fusion of channels and spatial attention, which overcomes the limitations of static, predefined structures found in prior works like CBAM. Key innovations, including the Temperature-controlled Channel Attention (TCA) and the Direction-aware Multi-scale Spatial Context Encoder (MSCE), enable the model to adaptively modulate features based on input characteristics, leading to significant and consistent performance gains across multiple benchmarks, including ImageNet.

Comprehensive experiments demonstrated DMSCA’s superiority over existing attention mechanisms in terms of classification accuracy, robustness against adversarial attacks and corruptions, and generalization to fine-grained datasets. While DMSCA introduces a marginal computational overhead, the substantial performance benefits justify this trade-off, establishing a compelling solution for CNN-based feature refinement.

Despite these promising results, we acknowledge certain limitations inherent to our current design scope. The DMSCA mechanism is explicitly engineered to leverage the inductive biases of CNNs, specifically utilizing convolutional operations to capture local multi-scale contexts and spatial hierarchies efficiently. Consequently, its direct applicability to Transformer-based backbones (e.g., ViT, Swin Transformer), which process images as sequences of tokens without inherent spatial locality, remains explored. Future work will prioritize adapting the core principles of dynamic channel-spatial coupling to token-based architectures. Drawing inspiration from recent hybrid approaches like Modumer21, we aim to investigate how DMSCA can be integrated into self-attention frameworks. This evolution seeks to create a hybrid attention model that synergistically combines the efficiency of local feature refinement with the long-range dependency modeling capabilities of Transformers42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67.

Furthermore, the potential of DMSCA in other complex visual tasks, such as video analysis or 3D point cloud processing, remains an exciting avenue for future research. We believe that the design philosophy of DMSCA—emphasizing dynamic interaction and multi-scale context—offers a valuable direction for developing next-generation attention mechanisms.