A multi-scale dense residual correlation network for remote sensing scene classification

Dai, Wei; Shi, Furong; Wang, Xinyu; Xu, Haixia; Yuan, Liming; Wen, Xianbin

doi:10.1038/s41598-024-73252-8

Download PDF

Article
Open access
Published: 27 September 2024

A multi-scale dense residual correlation network for remote sensing scene classification

Wei Dai^1,2,
Furong Shi^1,2,
Xinyu Wang^1,2,
Haixia Xu^1,2,
Liming Yuan^1,2 &
…
Xianbin Wen^1,2

Scientific Reports volume 14, Article number: 22197 (2024) Cite this article

2150 Accesses
7 Citations
Metrics details

Subjects

Abstract

Most existing scene classification methods based on remote sensing images tend to ignore important interactive information at different levels in the image. We propose an effective remote sensing scene classification method named multi-scale dense residual correlation network. The method is divided into three parts. First, the multi-stream feature extraction module is introduced which effectively utilizes features at different scales to extract different levels of information. Secondly, the dense residual connected feature fusion technology is proposed, which allows for a wide range of feature fusion. The Correlation Attention Module learn feature representations at multiple levels. This improves classification performance. The method outperforms existing algorithms in terms of effectiveness and accuracy, achieving state-of-the-art results on widely used remote sensing scene classification benchmarks.

Self-supervised learning for remote sensing scene classification under the few shot scenario

Article Open access 09 January 2023

Efficient remote sensing image classification using the novel STConvNeXt convolutional network

Article Open access 11 March 2025

A Multi-Scale attention network for building extraction from high-resolution remote sensing images

Article Open access 10 July 2025

Introduction

Remote sensing scene classification has attracted much attention in the field of remote sensing, with the goal of assigning specific semantic labels to remote sensing images. However, this task faces challenges primarily due to the lack of adequately annotated training data the presence of significant intra-class variations and inter-class similarities¹. With the increasing maturity of remote sensing technology, it has become possible to obtain a large amount of high-resolution remote sensing image information from the air. This information can be used for land use and land cover², urban design³ and vegetation surveying and mapping⁴. This shows that remote sensing scene classification is widely used in the analysis of earth surface characteristics. Over the past few decades, a range of algorithms for remote sensing scene classification has been developed. Early methods heavily relied on handcrafted features and classical classification techniques, such as SVM⁵, Random Forest⁶, and boosting⁷. In general, these methods can be categorized into two classes: low-level feature based method and mid-level feature based method. Low-level feature methods include Scale-Invariant Feature Transform (SIFT)⁸ , Gray-Level Co-occurrence Matrix (GLCM)⁹ , Oriented Gradient Histograms¹⁰, and Local Binary Patterns¹¹. These methods perform well on simple objects, but hard to deal with complex and challenging scenes.

To address these challenges, several mid-level feature methods have emerged to enhance classification performance. Unlike low-level feature methods, mid-level feature methods encode local descriptors to represent global features. Prominent mid-level methods include the Bag of Visual Words¹², Spatially Constrained Linear Coding¹³, Spatial Pyramid Matching¹⁴, and Improved Fisher Kernel¹⁵. Although mid-level feature methods have improved scene classification in remote sensing images, they still lack sufficient representational capacity to handle increasingly complex remote sensing images, heavily rely on image processing expertise, and exhibit relatively weaker performance on demanding scenes. In recent years, the rapid advancements in deep learning techniques, particularly Convolutional Neural Networks (CNNs), have led to their extensive utilization in various computer vision tasks. Compared to manually crafted features, deep features offer a wealth of semantic information, enabling a better capture of the true essence of images. Convolutional neural networks, due to their exceptional feature learning capabilities, have demonstrated remarkable performance across a range of computer vision applications, including remote sensing scene classification^16,17.

Although CNN has made progress in feature extraction, local features are easily ignored when extracting global features of remote sensing scenes^18,19,20, resulting in poor performance. In contrast to typical natural images, where objects usually dominate the image space, remote sensing scene images frequently contain multiple objects dispersed across smaller and more scattered backgrounds, as depicted in Fig. 1. Furthermore, these images exhibit high intra-class variability and low inter-class variability. As illustrated in Fig. 1, The image on the left is a “commercial” scene, and the image on the right depicts a “dense residential” scene. Both scenes contain elements such as houses, streets, trees, cars, and other objects, with the main difference being the arrangement and distribution of these objects. Therefore, achieving accurate remote sensing scene classification requires not only extracting global information, but also local information.

In the field of attention-based CNNs, several representative models have been introduced to enhance feature learning and representation. Hu et al.²¹ proposed a SENet, which employs a channel-level attention mechanism to adaptively adjust the weights of different channels in feature maps, enhancing the focus on essential features. Li et al.²² proposed s SKNet, which incorporates deformable convolution and channel-level attention mechanisms to adjust the weights of feature maps at different spatial scales and channels, further enhancing feature focus. Wang et al.²³ proposed a ECANet, which introduces a channel-level attention mechanism combined with efficient calculation strategies to improve model performance. The core innovation of ECANet lies in the introduction of the “Efficient Channel Attention” (ECA) mechanism, which uses a simple 1D convolution layer to adaptively adjust feature map weights in each channel, ensuring computational efficiency without requiring global pooling or deformable convolution layers. Li et al.²⁴ proposed a semantic segmentation model,called PANet, which is based on the Feature Pyramid Network (FPN) and attention mechanism. The core idea of PANet is to improve instance segmentation tasks’ accuracy and robustness by introducing multi-scale information fusion and adaptive attention mechanisms. Woo et al.²⁵ introduced the CBAM, which incorporates both channel-level and spatial-level attention mechanisms. These mechanisms adaptively adjust the weights of feature maps at different channels and spatial positions, enhancing the model’s focus on crucial features. Zhang et al.²⁶ propoded a convolutional neural network model, called ResNest, which is based on the ResNet architecture and introduces a “grouped attention mechanism.” This mechanism enhances the network’s feature learning and representation ability, further improving its performance. However, these models cannot extract the interactive information between different layers well, thus affecting the classification performance.

Within the context of remote sensing scene classification, features at different levels possess distinct information representation capabilities. To achieve a more comprehensive feature representation, it becomes imperative to harness the complementary information from multi-level convolutional features. However, this approach can also introduce redundant or mutually exclusive information, as noted in prior research^27,28. To overcome the above disadvantages, in this article, we proposed a method called multi-scale dense residual correlation network (MDRCN) for remote sensing scene classification. The main contributions of this paper can be summarized in the following aspects:

1.
The multi-scale feature extraction (MSFE) method, which effectively extracts features at different levels independently.
2.
The dense residual connection feature fusion (DRCFF) method is introduced, leveraging residual connections to transfer features across different levels. This facilitates the retention of valuable information from the original features.Dense connections to merge all the features encouraging their interaction.
3.
The contextual attention module (CAM), a bottom-up feature extraction technique that uses CNNs and spatial pyramid pooling to extract region-level feature representations at different scales. By computing similarities between regions and applying an attention mechanism, effectively weight and aggregate these features.
4.
An end-to-end network called MDRCN is proposed to classify the scenes in remote sensing images. Experimental results show that the proposed MDRCN successfully incorporates associated attention into CNN, which effectively improves remote sensing scene classification performance.

The remainder of this paper is structured as follows. Section 2 presents the current status of the research. Section 3 describes the details of the proposed method. Section 4 analyzes the experimental results. Section 5 is the summary of this article and prospects for future research.

Related work

CNN models

Convolutional Neural Networks (CNNs) typically consist of convolutional, pooling and fully connected layers. The convolutional layer is used to extract complex and hierarchical features from the input data; the pooling layer helps to reduce the spatial dimension of the feature map while retaining the most important information; the fully connected layer flattens the feature map into a one-dimensional vector and connects all the neurons to the output.

CNN is a deep learning architecture specialized for the processing of images and image-related data. It derives its design principles from the biological visual system and excels at extracting intricate features. The operational paradigm of CNN involves the training of the network via forward and backward propagation, facilitating the continual adjustment of weight and bias parameters to enhance network performance.The way CNN operates enables it to autonomously learn and identify hierarchical features in images, making it highly proficient in completing a wide range of computer vision application tasks. Eminent CNN architectures such as AlexNet²⁹, ResNet³⁰ and DenseNet³¹ etc, symbolize a substantial shift from manual feature engineering towards the utilization of deep CNNs. This transition has solidified CNNs as the predominant choice for the backbone network in remote sensing image scene classification, primarily due to their exceptional feature representation capabilities. Consequently, CNNs have been extensively employed across a spectrum of computer vision tasks, including remote sensing scene classification.

Attention mechanism

Attention mechanisms are algorithms specifically designed to efficiently and precisely highlight the target object while simultaneously suppressing irrelevant information. They closely emulate the way in which humans perceive and interpret images. In recent years, numerous attention mechanisms have been developed and applied across various domains, including but not limited to natural language processing, object detection, and image recognition. Importantly, these mechanisms have also brought fresh perspectives and innovative approaches to the field of remote sensing image analysis. The Attention Mechanism is a widely adopted technique in deep learning models to enhance their performance and accuracy.

In the domain of attention-based CNNs, several representative models have been introduced to enhance feature learning and representation. Hu et al.²¹ proposed SENet, which uses channel-level attention to adaptively adjust the weights of different feature map channels and enhance the focus on key features. Li et al.²² proposed SKNet, which combines deformable convolution and channel-level attention to refine feature map weights at various spatial scales and channels. Li et al.²⁴ proposed PANet, a semantic segmentation model based on feature pyramid network (FPN) and attention mechanism, which can improve the accuracy and robustness of instance segmentation tasks. Woo et al.²⁵ proposed CBAM, which combines channel-level and spatial-level attention to adaptively adjust feature map weights and enhance the focus on key features.This mechanism enhances the network’s feature learning and representation ability, further improving its performance.

Poposed method

The overall architecture of our proposed MDRCN is composed of four modules. As shown in Fig. 2.

First, the algorithm extracts multiple convolutional features at different semantic levels based on the pre-trained ResNet50 model, including $\{ C_2, C_3, C_4, C_5 \}$. Secondly, the MSFE module connects features at different levels to generate multi-scale semantic features $\{ D_2, D_3, D_4, D_5\}$, and constructs a feature pyramid in a top-down manner. This enables comprehensive feature interaction and an overall improvement in accuracy. Then, the DRCFF module is utilized to transfer its own features to the next layer while retaining the original information, thereby acquiring global information and further enhancing accuracy. Finally, a relevant attention network is introduced to boost feature expression ability and further improve classification performance.

Multi-scale feature extraction

The framework adopts a hybrid approach that combines convolutional and non-convolutional techniques. As confirmed by previous studies³², dilated convolutions become an effective tool for tasks involving dense predictions. This strategic use of dilated convolutions helps capture contextual information effectively. Furthermore, we introduce skip connections in our architecture. It not only facilitates the flow of information between different layers, but also ensures that key features are not lost as the network deepens. The last layer of the architecture has two global average pooling (GAP) layers for generating image-level representation features. A comprehensive breakdown of the multi-scale feature encoding (MSFE) module is shown in Fig. 3. All in all, this architectural design is carefully crafted to handle contextual information across various levels while maintaining computational efficiency.

The higher-level feature processing segment is equipped with two residual blocks and culminates in a GAP layer. For a more detailed breakdown of the remaining blocks within this higher-level feature processing, please refer to Table 1. In this segment, a sequence of $1 \times 1$, $3 \times 3$, and another $3 \times 3$ convolutional layers is meticulously arranged. This arrangement is devised to efficiently acquire deep-level features. It’s important to note that each layer is immediately followed by a batch normalization layer and a ReLU layer to facilitate non-linear transformations. The output from the internal path of each residual block is then merged through element-wise addition. In essence, the output of the higher-level feature processing incorporates a fusion of various convolutional layers and residual blocks, thoughtfully integrated after batch normalization and ReLU operations to harness efficient feature learning capabilities. Ultimately, the final output undergoes processing by the GAP layer to yield the ultimate prediction result.

Table 1 Details of the $S_5$,$S_4$.

Full size table

The output of a residual block can be expressed in terms of

$$\begin{aligned} Y= \Psi (F_5 (X,\omega _5)\oplus F(X,{\omega _i})) \end{aligned}$$

(1)

where

$$\begin{aligned} F(X, {\omega _i})= F_4 (F_3 (F_2 (X,\omega _2 ),\omega _3 ),\omega _4) \end{aligned}$$

(2)

where X and Y denote the input and output of the residual block, respectively. $F_2$, $F_3$,$F_4$ and $F_5$ denote the $1 \times 1$ , $3 \times 3$ , and $3 \times 3$ and $1 \times 1$ convolution layers in MSFE, $\omega _2$, $\omega _3$, $\omega _4$ and $\omega _5$ are the corresponding parameters, $\Psi (\cdot )$ refers to the rectified linear unit(ReLU) activation function, as shown in (3).

$$\begin{aligned} \Psi (x)=max(x,0) \end{aligned}$$

(3)

We can now write the outputs of the two residual blocks of the first scale as

$$\begin{aligned} Y_1^1= & \Psi (F_5^1 (C_5,\omega _5^1)\oplus F_4^1 (F_3^1 (F_2^1 (C_5,\omega _2^1 ),\omega _3^1 ),\omega _4^1)) \end{aligned}$$

(4)

$$\begin{aligned} Y_1^2= & \Psi (F_5^2 (Y_1^1,\omega _5^2)\oplus F_4^2 (F_3^2 (F_2^2 (Y_1^1,\omega _2^2 ),\omega _3^2 ),\omega _4^2)) \end{aligned}$$

(5)

where $Y_1^1$ and $Y_1^2$ denote the outputs of the first and second residual blocks, respectively. Note that, in (4), the higher-level feature map $C_5$ is taken as input, and in (5), the output of the first residual block $Y_1^1$ is taken as input.

By introducing GAP after two residual blocks, we enhanced the correspondence between categories and feature maps, generating deep feature

$$\begin{aligned} Scale_1=\delta (Y_1^2 )=\frac{1}{h \times w}\sum _{i=1}^{H} \sum _{j=1}^{W}Y_{(i,j)}^{l_1} \end{aligned}$$

(6)

where $\delta$ denotes the GAP operation and $Y^{l_1}\in \mathbb {R}^{H\times W}$ is a feature map with height H and W for $l_1$-th channel of the input $Y_1^2$.

The output characteristics of the second scale can be described as

$$\begin{aligned} Scale_2=\delta (Y_2^2 )=\frac{1}{h \times w}\sum _{i=1}^{H} \sum _{j=1}^{W}Y_{(i,j)}^{l_2} \end{aligned}$$

(7)

where $\delta$ denotes the GAP operation and $Y^{l_2}\in \mathbb {R}^{H\times W}$ is a feature map with height H and W for $l_2$-th channel of the input $Y_2^2$.

Table 2 Details of the $S_3$,$S_2$.

Full size table

To better fit the intrinsic characteristics of the input data, we replace the traditional convolutions within each residual block of the lower layers with dilated convolutions. Using dilated convolution techniques to expand the receptive field can enhance the model’s ability to perceive contextual information within images, which helps them classify in complex remote sensing scenes. Details related to its configuration are listed in Table 2.

Dilated convolution, also referred to as atrous convolution, is a technique introduced in the work by Yu et al.³³. It enhances the receptive field of a convolutional neural network by strategically inserting gaps or zeros between the elements of the convolution kernel. This expansion of the receptive field is controlled by a hyperparameter known as the dilation rate (r), which determines the extent of the gap or spacing between the kernel elements. More precisely, the dilation rate specifies the number of zeros to be inserted between kernel elements, and as a result, the effective size of the kernel is increased from k to $k + (k-1) \times (r-1)$. Importantly, dilated convolution achieves this expanded receptive field without introducing additional learnable parameters.When employing dilated convolution in a network with multiple layers containing nested convolution operations, the receptive field expands in an efficient and controlled manner. This enables the network to gather a broader range of local information, contributing to improved feature extraction and contextual understanding without the burden of a substantially increased parameter count.

The output of a residual block in the lower scale, based on dilated convolution, can be calculated using the following equation:

$$\begin{aligned} U_3^1= & \Phi (G_5^1 (C_3,\mu _5^1)\oplus G_4^1 (G_3^1 (G_2^1 (C_3,\mu _2^1 ),\mu _3^1 ),\mu _4^1)) \end{aligned}$$

(8)

$$\begin{aligned} U_3^2= & \Phi (G_5^2 (U_3^1,\mu _5^2)\oplus G_4^2 (G_3^2 (G_2^2 (U_3^1,\mu _2^2 ),\mu _3^2 ),\mu _4^2)) \end{aligned}$$

(9)

where $U_3^1$ and $U_3^2$ denote the outputs of the first and second residual blocks in the third scale, respectively. Note that, in (8), the lower-level feature map $C_3$ is taken as input, and in (9), the output of the first residual block $U_3^1$ is taken as input.

Then, GAP is applied to $U_3^2$ to obtain the deep features of the third scale. Therefore, the output characteristics of the third scale can be described as

$$\begin{aligned} Scale_3=\delta (U_3^2 )=\frac{1}{h \times w}\sum _{i=1}^{H} \sum _{j=1}^{W}U_{(i,j)}^{l_3} \end{aligned}$$

(10)

where $U^{l_3}\in \mathbb {R}^{H\times W}$ is a feature map with height H and W for $l_3$-th channel of the input $U_3^2$.

Similarly, one can infer the same conclusion.

We can now write the outputs of the two residual blocks of the fourth scale as

$$\begin{aligned} U_4^1= & \Phi (G_5^1 (C_2,\mu _5^1)\oplus G_4^1 (G_3^1 (G_2^1 (C_2,\mu _2^1 ),\mu _3^1 ),\mu _4^1)) \end{aligned}$$

(11)

$$\begin{aligned} U_4^2= & \Phi (G_5^2 (U_4^1,\mu _5^2)\oplus G_4^2 (G_3^2 (G_2^2 (U_4^1,\mu _2^2 ),\mu _3^2 ),\mu _4^2)) \end{aligned}$$

(12)

The output characteristics of the fourth scale can be described as

$$\begin{aligned} Scale_4=\delta (Y_2^2 )=\frac{1}{h \times w}\sum _{i=1}^{H} \sum _{j=1}^{W}U_{(i,j)}^{l_4} \end{aligned}$$

(13)

where $U^{l_4}\in \mathbb {R}^{H\times W}$ is a feature map with height H and W for $l_4$-th channel of the input $U_4^2$.

Dense residual connection feature fusion

The motivation for designing this module is that low-resolution feature maps contain a larger receptive field and can provide global context information, while high-resolution feature maps contain more details and local information. Fusion of these features helps the model to comprehensively consider global and local information when understanding the scene, improving the accuracy of recognition and segmentation. Features of different scales are complementary in capturing the diversity and changes of objects. Fusion of these features can make the model more stable when facing different scenes or different types of inputs. Multi-scale feature fusion can integrate information from different scales to obtain richer context information.

In the FPN¹⁹ architecture, high-level neurons exhibit responsiveness to the entire image, while low-level neurons prefer to activate in response to local patterns. Therefore, in the context of remote sensing scene analysis, in addition to the global image-level representations obtained from high-level convolutional neural networks (CNNs), local specific object-level features derived from low-level ones are also very valuable. Inspired by this, the DRCFF module extends the traditional FPN network. It introduces methods of multi-scale feature fusion and cross-layer connection, which can maximize the utilization of different levels of deep features, as shown in Fig. 4. The DRCFF module effectively merges global image-level representation with local object-level features.

The details of the DRCFF by FPN are introduced as follows. Let $S_i\in \mathbb {R}^{H_i\times W_i\times C_i} (i=2,3,4,5)$be the extracted deep features. We feed $S_5$ into convolution layer to reduce the number of output channels to C. Suppose the convolution layer has C convolution kernels $\omega _5^{k}\in \mathbb {R}^{1\times 1\times C_5},k=1,2,\ldots ,C$. $S_5$ is convoluted with each convolutional kernel $\omega _5^{k}$ to generate $L_5^k\in \mathbb {R}^{H_5\times W_5}$. Then, $L_5^{k}$ is stacked by the channel to generate $L_5\in \mathbb {R}^{H_5\times W_5\times C}$

$$\begin{aligned} L_5^{k}= & \rho (\omega _5^{k}*S_5) \end{aligned}$$

(14)

$$\begin{aligned} L_5= & [L_5^{1},L_5^{2},...,L_5^{C}] \end{aligned}$$

(15)

where $\Psi (\cdot )$ refers to the rectified linear unit activation function, $*$represents the convolution, and $[\cdot ]$denotes stacking by the channel. For convenience, we simplify the formula of the convolution layer by

$$\begin{aligned} D_5=L_5=\Psi (G_{S_5}(S_5,\omega _5)) \end{aligned}$$

(16)

where $G_{S_5}$ denotes these two-dimensional (2-D) convolutions and $\omega _5$ represents the weight parameter of the convolution layer.

Similarly, for $S_4$, $S_3$, and $S_2$, we can get

$$\begin{aligned} L_i=\Psi (G_{S_i } (C_i,\omega _i )),i=2,3,4 \end{aligned}$$

(17)

where $\omega _i$ represent the weight parameters of the corresponding convolutional layers.

In the top-down pathway, the top features are hierarchically integrated into the bottom one’s layer-by-layer using the two-fold, four-fold and eight-fold upsampling operation and element-wise addition. This process is formulated as

$$\begin{aligned} D_4= & \Psi (L_4\oplus G_{2\times up} (D_5 )) \end{aligned}$$

(18)

$$\begin{aligned} D_3= & \Psi (L_3\oplus G_{2\times up}(D_4 ))\oplus \Psi (L_4\oplus G_{4\times up}(D_5 )) \end{aligned}$$

(19)

$$\begin{aligned} D_2= & \Psi (L_2\oplus G_{2\times up}(D_3 ))\oplus \Psi (L_3\oplus G_{4\times up}(D_4 ))\oplus \Psi (L_4\oplus G_{8\times up} (D_5)) \end{aligned}$$

(20)

where $D_i\in \mathbb {R}^{H_i\times W_i\times C},i=2,3,4,$$\oplus$ represents the elementwise addition operation, and $G_{2\times up}$, $G_{4\times up}$ and $G_{8\times up}$ denotes two-fold, four-fold and eight-fold upsampling operation, respectively.

After generating the optimal resolution map, perform a $3\times 3$ convolution on each merged map to eliminate upsampling artifacts. The final output of the DRCFF module is named $x=\{N_2, N_3, N_4, N_5\}$, which corresponds to $x=\{S_2, S_3, S_4, S_5\}$ feature maps. The calculation process can be formulated as

$$\begin{aligned} N_5= & D_5 \end{aligned}$$

(21)

$$\begin{aligned} N_i= & \Psi (G_{conv3\times 3}(D_i,\rho _i )),i=2,3,4 \end{aligned}$$

(22)

where $N_i\in \mathbb {R}^{H_i\times W_i\times C}, G_{conv3\times 3}$ represents a 2-D convolution with the kernel size of $3 \times 3$, and $\rho _i$ are the weight parameters of the $3\times 3$ convolutional layers.

Correlation-attention module

The motivation for designing this module is that traditional convolutional neural networks (CNNs) have certain limitations when capturing long-distance dependencies. The attention mechanism can directly weight and aggregate the entire input sequence when processing inputs of any distance, thereby effectively capturing long-distance dependencies. The attention mechanism allows the model to give higher weights to important parts when processing inputs. This dynamic weight adjustment capability enables the model to more effectively focus on key information in the input and ignore irrelevant parts, thereby improving the accuracy of the task. The introduction of the relevant attention module is mainly to enhance the model’s ability to capture long-distance dependencies, dynamically focus on key information, improve the model’s expressiveness and computational efficiency, and increase the model’s robustness and interpretability.

This module, shown in Fig. 5, is based on the motivation that low-level features contain contextual information related to high-level features, while high-level features encapsulate more abstract representations derived from low-level features. Therefore, we introduce the correlation attention (CAM) model. This model is used to capture the complex relationships that exist between these features and then merge them into multi-scale feature sets. In this iterative process, starting from low-level features $N_2$, we follow a series of two key steps: computing the relational attention and generating the feature map of associative attention $A_{n+1}$ (where n ranges from 2 to 4). In detail, First, the contextual information existing between low-level features $N_n$ and high-level features $N_{n+1}$ is obtained by calculating associative attention. Second, this attention information is embedded into higher-level features. Then, GAP is used to collect global information for each channel within $N_n$, and finally, these collected information are used to facilitate the calculation of relational attention.

$$\begin{aligned} Z(N_n)=G_{pool}(N_n )= \frac{1}{H_n \times W_n}\sum _{i=1}^{H_n} \sum _{j=1}^{W_n}N_n{(i,j)} \end{aligned}$$

(23)

where $Z(N_n)\in \mathbb {R}^C$ represent the pooled features, $G_{pool}$ refers to the global average pooling operation and $N_n\in \mathbb {R}^{H_n\times W_n\times C},(n=2,3,4)$, denote the multiscale convolution feature maps, where $H_n$, $W_n$ and C denote the height, width and channel dimension for $N_n$.

Encode the correlation between adjacent channels using one-dimensional convolution along the channel dimension to generate $Z(N_n)$, and send it to a higher-level feature $N_{n+1}$ as the encoded scale context dependency relationship. The learned correlative-attention is expressed as

$$\begin{aligned} \gamma (N_n)=\sigma (G_{conv}^{1}(Z(N_n))) \end{aligned}$$

(24)

where $\gamma (N_n)\in \mathbb {R}^C$ represent the set of attention weights and $\sigma$ is the Sigmoid function. The one-dimensional convolution for $Z(N_n)$ is formulated as

$$\begin{aligned} V(N_n)=G_{conv}^{1} (Z(N_n))=W_k\cdot Z(N_n) \end{aligned}$$

(25)

where $V(N_n )\in \mathbb {R}^C, G_{conv}^{1}$ represents a one-dimensional convolution, and $W_k\in \mathbb {R}^{C\times C}$ are the parameters of the filters in the one-dimensional convolution.

After the attention weights $\gamma (N_n)$ are calculated, we perform the element-wise multiplication on $\gamma (N_n)$ and $N_{n+1}$ to compute the feature maps with correlative-attention.

$$\begin{aligned} A_{n+1}= {\left\{ \begin{array}{ll} N_{n+1}& \text{ if } n=1\\ \gamma (N_n)\otimes N_{n+1}& \text{ if } n=2,3,4 \end{array}\right. } \end{aligned}$$

(26)

where $A_i\in \mathbb {R}^{H_i\times W_i\times C}(n=2,3,4,5)$and represents the element-wise multiplication operation. According to (26), the correlative-attention features corresponding to $N_2, N_3, N_4,$ and $N_5$ can be derived.

Ultimately, the pooled correlative-attention feature maps are fused using concatenation (concat for short) to produce the final multi-level fused correlative-attention features F

$$\begin{aligned} F=Concat[G_{pool} (A_2),G_{pool}(A_3),G_{pool}(A_4),G_{pool}(A_5)] \end{aligned}$$

(27)

where $F\in \mathbb {R}^{1\times 1\times 4C}$.

Dataset description

The proposed MDRCN is evaluated on four publicly available remote sensing scene datasets: UC-Merced; WHU-RS19; AID; NWPU-RESISC45. The detailed information of these three datasets is given in Table 3.

Table 3 Basic information of the four datasets.

Full size table

1.
UC Merced Land-Use Dataset(UCM):The UCM dataset is a widely used benchmark for remote sensing scene classification. Figure 6 shows some examples of images from this dataset.
2.
AID: This dataset contains 10,000 aerial scene images. The dataset was collected from Google Earth by Wuhan University and has small differences between classes, high differences within classes and a large scale. Figure 7 shows example images for each class.
3.
WHU-RS19: This dataset contains 1005 images divided into 19 categories and was published by Wuhan University in 2012. Figure 8 shows examples of each class.
4.
NWPU-RESISC45: This dataset contains 31500 images divided into 45 categories and was collected by Northwestern Polytechnical University in 2016. Figure 9 shows examples of each class.

Experimental details

1.
Train-to-test ratio: To ensure a fair comparison with state-of-the-art algorithms in our experiments, we adopt a training to test ratio that is consistent with comparative work on different datasets.
2.
Model initialisation: In the MDRCN network, the parameters pre-trained on ImageNet are used as the initialisation parameters of the deep convolution layer, and the parameters of the other network layers are initialised at random. All offset parameters are initially set to 0.001.
3.
Training procedure:We use the PyTorch deep learning framework to perform experiments on the NVIDIA RTX 2080S GPU. All images are resized to 224 $\times$ 224 pixels as input, the batch size is set to 16, and the Adam optimiser is used for parameter optimisation.

Comparison with state-of-the-arts

In order to fully verify the progress of our proposed method, we compared it with some state-of-the-art methods, including AlexNet³⁴, TEXNet²⁷, GoogLeNet³⁴ ,VGG-16-CapsNet³⁵, VGG-VD-16-SAFF³⁶, CSDS³⁷, MSRes-SplitNet¹⁷, EFPN-DSE³², TDFE-DAA³⁸, RANet³⁹, EFPN-DSE-TDFF³², EAM⁴⁰, T-CNN⁴¹, EMSCNet(ResNet-50)⁴², EMSCNet(ViT-B)⁴², SCCov(Alexnet)⁴³, SCCov(VGG16)⁴³, D-CNN with VGGNet-16⁴⁴, MLDS⁴⁵, Two-Stream Fusion⁴⁶ and EAM⁴⁰. To ensure fairness in our comparisons, we repeated the experiments ten times and calculated both the average and standard deviation of the Overall Accuracy (OA).

Experimental results

1.
Table 4 shows the classification results of our proposed MDRCN method on the UCM dataset as well as the results of other state-of-the-art methods. Figure 10 depicts histograms of the performance of these different methods. Figure 11 shows the confusion matrix results under the same training conditions.

Table 4 Experimental results of the UCM dateset.

Full size table

At 50% and 80% training rates, except for EMSCNet(ViT-B) and EMSCNet(ResNet-50), our method (MDRCN) outperforms other deep feature methods, and MDRCN shows higher performance. At 50% training rate, methods such as CSDS, MSRes-SplitNet, TDFE-DAA and RANet also show competitive performance.

In addition to evaluating the overall accuracy, we also perform a detailed analysis of our proposed method using a confusion matrix, showing the best results under different fixed training ratios. When the training ratio is 50%, the classification accuracy of 18 categories reaches 100%. In particular, the classification accuracy of “denseresidential”, “intersection” and “sparseresidential” exceeds 95%. But there is still a 5% probability that the “denseresidential” image is incorrectly classified as a “buildings”. The probability of misclassifying the “intersection” image as “mediumresidential” is also 5%. At a training ratio of 80%, most scene categories achieve 100% classification accuracy, with the only exception being the “mediumresidential” category, where 20 categories achieve 95% classification accuracy. The confusion matrix shows that some “denseresidential” images are incorrectly classified as “sparseresidential”. The reason for the above classification errors may be the great similarity between classes.

Table 5 Experimental results of the AID dateset.

Full size table

2.
Table 5 presents the classification results obtained by proposed MDRCN method on the AID dataset, alongside the outcomes of other advanced methods. For a more straightforward comparison, Fig. 12 illustrates histograms depicting the performance of these various methods.

When the training rate is set at 50%, MDRCN achieve comparable results (95.66%) among all deep feature methods. Upon reducing the training ratio to 20%, our method demonstrates superior performance, while the overall accuracy of EAM declines to 93.13%. This suggests that MDRCN exhibits stronger discriminative capability compared to EAM.

In Fig. 13, owing to space limitations, we only display the confusion matrices for training levels of 20% and 50%, respectively. Figure 13a showcases the confusion matrix for a training rate of 20%. It can be observed that most categories achieve satisfactory classification results of over 90%, with only five categories “Centre”, “Church”, “Resort”, “School”, and “Plaza” experiencing significant misclassification. Despite the substantial reduction in the number of training images, certain categories prone to misclassification can still be effectively classified, such as “medium-sized residence” (93%) and “sparse residence” (99%), or “bridge” (97%), “port” (96%), and “river” (97%).

As depicted in Fig. 13b, the classification accuracy of 26 out of 30 categories surpasses 95%. Categories with classification accuracy below 95% include “Centre” (91%), “Resort” (86%), “School” (93%), and “Plaza” (94%). In the AID dataset, the most significant confusion arises between “Resort” and “Park”, “School ” and “Business”, or “Centre” and “Plaza”. This phenomenon may be attributed to the presence of similar ground objects or geometric structure distributions.

3.
Table 6 shows the classification results achieved by proposed MDRCN method, as well as the results of other cutting-edge methods, all of which are evaluated on the UCM dataset. To provide a more visual comparison, Fig. 14 depicts histograms illustrating the performance of these different methods.

Table 6 Experimental results of the WHU-RS19 dateset.

Full size table

In Fig. 15, the confusion matrices at training levels of 60% and 40% are shown respectively. As shown in Fig. 15a, among the 30 categories, except for the classification accuracy of the “port” category (0.95), the other classification accuracy rates all reach 1.00. In the WHU-RS19 dataset, the greatest confusion occurs between “port” and “bridge”. The explanation for these results is that these classes have similar geometric distributions.

Figure 15b shows the confusion matrix when the training rate is 4%. Among the 30 categories, except for the classification accuracy of the “football field” (0.95) and “port” (0.91) categories, The accuracy rates of other classifications all reach 1.00. As you can see, 5% of the images of “football stadium” are incorrectly classified as “industrial” and 9% of the images of “port” are incorrectly classified as “airport”. This may be attributed to their similar landforms.

Table 7 Experimental results of the NWPU-RESISC45 dateset.

Full size table

4.
Table 7 presents the classification results obtained by proposed MDRCN method on the NWPU-RESISC45 dataset, alongside the outcomes of other advanced methods. For a more straightforward comparison, Fig. 16 illustrates histograms depicting the performance of these various methods.

Figure 17 shows the confusion matrix when the training rate is 10% and 20% respectively. As shown in Fig. 17a, among the 45 categories, except for the classification accuracy of the “church” category and the “palace” category, which is less than 80%, the classification accuracy of other categories is more than 80%. 10% of the “church” images are incorrectly classified as “palace”, 6% of the “desert” images are incorrectly classified as “mountains”, and 6% of the “terrace” images are incorrectly classified as “rectangular-farmland”. Figure 17b shows the confusion matrix when the training rate is 20%. Among the 45 categories, it can be seen that except for the classification accuracy of the “palace” category, which is less than 80%, the classification accuracy of other categories is more than 80%. 9% of the “church” images are incorrectly classified as “palace”, and 7% of the “lake” images are incorrectly classified as “wetland”. It can be seen that in the NWPU-RESISC45 dataset, the most confused categories are “church” and “palace”. This is because the similarity between the categories is relatively large.

Discussion

In order to effectively evaluate our method, different ablation experiments are performed below using different connection possibilities.

Impact of data augmentation

During training, the input images will be enhanced by random horizontal mirroring and random rotation, leading to a richer enhanced image than the original. To verify the effectiveness of data augmentation, we compare the methods with and without data augmentation. Table 8 shows the results of the comparison. In this table, $MDRCN^{+}$ represents the method with data augmentation, while $MDRCN^{-}$ represents the method without data augmentation. Experimental results show that using data augmentation technology, the overall accuracy rate is increased by more than 0.5%.

Table 8 Ablation analysis on data augmentation.

Full size table

Effects of different modules

There are three modules in this framework: MSFE, DRCFF and CAM respectively. Each architecture omits the control methods of only one module at a time. As shown in Figs 18a–c show the architecture with MSFE, DRCFF and CAM omitted. Table 9 shows the overall accuracy at 80% and 50% for different architectures on the UCM dataset. Table 10 shows the overall accuracy of different architectures on the NWPU-RESISC45 dataset at 10% and 20%.

Effects of MSFE: The results of Scheme 1 are the worst, because the MSFE is omitted in this architecture, while the function of the MSFE is to initially strengthen the semantics of all level feature maps. Compared to Scheme 1, since Scheme 2, 3, and 4 include the MSFE. On the UCM dataset, their overall accuracy improves by 0.47%, 0.83%, and 1.35% at 80% training rate, and by 0.79%, 0.98%, and 1.47% at 50% training rate. On the NWPU-RESISC45 dataset, their overall accuracy improves by 0.97%, 1.27% and 1.47% at 10% training rate, and by 0.75%, 1.12%, and 1.96% at 20% training rate. These results show that the MSFE we addressed is indeed beneficial for remote sensing scene classification.

Table 9 Overall accuracy (%) of different architectures on UCM dataset.

Full size table

Table 10 Overall accuracy (%) of different architectures on NWPU-RESISC45 dataset.

Full size table

Effects of DRCFF: In Scheme 2, CAM is directly linked to the output of the MSFE without DRCFF. In comparison, our method, Scheme 4, through DRCFF, achieves better performance with an overall accuracy improvement of 1.35% and 1.47% at 80% and 50% training rates, respectively. The overall accuracy of the NWPU-RESISC45 dataset increases by 0.50% and 1.21% at 10% and 20% training rates, respectively. The effectiveness and superiority of our proposed DRCFF is strongly confirmed by this phenomenon.

Effects of CAM: In Scheme 3, the CAM is removed and replaced by a simple GAP layer, GAP for the subsequent scene classification. From Table 9 it can be seen that although Scheme 3 has a better performance than Scheme 1 and 2, there are still slight decreases in comparison to Scheme 4. Specifically, at training rates of 80% and 50%, there are slight decreases in overall accuracy of 0.52% and 0.49%, respectively. As can be seen from Table 10, Scheme 4 has the best performance. Compared with the first three schemes, its overall accuracy is slightly higher than that of other schemes when the training rate is 10% and 20%.

Conclusion

In this paper, we build an efficient deep learning framework called MDRCN. Considering different levels of feature diversity, we design MSFE to extract multi-scale feature maps. The DRCFF module is introduced to merge all the extracted features and each feature to interact with other features. A CAM model is also introduced to extract key supplementary semantic information, thereby further improving the performance of classification tasks.Our proposed method achieves significant improvements over existing algorithms in terms of effectiveness and accuracy, ultimately achieving state-of-the-art results on three mature remote sensing scene classification benchmarks. In future work, we will pay more attention to designing effective plug-and-play modules, such as the proposed MSFE, and embed them into different CNN architectures to further improve the network’s remote sensing scene classification capabilities.

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Cheng, G., Xie, X., Han, J., Guo, L. & Xia, G.-S. Remote sensing image scene classification meets deep learning: Challenges, methods, benchmarks, and opportunities. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.13, 3735–3756 (2020).
Article ADS Google Scholar
Yao, X., Han, J., Cheng, G., Qian, X. & Guo, L. Semantic annotation of high-resolution satellite images via weakly supervised learning. IEEE Trans. Geosci. Remote Sens.54(6), 3660–3671. https://doi.org/10.1109/TGRS.2016.2523563 (2016).
Article ADS Google Scholar
Zhang, X. & Du, S. A linear dirichlet mixture model for decomposing scenes: Application to analyzing urban functional zonings. Remote Sens. Environ.169, 37–49. https://doi.org/10.1016/j.rse.2015.07.017 (2015).
Article ADS Google Scholar
Feng, Q., Liu, J. & Gong, J. Uav remote sensing for urban vegetation mapping using random forest and texture analysis. Remote Sens.7(1), 1074–1094 (2015).
Article ADS Google Scholar
Bian, X., Chen, C., Tian, L. & Du, Q. Fusing local and global features for high-resolution scene classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.10(6), 2889–2901 (2017).
Article ADS Google Scholar
Stumpf, A. & Kerle, N. Object-oriented mapping of landslides using random forests. Remote Sens. Environ.115(10), 2564–2577 (2011).
Article ADS Google Scholar
Shotton, J., Winn, J. M., Rother, C. & Criminisi, A. Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. Int. J. Comput. Vision81, 2–23 (2007).
Article Google Scholar
Hu, J., Xia, G.-S., Hu, F., Sun, H., & Zhang, L. A comparative study of sampling analysis in scene classification of high-resolution remote sensing imagery. In 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pp 2389–2392 (2015). IEEE
Gevers, T. & Smeulders, A. W. Pictoseek: Combining color and shape invariant features for image retrieval. IEEE Trans. Image Process.9(1), 102–119 (2000).
Article ADS CAS PubMed Google Scholar
Dalal, N., & Triggs, B. Histograms of oriented gradients for human detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), 1, pp. 886–893 (2005). IEEE
Ojala, T., Pietikainen, M. & Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell.24(7), 971–987 (2002).
Article Google Scholar
Yang, Y., & Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 270–279 (2010).
Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., & Gong, Y. Locality-constrained linear coding for image classification. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3360–3367 (2010). IEEE
Lazebnik, S., Schmid, C., & Ponce, J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), 2, 2169–2178 (2006). IEEE
Perronnin, F., Sánchez, J., & Mensink, T. Improving the fisher kernel for large-scale image classification. In Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part IV 11, pp. 143–156 (2010). Springer
Feng, X., Han, J., Yao, X. & Cheng, G. Tcanet: Triple context-aware network for weakly supervised object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens.59(8), 6946–6955 (2020).
Article ADS Google Scholar
Wang, G., Xu, H., Wang, X., Yuan, L. & Wen, X. Remote sensing scene image classification model based on multi-scale features and attention mechanism. J. Appl. Remote Sens.16(4), 044510 (2022).
Article ADS Google Scholar
Wang, X., Xiong, X., Ning, C., Shi, A. & Lv, G. Integration of heterogeneous features for remote sensing scene classification. J. Appl. Remote Sens.12(1), 015023–015023 (2018).
Article ADS Google Scholar
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2117–2125 (2017).
Wang, X., Shen, S., Ning, C., Huang, F. & Gao, H. Multi-class remote sensing object recognition based on discriminative sparse representation. Appl. Opt.55(6), 1381–1394 (2016).
Article ADS PubMed Google Scholar
Hu, J., Shen, L., & Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7132–7141 (2018).
Li, X., Wang, W., Hu, X., & Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 510–519 (2019).
Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., & Hu, Q. Eca-net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 11534–11542 (2020).
Li, H., Xiong, P., An, J., & Wang, L. Pyramid attention network for semantic segmentation.(2018).
Woo, S., Park, J., Lee, J.-Y., & Kweon, I. S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), pp 3–19 (2018).
Zhang, H., Wu, C., Zhang, Z., Zhu, Y., Lin, H., Zhang, Z., Sun, Y., He, T., Mueller, J., Manmatha, R., Li, M., & Smola, A. Resnest: Split-attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp 2736–2746 (2022).
Anwer, R. M., Khan, F. S., Van De Weijer, J., Molinier, M. & Laaksonen, J. Binary patterns encoded convolutional neural networks for texture recognition and remote sensing scene classification. ISPRS J. Photogramm. Remote. Sens.138, 74–85 (2018).
Article ADS Google Scholar
Singh, B., & Davis, L. S. An analysis of scale invariance in object detection snip. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3578–3587 (2018).
Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Commun. ACM60(6), 84–90 (2017).
Article Google Scholar
He, K., Zhang, X., Ren, S., & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778 (2016).
Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4700–4708 (2017).
Wang, X., Wang, S., Ning, C. & Zhou, H. Enhanced feature pyramid network with deep semantic embedding for remote sensing scene classification. IEEE Trans. Geosci. Remote Sens.59(9), 7918–7932 (2021).
Article ADS Google Scholar
Yu, F., & Koltun, V. Multi-scale context aggregation by dilated convolutions (2015).
Cheng, G., Han, J. & Lu, X. Remote sensing image scene classification: Benchmark and state of the art. Proc. IEEE105(10), 1865–1883 (2017).
Article Google Scholar
Zhang, W., Tang, P. & Zhao, L. Remote sensing image scene classification using cnn-capsnet. Remote Sens.11(5), 494 (2019).
Article ADS Google Scholar
Cao, R., Fang, L., Lu, T. & He, N. Self-attention-based deep feature fusion for remote sensing scene classification. IEEE Geosci. Remote Sens. Lett.18(1), 43–47 (2020).
Article ADS Google Scholar
Wang, X., Yuan, L., Xu, H. & Wen, X. Csds: End-to-end aerial scenes classification with depthwise separable convolution and an attention mechanism. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.14, 10484–10499 (2021).
Article ADS Google Scholar
Chen, X. et al. Attention-aware deep feature embedding for remote sensing image scene classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.16, 1171–1184 (2022).
Article ADS Google Scholar
Wang, X., Duan, L., Ning, C. & Zhou, H. Relation-attention networks for remote sensing scene classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.15, 422–439 (2021).
Article ADS Google Scholar
Sitaula, C., KC, S. & Aryal, J. Enhanced multi-level features for very high resolution remote sensing scene classification. Neural Comput. Appl.36, 7071–7083 (2024).
Article Google Scholar
Wang, W., Chen, Y. & Ghamisi, P. Transferring cnn with adaptive learning for remote sensing scene classification. IEEE Trans. Geosci. Remote Sens.60, 1–18 (2022).
Google Scholar
Zhao, Y., Liu, J., Yang, J. & Wu, Z. Emscnet: Efficient multisample contrastive network for remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens.61, 1–14. https://doi.org/10.1109/TGRS.2023.3262840 (2023).
Article CAS Google Scholar
He, N., Fang, L., Li, S., Plaza, J. & Plaza, A. Skip-connected covariance network for remote sensing scene classification. IEEE Trans. Neural Netw. Learn. Syst.31(5), 1461–1474. https://doi.org/10.1109/TNNLS.2019.2920374 (2020).
Article PubMed Google Scholar
Cheng, G., Yang, C., Yao, X., Guo, L. & Han, J. When deep learning meets metric learning: Remote sensing image scene classification via learning discriminative cnns. IEEE Trans. Geosci. Remote Sens.56(5), 2811–2821. https://doi.org/10.1109/TGRS.2017.2783902 (2018).
Article ADS Google Scholar
Liu, X. et al. Siamese convolutional neural networks for remote sensing scene classification. IEEE Geosci. Remote Sens. Lett.16(8), 1200–1204 (2019).
Article ADS Google Scholar
Tang, X. et al. Attention consistent network for remote sensing scene classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.14, 2030–2045 (2021).
Article ADS Google Scholar
Xia, G.-S. et al. Aid: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens.55(7), 3965–3981 (2017).
Article ADS Google Scholar
Lu, X., Sun, H. & Zheng, X. A feature aggregation convolutional neural network for remote sensing scene classification. IEEE Trans. Geosci. Remote Sens.57(10), 7894–7906 (2019).
Article ADS Google Scholar
Wang, X., Duan, L., Shi, A. & Zhou, H. Multilevel feature fusion networks with adaptive channel dimensionality reduction for remote sensing scene classification. IEEE Geosci. Remote Sens. Lett.19, 1–5. https://doi.org/10.1109/LGRS.2021.3070016 (2022).
Article CAS Google Scholar

Download references

Acknowledgements

This work was supported in part by the New-Generation AI Major Scientific and Technological Special Project of Tianjin (18ZXZNGX00150) and in part by the Special Foundation for Technology Innovation of Tianjin (21YDTPJC00250).

Author information

Authors and Affiliations

Tianjin University of Technology, School of Computer Science and Engineering, Tianjin, 300384, China
Wei Dai, Furong Shi, Xinyu Wang, Haixia Xu, Liming Yuan & Xianbin Wen
Ministry of Education, Key Laboratory of Computer Vision and System, Tianjin, 300384, China
Wei Dai, Furong Shi, Xinyu Wang, Haixia Xu, Liming Yuan & Xianbin Wen

Authors

Wei Dai
View author publications
Search author on:PubMed Google Scholar
Furong Shi
View author publications
Search author on:PubMed Google Scholar
Xinyu Wang
View author publications
Search author on:PubMed Google Scholar
Haixia Xu
View author publications
Search author on:PubMed Google Scholar
Liming Yuan
View author publications
Search author on:PubMed Google Scholar
Xianbin Wen
View author publications
Search author on:PubMed Google Scholar

Contributions

W.D.: Writing - original draft. F.S.: Writing - review and editing. H.X.: Writing - review and editing. X.W.: Writing - review and editing. Liming Yuan: Writing - review and editing. X.W.: Funding acquisition, writing - review and editing.

Corresponding author

Correspondence to Xianbin Wen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Dai, W., Shi, F., Wang, X. et al. A multi-scale dense residual correlation network for remote sensing scene classification. Sci Rep 14, 22197 (2024). https://doi.org/10.1038/s41598-024-73252-8

Download citation

Received: 16 May 2024
Accepted: 16 September 2024
Published: 27 September 2024
DOI: https://doi.org/10.1038/s41598-024-73252-8

A multi-scale dense residual correlation network for remote sensing scene classification

Subjects

Abstract

Similar content being viewed by others

Self-supervised learning for remote sensing scene classification under the few shot scenario

Efficient remote sensing image classification using the novel STConvNeXt convolutional network

A Multi-Scale attention network for building extraction from high-resolution remote sensing images

Introduction

Related work

CNN models

Attention mechanism

Poposed method

Multi-scale feature extraction

Dense residual connection feature fusion

Correlation-attention module

Dataset description

Experimental details

Comparison with state-of-the-arts

Experimental results

Discussion

Impact of data augmentation

Effects of different modules

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Search

Quick links

Subjects

Abstract

Similar content being viewed by others

Self-supervised learning for remote sensing scene classification under the few shot scenario

Efficient remote sensing image classification using the novel STConvNeXt convolutional network

A Multi-Scale attention network for building extraction from high-resolution remote sensing images

Introduction

Related work

CNN models

Attention mechanism

Poposed method

Multi-scale feature extraction

Dense residual connection feature fusion

Correlation-attention module

Dataset description

Experimental details

Comparison with state-of-the-arts

Experimental results

Discussion

Impact of data augmentation

Effects of different modules

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links