Introduction

Remote sensing scene classification has attracted much attention in the field of remote sensing, with the goal of assigning specific semantic labels to remote sensing images. However, this task faces challenges primarily due to the lack of adequately annotated training data the presence of significant intra-class variations and inter-class similarities1. With the increasing maturity of remote sensing technology, it has become possible to obtain a large amount of high-resolution remote sensing image information from the air. This information can be used for land use and land cover2, urban design3 and vegetation surveying and mapping4. This shows that remote sensing scene classification is widely used in the analysis of earth surface characteristics. Over the past few decades, a range of algorithms for remote sensing scene classification has been developed. Early methods heavily relied on handcrafted features and classical classification techniques, such as SVM5, Random Forest6, and boosting7. In general, these methods can be categorized into two classes: low-level feature based method and mid-level feature based method. Low-level feature methods include Scale-Invariant Feature Transform (SIFT)8 , Gray-Level Co-occurrence Matrix (GLCM)9 , Oriented Gradient Histograms10, and Local Binary Patterns11. These methods perform well on simple objects, but hard to deal with complex and challenging scenes.

To address these challenges, several mid-level feature methods have emerged to enhance classification performance. Unlike low-level feature methods, mid-level feature methods encode local descriptors to represent global features. Prominent mid-level methods include the Bag of Visual Words12, Spatially Constrained Linear Coding13, Spatial Pyramid Matching14, and Improved Fisher Kernel15. Although mid-level feature methods have improved scene classification in remote sensing images, they still lack sufficient representational capacity to handle increasingly complex remote sensing images, heavily rely on image processing expertise, and exhibit relatively weaker performance on demanding scenes. In recent years, the rapid advancements in deep learning techniques, particularly Convolutional Neural Networks (CNNs), have led to their extensive utilization in various computer vision tasks. Compared to manually crafted features, deep features offer a wealth of semantic information, enabling a better capture of the true essence of images. Convolutional neural networks, due to their exceptional feature learning capabilities, have demonstrated remarkable performance across a range of computer vision applications, including remote sensing scene classification16,17.

Although CNN has made progress in feature extraction, local features are easily ignored when extracting global features of remote sensing scenes18,19,20, resulting in poor performance. In contrast to typical natural images, where objects usually dominate the image space, remote sensing scene images frequently contain multiple objects dispersed across smaller and more scattered backgrounds, as depicted in Fig. 1. Furthermore, these images exhibit high intra-class variability and low inter-class variability. As illustrated in Fig. 1, The image on the left is a “commercial” scene, and the image on the right depicts a “dense residential” scene. Both scenes contain elements such as houses, streets, trees, cars, and other objects, with the main difference being the arrangement and distribution of these objects. Therefore, achieving accurate remote sensing scene classification requires not only extracting global information, but also local information.

Figure 1
figure 1

Remote sensing scene images contain a variety of objects.

In the field of attention-based CNNs, several representative models have been introduced to enhance feature learning and representation. Hu et al.21 proposed a SENet, which employs a channel-level attention mechanism to adaptively adjust the weights of different channels in feature maps, enhancing the focus on essential features. Li et al.22 proposed s SKNet, which incorporates deformable convolution and channel-level attention mechanisms to adjust the weights of feature maps at different spatial scales and channels, further enhancing feature focus. Wang et al.23 proposed a ECANet, which introduces a channel-level attention mechanism combined with efficient calculation strategies to improve model performance. The core innovation of ECANet lies in the introduction of the “Efficient Channel Attention” (ECA) mechanism, which uses a simple 1D convolution layer to adaptively adjust feature map weights in each channel, ensuring computational efficiency without requiring global pooling or deformable convolution layers. Li et al.24 proposed a semantic segmentation model,called PANet, which is based on the Feature Pyramid Network (FPN) and attention mechanism. The core idea of PANet is to improve instance segmentation tasks’ accuracy and robustness by introducing multi-scale information fusion and adaptive attention mechanisms. Woo et al.25 introduced the CBAM, which incorporates both channel-level and spatial-level attention mechanisms. These mechanisms adaptively adjust the weights of feature maps at different channels and spatial positions, enhancing the model’s focus on crucial features. Zhang et al.26 propoded a convolutional neural network model, called ResNest, which is based on the ResNet architecture and introduces a “grouped attention mechanism.” This mechanism enhances the network’s feature learning and representation ability, further improving its performance. However, these models cannot extract the interactive information between different layers well, thus affecting the classification performance.

Within the context of remote sensing scene classification, features at different levels possess distinct information representation capabilities. To achieve a more comprehensive feature representation, it becomes imperative to harness the complementary information from multi-level convolutional features. However, this approach can also introduce redundant or mutually exclusive information, as noted in prior research27,28. To overcome the above disadvantages, in this article, we proposed a method called multi-scale dense residual correlation network (MDRCN) for remote sensing scene classification. The main contributions of this paper can be summarized in the following aspects:

  1. 1.

    The multi-scale feature extraction (MSFE) method, which effectively extracts features at different levels independently.

  2. 2.

    The dense residual connection feature fusion (DRCFF) method is introduced, leveraging residual connections to transfer features across different levels. This facilitates the retention of valuable information from the original features.Dense connections to merge all the features encouraging their interaction.

  3. 3.

    The contextual attention module (CAM), a bottom-up feature extraction technique that uses CNNs and spatial pyramid pooling to extract region-level feature representations at different scales. By computing similarities between regions and applying an attention mechanism, effectively weight and aggregate these features.

  4. 4.

    An end-to-end network called MDRCN is proposed to classify the scenes in remote sensing images. Experimental results show that the proposed MDRCN successfully incorporates associated attention into CNN, which effectively improves remote sensing scene classification performance.

The remainder of this paper is structured as follows. Section 2 presents the current status of the research. Section 3 describes the details of the proposed method. Section 4 analyzes the experimental results. Section 5 is the summary of this article and prospects for future research.

Related work

CNN models

Convolutional Neural Networks (CNNs) typically consist of convolutional, pooling and fully connected layers. The convolutional layer is used to extract complex and hierarchical features from the input data; the pooling layer helps to reduce the spatial dimension of the feature map while retaining the most important information; the fully connected layer flattens the feature map into a one-dimensional vector and connects all the neurons to the output.

CNN is a deep learning architecture specialized for the processing of images and image-related data. It derives its design principles from the biological visual system and excels at extracting intricate features. The operational paradigm of CNN involves the training of the network via forward and backward propagation, facilitating the continual adjustment of weight and bias parameters to enhance network performance.The way CNN operates enables it to autonomously learn and identify hierarchical features in images, making it highly proficient in completing a wide range of computer vision application tasks. Eminent CNN architectures such as AlexNet29, ResNet30 and DenseNet31 etc, symbolize a substantial shift from manual feature engineering towards the utilization of deep CNNs. This transition has solidified CNNs as the predominant choice for the backbone network in remote sensing image scene classification, primarily due to their exceptional feature representation capabilities. Consequently, CNNs have been extensively employed across a spectrum of computer vision tasks, including remote sensing scene classification.

Attention mechanism

Attention mechanisms are algorithms specifically designed to efficiently and precisely highlight the target object while simultaneously suppressing irrelevant information. They closely emulate the way in which humans perceive and interpret images. In recent years, numerous attention mechanisms have been developed and applied across various domains, including but not limited to natural language processing, object detection, and image recognition. Importantly, these mechanisms have also brought fresh perspectives and innovative approaches to the field of remote sensing image analysis. The Attention Mechanism is a widely adopted technique in deep learning models to enhance their performance and accuracy.

In the domain of attention-based CNNs, several representative models have been introduced to enhance feature learning and representation. Hu et al.21 proposed SENet, which uses channel-level attention to adaptively adjust the weights of different feature map channels and enhance the focus on key features. Li et al.22 proposed SKNet, which combines deformable convolution and channel-level attention to refine feature map weights at various spatial scales and channels. Li et al.24 proposed PANet, a semantic segmentation model based on feature pyramid network (FPN) and attention mechanism, which can improve the accuracy and robustness of instance segmentation tasks. Woo et al.25 proposed CBAM, which combines channel-level and spatial-level attention to adaptively adjust feature map weights and enhance the focus on key features.This mechanism enhances the network’s feature learning and representation ability, further improving its performance.

Poposed method

The overall architecture of our proposed MDRCN is composed of four modules. As shown in Fig. 2.

Figure 2
figure 2

Overall architecture of the MDRCN.

First, the algorithm extracts multiple convolutional features at different semantic levels based on the pre-trained ResNet50 model, including \(\{ C_2, C_3, C_4, C_5 \}\). Secondly, the MSFE module connects features at different levels to generate multi-scale semantic features \(\{ D_2, D_3, D_4, D_5\}\), and constructs a feature pyramid in a top-down manner. This enables comprehensive feature interaction and an overall improvement in accuracy. Then, the DRCFF module is utilized to transfer its own features to the next layer while retaining the original information, thereby acquiring global information and further enhancing accuracy. Finally, a relevant attention network is introduced to boost feature expression ability and further improve classification performance.

Multi-scale feature extraction

The framework adopts a hybrid approach that combines convolutional and non-convolutional techniques. As confirmed by previous studies32, dilated convolutions become an effective tool for tasks involving dense predictions. This strategic use of dilated convolutions helps capture contextual information effectively. Furthermore, we introduce skip connections in our architecture. It not only facilitates the flow of information between different layers, but also ensures that key features are not lost as the network deepens. The last layer of the architecture has two global average pooling (GAP) layers for generating image-level representation features. A comprehensive breakdown of the multi-scale feature encoding (MSFE) module is shown in Fig. 3. All in all, this architectural design is carefully crafted to handle contextual information across various levels while maintaining computational efficiency.

Figure 3
figure 3

Illustration of the MSFE. It consists of \(S_2\), \(S_3\), \(S_4\), \(S_5\), where \(S_2\) or \(S_3\) means that these two branches are handled in the same way, and \(S_4\) or \(S_5\) means that these two branches are handled in the same way.

The higher-level feature processing segment is equipped with two residual blocks and culminates in a GAP layer. For a more detailed breakdown of the remaining blocks within this higher-level feature processing, please refer to Table 1. In this segment, a sequence of \(1 \times 1\), \(3 \times 3\), and another \(3 \times 3\) convolutional layers is meticulously arranged. This arrangement is devised to efficiently acquire deep-level features. It’s important to note that each layer is immediately followed by a batch normalization layer and a ReLU layer to facilitate non-linear transformations. The output from the internal path of each residual block is then merged through element-wise addition. In essence, the output of the higher-level feature processing incorporates a fusion of various convolutional layers and residual blocks, thoughtfully integrated after batch normalization and ReLU operations to harness efficient feature learning capabilities. Ultimately, the final output undergoes processing by the GAP layer to yield the ultimate prediction result.

Table 1 Details of the \(S_5\),\(S_4\).

The output of a residual block can be expressed in terms of

$$\begin{aligned} Y= \Psi (F_5 (X,\omega _5)\oplus F(X,{\omega _i})) \end{aligned}$$
(1)

where

$$\begin{aligned} F(X, {\omega _i})= F_4 (F_3 (F_2 (X,\omega _2 ),\omega _3 ),\omega _4) \end{aligned}$$
(2)

where X and Y denote the input and output of the residual block, respectively. \(F_2\), \(F_3\),\(F_4\) and \(F_5\) denote the \(1 \times 1\) , \(3 \times 3\) , and \(3 \times 3\) and \(1 \times 1\) convolution layers in MSFE, \(\omega _2\), \(\omega _3\), \(\omega _4\) and \(\omega _5\) are the corresponding parameters, \(\Psi (\cdot )\) refers to the rectified linear unit(ReLU) activation function, as shown in (3).

$$\begin{aligned} \Psi (x)=max(x,0) \end{aligned}$$
(3)

We can now write the outputs of the two residual blocks of the first scale as

$$\begin{aligned} Y_1^1= & \Psi (F_5^1 (C_5,\omega _5^1)\oplus F_4^1 (F_3^1 (F_2^1 (C_5,\omega _2^1 ),\omega _3^1 ),\omega _4^1)) \end{aligned}$$
(4)
$$\begin{aligned} Y_1^2= & \Psi (F_5^2 (Y_1^1,\omega _5^2)\oplus F_4^2 (F_3^2 (F_2^2 (Y_1^1,\omega _2^2 ),\omega _3^2 ),\omega _4^2)) \end{aligned}$$
(5)

where \(Y_1^1\) and \(Y_1^2\) denote the outputs of the first and second residual blocks, respectively. Note that, in (4), the higher-level feature map \(C_5\) is taken as input, and in (5), the output of the first residual block \(Y_1^1\) is taken as input.

By introducing GAP after two residual blocks, we enhanced the correspondence between categories and feature maps, generating deep feature

$$\begin{aligned} Scale_1=\delta (Y_1^2 )=\frac{1}{h \times w}\sum _{i=1}^{H} \sum _{j=1}^{W}Y_{(i,j)}^{l_1} \end{aligned}$$
(6)

where \(\delta\) denotes the GAP operation and \(Y^{l_1}\in \mathbb {R}^{H\times W}\) is a feature map with height H and W for \(l_1\)-th channel of the input \(Y_1^2\).

The output characteristics of the second scale can be described as

$$\begin{aligned} Scale_2=\delta (Y_2^2 )=\frac{1}{h \times w}\sum _{i=1}^{H} \sum _{j=1}^{W}Y_{(i,j)}^{l_2} \end{aligned}$$
(7)

where \(\delta\) denotes the GAP operation and \(Y^{l_2}\in \mathbb {R}^{H\times W}\) is a feature map with height H and W for \(l_2\)-th channel of the input \(Y_2^2\).

Table 2 Details of the \(S_3\),\(S_2\).

To better fit the intrinsic characteristics of the input data, we replace the traditional convolutions within each residual block of the lower layers with dilated convolutions. Using dilated convolution techniques to expand the receptive field can enhance the model’s ability to perceive contextual information within images, which helps them classify in complex remote sensing scenes. Details related to its configuration are listed in Table 2.

Dilated convolution, also referred to as atrous convolution, is a technique introduced in the work by Yu et al.33. It enhances the receptive field of a convolutional neural network by strategically inserting gaps or zeros between the elements of the convolution kernel. This expansion of the receptive field is controlled by a hyperparameter known as the dilation rate (r), which determines the extent of the gap or spacing between the kernel elements. More precisely, the dilation rate specifies the number of zeros to be inserted between kernel elements, and as a result, the effective size of the kernel is increased from k to \(k + (k-1) \times (r-1)\). Importantly, dilated convolution achieves this expanded receptive field without introducing additional learnable parameters.When employing dilated convolution in a network with multiple layers containing nested convolution operations, the receptive field expands in an efficient and controlled manner. This enables the network to gather a broader range of local information, contributing to improved feature extraction and contextual understanding without the burden of a substantially increased parameter count.

The output of a residual block in the lower scale, based on dilated convolution, can be calculated using the following equation:

$$\begin{aligned} U_3^1= & \Phi (G_5^1 (C_3,\mu _5^1)\oplus G_4^1 (G_3^1 (G_2^1 (C_3,\mu _2^1 ),\mu _3^1 ),\mu _4^1)) \end{aligned}$$
(8)
$$\begin{aligned} U_3^2= & \Phi (G_5^2 (U_3^1,\mu _5^2)\oplus G_4^2 (G_3^2 (G_2^2 (U_3^1,\mu _2^2 ),\mu _3^2 ),\mu _4^2)) \end{aligned}$$
(9)

where \(U_3^1\) and \(U_3^2\) denote the outputs of the first and second residual blocks in the third scale, respectively. Note that, in (8), the lower-level feature map \(C_3\) is taken as input, and in (9), the output of the first residual block \(U_3^1\) is taken as input.

Then, GAP is applied to \(U_3^2\) to obtain the deep features of the third scale. Therefore, the output characteristics of the third scale can be described as

$$\begin{aligned} Scale_3=\delta (U_3^2 )=\frac{1}{h \times w}\sum _{i=1}^{H} \sum _{j=1}^{W}U_{(i,j)}^{l_3} \end{aligned}$$
(10)

where \(U^{l_3}\in \mathbb {R}^{H\times W}\) is a feature map with height H and W for \(l_3\)-th channel of the input \(U_3^2\).

Similarly, one can infer the same conclusion.

We can now write the outputs of the two residual blocks of the fourth scale as

$$\begin{aligned} U_4^1= & \Phi (G_5^1 (C_2,\mu _5^1)\oplus G_4^1 (G_3^1 (G_2^1 (C_2,\mu _2^1 ),\mu _3^1 ),\mu _4^1)) \end{aligned}$$
(11)
$$\begin{aligned} U_4^2= & \Phi (G_5^2 (U_4^1,\mu _5^2)\oplus G_4^2 (G_3^2 (G_2^2 (U_4^1,\mu _2^2 ),\mu _3^2 ),\mu _4^2)) \end{aligned}$$
(12)

The output characteristics of the fourth scale can be described as

$$\begin{aligned} Scale_4=\delta (Y_2^2 )=\frac{1}{h \times w}\sum _{i=1}^{H} \sum _{j=1}^{W}U_{(i,j)}^{l_4} \end{aligned}$$
(13)

where \(U^{l_4}\in \mathbb {R}^{H\times W}\) is a feature map with height H and W for \(l_4\)-th channel of the input \(U_4^2\).

Dense residual connection feature fusion

The motivation for designing this module is that low-resolution feature maps contain a larger receptive field and can provide global context information, while high-resolution feature maps contain more details and local information. Fusion of these features helps the model to comprehensively consider global and local information when understanding the scene, improving the accuracy of recognition and segmentation. Features of different scales are complementary in capturing the diversity and changes of objects. Fusion of these features can make the model more stable when facing different scenes or different types of inputs. Multi-scale feature fusion can integrate information from different scales to obtain richer context information.

In the FPN19 architecture, high-level neurons exhibit responsiveness to the entire image, while low-level neurons prefer to activate in response to local patterns. Therefore, in the context of remote sensing scene analysis, in addition to the global image-level representations obtained from high-level convolutional neural networks (CNNs), local specific object-level features derived from low-level ones are also very valuable. Inspired by this, the DRCFF module extends the traditional FPN network. It introduces methods of multi-scale feature fusion and cross-layer connection, which can maximize the utilization of different levels of deep features, as shown in Fig. 4. The DRCFF module effectively merges global image-level representation with local object-level features.

Figure 4
figure 4

Illustration of the DRCFF.

The details of the DRCFF by FPN are introduced as follows. Let \(S_i\in \mathbb {R}^{H_i\times W_i\times C_i} (i=2,3,4,5)\)be the extracted deep features. We feed \(S_5\) into convolution layer to reduce the number of output channels to C. Suppose the convolution layer has C convolution kernels \(\omega _5^{k}\in \mathbb {R}^{1\times 1\times C_5},k=1,2,\ldots ,C\). \(S_5\) is convoluted with each convolutional kernel \(\omega _5^{k}\) to generate \(L_5^k\in \mathbb {R}^{H_5\times W_5}\). Then, \(L_5^{k}\) is stacked by the channel to generate \(L_5\in \mathbb {R}^{H_5\times W_5\times C}\)

$$\begin{aligned} L_5^{k}= & \rho (\omega _5^{k}*S_5) \end{aligned}$$
(14)
$$\begin{aligned} L_5= & [L_5^{1},L_5^{2},...,L_5^{C}] \end{aligned}$$
(15)

where \(\Psi (\cdot )\) refers to the rectified linear unit activation function, \(*\)represents the convolution, and \([\cdot ]\)denotes stacking by the channel. For convenience, we simplify the formula of the convolution layer by

$$\begin{aligned} D_5=L_5=\Psi (G_{S_5}(S_5,\omega _5)) \end{aligned}$$
(16)

where \(G_{S_5}\) denotes these two-dimensional (2-D) convolutions and \(\omega _5\) represents the weight parameter of the convolution layer.

Similarly, for \(S_4\), \(S_3\), and \(S_2\), we can get

$$\begin{aligned} L_i=\Psi (G_{S_i } (C_i,\omega _i )),i=2,3,4 \end{aligned}$$
(17)

where \(\omega _i\) represent the weight parameters of the corresponding convolutional layers.

In the top-down pathway, the top features are hierarchically integrated into the bottom one’s layer-by-layer using the two-fold, four-fold and eight-fold upsampling operation and element-wise addition. This process is formulated as

$$\begin{aligned} D_4= & \Psi (L_4\oplus G_{2\times up} (D_5 )) \end{aligned}$$
(18)
$$\begin{aligned} D_3= & \Psi (L_3\oplus G_{2\times up}(D_4 ))\oplus \Psi (L_4\oplus G_{4\times up}(D_5 )) \end{aligned}$$
(19)
$$\begin{aligned} D_2= & \Psi (L_2\oplus G_{2\times up}(D_3 ))\oplus \Psi (L_3\oplus G_{4\times up}(D_4 ))\oplus \Psi (L_4\oplus G_{8\times up} (D_5)) \end{aligned}$$
(20)

where \(D_i\in \mathbb {R}^{H_i\times W_i\times C},i=2,3,4,\)\(\oplus\) represents the elementwise addition operation, and \(G_{2\times up}\), \(G_{4\times up}\) and \(G_{8\times up}\) denotes two-fold, four-fold and eight-fold upsampling operation, respectively.

After generating the optimal resolution map, perform a \(3\times 3\) convolution on each merged map to eliminate upsampling artifacts. The final output of the DRCFF module is named \(x=\{N_2, N_3, N_4, N_5\}\), which corresponds to \(x=\{S_2, S_3, S_4, S_5\}\) feature maps. The calculation process can be formulated as

$$\begin{aligned} N_5= & D_5 \end{aligned}$$
(21)
$$\begin{aligned} N_i= & \Psi (G_{conv3\times 3}(D_i,\rho _i )),i=2,3,4 \end{aligned}$$
(22)

where \(N_i\in \mathbb {R}^{H_i\times W_i\times C}, G_{conv3\times 3}\) represents a 2-D convolution with the kernel size of \(3 \times 3\), and \(\rho _i\) are the weight parameters of the \(3\times 3\) convolutional layers.

Correlation-attention module

The motivation for designing this module is that traditional convolutional neural networks (CNNs) have certain limitations when capturing long-distance dependencies. The attention mechanism can directly weight and aggregate the entire input sequence when processing inputs of any distance, thereby effectively capturing long-distance dependencies. The attention mechanism allows the model to give higher weights to important parts when processing inputs. This dynamic weight adjustment capability enables the model to more effectively focus on key information in the input and ignore irrelevant parts, thereby improving the accuracy of the task. The introduction of the relevant attention module is mainly to enhance the model’s ability to capture long-distance dependencies, dynamically focus on key information, improve the model’s expressiveness and computational efficiency, and increase the model’s robustness and interpretability.

This module, shown in Fig. 5, is based on the motivation that low-level features contain contextual information related to high-level features, while high-level features encapsulate more abstract representations derived from low-level features. Therefore, we introduce the correlation attention (CAM) model. This model is used to capture the complex relationships that exist between these features and then merge them into multi-scale feature sets. In this iterative process, starting from low-level features \(N_2\), we follow a series of two key steps: computing the relational attention and generating the feature map of associative attention \(A_{n+1}\) (where n ranges from 2 to 4). In detail, First, the contextual information existing between low-level features \(N_n\) and high-level features \(N_{n+1}\) is obtained by calculating associative attention. Second, this attention information is embedded into higher-level features. Then, GAP is used to collect global information for each channel within \(N_n\), and finally, these collected information are used to facilitate the calculation of relational attention.

Figure 5
figure 5

Illustration of the CAM.

$$\begin{aligned} Z(N_n)=G_{pool}(N_n )= \frac{1}{H_n \times W_n}\sum _{i=1}^{H_n} \sum _{j=1}^{W_n}N_n{(i,j)} \end{aligned}$$
(23)

where \(Z(N_n)\in \mathbb {R}^C\) represent the pooled features, \(G_{pool}\) refers to the global average pooling operation and \(N_n\in \mathbb {R}^{H_n\times W_n\times C},(n=2,3,4)\), denote the multiscale convolution feature maps, where \(H_n\), \(W_n\) and C denote the height, width and channel dimension for \(N_n\).

Encode the correlation between adjacent channels using one-dimensional convolution along the channel dimension to generate \(Z(N_n)\), and send it to a higher-level feature \(N_{n+1}\) as the encoded scale context dependency relationship. The learned correlative-attention is expressed as

$$\begin{aligned} \gamma (N_n)=\sigma (G_{conv}^{1}(Z(N_n))) \end{aligned}$$
(24)

where \(\gamma (N_n)\in \mathbb {R}^C\) represent the set of attention weights and \(\sigma\) is the Sigmoid function. The one-dimensional convolution for \(Z(N_n)\) is formulated as

$$\begin{aligned} V(N_n)=G_{conv}^{1} (Z(N_n))=W_k\cdot Z(N_n) \end{aligned}$$
(25)

where \(V(N_n )\in \mathbb {R}^C, G_{conv}^{1}\) represents a one-dimensional convolution, and \(W_k\in \mathbb {R}^{C\times C}\) are the parameters of the filters in the one-dimensional convolution.

After the attention weights \(\gamma (N_n)\) are calculated, we perform the element-wise multiplication on \(\gamma (N_n)\) and \(N_{n+1}\) to compute the feature maps with correlative-attention.

$$\begin{aligned} A_{n+1}= {\left\{ \begin{array}{ll} N_{n+1}& \text{ if } n=1\\ \gamma (N_n)\otimes N_{n+1}& \text{ if } n=2,3,4 \end{array}\right. } \end{aligned}$$
(26)

where \(A_i\in \mathbb {R}^{H_i\times W_i\times C}(n=2,3,4,5)\)and represents the element-wise multiplication operation. According to (26), the correlative-attention features corresponding to \(N_2, N_3, N_4,\) and \(N_5\) can be derived.

Ultimately, the pooled correlative-attention feature maps are fused using concatenation (concat for short) to produce the final multi-level fused correlative-attention features F

$$\begin{aligned} F=Concat[G_{pool} (A_2),G_{pool}(A_3),G_{pool}(A_4),G_{pool}(A_5)] \end{aligned}$$
(27)

where \(F\in \mathbb {R}^{1\times 1\times 4C}\).

Dataset description

The proposed MDRCN is evaluated on four publicly available remote sensing scene datasets: UC-Merced; WHU-RS19; AID; NWPU-RESISC45. The detailed information of these three datasets is given in Table 3.

Table 3 Basic information of the four datasets.
Figure 6
figure 6

Scene examples from the UC Merced Land-Use dataset.

Figure 7
figure 7

Scene examples from the AID dataset.

Figure 8
figure 8

Scene examples from the WHU-RS19 dataset.

Figure 9
figure 9

Scene examples from the NWPU-RESISC45 dataset.

  1. 1.

    UC Merced Land-Use Dataset(UCM):The UCM dataset is a widely used benchmark for remote sensing scene classification. Figure 6 shows some examples of images from this dataset.

  2. 2.

    AID: This dataset contains 10,000 aerial scene images. The dataset was collected from Google Earth by Wuhan University and has small differences between classes, high differences within classes and a large scale. Figure 7 shows example images for each class.

  3. 3.

    WHU-RS19: This dataset contains 1005 images divided into 19 categories and was published by Wuhan University in 2012. Figure 8 shows examples of each class.

  4. 4.

    NWPU-RESISC45: This dataset contains 31500 images divided into 45 categories and was collected by Northwestern Polytechnical University in 2016. Figure 9 shows examples of each class.

Experimental details

  1. 1.

    Train-to-test ratio: To ensure a fair comparison with state-of-the-art algorithms in our experiments, we adopt a training to test ratio that is consistent with comparative work on different datasets.

  2. 2.

    Model initialisation: In the MDRCN network, the parameters pre-trained on ImageNet are used as the initialisation parameters of the deep convolution layer, and the parameters of the other network layers are initialised at random. All offset parameters are initially set to 0.001.

  3. 3.

    Training procedure:We use the PyTorch deep learning framework to perform experiments on the NVIDIA RTX 2080S GPU. All images are resized to 224 \(\times\) 224 pixels as input, the batch size is set to 16, and the Adam optimiser is used for parameter optimisation.

Comparison with state-of-the-arts

In order to fully verify the progress of our proposed method, we compared it with some state-of-the-art methods, including AlexNet34, TEXNet27, GoogLeNet34 ,VGG-16-CapsNet35, VGG-VD-16-SAFF36, CSDS37, MSRes-SplitNet17, EFPN-DSE32, TDFE-DAA38, RANet39, EFPN-DSE-TDFF32, EAM40, T-CNN41, EMSCNet(ResNet-50)42, EMSCNet(ViT-B)42, SCCov(Alexnet)43, SCCov(VGG16)43, D-CNN with VGGNet-1644, MLDS45, Two-Stream Fusion46 and EAM40. To ensure fairness in our comparisons, we repeated the experiments ten times and calculated both the average and standard deviation of the Overall Accuracy (OA).

Experimental results

  1. 1.

    Table 4 shows the classification results of our proposed MDRCN method on the UCM dataset as well as the results of other state-of-the-art methods. Figure 10 depicts histograms of the performance of these different methods. Figure 11 shows the confusion matrix results under the same training conditions.

Table 4 Experimental results of the UCM dateset.
Figure 10
figure 10

Comparison of the accuracies on the UCM dataset.

Figure 11
figure 11

Confusion matrix of the UCM dataset.

At 50% and 80% training rates, except for EMSCNet(ViT-B) and EMSCNet(ResNet-50), our method (MDRCN) outperforms other deep feature methods, and MDRCN shows higher performance. At 50% training rate, methods such as CSDS, MSRes-SplitNet, TDFE-DAA and RANet also show competitive performance.

In addition to evaluating the overall accuracy, we also perform a detailed analysis of our proposed method using a confusion matrix, showing the best results under different fixed training ratios. When the training ratio is 50%, the classification accuracy of 18 categories reaches 100%. In particular, the classification accuracy of “denseresidential”, “intersection” and “sparseresidential” exceeds 95%. But there is still a 5% probability that the “denseresidential” image is incorrectly classified as a “buildings”. The probability of misclassifying the “intersection” image as “mediumresidential” is also 5%. At a training ratio of 80%, most scene categories achieve 100% classification accuracy, with the only exception being the “mediumresidential” category, where 20 categories achieve 95% classification accuracy. The confusion matrix shows that some “denseresidential” images are incorrectly classified as “sparseresidential”. The reason for the above classification errors may be the great similarity between classes.

Table 5 Experimental results of the AID dateset.
  1. 2.

    Table 5 presents the classification results obtained by proposed MDRCN method on the AID dataset, alongside the outcomes of other advanced methods. For a more straightforward comparison, Fig. 12 illustrates histograms depicting the performance of these various methods.

Figure 12
figure 12

Comparison of the accuracies on the AID dataset.

Figure 13
figure 13

Confusion matrix of the AID dataset.

When the training rate is set at 50%, MDRCN achieve comparable results (95.66%) among all deep feature methods. Upon reducing the training ratio to 20%, our method demonstrates superior performance, while the overall accuracy of EAM declines to 93.13%. This suggests that MDRCN exhibits stronger discriminative capability compared to EAM.

In Fig. 13, owing to space limitations, we only display the confusion matrices for training levels of 20% and 50%, respectively. Figure 13a showcases the confusion matrix for a training rate of 20%. It can be observed that most categories achieve satisfactory classification results of over 90%, with only five categories “Centre”, “Church”, “Resort”, “School”, and “Plaza” experiencing significant misclassification. Despite the substantial reduction in the number of training images, certain categories prone to misclassification can still be effectively classified, such as “medium-sized residence” (93%) and “sparse residence” (99%), or “bridge” (97%), “port” (96%), and “river” (97%).

As depicted in Fig. 13b, the classification accuracy of 26 out of 30 categories surpasses 95%. Categories with classification accuracy below 95% include “Centre” (91%), “Resort” (86%), “School” (93%), and “Plaza” (94%). In the AID dataset, the most significant confusion arises between “Resort” and “Park”, “School ” and “Business”, or “Centre” and “Plaza”. This phenomenon may be attributed to the presence of similar ground objects or geometric structure distributions.

  1. 3.

    Table 6 shows the classification results achieved by proposed MDRCN method, as well as the results of other cutting-edge methods, all of which are evaluated on the UCM dataset. To provide a more visual comparison, Fig. 14 depicts histograms illustrating the performance of these different methods.

Table 6 Experimental results of the WHU-RS19 dateset.
Figure 14
figure 14

Comparison of the accuracies on the WHU-RS19 dataset.

Figure 15
figure 15

Confusion matrix of the WHU-RS19 dataset.

In Fig. 15, the confusion matrices at training levels of 60% and 40% are shown respectively. As shown in Fig. 15a, among the 30 categories, except for the classification accuracy of the “port” category (0.95), the other classification accuracy rates all reach 1.00. In the WHU-RS19 dataset, the greatest confusion occurs between “port” and “bridge”. The explanation for these results is that these classes have similar geometric distributions.

Figure 15b shows the confusion matrix when the training rate is 4%. Among the 30 categories, except for the classification accuracy of the “football field” (0.95) and “port” (0.91) categories, The accuracy rates of other classifications all reach 1.00. As you can see, 5% of the images of “football stadium” are incorrectly classified as “industrial” and 9% of the images of “port” are incorrectly classified as “airport”. This may be attributed to their similar landforms.

Table 7 Experimental results of the NWPU-RESISC45 dateset.
Figure 16
figure 16

Comparison of the accuracies on the NWPU-RESISC45 dataset.

Figure 17
figure 17

Confusion matrix of the NWPU-RESISC45 dataset.

  1. 4.

    Table 7 presents the classification results obtained by proposed MDRCN method on the NWPU-RESISC45 dataset, alongside the outcomes of other advanced methods. For a more straightforward comparison, Fig. 16 illustrates histograms depicting the performance of these various methods.

Figure 17 shows the confusion matrix when the training rate is 10% and 20% respectively. As shown in Fig. 17a, among the 45 categories, except for the classification accuracy of the “church” category and the “palace” category, which is less than 80%, the classification accuracy of other categories is more than 80%. 10% of the “church” images are incorrectly classified as “palace”, 6% of the “desert” images are incorrectly classified as “mountains”, and 6% of the “terrace” images are incorrectly classified as “rectangular-farmland”. Figure 17b shows the confusion matrix when the training rate is 20%. Among the 45 categories, it can be seen that except for the classification accuracy of the “palace” category, which is less than 80%, the classification accuracy of other categories is more than 80%. 9% of the “church” images are incorrectly classified as “palace”, and 7% of the “lake” images are incorrectly classified as “wetland”. It can be seen that in the NWPU-RESISC45 dataset, the most confused categories are “church” and “palace”. This is because the similarity between the categories is relatively large.

Discussion

In order to effectively evaluate our method, different ablation experiments are performed below using different connection possibilities.

Impact of data augmentation

During training, the input images will be enhanced by random horizontal mirroring and random rotation, leading to a richer enhanced image than the original. To verify the effectiveness of data augmentation, we compare the methods with and without data augmentation. Table 8 shows the results of the comparison. In this table, \(MDRCN^{+}\) represents the method with data augmentation, while \(MDRCN^{-}\) represents the method without data augmentation. Experimental results show that using data augmentation technology, the overall accuracy rate is increased by more than 0.5%.

Table 8 Ablation analysis on data augmentation.

Effects of different modules

There are three modules in this framework: MSFE, DRCFF and CAM respectively. Each architecture omits the control methods of only one module at a time. As shown in Figs 18a–c show the architecture with MSFE, DRCFF and CAM omitted. Table 9 shows the overall accuracy at 80% and 50% for different architectures on the UCM dataset. Table 10 shows the overall accuracy of different architectures on the NWPU-RESISC45 dataset at 10% and 20%.

Effects of MSFE: The results of Scheme 1 are the worst, because the MSFE is omitted in this architecture, while the function of the MSFE is to initially strengthen the semantics of all level feature maps. Compared to Scheme 1, since Scheme 2, 3, and 4 include the MSFE. On the UCM dataset, their overall accuracy improves by 0.47%, 0.83%, and 1.35% at 80% training rate, and by 0.79%, 0.98%, and 1.47% at 50% training rate. On the NWPU-RESISC45 dataset, their overall accuracy improves by 0.97%, 1.27% and 1.47% at 10% training rate, and by 0.75%, 1.12%, and 1.96% at 20% training rate. These results show that the MSFE we addressed is indeed beneficial for remote sensing scene classification.

Figure 18
figure 18

Concise illustration of various architectures. (a) Without MSFE. (b) Without DRCFF. (c) Without CAM.

Table 9 Overall accuracy (%) of different architectures on UCM dataset.
Table 10 Overall accuracy (%) of different architectures on NWPU-RESISC45 dataset.

Effects of DRCFF: In Scheme 2, CAM is directly linked to the output of the MSFE without DRCFF. In comparison, our method, Scheme 4, through DRCFF, achieves better performance with an overall accuracy improvement of 1.35% and 1.47% at 80% and 50% training rates, respectively. The overall accuracy of the NWPU-RESISC45 dataset increases by 0.50% and 1.21% at 10% and 20% training rates, respectively. The effectiveness and superiority of our proposed DRCFF is strongly confirmed by this phenomenon.

Effects of CAM: In Scheme 3, the CAM is removed and replaced by a simple GAP layer, GAP for the subsequent scene classification. From Table 9 it can be seen that although Scheme 3 has a better performance than Scheme 1 and 2, there are still slight decreases in comparison to Scheme 4. Specifically, at training rates of 80% and 50%, there are slight decreases in overall accuracy of 0.52% and 0.49%, respectively. As can be seen from Table 10, Scheme 4 has the best performance. Compared with the first three schemes, its overall accuracy is slightly higher than that of other schemes when the training rate is 10% and 20%.

Conclusion

In this paper, we build an efficient deep learning framework called MDRCN. Considering different levels of feature diversity, we design MSFE to extract multi-scale feature maps. The DRCFF module is introduced to merge all the extracted features and each feature to interact with other features. A CAM model is also introduced to extract key supplementary semantic information, thereby further improving the performance of classification tasks.Our proposed method achieves significant improvements over existing algorithms in terms of effectiveness and accuracy, ultimately achieving state-of-the-art results on three mature remote sensing scene classification benchmarks. In future work, we will pay more attention to designing effective plug-and-play modules, such as the proposed MSFE, and embed them into different CNN architectures to further improve the network’s remote sensing scene classification capabilities.