Local feature acquisition and global context understanding network for very high-resolution land cover classification

Li, Zhengpeng; Hu, Jun; Wu, Kunyang; Miao, Jiawei; Zhao, Zixue; Wu, Jiansheng

doi:10.1038/s41598-024-63363-7

Download PDF

Article
Open access
Published: 01 June 2024

Local feature acquisition and global context understanding network for very high-resolution land cover classification

Zhengpeng Li^1,2,
Jun Hu^1,2,
Kunyang Wu^3,4,5,
Jiawei Miao^1,2,
Zixue Zhao⁶ &
…
Jiansheng Wu⁶

Scientific Reports volume 14, Article number: 12597 (2024) Cite this article

2883 Accesses
10 Citations
Metrics details

Subjects

Abstract

Very high-resolution remote sensing images hold promising applications in ground observation tasks, paving the way for highly competitive solutions using image processing techniques for land cover classification. To address the challenges faced by convolutional neural network (CNNs) in exploring contextual information in remote sensing image land cover classification and the limitations of vision transformer (ViT) series in effectively capturing local details and spatial information, we propose a local feature acquisition and global context understanding network (LFAGCU). Specifically, we design a multidimensional and multichannel convolutional module to construct a local feature extractor aimed at capturing local information and spatial relationships within images. Simultaneously, we introduce a global feature learning module that utilizes multiple sets of multi-head attention mechanisms for modeling global semantic information, abstracting the overall feature representation of remote sensing images. Validation, comparative analyses, and ablation experiments conducted on three different scales of publicly available datasets demonstrate the effectiveness and generalization capability of the LFAGCU method. Results show its effectiveness in locating category attribute information related to remote sensing areas and its exceptional generalization capability. Code is available at https://github.com/lzp-lkd/LFAGCU.

Spatially adaptive interaction network for semantic segmentation of high-resolution remote sensing images

Article Open access 02 May 2025

A deep inverse convolutional neural network-based semantic classification method for land cover remote sensing images

Article Open access 27 March 2024

GLE-net: global-local information enhancement for semantic segmentation of remote sensing images

Article Open access 25 October 2024

Introduction

In recent years, the advancement of very high-resolution (VHR) satellite sensors has profoundly impacted the domain of multi-image processing. VHR satellite sensors enable the acquisition of earth's surface imagery at exceptionally high resolutions, offering substantial contributions to domains such as earth resource understanding and management^1,2, urban planning³, and environmental monitoring⁴. Notably, the utilization of VHR imagery has marked a significant breakthrough in the context of land cover classification tasks^5,6. While the CNNs have showcased impressive capabilities in accurately categorizing land cover through intricate feature extraction from satellite and aerial images, their inherent receptive fields are inherently confined. This constraint can impede the effective capture of extensive geographical features embedded in remote-sensing images. Moreover, traditional convolutional methodologies encounter challenges when handling high-resolution images, primarily due to their struggle in accurately identifying subtle spatial patterns and spectral variabilities. This assertion stems from the inherent complexity present in high-resolution imagery, encompassing intricate details and intricate spatial structures that conventional convolutional approaches may overlook or inadequately capture, thus leading to diminished performance in processing such images.

The quantity of VHR imagery is significantly smaller than that of natural environment images. Classic CNNs often enhance accuracy by increasing the number of convolutional layers, using smaller convolution kernels, and incorporating multiple residuals^7,8. However, this approach to increasing network depth and complexity can exacerbate overfitting issues, especially in situations with limited data where models might overfit to training data excessively. To address the issues brought about by deepening the network with additional convolutional layers, researchers have adopted attention mechanisms for global modeling of VHR images to compensate for the CNNs' limitation in capturing only local features during hierarchical learning. In the field of remote sensing, the ViT has become a focal point of research due to its immense potential in global information modeling. This breakthrough is largely attributed to the ViT series models' learning methods based on attention mechanisms. Wang et al.⁹ effectively augment the shortcomings of CNNs in learning local features by using attention mechanisms to globally model very high-resolution imagery. However, the ViT structure tends to rely on global self-attention mechanisms which fail to adequately extract and utilize local features in remote sensing imagery, leading to decreased performance in tasks that require handling of local detail information, such as geospatial classification in remote sensing images.

An increasing number of researchers are exploring the potential applications of a combined mechanism of CNNs and Transformers in VHR land cover classification imagery^10,11. Ding et al.¹² proposed utilizing multiscale feature fusion and probabilistic decision fusion strategies to integrate local spatial features with global spectral features, addressing complex spatial-spectral associations and facilitating effective interaction between multimodal data. Song et al.¹³ designed a dual-backbone attention fusion module and a multilayer dense connectivity network to integrate both local and global contextual information. To overcome the limitations inherent in convolution operations which restrict the network’s ability to extract global contextual information, and the Transformer’s deficiencies in capturing detailed local information, this paper proposes a local feature acquisition and global context understanding network. This approach attempts to fully couple the hierarchical feature representation of CNNs with the global dependency relationships of Transformers, leveraging the strengths of both paradigms.

The primary contributions of this research are as follows.

(1)
Proposing a neural network, LFAGCU, for VHR remote sensing image classification, which synergistically learns the potential of local–global semantic information embedded in images through two design paradigms.
(2)
Constructing a local feature extractor that explicitly considers spatial relationships between fine-grained image pixels and the characteristics of geographic attributes using CNN biases and inductive techniques.
(3)
Introducing a global feature learning module (GFL) for image modeling, facilitating a proficient understanding of spatial relationships and inherent contextual clues in terrestrial entities.
(4)
Conducting a series of experiments on widely-used open-source datasets, namely RSCCN7, WHU-RS19, and UCMerced-LandUse, demonstrating that LFAGCU outperforms other advanced methods for remote sensing image classification.

Related works

Feature extractor based on CNN

Rezaee et al.¹⁴ integrated spatial features and spectral attributes using a pre-trained AlexNet, enhancing thematic land cover information. Jamali et al.¹⁵ highlighted the synergy between wavelet transformation and deep convolutional networks for effective feature extraction from imagery. Scott et al.¹⁶ merged CaffeNet, GoogLeNet, and ResNet50 architectures, focusing on category-specific information aggregation. Residual networks like ResNet have found extensive use in land cover analysis; however, in limited dataset scenarios, they may face overfitting issues¹⁷. Jamali et al.¹⁸ emphasized tailored network architectures for comprehensive land feature attribute capture in land cover tasks. Singh¹⁹ combined convolutional structures and contractive-expansive-contractive networks with residual connections for improved feature representation. Conventional CNN models might not suit VHR satellite sensor-acquired land cover images due to acquisition disparities. Scholars aim for multi-scale and multi-modal learning for remote sensing insights. Gbodjo et al.²⁰ used knowledge distillation to handle multi-temporal and multi-scale remote sensing data, enriching feature representations. Li et al.²¹ fused optical and imagery, enhancing interpretability with a semantic consistency constraint algorithm. Ye et al.²² proposed controllable filters and a multi-scale strategy for improved remote sensing feature learning. Fan et al.²³ applied "pyramid features of orientated self-similarity" for multi-modal remote sensing image matching, overcoming geometric distortions and intensity variations.

The current research focuses on the pivotal role of CNNs in the field of land cover classification. However, these methods do not adequately consider the representation of global features from a macro perspective, especially when it comes to tasks involving the capture and presentation of semantic information in VHR remote sensing imagery.

Feature extractor based on transformer

The recent progress in deep models, especially the ViT²⁴, has sparked interest in leveraging attention mechanisms for land cover classification. Li et al.²⁵ utilized a multi-head encoder and a knowledge-guided decoder to capture diverse land patterns. However, ViT faces challenges due to its high parameter count compared to traditional CNNs^26,27. Lv et al.²⁸ introduced a spatial channel-preserved ViT model, enhancing feature preservation and representing a significant advancement. Yao et al.²⁹ demonstrated exploiting spatial and pattern-specific channel information by integrating ViT modules with separable convolutional architectures. Zhao et al.³⁰ explored the multi-sample contrastive ViT, revealing nuanced patterns across distinct samples. Hou et al.³¹ enriched cross-domain learning using pseudo-label self-training and consistency regularization, highlighting ViT's adaptability in addressing diverse challenges. Tang et al.³² employed a self-attention mechanism to integrate features at different levels, deeply exploring subtle representations in remote sensing images.

Recent ViT-based methods, leveraging attention mechanisms, have made significant strides in image processing, particularly in handling high-resolution images. However, these methods often suffer from a high number of parameters, leading to increased model complexity. While ViT methods focus on capturing global semantic features of high-resolution images, their effectiveness in learning subtle local features is not as prominent as CNN models, primarily due to the lack of an inductive bias strategy. Addressing this challenge, we propose an approach that combines local induction and global feature integration to complement the deficiencies of ViT in learning local features.

Methodology

We introduce LFAGCU, a new network design specifically crafted for accurately classifying land cover in VHR remote sensing images. This architecture combines classical convolutions, pointwise convolutions, and focused depthwise convolutions (DConv), which inherently capture local sensitivities. Additionally, we incorporate a GFL to understand comprehensive semantic attributes and spatial relationships. As illustrated in Fig. 1, LFAGCU takes inspiration from the ResNet series paradigm, employing a multi-residual connection approach to preserve the innate features extracted within each feature extraction unit. These original features are amalgamated with subsequent updated features to foster an enhanced grasp of both local and global attributes inherent in land objects. The overarching LFAGCU framework comprises $n = 2$ sets of linearly combined convolutional groups, strategically devised to fully capture the contextual relationships and geometric intricacies underpinning the entirety of the land entities' spatial landscape. In tandem, $m = 2$ sets of GFL are harnessed to holistically model the entire image, allowing the features at each position to comprehensively perceive the broader array of inter-land relationships. These modules dynamically adjust the feature weights at each position based on the spatial relationships between land objects, thereby adeptly encapsulating the global characteristics of land entities. Ultimately, the synthesis of contextually enriched and spatially encoded features culminates via a global pooling layer, culminating in the definitive classification of each pixel's category within the input image. This innovative design of LFAGCU aims to seamlessly integrate local features, global attributes, and spatial relationships within the context of land cover classification tasks, thus enhancing classification accuracy and the capacity for feature expression in the realm of VHR remote sensing imagery.

Local perception

The central focus of LFAGCU lies in local feature extraction from VHR remote sensing images, achieved through the integration of two pivotal modules: convolution-batch normalization-activation (CBA) and DConv-batch normalization-activation (DCBA). In the structural framework, for a given input image, a preliminary preprocessing step converts it into a three-dimensional tensor with dimensions H representing height, W representing width, and C signifying the number of channels. Within the initial CBA layer, the mathematical representation of the output stemming from the standard convolutional layer can be expressed as follows:

$$ Y(i,j,k) = \sum\limits_{a = 0}^{n - 1} {\sum\limits_{b = 0}^{n - 1} {\sum\limits_{c = 0}^{C - 1} {X_{(i + a,j + b,c)} \cdot \omega_{(a,b,c,k)} } } } , $$

(1)

where $Y \in {\mathbb{R}}^{H \times W \times K}$ represents the resulting output tensor, while K stands for the count of convolutional kernels. The index k is used to denote a specific convolutional kernel. Furthermore, the indices i and j respectively represent the dimensions of height and width in the resulting convolved tensor. The variables a and b are employed to indicate the offsets in the vertical and horizontal directions of the convolutional kernel on the input tensor. Additionally, the variable c signifies the index of the channel in the input tensor. $\omega_{(a,b,c,k)}$ denotes the weight associated with the convolutional kernel at the position (a, b), connecting channel c in the input tensor to channel k.

In the domain of VHR image processing, it is often imperative for models to possess sufficient capacity to intricately capture the nuances present in the images. However, this pursuit can potentially lead to the problem of overfitting, especially in scenarios where the available data is limited¹⁸. Moreover, VHR remote sensing images are subject to significant variability due to variations in geographical regions, temporal factors, and sensor characteristics³³. To address these challenges, LFAGCU incorporates standard n × n convolutions paired with batch normalization layers (BN). This integration of BN aids in normalizing the mean and variance of individual feature channels, effectively aligning them with a standard normal distribution. This normalization process fosters enhanced learning of pertinent feature representations by rendering the network more amenable to capturing the intricacies embedded within VHR images. The mathematical formulation for the output of BN can be succinctly described as:

$$ Y^{BN} = \gamma \cdot \frac{X - \mu }{{\sqrt {\sigma^{2} + \varepsilon } }} + \beta , $$

(2)

where $Y^{BN}$ is the normalized output feature map, $\gamma$ and $\beta$ are learnable parameters, $\gamma$ is used for scaled normalized features, $\beta$ is used for translated normalized features, $\mu$ is the mean value of X, $\sigma^{2}$ is the variance of X, and $\varepsilon$ is a positive parameter. Following the feature map normalization, LFAGCU incorporates a mechanism for introducing nonlinearity. VHR remote sensing images often exhibit intricate terrain boundaries, textures, and distinctive features, necessitating the network's capability to capture diverse nonlinear characteristics inherent in the data³⁴. In Fig. 2, we present a visual comparison of the distinct nonlinear transformation curves associated with eleven diverse nonlinear activation functions. Through meticulous experimental comparisons and subsequent discussions (elaborated in Sect. "Comparative experiment and analysis"), we observe that the sigmoid-weighted linear unit (SiLU) activation function excels within the LFAGCU framework. The SiLU activation function amalgamates both linear and nonlinear attributes, endowing it with enhanced adaptability to intricate feature variations evident in VHR remote sensing images. This selection is a product of comprehensive contemplation and exhaustive experimental validation. We firmly contend that the SiLU activation function adeptly facilitates LFAGCU in capturing pivotal image features, thereby elevating the performance of land cover classification tasks. The mathematical formulation of SiLU is as follows:

$$ Y^{s} = X \cdot \frac{1}{{1 + e^{ - X} }}. $$

(3)

To enhance the learning of variations in the channel dimension, we employ pointwise convolution on the feature map resulting from the standard convolution, as shown Eq. (4). This involves computing convolution operations at each pixel location of the feature map, ensuring that spatial dimensions remain unchanged. By combining information from different channels with the introduced nonlinearity, the tensor is projected into a higher-dimensional space, resulting in a more intricate feature representation.

$$ Y_{i,j,c}^{pc} = \sum\limits_{d = 1}^{D} {X \cdot \omega_{1,1,d,c} } , $$

(4)

where $Y_{i,j,c}^{pc}$ represents the value of channel c at position (i, j) in the output feature map. $\omega_{1,1,d,c}$ denotes the weight of the pointwise convolution, connecting input channel d to output channel c.

Finally, LFAGCU incorporates DConv (as shown Eq. (5)), a technique that applies individual convolution kernels to each channel of the input feature map. This approach facilitates the maintenance of channel separability while capturing spatial dependencies among channels. Given an input feature map X, DConv performs separate convolutions using n × n kernels for each input channel, resulting in distinct sets of output features.

$$ Y_{i,j,d}^{DC} = \sum\limits_{p = 1}^{n} {\sum\limits_{q = 1}^{n} {X_{i + (p - 1),j + (q - 1),d} } } \cdot \omega_{p,q,d} , $$

(5)

where $Y_{i,j,d}^{DC}$ represents the value at position (i, j) in the output feature map, in channel d. $X_{i + p,j + (q - 1),d}$ stands for the value at position $(i + (p - 1),j + (q - 1))$ in the input feature map, also in channel d. $\omega_{p,q,d}$ corresponds to the weight associated with channel d at position (p,q) of the convolutional kernel. For the feature map obtained after convolution, individual convolutional kernels are applied to each input channel. This implementation yields a form of channel-wise separable convolution operation, enabling the decoupling of information between different channels.

Global context modeling

By incorporating deeper convolutional layers and expanding the dimensions of the sliding window, the capability to effectively extract features from VHR images can be enhanced, thereby bolstering the network's adaptability for tasks such as land cover classification. However, this approach results in a substantial increase in the model's trainable parameters, potentially compromising its practical deployment due to increased computational demands and memory requirements. To address this challenge, LFAGCU introduces the (GFL) that aims to ensure the modeling of long-range non-local dependencies by leveraging an effective receptive field spanning the dimensions H × W. Notably, while ViT has demonstrated remarkable effectiveness in diverse computer vision tasks^35,36, it presents limitations in terms of spatial inductive bias and its propensity for fine-tuning, hindering its full potential for certain tasks³⁷. To overcome the limitations of weighted averaging of each pixel within the receptive field during convolution operations, which can lead to noise pixels affecting the distinguishability of image target pixels, the GFL module utilizes a multi-head self-attention mechanism for comprehensive global context modeling. This enables a better capture of long-range non-local dependencies, thereby ensuring the preservation of spatial information for each patch within the feature map.

As illustrated in Fig. 3, the GFL unfolds the feature map post pointwise convolution into non-overlapping flattened patches, denoted as $X_{pc} \in {\mathbb{R}}^{P \times N \times d}$. Here, P represents the total number of patches, N signifies the patch dimensionality, and d stands for the feature depth. This approach aims to enable effective information aggregation across patches while preserving their spatial relationships. Where P = wh and N = HW, h and w respectively represent the height and width of each patch. Additionally, N corresponds to the total number of elements within a patch, which is determined by the product of the patch height (h) and width (w). This configuration enables efficient organization of information for subsequent analysis and processing. In the framework of the multi-head self-attention mechanism, the flattened patch tensor $Xp \in {\mathbb{R}}^{P \times N \times d}$ is multiplied with the weight matrices $W_{q}$, $W_{k}$, and $W_{v}$. This operation yields linear transformations Q, K, and V, corresponding to query, key, and value representations, respectively. Within each attention head, the dot product between queries and keys is computed, followed by a scaling step to regulate the magnitude of attention scores. The resulting attention scores matrix is denoted as $Attention_{{head_{i} }}$.

$$ Attention_{{head_{i} }} = softmax\,\,\left( {\frac{{Q \times K^{T} }}{\sqrt d }} \right). $$

(6)

The value matrix $V_{i}$ is multiplied element-wise by the attention score matrix $Attention_{{head_{i} }}$ and subsequently summed, yielding the output matrix $O_{{head_{i} }}$ specific to each attention head.

$$ O_{{head_{i} }} = Attention_{{head_{i} }} \cdot V_{i} . $$

(7)

In the multi-head self-attention layer of GFL, each attention head generates a self-attention matrix, which encapsulates the significance weights attributed to different positions within the remote sensing image. These weights are instrumental in computing a weighted average of features within the flattened patch tensor, facilitating the assimilation of global contextual insights. Ultimately, by concatenating the individual head output matrices, denoted as $O_{{head_{i} }}$, a consolidated multi-head self-attention output matrix $O \in {\mathbb{R}}^{{P \times N \times \left( {K \times d} \right)}}$ is obtained. This amalgamated matrix empowers the model to encapsulate the interrelationships and interdependencies among diverse image positions. Notably, the multi-head self-attention strategy employed by LFAGCU is underpinned by a localized connectivity approach, whereby each patch position selectively focuses on its proximate counterparts. This strategic orientation contributes to a more refined assimilation of local features and spatial patterns in the image. This strategic localization augments the model's acuity to nuances, thereby enabling a more nuanced comprehension of subtle disparities and intricate configurations in the image.

In summation, GFL adeptly harnesses the potential of the multi-head self-attention mechanism to ascertain both local and global feature representations while upholding the integrity of the patch and pixel sequence. The harmonization of conventional convolutions and transformative elements in the LFAGCU architecture facilitates a localized and comprehensive interpretation of VHR remote sensing images. Concurrently, the model's acumen to discern subtle intricacies and intricate relationships within the image is accentuated.

Loss function

In the context of land cover classification tasks, remote sensing images often exhibit significant variations across different categories, giving rise to a challenging scenario exacerbated by the presence of imbalanced multi-class training samples. This challenge bears paramount importance, as model training could be disproportionately influenced by the prevalence of samples from certain dominant categories, potentially resulting in compromised classification performance on underrepresented minority classes. In light of this, in order to provide more effective guidance for the optimization of LFAGCU, we posit the model's output as a predictive probability distribution denoted by $P = (p_{1} ,p_{2} ,...,p_{C} )$, where C represents the cardinality of the class set, and $p_{i}$ denotes the probability assigned by the model to the sample's membership in the i-th class. The actual ground truth labels are encoded as $Y = (y_{1} ,y_{2} ,...,y_{C} )$, where assumes binary values (0 or 1), signifying the true class membership of the sample. The formulation of the loss function is articulated as follows:

$$ Loss = - \sum\limits_{i = 1}^{C} {y_{i} \log (p_{i} )} . $$

(8)

By minimizing the cross-entropy loss, the model undergoes iterative parameter adjustments during the training process to align its predicted probability distribution with the true label distribution as closely as possible. This endeavor facilitates enhanced model adaptation to diverse category samples, thereby improving its classification performance across various classes.

Materials and data preparation

Network training

In the training process, we employed NVIDIA 3080ti GPU and conducted training for 80 epochs. The batch size was set to 64. During the optimization phase of the network model, we employed the AdamW optimizer with a weight decay (L2 regularization) weight of 1E-2. Furthermore, an initial learning rate of 2E-4 was configured, and a cosine annealing learning rate scheduler was applied to dynamically adjust the model's learning rate. We set a 2-layer local feature extraction block and a 2-layer GFL. We randomly selected 80% of annotated samples from each category to form the training set, allocating 10% for validation. The remaining 10% of the data was reserved as the test set for evaluating model performance. Throughout the model training process, our loss function employed the supervised loss function defined in Eq. (8), guiding the model optimization process. By employing the aforementioned configurations and training strategies, we aimed to fine-tune the LFAGCU model, enhancing its performance and generalization capabilities in the context of land cover classification tasks.

Study area

In the context of this research, we employed three prominent datasets sourced from the realm of remote sensing imagery, namely the RSSCN7 dataset³⁸, the WHU-RS19 dataset³⁹, and the UC Merced landuse dataset⁴⁰, to serve as foundational pillars in substantiating the outcomes of our study. A comprehensive tabulation encapsulating data categories, sample sizes, and data dimensions for each of the aforementioned datasets is meticulously documented in Table 1.

Table 1 Multi-class land cover datasets.

Full size table

The RSSCN7 dataset has emerged as a valuable resource within the realm of remote sensing, incorporating diverse imagery collected from various geographical regions across China. This dataset has found applications in pivotal areas such as scene classification and object recognition^41,42. It encompasses a wide spectrum of environments including urban, rural, and natural landscapes, thereby encapsulating a rich variety of land features. The dataset encompasses seven prevalent land cover categories as shown in Fig. 4a.

The WHU-RS19 dataset is a prominent resource originating from the Institute of Remote Sensing at Wuhan University, specifically curated for advancing research in the realms of semantic segmentation and land cover classification, as elucidated by Ref.^39,43. The WHU-RS19 dataset standardizes its evaluation using nineteen prevalent land cover categories, including but not limited to roads, buildings, water bodies, forests, and croplands. Each category is distinguished by its unique combination of spectral and textural attributes. This intricate interplay of spectral information and textural nuances underscores the dataset's intrinsic capacity for discerning and classifying diverse land cover classes. A visual schematic of the research categories encapsulated within the WHU-RS19 dataset is artistically presented in Fig. 4b.

The UC Merced Land Use dataset comprises high-resolution remote sensing images from various regions across the United States, catering to the needs of land use classification tasks^44,45. The dataset encompasses a diverse range of land use types, including urban, industrial, and agricultural areas. The benchmark dataset encompasses twenty-one distinct land use categories, such as residential zones, airports, highways, and orchards, each characterized by unique color and texture attributes. An illustrative example of the research area within the UCMerced dataset is provided in Fig. 4c.

Results and discussions

Comparative experiment and analysis

To ensure a fair comparison, we quantitatively assessed the classification performance of these models using four widely recognized evaluation metrics: average precision (Pre), average recall (Rec), average accuracy (Acc), and average f1 score (F1). We have selected classical models and state-of-the-art land cover classification models for comparative experiments, as shown in Tables 2, 3, and 4. The upper section displays models primarily based on CNN architectures, including neural networks that utilize full CNNs for feature extraction: VGGNet⁴⁶, GoogleNet⁴⁷, ResNet⁴⁸, AlexNet^14,49, Mobilenet⁵⁰, Shufflenet-v2⁵¹, densenet⁵², Efficientnet⁵³, and the latest models re-evaluating inductive biases and aggregating feature extraction techniques, such as Convnext⁵⁴, Vgg-Vote⁵⁵, DFAGCN⁵⁶, SNN-VGG-15⁴². The middle section includes models based on transformer architectures: MLLD⁵⁷, HSL-MINet⁵⁸, ViT-b-p16²⁴, ViT-b-p32²⁴, ViT-l-p16²⁴, and T2T-VIT-12⁵⁹. The lower section includes models with global–local perspective structures, such as TransResUNet⁶⁰, BPECN⁶¹, SKAL-AlexNet⁶², SKAL-ResNet18⁶², SKAL-GoogleNet⁶², GCSANet⁶³, EMTCAL³², and SF-MSFormer-ResNet18⁶⁴.

Table 2 The results of different models on the RSSCN7 dataset.

Full size table

Table 3 The results of different models on the WHU-RS19 dataset.

Full size table

Table 4 The results of different models on the UCMerced-LandUse dataset.

Full size table

In our investigation of the few-category multi-sample dataset, specifically the RSSCN7, we selected 2240 images as training samples. Through comparative analysis (refer to Table 2), it was observed that under conditions of ample samples, the LFAGCU model exhibited outstanding performance, Pre, Rec, Acc, and F1 scores of 0.9820, 0.9820, 0.9820, and 0.9818, respectively. In contrast, the DenseNet161 model from the fully convolutional network series also demonstrated remarkable performance (Acc 0.9712, F1 0.9706), highlighting the effectiveness of deep convolutional network stacking in effectively extracting potential semantic features from images when an adequate number of samples is available. Notably, compared to the DenseNet161 model, LFAGCU exhibited a notable increase of 1.06% and 1.14% in Acc and F1 metrics, respectively, demonstrating not only the efficiency of LFAGCU in utilizing limited biases and inductive techniques but also the excellent performance of its combined GFL in considering spatial relations between fine-grained image pixels and geographical attribute features. Furthermore, compared to the recent DFAGCN and SNN-VGG-15 models, LFAGCU showed an improvement of 4.06% and 3.66% in the Acc metric, respectively. In comparison to the ViT models based on wide-range feature extraction, LFAGCU exhibited at least a 2.52% improvement across all four major metrics. These results not only demonstrate LFAGCU's strong capability in understanding the spatial relationships and intrinsic contextual cues of terrestrial entities but also confirm the effectiveness of combining local and global semantic information in two design paradigms for embedding information in collaborative learning images. When compared with the best-performing local and global paradigm model, SKAL-ResNet18⁶², LFAGCU still holds a 2.16% higher accuracy. This result is attributed to LFAGCU’s more effective feature extraction and fusion mechanisms that capture the details (local features) and overall structure (global features) in remote sensing images.

In our research involving the WHU-RS19 and UCMerced-LandUse datasets, each containing multiple categories and limited samples, approximately 42 and 80 images were selected for training in each category, respectively (refer to Tables 3 and 4). A detailed analysis from Table 3 reveals that despite the limited number of training samples, leveraging state-of-the-art fully convolutional network techniques such as Efficient-b6-1 k and Convnext-b-1 k achieved F1 scores of 99.34% and 99.01%, respectively. This highlights the significant impact of utilizing pre-trained parameters (where 1 k and 22 k represent pre-training using the ImageNet-1 K and ImageNet-22 K datasets, respectively) in enhancing the precision of deep convolutional networks, thereby achieving outstanding performance in addressing practical issues. However, it is noteworthy that overfitting issues persist in some networks, such as AlexNet, within the limited sample datasets. Figure 10 illustrates the confusion matrices of several models that achieved identical accuracy values on the WHU-RS19 dataset. Although these models demonstrate uniformity in aggregate accuracy metrics, the confusion matrices reveal nuanced variations in classification performance across different categories. On the WHU-RS19 dataset, our novel LFAGCU model demonstrated outstanding performance, achieving an Acc of 98.96% and an F1 score of 99.34%, surpassing traditional CNN networks and recent popular feature aggregation networks comprehensively. Compared to local–global paradigm models such as TransResUNet, BPECN, SF-MSFormer-ResNet18, and LGFormer, LFAGCU still displays a distinct advantage. Although the latter three models have not provided comprehensive evaluation metrics, only their accuracy, LFAGCU continues to demonstrate superior overall performance. Notably, when compared with LGFormer, which boasts an accuracy rate as high as 99.20%, LFAGCU still outperforms in terms of comprehensive capabilities. This enhanced performance can be attributed to LFAGCU’s implementation of effective structural designs and strategies, which proficiently integrate local and global features. This integration significantly boosts the model's capability to comprehend and process complex data patterns. In summary, LFAGCU has demonstrated robust potential and exceptional performance in challenging classification tasks throughout these experiments.

Furthermore, an analysis of Table 4 reveals that on the UCMerced-LandUse dataset, the LFAGCU model achieved an accuracy of 99.53% and an F1 score of 99.62%. This underscores its exceptional ability in collaborative learning and embedding information within images, particularly evident in scenarios involving small samples, unaffected by dataset sparsity or sample size limitations. Compared to the ViT network, LFAGCU shows improvements of 3.29% in accuracy and 3.46% in F1 score. Even when compared to the best-performing local–global paradigm model referenced in the literature, SKAL-ResNet18, which had an accuracy of 99.52%, LFAGCU's performance is slightly superior. The balanced performance of LFAGCU across all evaluation metrics, including precision and recall, provides a more comprehensive understanding of its capabilities, highlighting its superiority in handling complex remote sensing image tasks.

Nonlinear transformation ablation experiment

This study analyzes various activation functions' impact on land cover classification using the LFAGCU model across three datasets: RSSCN7, WHU-RS19, and UCMerced-LandUse. Eleven activation functions: Sigmoid, RReLU, Softsign, ReLU, SELU, ELU, PReLU, ReLU6, LeakyReLU, Mish, and SiLU are evaluated (as shown in Figs. 5, 6, and 7).

Across all metrics, SiLU performed best on the RSSCN7 dataset, capturing fine-grained nuances. ELU excelled in precision, recall, and accuracy on the WHU-RS19 dataset, with SiLU demonstrating adaptability. On the UCMerced-LandUse dataset, SiLU again performed outstandingly, followed by ReLU. Overall, SiLU consistently demonstrated high-performance characteristics across datasets, providing valuable optimization insights for accurate land cover classification using the LFAGCU model.

Comparative experiment and analysis

In this study, a particular focus on exploring the impact of different hyperparameters within the GFL module. These hyperparameters encompass:

(1)
Input and output channel dimensions prior to the GFL module (Into/Out GFL-channels): These parameters influence the propagation and transformation of information within the GFL module, thereby affecting feature learning and model performance.
(2)
Number of channels used for computing attention weights within the GFL module (Attention-channels): This parameter determines the dimensionality of each channel within the attention mechanism, thereby influencing the model's capacity to learn relationships among different channels.
(3)
Dimensionality of linear layers within the GFL module (GFL-dimension): This parameter determines the dimensionality of the internal linear layers of the GFL module, consequently impacting the complexity and richness of feature transformation.

The study evaluated the effects of these hyperparameters on model performance by exploring them in different model configurations (LFAGCU, LFAGCU (M), and LFAGCU (S)). These investigations were conducted through experimentation involving variations in channel dimensions and transformation sizes, as presented in the tabulated results (Table 5).

Table 5 LFAGCU model performance under various hyperparameter configurations.

Full size table

By observing the experimental results in Table 5, it becomes apparent that the "LFAGCU" model exhibits the highest complexity, with the greatest total parameters and parameter count. This suggests that the model possesses a more potent capability to capture intricate land cover features. However, this increase in complexity comes at the expense of computational efficiency, as indicated by its notably elevated total multiply-adds. On the other hand, while the "LFAGCU (S)" model boasts the lowest complexity, it may still retain reasonable performance within the framework of local perception and global context modeling. We note that the reduction in GFL hyperparameters corresponds to reduced model complexity and memory requirements. The "LFAGCU" model incurs the highest memory usage due to its larger parameter size, whereas the "LFAGCU (S)" model demonstrates markedly lower memory usage owing to its smaller parameter dimensions. When selecting a land cover classification model architecture, a trade-off between performance and complexity is crucial. We anticipate that the "LFAGCU" model could excel in tasks necessitating comprehensive extraction of remote sensing features. This balancing act can be tailored to specific task requirements and computational resources available.

In the data analysis presented in Table 6, there is an overall increasing trend in model performance as the hyperparameters transition from 'Min' to 'Med' and then to 'LFAGCU', aligning with the initial expectations. This trend indicates that, in the context of VHR remote sensing image land cover classification tasks, employing larger model configurations generally leads to improved classification performance. Whether considering few-category multi-sample datasets (RSSCN7) or multi-category few-sample datasets (WHU-RS19 and UCMerced-LandUse), LFAGCU consistently exhibits the highest precision, recall, accuracy, and F1 score. This observation suggests that adopting larger model dimensions in terms of channels and GFL components contributes to a more reliable capture of intricate land cover patterns, thereby enhancing the performance of VHR remote sensing image classification.

Table 6 Performance of LFAGCU with different hyperparameters on different datasets.

Full size table

This superior performance is further validated through two visualization techniques: confusion matrix and t-SNE analysis. From observations in Fig. 8, it is evident that LFAGCU exhibits the most substantial and numerous true positive values. In Fig. 9, the scatter plot resulting from t-SNE dimensionality reduction displays each sample's position in the visualization space, reflecting its position in the reduced feature space. Notably, LFAGCU's scatter plot exhibits the most distinct clustering within each class, affirming that the adoption of larger model dimensions in terms of channels and GFL components effectively captures intricate land cover patterns, significantly enhancing classification performance for high-resolution remote sensing images (Fig. 10).

Conclusion

To address the challenge of high-resolution remote sensing image land cover classification, this paper proposes an innovative dual-feature learning paradigm model named LFAGCU. Firstly, we designed a multi-channel and multi-dimensional local feature extractor, focusing on deep exploration of local details and semantic features across various spatial dimensions in remote sensing images. This step aims to comprehensively capture subtle land object characteristics in images, enhancing classification accuracy. Secondly, we introduced the GFL module aimed at abstracting global features and contextual semantic information of remote sensing images. This module aids in understanding the overall background and context of the images, thereby enhancing the model's ability to recognize the entire image. Finally, we conducted extensive comparative and ablation experiments on three open-source datasets, including RDDCN7, WHU-RS19, and UCMerced-LandUse. The results demonstrate LFAGCU's significant competitiveness in land cover classification, maintaining a leading position and exhibiting robust generalization capabilities. In future work, we intend to thoroughly explore the influence of local and global information captured at various receptive field sizes on the attribute information of land objects. Our goal is to refine and enhance current models, bolstering their capacity to efficiently process and analyze remote sensing imagery. This investigation is pivotal in improving model robustness and adaptability, ensuring that they perform consistently across a broad spectrum of domains and applications within remote sensing tasks. By optimizing the interplay between local detail and global context, we aspire to push the boundaries of what these sophisticated models can achieve in diverse settings, from urban landscapes to natural environments.

Data availability

The RSSCN7 dataset³⁸ analysed during the current study is accessible at the link: https://github.com/palewithout/RSSCN7. The WHU-RS19 dataset³⁹ analysed during the current study is accessible at the link: https://captain-whu.github.io/BED4RS/#. The UCMerced-LandUse dataset⁴⁰ analysed during the current study is accessible at the link: http://weegee.vision.ucmerced.edu/datasets/landuse.html.

References

Filippelli, S. K., Vogeler, J. C., Falkowski, M. J. & Meneguzzo, D. M. Monitoring conifer cover: Leaf-off lidar and image-based tracking of eastern redcedar encroachment in central Nebraska. Remote Sens. Environ. https://doi.org/10.1016/j.rse.2020.111961 (2020).
Article Google Scholar
Tottrup, C. et al. Surface water dynamics from space: a round robin intercomparison of using optical and SAR high-resolution satellite observations for regional surface water detection. Remote Sens. https://doi.org/10.3390/rs14102410 (2022).
Article Google Scholar
Zheng, S. et al. Linking cultural ecosystem service and urban ecological-space planning for a sustainable city: Case study of the core areas of Beijing under the context of urban relieving and renewal. Sustain. Cities Soc. https://doi.org/10.1016/j.scs.2022.104292 (2023).
Article Google Scholar
Kayet, N. et al. Assessment of foliar dust using Hyperion and Landsat satellite imagery for mine environmental monitoring in an open cast iron ore mining areas. J. Clean. Prod. 218, 993–1006. https://doi.org/10.1016/j.jclepro.2019.01.305 (2019).
Article Google Scholar
Zhang, H., Lin, M., Yang, G. & Zhang, L. ESCNet: An end-to-end superpixel-enhanced change detection network for very-high-resolution remote sensing images. IEEE Trans. Neural Netw. Learn. Syst. 34, 28–42. https://doi.org/10.1109/TNNLS.2021.3089332 (2023).
Article CAS PubMed Google Scholar
Wieland, M., Martinis, S., Kiefl, R. & Gstaiger, V. Semantic segmentation of water bodies in very high-resolution satellite and aerial images. Remote Sens. Environ. https://doi.org/10.1016/j.rse.2023.113452 (2023).
Article Google Scholar
Mei, W. et al. Using deep learning and very-high-resolution imagery to map smallholder field boundaries. Remote Sens. https://doi.org/10.3390/rs14133046 (2022).
Article Google Scholar
Han, C., Wu, C., Guo, H., Hu, M. & Chen, H. HANet: A hierarchical attention network for change detection with bitemporal very-high-resolution remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 16, 3867–3878. https://doi.org/10.1109/JSTARS.2023.3264802 (2023).
Article ADS Google Scholar
Wang, S., Huang, S., Liu, S. & Bi, Y. Not just select samples, but exploration: Genetic programming aided remote sensing target detection under deep learning. Appl. Soft Comput. https://doi.org/10.1016/j.asoc.2023.110570 (2023).
Article PubMed PubMed Central Google Scholar
Zhang, R., Zhang, Q. & Zhang, G. LSRFormer: Efficient transformer supply convolutional neural networks with global information for aerial image segmentation. IEEE Trans. Geosci. Remote Sens. 62, 1–13. https://doi.org/10.1109/TGRS.2024.3366709 (2024).
Article CAS Google Scholar
Wang, L. et al. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 190, 196–214. https://doi.org/10.1016/j.isprsjprs.2022.06.008 (2022).
Article ADS Google Scholar
Ding, K., Lu, T., Fu, W., Li, S. & Ma, F. Global-local transformer network for HSI and LiDAR data joint classification. IEEE Trans. Geosci. Remote Sens. https://doi.org/10.1109/TGRS.2022.3216319 (2022).
Article Google Scholar
Song, P., Li, J., An, Z., Fan, H. & Fan, L. CTMFNet: CNN and transformer multiscale fusion network of remote sensing urban scene imagery. IEEE Trans. Geosci. Remote Sens. https://doi.org/10.1109/TGRS.2022.3232143 (2023).
Article Google Scholar
Rezaee, M., Mahdianpari, M., Zhang, Y. & Salehi, B. Deep convolutional neural network for complex wetland classification using optical remote sensing imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 11, 3030–3039. https://doi.org/10.1109/JSTARS.2018.2846178 (2018).
Article ADS Google Scholar
Jamali, A., Mahdianpari, M., Mohammadimanesh, F., Bhattacharya, A. & Homayouni, S. PolSAR image classification based on deep convolutional neural networks using wavelet transformation. IEEE Geosci. Remote Sens. Lett. https://doi.org/10.1109/LGRS.2022.3185118 (2022).
Article Google Scholar
Scott, G. J., Marcum, R. A., Davis, C. H. & Nivin, T. W. Fusion of deep convolutional neural networks for land cover classification of high-resolution imagery. IEEE Geosci. Remote Sens. Lett. 14, 1638–1642. https://doi.org/10.1109/LGRS.2017.2722988 (2017).
Article ADS Google Scholar
Qiu, C., Mou, L., Schmitt, M. & Zhu, X. X. Fusing multiseasonal sentinel-2 imagery for urban land cover classification with multibranch residual convolutional neural networks. IEEE Geosci. Remote Sens. Lett. 17, 1787–1791. https://doi.org/10.1109/LGRS.2019.2953497 (2020).
Article ADS Google Scholar
Jamali, A. et al. Comparing solo versus ensemble convolutional neural networks for wetland classification using multi-spectral satellite imagery. Remote Sens. https://doi.org/10.3390/rs13112046 (2021).
Article Google Scholar
Singh, A. & Bruzzone, L. Mono-and dual-regulated contractive-expansive-contractive deep convolutional networks for classification of multispectral remote sensing images. IEEE Geosci. Remote Sens. Lett. https://doi.org/10.1109/LGRS.2022.3211861 (2022).
Article Google Scholar
Gbodjo, Y. J. E., Montet, O., Ienco, D., Gaetano, R. & Dupuy, S. Multisensor land cover classification with sparsely annotated data based on convolutional neural networks and self-distillation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 14, 11485–11499. https://doi.org/10.1109/JSTARS.2021.3119191 (2021).
Article ADS Google Scholar
Li, X., Lei, L., Zhang, C. & Kuang, G. Multimodal semantic consistency-based fusion architecture search for land cover classification. IEEE Trans. Geosci. Remote Sens. https://doi.org/10.1109/TGRS.2022.3193273 (2022).
Article Google Scholar
Ye, Y. et al. A robust multimodal remote sensing image registration method and system using steerable filters with first- and second-order gradients. ISPRS J. Photogramm. Remote Sens. 188, 331–350. https://doi.org/10.1016/j.isprsjprs.2022.04.011 (2022).
Article ADS Google Scholar
Fan, J., Xiong, Q., Ye, Y. & Li, J. Combining phase congruency and self-similarity features for multimodal remote sensing image matching. IEEE Geosci. Remote Sens. Lett. https://doi.org/10.1109/LGRS.2023.3239191 (2023).
Article Google Scholar
Dosovitskiy, A. et al. AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE. In 9th International Conference on Learning Representations, ICLR 2021, May 3, 2021 - May 7, 2021. Amazon; DeepMind; et al.; Facebook AI; Microsoft; OpenAI (International Conference on Learning Representations, ICLR).
Li, Y. et al. DKDFN: Domain knowledge-guided deep collaborative fusion network for multimodal unitemporal remote sensing land cover classification. ISPRS J. Photogramm. Remote Sens. 186, 170–189. https://doi.org/10.1016/j.isprsjprs.2022.02.013 (2022).
Article ADS Google Scholar
Feng, R., Shen, H., Bai, J. & Li, X. Advances and opportunities in remote sensing image geometric registration: A systematic review of state-of-the-art approaches and future research directions. IEEE Geosci. Remote Sens. Mag. 9, 120–142. https://doi.org/10.1109/MGRS.2021.3081763 (2021).
Article Google Scholar
Chen, B., Liu, L., Zou, Z. & Shi, Z. Target detection in hyperspectral remote sensing image: Current status and challenges. Remote Sens. https://doi.org/10.3390/rs15133223 (2023).
Article Google Scholar
Lv, P., Wu, W., Zhong, Y., Du, F. & Zhang, L. SCViT: A spatial-channel feature preserving vision transformer for remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. https://doi.org/10.1109/TGRS.2022.3157671 (2022).
Article Google Scholar
Yao, J., Zhang, B., Li, C., Hong, D. & Chanussot, J. Extended vision transformer (ExViT) for land use and land cover classification: A multimodal deep learning framework. IEEE Trans. Geosci. Remote Sens. https://doi.org/10.1109/TGRS.2023.3284671 (2023).
Article Google Scholar
Zhao, Y., Liu, J., Yang, J. & Wu, Z. EMSCNet: Efficient multisample contrastive network for remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. https://doi.org/10.1109/TGRS.2023.3262840 (2023).
Article Google Scholar
Hou, D., Wang, S., Tian, X. & Xing, H. PCLUDA: A pseudo-label consistency learning- based unsupervised domain adaptation method for cross-domain optical remote sensing image retrieval. IEEE Trans. Geosci. Remote Sens. https://doi.org/10.1109/TGRS.2022.3233133 (2023).
Article Google Scholar
Tang, X. et al. EMTCAL: Efficient multiscale transformer and cross-level attention learning for remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. https://doi.org/10.1109/TGRS.2022.3194505 (2022).
Article Google Scholar
Shi, W. et al. Land cover classification in foggy conditions: Toward robust models. IEEE Geosci. Remote Sens. Lett. https://doi.org/10.1109/LGRS.2022.3187779 (2022).
Article Google Scholar
Liu, S. et al. A shallow-to-deep feature fusion network for VHR remote sensing image classification. IEEE Trans. Geosci. Remote Sens. https://doi.org/10.1109/TGRS.2022.3179288 (2022).
Article Google Scholar
Chaib, S. et al. On the co-selection of vision transformer features and images for very high-resolution image scene classification. Remote Sens. https://doi.org/10.3390/rs14225817 (2022).
Article Google Scholar
Han, K. et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 45, 87–110. https://doi.org/10.1109/TPAMI.2022.3152247 (2023).
Article PubMed Google Scholar
Zhang, Q., Xu, Y., Zhang, J. & Tao, D. ViTAEv2: Vision transformer advanced by exploring inductive bias for image recognition and beyond. Int. J. Comput. Vis. 131, 1141–1162. https://doi.org/10.1007/s11263-022-01739-w (2023).
Article Google Scholar
Zou, Q., Ni, L., Zhang, T. & Wang, Q. Deep learning based feature selection for remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. Lett. 12, 2321–2325. https://doi.org/10.1109/LGRS.2015.2475299 (2015).
Article ADS Google Scholar
Xia, G.-S. et al. Structural high-resolution satellite image indexing. In ISPRS Technical Commission VII Symposium on Advancing Remote Sensing Science, July 5, 2010 - July 7, 2010, 298-303 (International Society for Photogrammetry and Remote Sensing).
Yang, Y. & Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In 18th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM SIGSPATIAL GIS 2010, 270-279 (Association for Computing Machinery).
Zhang, X., Yao, X., Feng, X., Cheng, G. & Han, J. DFENet for domain adaptation-based remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. https://doi.org/10.1109/TGRS.2021.3119914 (2022).
Article Google Scholar
Niu, L.-Y., Wei, Y. & Liu, Y. Event-driven spiking neural network based on membrane potential modulation for remote sensing image classification. Eng. Appl. Artif. Intell. https://doi.org/10.1016/j.engappai.2023.106322 (2023).
Article Google Scholar
Ji, S., Wei, S. & Lu, M. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Trans. Geosci. Remote Sens. 57, 574–586. https://doi.org/10.1109/TGRS.2018.2858817 (2019).
Article ADS Google Scholar
Xie, W. et al. Co-compression via superior gene for remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. https://doi.org/10.1109/TGRS.2023.3247872 (2023).
Article Google Scholar
Basha, S. H. S., Vinakota, S. K., Dubey, S. R., Pulabaigari, V. & Mukherjee, S. AutoFCL: Automatically tuning fully connected layers for handling small dataset. Neural Comput. Appl. 33, 8055–8065. https://doi.org/10.1007/s00521-020-05549-4 (2021).
Article Google Scholar
Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, May 7, 2015 - May 9, 2015 (International Conference on Learning Representations, ICLR).
Szegedy, C. et al. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, June 7, 2015 - June 12, 2015, 1-9 (IEEE Computer Society).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, June 26, 2016 - July 1, 2016, 770-778 (IEEE Computer Society).
Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 84–90. https://doi.org/10.1145/3065386 (2017).
Article Google Scholar
Howard, A. et al. Searching for mobileNetV3. In 17th IEEE/CVF International Conference on Computer Vision, ICCV 2019, October 27, 2019 - November 2, 2019, 1314-1324 (Institute of Electrical and Electronics Engineers Inc.).
Ma, N., Zhang, X., Zheng, H.-T. & Sun, J. Shufflenet V2: Practical guidelines for efficient cnn architecture design. In 15th European Conference on Computer Vision, ECCV 2018, September 8, 2018 - September 14, 2018, 122-138 (Springer Verlag).
Cheng, Q. et al. Scene classification of remotely sensed images via densely connected convolutional neural networks and an ensemble classifier. Photogramm. Eng. Remote Sens. 87, 295–308. https://doi.org/10.14358/PERS.87.3.295 (2021).
Article Google Scholar
Tan, M. & Le, Q. V. EfficientNetV2: Smaller Models and Faster Training. In 38th International Conference on Machine Learning, ICML 2021, July 18, 2021 - July 24, 2021, 10096-10106 (ML Research Press).
Liu, Z. et al. A ConvNet for the 2020s. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, June 19, 2022 - June 24, 2022, 11966-11976 (IEEE Computer Society).
Zhao, J. et al. A high-precision image classification network model based on a voting mechanism. Int. J. Digit. Earth 15, 2168–2183. https://doi.org/10.1080/17538947.2022.2142306 (2022).
Article ADS Google Scholar
Xu, K., Huang, H., Deng, P. & Li, Y. Deep feature aggregation framework driven by graph convolutional network for scene classification in remote sensing. IEEE Trans. Neural Netw. Learn. Syst. 33, 5751–5765. https://doi.org/10.1109/TNNLS.2021.3071369 (2022).
Article PubMed Google Scholar
Yuan, Z., Tang, C., Yang, A., Huang, W. & Chen, W. Few-shot remote sensing image scene classification based on metric learning and local descriptors. Remote Sens. https://doi.org/10.3390/rs15030831 (2023).
Article Google Scholar
Jia, Y., Gao, J., Huang, W., Yuan, Y. & Wang, Q. Exploring hard samples in multiview for few-shot remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. https://doi.org/10.1109/TGRS.2023.3295129 (2023).
Article Google Scholar
Yuan, L. et al. Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet. In 18th IEEE/CVF International Conference on Computer Vision, ICCV 2021, October 11, 2021 - October 17, 2021, 538-547 (Institute of Electrical and Electronics Engineers Inc.).
Reza, S., Amin, O. B. & Hashem, M. M. A. TransResUNet: Improving U-Net architecture for robust lungs segmentation in chest X-rays. In 2020 IEEE Region 10 Symposium, TENSYMP 2020, June 5, 2020 - June 7, 2020, 1592-1595 (Institute of Electrical and Electronics Engineers Inc.).
Anwer, R. M., Khan, F. S., van de Weijer, J., Molinier, M. & Laaksonen, J. Binary patterns encoded convolutional neural networks for texture recognition and remote sensing scene classification. ISPRS J. Photogramm. Remote Sens. 138, 74–85. https://doi.org/10.1016/j.isprsjprs.2018.01.023 (2018).
Article ADS Google Scholar
Wang, Q., Huang, W., Xiong, Z. & Li, X. Looking closer at the scene: Multiscale representation learning for remote sensing image scene classification. IEEE Trans. Neural Netw. Learn. Syst. 33, 1414–1428. https://doi.org/10.1109/TNNLS.2020.3042276 (2022).
Article PubMed Google Scholar
Chen, W. et al. GCSANet: A global context spatial attention deep learning network for remote sensing scene classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 15, 1150–1162. https://doi.org/10.1109/JSTARS.2022.3141826 (2022).
Article ADS Google Scholar
Yang, Y. et al. An explainable spatial-frequency multiscale transformer for remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. https://doi.org/10.1109/TGRS.2023.3265361 (2023).
Article Google Scholar
Yang, Y. et al. LGLFormer: Local-global lifting transformer for remote sensing scene parsing. IEEE Trans. Geosci. Remote Sens. 62, 1–13. https://doi.org/10.1109/TGRS.2023.3344116 (2024).
Article Google Scholar

Download references

Acknowledgements

We are deeply grateful to the editors and reviewers of the scientific report for their hard work.

Funding

The research work was supported by the Science and Technology Development Project of Jilin Province under Grant 212551GX010283541; the Science and Technology Development Project of Changchun, China, under Grant 21ZY21.

Author information

Authors and Affiliations

School of Electronic and Information Engineering, University of Science and Technology Liaoning, Anshan, China
Zhengpeng Li, Jun Hu & Jiawei Miao
Liaoning Province Key Laboratory of Intelligent Construction and Internet of Things Application Technologies, Anshan, China
Zhengpeng Li, Jun Hu & Jiawei Miao
College of Instrumentation and Electrical Engineering, Jilin University, Changchun, China
Kunyang Wu
National Geophysical Exploration Equipment Engineering Research Center, Jilin University, Changchun, China
Kunyang Wu
Key Laboratory of Geophysical Exploration Equipment Ministry of Education of China (Jilin University), Changchun, China
Kunyang Wu
School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, China
Zixue Zhao & Jiansheng Wu

Authors

Zhengpeng Li
View author publications
Search author on:PubMed Google Scholar
Jun Hu
View author publications
Search author on:PubMed Google Scholar
Kunyang Wu
View author publications
Search author on:PubMed Google Scholar
Jiawei Miao
View author publications
Search author on:PubMed Google Scholar
Zixue Zhao
View author publications
Search author on:PubMed Google Scholar
Jiansheng Wu
View author publications
Search author on:PubMed Google Scholar

Contributions

Zhengpeng Li: Conceptualization, Methodology, Writing—original draft. Jun Hu: Supervision, Writing—review & editing. Kunyang Wu: Funding acquisition, Data curation, Methodology. Jiawei Miao: Data curation, Methodology, Zixue Zhao: Conceptualization, Investigation, Validation. Jiansheng Wu: Conceptualization, Investigation.

Corresponding author

Correspondence to Jun Hu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Li, Z., Hu, J., Wu, K. et al. Local feature acquisition and global context understanding network for very high-resolution land cover classification. Sci Rep 14, 12597 (2024). https://doi.org/10.1038/s41598-024-63363-7

Download citation

Received: 17 January 2024
Accepted: 28 May 2024
Published: 01 June 2024
DOI: https://doi.org/10.1038/s41598-024-63363-7

Keywords

This article is cited by

Enhanced hybrid CNN and transformer network for remote sensing image change detection
- Junjie Yang
- Haibo Wan
- Zhihai Shang
Scientific Reports (2025)
Multi-axis compression fusion network for vehicle re-identification
- Tengda Ma
- Ke Sun
- Cheng Wang
Scientific Reports (2025)
Cognitive difference text classification in online knowledge collaboration based on SA-BiLSTM hybrid model
- Fengjun Liu
- Na Zhao
- Guoqing Zhu
Scientific Reports (2025)
Enhanced Chinese scene text recognition model base on cross-domain feature fusion
- Ran Cui
- Aichun Zhu
- Zichen Ding
Signal, Image and Video Processing (2025)
LMR-IPGN: An Effective Model for automatic summarization of Chinese text
- Dangguo Shao
- Gaoan Huang
- Chunyun Pu
Multimedia Systems (2025)

Subjects

Abstract

Similar content being viewed by others

Introduction

Related works

Feature extractor based on CNN

Feature extractor based on transformer

Methodology

Local perception

Global context modeling

Loss function

Materials and data preparation

Network training

Study area

Results and discussions

Comparative experiment and analysis

Nonlinear transformation ablation experiment

Comparative experiment and analysis

Conclusion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

Search

Quick links