Abstract
Semantic segmentation of high-resolution remote sensing imagery is pivotal in decision-making and analysis in a wide array of sectors, including but not limited to water management, agriculture, military operations, and environmental protection. This technique offers detailed and precise feature information, facilitating an accurate imagery interpretation. Despite its importance, existing methods often fall short as they lack a mechanism for spatial location feature screening. These methods tend to treat all extracted features on an equal footing, neglecting their spatial relevance. To overcome these shortcomings, we introduce a groundbreaking approach, the Spatially Adaptive Interaction Network (SAINet), designed for dynamic feature interaction in remote sensing semantic segmentation. SAINet integrates a spatial refinement module that leverages local context information to filter spatial locations and extract prominent regions. This enhancement allows the network to concentrate on pertinent areas, thereby improving the quality of feature representation. Furthermore, we present an innovative spatial interaction module that utilizes a spatial adaptive modulation mechanism. This mechanism dynamically selects and allocates spatial position weights, fostering effective interaction between local salient areas and global information, which in turn boosts the network’s segmentation performance. The adaptability of SAINet allows it to capture more informative features, leading to a significant improvement in segmentation accuracy. We have validated the effectiveness and capability of our proposed approach through experiments on widely recognized public datasets such as DeepGlobe, Vaihingen, and Potsdam.
Similar content being viewed by others
Introduction
With the increasing prevalence of high-resolution remote sensing satellites in global Earth observation missions, high-resolution remote sensing data has become abundant and the primary source for Earth observation. Semantic segmentation of high-resolution remote sensing images1,2,3,4,5 plays a vital role in understanding the distribution of ground object features, enabling refined urban management, environmental monitoring, natural resource assessment, crop analysis, precise surveying, and landcover6. High-resolution remote sensing images possess distinct characteristics, including complex background information, dense targets, and rich ground object features. And high-resolution remote sensing images of the same area possess larger image sizes and more abundant pixel detail. This implies that high-resolution remote images require a greater number of pixels to depict the same area, necessitating a solution for handling input models with high-resolution images (i.e., larger pixel sizes) in order to leverage the global information within high-resolution remote images. High-resolution remote sensing images provide finer spatial details about target features and texture information, but they also come with complex backgrounds. Typically, a pair of remote sensing images contains a multitude of features such as buildings, vegetation, farmland, and various terrain elements. Therefore, semantic segmentation of high-resolution remote sensing images not only entails extracting global information and local details in complex scenes, but also addressing the issue of handling excessively large input image sizes.
Semantic segmentation of high-resolution remote sensing images is a challenging task, with two primary solutions currently dominating the field: single-branch segmentation networks and dual-branch networks. As illustrated in Fig. 1a, single-branch segmentation networks7,8,9 utilize pure convolutional neural networks (CNNs) to bolster their feature extraction capabilities. These architectures typically rely on robust backbone networks to extract global features, which are subsequently fed into a segmentation head to perform remote sensing image segmentation. Zhu et al.10 employs convolutional neural networks (CNN) for the extraction of local features and utilizes one hand-crafted filter to obtain global features. However, these methods often fall short in effectively capturing the intricate interplay between global and local features of target objects, particularly within the complex backgrounds of high-resolution remote sensing images. This limitation can lead to less accurate segmentation results, as the nuanced details of smaller objects or features may be overlooked. Khan and Basalamah6 combine DenseNet and UNet to extract multi-scale features while preserving low-level details through long-range connections. Its hybrid loss function addresses class imbalance, and its cyclical learning rate strategy enhances training convergence. However, the model lacks an adaptive feature interaction mechanism and spatial screening, limiting its ability to handle complex backgrounds and small targets. Moreover, due to the substantial size of high-resolution remote sensing images, these approaches often resort to downsampling the images prior to feature extraction. While this strategy reduces computational load, it comes with a significant drawback. Many objects of interest in remote sensing images are relatively small targets, consisting of only tens or even a few pixels. Consequently, the information content of these segmented objects is minimal. Downsampling can result in the loss of these small targets, negating the advantage of the rich feature information inherent in high-resolution images. Therefore, while single-branch segmentation networks have made strides in semantic segmentation, their limitations underscore the need for more sophisticated approaches that can effectively balance the extraction of global and local features, and preserve the rich information content of high-resolution remote sensing images.
Comparison of SAINet with other networks in semantic segmentation of high-resolution remote sensing images: (a) Single-branch networks, typically train on downsampled images, resulting in insufficient capability to extract local detailed information. (b) Dual-branch networks, train with both global and local branches, fuse all the information from different branches, and treat each feature equally. (c) Our proposed SAINet, dynamic interaction of information across different features. The images used in this figure are sourced from the DeepGlobe dataset, and the dataset can be accessed at: https://www.kaggle.com/datasets/balraj98/deepglobe-land-cover-classification-dataset.
Addressing the previously mentioned issue of local feature underestimation, dual-branch architectures11,12,13,14 have emerged as a significant trend in the semantic segmentation community, making considerable strides in the field. As depicted in Fig. 1b, these architectures typically comprise two branches, each equipped with its own backbone network. One branch is dedicated to extracting global features, while the other focuses on local features. Khan and Basalamah3 integrate the pyramid pooling module with DenseNet to extract rich contextual features. Meanwhile, the local feature extraction module employs FCN and CNN to extract multi-scale local features and reduces background noise through specific strategies, thus enhancing the feature extraction capability. These extracted features are subsequently fused and input into a segmentation head, which performs the task of remote sensing image segmentation. This dual-branch network structure enhances segmentation performance by paying particular attention to local regions. However, the current methods for fusing global and local information from different branches are somewhat simplistic. They tend to treat all information equally, typically merging it through basic operations such as summation or concatenation. This approach, while straightforward, has its limitations. Remote sensing images often contain hidden salient regions, and the direct fusion method lacks a mechanism to selectively combine spatial and positional features. This lack of selectivity means that each local image is treated equally and merged with the segmentation of the entire image. This process not only overlooks the potential importance of certain local features but also consumes more computational resources. Therefore, there is a need for a more sophisticated approach that can dynamically and selectively fuse global and local information, taking into account the spatial relevance of different features.
As shown in Fig. 1, previous methods often directly aggregate features from dual branches, which tends to overlook hidden salient regions in remote sensing images-key areas crucial for accurate segmentation. This limitation results in potential information loss, as there is no mechanism to emphasize these critical regions. In SAINet, we address this gap by introducing the spatial refinement module (SRM). This module processes the entire image to produce a preliminary segmentation and isolates challenging local patches that are difficult to classify. These patches are then cropped and reintroduced into the backbone network for enhanced feature extraction. This targeted refinement ensures that latent salient regions are captured effectively, leading to significant improvements in segmentation accuracy. Previous approaches often rely on simple summation or concatenation for feature fusion, lacking selective mechanisms to prioritize relevant features. This undifferentiated merging of local and global features can dilute critical information and compromise segmentation results. To address this, SAINet incorporates the spatial interaction module (SIM), which enables adaptive interaction between locally salient regions and global information. By dynamically weighting spatial and positional features, SIM ensures that local details and global context are selectively emphasized. This selective fusion enhances the model’s ability to capture nuanced contextual information, significantly improving segmentation performance. To evaluate the effectiveness of the model, we conducted extensive experiments demonstrating that our proposed model outperforms state-of-the-art methods on the publicly available high-spatial-resolution image datasets DeepGlobe, Vaihingen, and Potsdam.
Our main contributions are summarized as follows:
-
A novel spatially adaptive interaction network is proposed for the high-resolution remote sensing semantic segmentation task.
-
Unlike previous methods, we present a new spatial refined module that uses local context information to filter spatial positions and extract salient regions.
-
We devise a spatial interaction model to adaptively modulate mechanisms to dynamically select and allocate spatial position weights to focus on target areas.
Related works
Semantic segmentation
The Fully Convolutional Network (FCN)15 pioneered the use of end-to-end convolutional networks for semantic segmentation, laying the foundation for this field. FCN eliminated connected layers, making it adaptable to inputs of any size. It utilized simple bilinear interpolation for initialization and employed deconvolution layers to upsample the feature maps from the final convolutional layer, preserving spatial positional information for pixel-wise predictions. The emergence of FCN laid the foundation for the field of semantic segmentation. Since then, many deep learning-based methods have emerged continuously, promoting the development of this field16. Optional application of Conditional Random Fields (CRF)17 improved classification mapping and segmentation results. The encoder-decoder framework was subsequently introduced with influential models like UNet18 and SegNet19. UNet, structured as a U-shape, employed deconvolutional upsampling and feature map concatenation from corresponding scales in preceding layers. It achieved high speed and became a baseline network for medical image segmentation tasks. SegNet, although not the first encoder-decoder structure, successfully generalized the architecture, balancing memory (parameters) and accuracy. It utilized unspooling for feature map upsampling, reducing parameters and preserving high-frequency information. The concept of dilated convolutions20 introduced background modules that used dilated convolutions for multi-scale aggregation, achieving dense prediction results without increasing parameters. Deeplabv221 proposed the pyramid-like atrous spatial pyramid pooling (ASPP) technique, employing multiple parallel atrous convolution layers with different sampling rates for effective multi-scale processing. EMRN22 proposes a multi-resolution features dimension uniform module to handle dimensional features from images of varying resolutions. PSPNet23 enhances the ResNet24 structure using atrous convolutions and incorporates a pyramid pooling module to capture contextual information at different scales. Deeplabv325 improves spatial ASPP by cascading multiple atrous convolution structures and introduces batch normalization after each parallel convolutional layer. In 201826, DeepLabv3 is adopted as an encoder architecture with an effective decoder module. Li et al.27 further explored how to better utilize multi - scale information for semantic segmentation, and the ASPP technique has been improved and optimized on this basis.It also explores the use of improved Xception and depth-wise separable convolutions to enhance the model’s performance in semantic segmentation tasks.
Recently, attention mechanism methods28,29,30,31,32 have been introduced to semantic segmentation. Transformer mechanisms and self-attention mechanisms33 were initially introduced in the field of natural language processing and later sparked widespread research interest in the field of computer vision34,35,36,37,38. The pioneering work, Vision Transformer (ViT)39, utilizes multiple Transformer blocks to process non-overlapping image patches, establishing a convolution-free image classification model. PVT (Pyramid Vision Transformer)40 draws inspiration from the pyramid structure of convolutional neural networks. It gradually reduces feature resolution by segmenting the input image into blocks of different scales. Transformer encoding blocks are then applied to each scale for feature extraction. This approach empowers PVT to handle features at various levels, resulting in remarkable performance across diverse image classification tasks. CFIL41 proposes a frequency-domain feature extraction module and feature interaction in the frequency domain to enhance salient features. MFC42 proposes a frequency-domain filtering module to achieve dense target feature enhancement. To avoid excessive attention computation, Swin Transformer43 employs window-based local attention to confine attention within local windows. To fully leverage the advantages of Convolutional Neural Networks (CNN) in local feature extraction and the capabilities of Transformers in global relationship modeling,44,45,46,47,48,49 combining these two methods allows for the simultaneous capture of image details and overall context. This integration aims to enhance image analysis and understanding and has the potential to yield improved results across various computer vision tasks. In certain situations, different regions may possess varying degrees of importance. Deformable attention50 enables models to dynamically adjust their focus on different regions of an image based on object shapes and positions. This adaptive tuning allows the model to more accurately emphasize important object-related areas, thus enhancing its performance in image processing tasks. Recently, to reduce computational costs, researchers have introduced a coarse-fine-grained Visual Transformer (CF-ViT)51 to alleviate the computational burden while maintaining performance. Wu et al.52 proposed a cross-supervised learning paradigm, which provides a new idea for dealing with cloud detection problems in remote sensing images and also provides a useful reference for research in the field of semantic segmentation. In addition, document53 innovatively proposes the SiMaLSTM-SNP method, which is specifically designed for semantic relatedness analysis in natural language processing. In practical applications and tests, this method has demonstrated excellent performance, significantly surpassing many traditional baseline methods in numerous evaluation metrics, providing new ideas and powerful tools for research in this field. Aiming at the highly challenging small target detection problem in the remote sensing field, document54 conducts in-depth research and proposes the group target distribution detection method. This method effectively overcomes the difficulties faced by traditional detection methods when dealing with complex remote sensing images, greatly improving the accuracy and efficiency of detection. Document55 focuses on the field of light field semantic segmentation. By deeply analyzing the characteristics of light field images, successfully achieving the effective mining and utilization of complex information in light field images, and significantly improving the accuracy and effect of segmentation.
These methods capture patterns at different granularities by integrating high-level and low-level features and have achieved certain effects in the segmentation of normal-resolution images. However, when dealing with ultra-high-resolution images, cascading multi-scale features may significantly increase the dimensionality, and careful design is required to avoid excessive feature dimensionality having a negative impact on computational efficiency and model performance. When processing high-resolution remote sensing images, it usually relies on downsampling operations to adapt to the model input, which leads to the loss of small target information and makes it difficult to effectively capture the local detail features of the target object. For example, in a complex background, small-sized targets such as buildings and vegetation may only occupy a few pixels after downsampling, and the information they contain becomes extremely limited and is easily overlooked by the model, thus affecting the segmentation accuracy.
Dual-branch architecture
In visual tasks involving natural images56,57,58,59, remote sensing images11,13,60,61, and visible light images62,63,64,65,66, a substantial body of work has sought to address the subjectivity issue in balancing this trade-off by learning to integrate multi-scale information. Specifically, these approaches learn representations from multiple parallel networks and then aggregate information across different scales before making the final predictions. Taking the remote sensing domain as an example, GLNet11 consists of a global branch and a local branch, tailored to address both global downscaled images and localized cropped image blocks. GLNet achieves high-quality segmentation outcomes, effectively balancing precision and memory consumption. UHRSNet12 enhances and streamlines the local and global feature fusion approach of GLNet, enabling small blocks to gather information from surrounding ones. This effectively addresses the challenges posed by cropping and downsampling. HPGN67 proposes a novel pyramid graph network targeting features, which is closely connected behind the backbone network to explore multi-scale spatial structural features. To fully harness the potential of the multi-branch architecture, MBNet13 introduces a scaling module. The Zoom module relinquishes the prior L2 normalization before feature concatenation, instead exploiting the acquired attention to fuse distinct features. This approach maximizes the advantages of multi-resolution capabilities. The existing dual-branch architectures have shown promising results by seamlessly integrating patches and downsampled images during training. However, these approaches have primarily focused on merging information from different branches without considering the importance of spatial location features. They treat all extracted features uniformly, resulting in direct interactions (such as concatenation or addition) without an adaptive mechanism.
Methods
We propose a spatially adaptive interaction network to utilize both global and local information better and achieve effective interaction between local saliency regions and global information: spatially adaptive interaction network, SAINet, a segmentation network based on high-resolution images. We first give an overview of the network and further introduce the composition of the network, including the backbone network, spatially refined module (SRM), and spatial interaction module (SIM).
Overview
SAINet network, as shown in Fig. 2, consists of the following parts: (1) Extracts global information from downsampled images; (2) Introduces a simple yet effective spatial refinement module, which utilizes local contextual information for spatial position filtering, extracting salient regions and their indices; (3) Extracts local detailed information; (4) Utilizes a spatial adaptive modulation mechanism for dynamically selecting and allocating spatial position weights, achieving the interaction of local salient regions and global information. The global feature extraction and local feature extraction are performed using the ResNet50-based FPN (Feature Pyramid Network), with the parameters of the two backbone networks shared.
An overview of the proposed SAINet. SAINet first downsamples the input image and then puts it into the backbone network extracting global features. The spatial refined module (SRM) extracts salient regions within the global features and obtains the salient regions index (SRI). The local features are extracted from crop patches. The spatial interaction module (SIM) performs the spatially adaptive interaction between global features and local features. The images used in this figure are sourced from the DeepGlobe dataset, and the dataset can be accessed at: https://www.kaggle.com/datasets/balraj98/deepglobe-land-cover-classification-dataset.
Specifically, the proposed approach first uses the backbone network to extract features from the input image X, generating the global feature map \(F_g\). It then uses the spatial refine module to obtain the saliency regions and their indices SRI. Subsequently, SAINet uses SRI to crop the original image, obtain the cropped image block \(X_c\), and apply the same backbone network to extract features from \(X_c\), resulting in the local feature map \(F_l\). To facilitate more effective interaction between the global feature \(F_g\) and the local feature \(F_l\), we introduce SIM, a novel attention mechanism that combines linear attention and softmax attention. Our network can be formulated as:
where the matrix X represents the input image, \(R(\cdot )\) denotes the resize operation, and \(B_{RA}(\cdot )\) is the backbone network, using the Feature Pyramid Network (FPN) with ResNet50. \(F_g\) is the global feature map with coarse extraction results.
where SRI represents the saliency region index, and SRM denotes the spatial refine module, with details shown in the “Spatial refine module” section.
where \(SRI(\cdot )\) denotes the cropping operation with SRI. The matrix \(X_c\) represents the cropped image. \(B_{RA}(\cdot )\) is the backbone network the same as formula 1. is the backbone network the same as formula 1. \(F_l\) denotes the local feature map with more detailed information.
where \(F_i\) signifies the newly obtained features through interaction, and \(SIM(\cdot )\) stands for the spatial interaction module, with operational details in the “Spatial Interaction module” section.
Spatial refine module
Not all the extracted global features from down-sampled images are inaccurate, meaning not every local area needs enhancement. Especially in high-resolution remote sensing images, some objects are often continuous and occupy a large proportion of the entire image. Therefore, we have introduced a spatial refinement module that utilizes local context information to screen spatial positions and extract salient regions.
As shown in Fig. 2, the global features \(F_{\text{ g }} \in \mathbb {R}^{H \times W \times \text{ Class } }\) are first passed through the spatial refinement module, followed by a softmax operation, and then a 1x1 convolution is employed to obtain the saliency level on the channel dimension, denoted as \(P_{i j} \in (0,1)\). Hence, the confidence level of the global image is noted as \(S_{\text{ g }} \in \mathbb {R}^{H \times W \times 1}\), and Equation (2) presents the calculation formulas.
To further enhance the expressiveness of salient regions, a sliding window is used to calculate the local region scores, resulting in the ScoreMap. In this paper, the appropriate sliding window size of 3x3 was determined through subsequent ablation experiments. The final window size s is set to 3x3. The threshold for determining salient regions denoted as T, is determined using the formula:
In this study, the values of t and \(\mu\) are set to 0.08 and 7, respectively. Their values were acquired by means of experiments. Areas of the ScoreMap with scores below the threshold T are considered salient regions (SR) along with their corresponding index (SRI), where
Where s represents the sliding window size, and this is the same as the one in Equation (5). Subsequently, the original image is cropped using the SRI to determine local crop patches. The weight matrix of the salient region \(W_S\) is learned by the spatial interactive module, for details see the “Spatial Interaction module” section.
Spatial interaction module
Traditional feature fusion involves merging global and local features entirely, and the fusion method is merely a simple addition or concatenation. This approach is not only inefficient but also results in insufficient fusion. We all know that the large and sometimes even global, receptive field of the Transformer model gives it superior representational abilities compared to its CNN counterparts, especially in terms of understanding global context. Therefore, this paper presents a fusion method based on Transformer. However, simply enlarging the receptive field can also lead to some issues. On the one hand, using dense attention in ViT can result in excessive memory and computational costs, and the features may be affected by irrelevant parts beyond the region of interest. On the other hand, the sparse attention adopted in PVT or Swin Transformer is data-agnostic and may limit the ability to model distant relationships. Therefore, this paper proposes a novel interactive attention mechanism, where the interactive positions are selected in a data-dependent manner. This flexible attention mechanism can focus on relevant areas and integrate more informative features.
As shown in Fig. 2, our spatial interaction module (SIM) is an innovative mode of attention that integrates the powerful Softmax attention with the efficient linear attention, represented as a quadruple (Q, S, K, V). With two inputs of N tokens represented as \(F_g,F_l \in \mathbb {R}^{N \times C}\), \(F_g\) and \(F_l\) are obtained from equation(1) and equation(3) respectively. So, self-attention can be formulated as follows in each head:
where \(W_{Q / K / V} \in \mathbb {R}^{C \times d}\) denote projection matrices, C and d are the channel dimension of the module and each head, and \(\operatorname {Sim}(\cdot )\) represents the similarity function. When using \(\operatorname {Sim}(Q, K)=\exp \left( Q K^T / \sqrt{d}\right)\) in equation(8), it becomes Softmax attention33. So Softmax attention can be abbreviated as:
Where \(W_{Q / K / V} \in \mathbb {R}^{C \times d}\) denote query, key, and value matrices and \(\sigma (\cdot )\) represents Softmax function. When similarity is measured as \(\operatorname {Sim}(Q, K)=\phi (Q) \phi (K)^T\) in equation(8), it becomes Linear attention68. So linear attention can be abbreviated as:
Softmax and linear attention suffer from either excessive computation complexity or insufficient model expressiveness. So this paper introduces a set of additional saliency region tokens \(S, \in \mathbb {R}^{d \times C}\)into the traditional attention module. As shown in Fig. 2, our interaction module consists of two Softmax attention operations. We initially treat tokens S as queries and perform attention calculations between S, K and V to aggregate local salient features \(O^{V_S}\):
The saliency region tokens S first act as intermediaries for the query tokens Q, aggregating information from K and V, and then broadcasting the information back to Q. The number of saliency region tokens is determined by the RM module and is far less than the number of query tokens. The saliency region attention is notably more efficient than the widely used Softmax attention, while still preserving the global context modeling capability.
Subsequently, we utilize S as keys and \(O^{V_S}\) as values in the second attention calculation with the query matrix Q, interactions the global information and local salient information to obtain the final output\(O_S\):
Then our attention used in the interaction module can be written as:
Loss function
During the first round of feature extraction, to capture global features, we utilized downsampled images for training, hence the labels were also downsampled to facilitate loss function computation. Employing an inter-class competition mechanism was particularly effective in learning inter-class information, prioritizing the accuracy of predicting the correct label probability, while neglecting differences among other non-correct labels. As a result, it is better suited for extracting overall image-level features. So, in the first training phase, we employ cross-entropy loss (CELoss) as a loss function to measure the difference between the prediction results of the backbone network and the ground truth data.
where p is the prediction result, and y is the ground truth data.
During the second phase of training, feature extraction was performed on image patches that had undergone spatial refinement mapping (SRM). Consequently, the context did not extend beyond the small block, significantly reducing the surrounding information. The diversity of categories made it easier for the network to misclassify small categories as large categories in the absence of neighborhood dependency information. Focal Loss serves as a solution for addressing class imbalances and differences in classification difficulty in classification problems. Therefore, we use the Focal Loss69 with \(\tau = 6\) as the optimization target in the second training phase.
where \(p_t\) represents the probability that the second backbone network predicts for a certain category; \(\alpha _t\) is used to balance the number of positive and negative samples, with a smaller \(\alpha _t\) value assigned to categories with more samples and a larger \(\alpha _t\) value to those with fewer samples; \(\tau\) is used to adjust the imbalance between hard-to-separate and easy-to-separate samples. In this paper, we take \(\tau =6\) to reduce the loss of easy-to-separate samples with a power function. In line with the other valid networks of the Deepglobe dataset70, we have chosen \(\alpha\) to be 0.25 to ensure that readers can clearly understand the significance and role of this parameter in the entire study.
Experiment
As shown in Table 1, we conduct experiments on three datasets to verify the effectiveness of our proposed method. DeepGlobe dataset, which primarily focuses on rural areas, and the Vaihingen and Potsdam datasets, which emphasize urban regions.
Datasets
The DeepGlobe Land Cover Classification dataset serves as a high-resolution benchmark, consisting of 1146 annotated sub-meter satellite images (\(2448\times 2448\) pixels) focused on rural regions. It provides pixel-wise masks for seven classes: urban, agriculture, rangeland, forest, water, barren, and unknown. The Vaihingen dataset, widely used in the 2D Semantic Labeling Contest, focuses on urban semantic segmentation. It comprises 33 high-resolution images (\(2494\times 2064\) pixels, 9 cm resolution) with three channels: red, green, and near-infrared (NIR), along with a digital surface model (DSM). We preprocessed the dataset by cropping images into \(512\times 512\) patches, resulting in 344 images for training and 398 images for validation across six classes: impervious surfaces, buildings, low vegetation, trees, cars, and background. The Potsdam dataset also used for urban semantic segmentation, includes 38 high-resolution images (\(6000\times 6000\) pixels, 5 cm resolution) with three bands and DSM. Out of these, 24 images are used for training and validation, while 14 are reserved for testing. The dataset includes the same six classes as Vaihingen. After cropping, 3,456 patches are used for training and 2,016 for validation.
Implementation details
To validate the performance of the proposed method, we compared SAINet with widely used single-branch networks and GLNet-based segmentation networks on the DeepGlobe, Vaihingen, and Potsdam datasets. In all the experiments, none of the networks utilized pre-trained weights. To demonstrate that our fusion method is more effective than GLNet’s fusion approach, both the downsampled global image and the cropped local patches share the same size as GLNet, \(500\times 500\) pixels. Neighboring patches have a 50-pixel overlap to avoid boundary vanishing for all the convolutional layers.
The experiments were implemented under the Pytorch framework, with Python version 3.8, and using an NVIDIA Geforce RTX2080Ti 11-GB GPU. For all comparison methods, the default parameters are used. For SAINet, we use the Adam optimizer \((\beta _1 = 0.9, \beta _2 = 0.999)\) and set the momentum to 0.9 and weight decay to \(1 \times 10^{-4}\) for training the global branch, and \(2 \times 10^{-5}\) for the local branch. The number of training epochs was set to 150, and a minibatch size of 6 for all training. In order to meticulously track the performance of the model during training, a validation procedure is performed after every epoch to evaluate the model’s generalization capability on unseen data. The ultimate selection of the model for deployment or further analysis is based on the checkpoint from the epoch that achieved the peak validation accuracy. This approach ensures that the chosen model represents the best-performing iteration throughout the training process. We used an early stopping mechanism in the training process to terminate the training when the performance of the test dataset did not increase in 10 epochs.
Results
To demonstrate the superiority of our proposed algorithm, we compared it with traditional semantic segmentation algorithms and the dual-branch GLNet network. Our network, as well as the single-branch networks, can only produce prediction results for patches. Therefore, we combine these patches to obtain the overall prediction result for the entire image. GLNet offers two methods to obtain predictions for the entire image, and we have chosen the combination method, which yields better results.
Results on DeepGlobe dataset
To verify the performance of our proposed SAINet, we compare the model with other CNN-based and GLNet-based semantic segmentation networks on the DeepGlobe dataset. GLNet only computed the mIoU, so we are consistent with it here. The speci?c results are shown in Table 2. The results show that double-branch GLNet-based methods outperform single-branch CNN-based methods on the DeepGlobe dataset. Here, GLNet extracts global features from downsampled images, and local features from patches, and then combines the features from both branches to enhance segmentation accuracy. Our proposed SAINet aims to capture more informative features, leading to improved segmentation accuracy. Our proposed SAINet has a significant performance improvement compared to other networks. Compared to GLNet11 and MBNet13, the mIoU of Our proposed SAINet is improved by 6.2% and 5.2%, respectively. Especially, the most significant improvement is observed in the “Barren” class, with a relative increase of 22.9% compared to GLNet. Compared to DeepLabv3+26, the top-performing single-branch network, the mIoU of Our proposed SAINet is improved by 20.7%. Hence, our network achieves optimal results in all classes and the mIoU matrix, indicating that our model outperforms GLNet, UHRSNet12, and MBNet. While UHRSNet and MBNet show improvements relative to GLNet, they integrate all global and local information in a simple manner for fusion. In contrast, our proposed SAINet utilizes the SRM module to identify significant regions and then dynamically allocates weights through the SIM module, achieving a better fusion of global and local information.
Results on Vaihingen dataset
To verify that the proposed model is equally advantageous in other datasets, we conducted comparative experiments on the Vaihingen dataset. The results are shown in Table 3 where our proposed SAINet yields mean F1 of 90.7%, OA 91.9%, mIoU 82.5% respectively and is quantitively superior to other methods. The experimental results show that our proposed SAINet maintains stable performance on the Vaihingen dataset, which indicates that our proposed SAINet has a better practical generalization ability. Specifically, compared with the second method, our proposed SAINet improvement in mean F1 is 0.9%, OA 1.3%, and mIoU% respectively. The “Building” class gains the highest classification accuracy of 96.5%. Especially in high-resolution remote sensing images, the Car, as a relatively small object, is difficult to identify in the Vaihingen dataset. Even so, our proposed SAINet reached 89.4% of the F1 score, significantly exceeding the secondary method. This indicates that the SRM module in our proposed SAINet can effectively focus on small target areas and enable the network to adeptly integrate small target information with its global context. However, our proposed SAINet exhibits some shortcomings in terms of F1 scores for certain categories, such as Impervious Surface and Tree, which are respectively lower than BGFNet by 0.4% and 2.5%. This could be due to an excessive focus on small targets, leading to a decrease in attention to other categories, which affects the model’s ability to capture contextual semantic information, resulting in a decrease in F1 scores for certain categories and impacting the overall stability of the model.
Results on Potsdam dataset
From the Table 4, it can be seen that the SAINet proposed in this paper also outperforms previous segmentation methods on the Postdam dataset. Our model surpasses existing contextual information fusion techniques, such as DeepLabV3+ and PSPNet. The meamF1, OA, and mIoU of our model are 3.2%, 3.3%, and 14.5% higher than those of the mentioned aggregation methods. Meanwhile, compared with BGFNet, Our proposed SAINet obtains increments of 0.4% and 0.3% in mean F1 and OA, respectively. Notably, our method scored 96.2% on F1 in the “Car” category, outperforming other networks by more than 0.65% on the Potsdam Dataset. It is evident that our proposed SAINet enhances attention to small objects through the SRI module, while also ensuring that the entire network autonomously captures contextual information from multiple perspectives through the SRI module. Compared to the DeepGlobe dataset, our algorithm exhibits a slightly smaller improvement in the Vaihingen and Potsdam datasets. This is primarily because the Vaihingen and Potsdam datasets are already mature and well-established. Compared to the DeepGlobe dataset, the Vaihingen and Potsdam datasets primarily focus on urban areas, characterized by well-defined boundaries and intricate details.
Ablation studies and analysis
In the previous subsection, we have shown the superiority of our proposed SAINet by comparing it to state-of-the-art methods. In what follows, we comprehensively analyze intrinsic factors that lead to our proposed SAINet’s superiority in the dataset, including: (1) The role of SRM and SIM. (2) The effectiveness of variant window size of SRM. (3) Compared with different attention modules. (4) Multiple Backbone Comparative Experiments. (5) Inference Speed.
The role of SRM and SIM
It can be observed from Table 5 that the two modules significantly influence the performance of the algorithm, and their effects on different datasets are consistent: the effect of using SIM alone is better than using SRM alone, while the best performance is achieved when both are used simultaneously. When SRM is used alone, the mIoU of our algorithm on three datasets are 65.7%, 75.6%, and 70.3%, respectively, which are lower than GLNet’s 9.9%, 3.1%, and 9%, respectively. The algorithm adopts the simplest addition operation to interact with the global and local features, leading to less improvement than GLNet, as GLNet performs more complex interactions between all global and local features. However, there is still an improvement in mIoU compared to using a single-branch network. When SIM is used alone, the mIoU of our algorithm on three datasets are 72.4%, 78.9%, and 79.6%, respectively, which are higher than GLNet’s 0.8%, 0.2%, and 0.3%. It interacts with all features, resulting in some improvement, although not significant, as not all features require interaction, and excessive interaction may have a detrimental effect. Only when SAM and SIM are used together can the network in this paper achieve the best performance. The mIoU of our algorithm on three datasets are 77.8%, 82.5%, and 88.3%, respectively, which are higher than GLNet’s 6.2%, 3.8%, and 9.0%. As can be observed from Fig. 3, the SRM approach exhibits “jiggling” artifacts and imprecise boundaries, which can be attributed to the loss of detail resulting from downsampling. In contrast, SIM demonstrates extensive Misclassification across large areas. It is noteworthy that “agriculture” and “barren” land regions frequently exhibit visual similarities. Consequently, patch-based training, lacking in spatial context and neighborhood dependency information, poses significant challenges in accurately distinguishing between “agriculture” and “barren” categories based solely on local patches.
The comparison result figures of the SIM-SAM module combination in SAINet versus the single SIM and SAM in experiments. The images used in this figure are sourced from the DeepGlobe dataset, and the dataset can be accessed at: https://www.kaggle.com/datasets/balraj98/deepglobe-land-cover-classification-dataset.
The effectiveness of variant window size of SRM
To explore which window size of SRM gives the best performance boost, we set the window size to \(3\times 3,5\times 5,7\times 7\) respectively, and select mIoU as the evaluation metric. As shown in Fig. 4, when the window size is \(3 \times 3\), the segmentation results are the best. The mIoU on the DeepGlobe dataset reaches as high as 77.8%, significantly exceeding (\(5 \times 5\)) and (\(7 \times 7\)) by 1.1% and 20.6%, respectively. The mIoU on the Vaihingen dataset reaches as high as 82.5%, significantly exceeding (\(5 \times 5\)) and (\(7 \times 7\)) by 2.4% and 28.3%, respectively. The mIoU on the Potsdam dataset reaches as high as 88.3%, significantly exceeding (\(5 \times 5\)) and (\(7 \times 7\)) by 9.9% and 26.0%, respectively. With a window size of \(5 \times 5\), the segmentation results are slightly worse, and with a window size of \(7 \times 7\), the segmentation results are inferior. The size of the window of SRM is an important parameter when filtering spatial locations and extracting salient regions. According to the results, the segmentation results are best when the window size is \(3 \times 3\). This is because global features have already been extracted before the first backbone network. Therefore, a \(3 \times 3\) window is already quite large, and increasing the window size further would only introduce unnecessary interference or noise without improving the results. Hence, choosing a \(3 \times 3\) window size is wise in this scenario, as it provides the best segmentation results. As shown in Table 6, on the DeepGlobe dataset, we obtained the optimal parameter values of t = 0.08 and \(\mu\) = 7 through experimental settings with different combinations of t and \(\mu\). We can see that the closer t is to 0.08 and the closer \(\mu\) is to 7, the higher the experimental evaluation index will be.
Compared with different attention modules
To further demonstrate the capability of our proposed spatial interaction module in understanding spatial relationships in data, we replaced the spatial interaction module with another attention mechanism and calculated the mIoU on three datasets. From Table 7, it can be seen that replacing our spatial interaction module with the convolutional block attention module results in a better mIoU on all three datasets compared to replacing it with the channel attention module and spatial attention module. The best results obtained after replacing with the channel attention module and spatial attention module are close, with differences of 0.4%, 0.3%, and 0.1% on the DeepGlobe dataset, Vaihingen dataset, and Potsdam dataset, respectively, indicating that the channel attention module and spatial attention module focus on important features to a similar extent. The convolutional block attention module, which concatenates the channel attention module and spatial attention module, further enhances the capability to focus on local information. The results of the proposed spatial interaction module on the three datasets are respectively 5.5%, 3.4%, and 8.6% better than the results of the convolutional block attention module on the three datasets. This is because our designed spatial interaction module is more compatible with the proposed spatial refine module, and it better achieves the interaction between global and local information.
Multiple backbone comparative experiments
As can be seen from Table 8, which details the performance and resource requirements of different backbone networks on the Potsdam dataset, ResNet50 strikes an effective balance between accuracy and parameter count. It is especially appropriate for environments where computational resources are constrained. Specifically, ResNet50 achieves mIOU 88.30% while maintaining a relatively modest memory cost count of 1.87 GB, ensuring efficient utilization of hardware capabilities without significant compromise on performance.
Inference speed
We also compared the inference speed of the networks. For our proposed SAINet and GLNet, the prediction time is for the entire original image. However, for single-branch networks like DeepLabv3+, ICNet72, and PSPNet23, the prediction time is the sum of the prediction time for all the smaller patches obtained by cropping the original image. As shown in Fig. 5, our method and GLNet demonstrate superior inference speed compared to single-branch networks. Thanks to the design of the spatial refine module and spatial interaction module, our network achieves faster inference for the entire original image while maintaining accuracy, outperforming GLNet by 105.24 seconds. While DeepLabv3+ achieves high accuracy, its inference efficiency is compromised. Although ICNet improves inference efficiency, it struggles to maintain high accuracy. PSPNet, on the other hand, increases accuracy but at the cost of significantly longer inference time. As is presented in Table 2,3,4 although the memory consumption of SAINet is 0.47G higher than the baseline, it is lower compared with other algorithms. A significant improvement in evaluation metrics has been achieved within an acceptable range of increased memory. This indicates that our algorithm has advantages in memory utilization efficiency and can obtain a higher performance return at a relatively small memory cost.
Visual results and analysis
To further evaluate the performance of our algorithm on different datasets and highlight its advantages, we selected five representative image samples from the DeepGlobe, Vaihingen dataset and the Postdam dataset. We analyzed and compared the prediction results using visualizations.
Figure 6 illustrates the comparative performance of DeepLab V3+, PSPNet, Mask2Former, and our proposed method on the DeepGlobe dataset. As can be observed, our method not only achieves precise boundary detection for diverse land cover types but also effectively eliminates the adverse effects of noise and shadows. It notably prevents the occurrence of internal holes and irregular contours within segmented regions, resulting in the highest mIOU score. The segmentation outputs are characterized by their accurate boundaries and correct classifications, underscoring the superior comprehensive performance of our method over the three alternatives. These findings affirm the distinct advantages of our proposed approach on the DeepGlobe dataset, particularly in handling complex segmentation tasks.
Visualization results on the DeepGlobe dataset. The images used in this figure are sourced from the DeepGlobe dataset, and the dataset can be accessed at: https://www.kaggle.com/datasets/balraj98/deepglobe-land-cover-classification-dataset.
From the visualization results in Figs. 7 and 8, it can be seen that our algorithm shows a remarkable improvement on both the Vaihingen and Postdam datasets. Compared to the prediction results of other algorithms, our algorithm outperforms in predicting buildings, clutter, and low vegetation. As can be seen from Fig. 7, our method outperforms other methods both in small target extraction and detail preservation. In the first row and the second row in Fig. 7, although PSPNet can also extract small targets relatively well, the edges of the extracted buildings are significantly less clear compared to those obtained using our algorithm. In the first row of Fig. 8, PSPNet misidentifies a tree as clutter. Moreover, in the fifth row of Fig. 8, both PSPNet and HRNet misclassify clutter as buildings, with HRNet showing the most severe misclassification. And the small red clutter within the black dashed box is incorrectly classified as a car by Deeplabv3+. However, GLNet and our algorithm can accurately segment it, highlighting the advantage of the dual-branch network.
Visualization results on the Vaihingen dataset. The images used in this figure are sourced from the Vaihingen dataset, and the dataset can be accessed at: https://www.isprs.org/education/benchmarks/UrbanSemLab/Default.aspx.
Visualization results on the Postdam dataset. The images used in this figure are sourced from the Postdam dataset, and the dataset can be accessed at: https://www.isprs.org/education/benchmarks/UrbanSemLab/Default.aspx.
The global branch in the dual-branch network takes the entire original image as input, enabling correct segmentation of clutter. It can be seen that our algorithm overall performs better, especially in building segmentation, where it produces more complete results with more accurate edges. This indicates that our algorithm can accurately separate buildings from other areas and produce clearer outlines. Additionally, our algorithm exhibits greater accuracy in segmenting smaller objects. Our algorithm can effectively identify and segment smaller targets, leading to more precise segmentation results.
Discussion
Compared with traditional single-branch semantic segmentation networks (such as ICNet72, FCN15, PSPNet23, SegNet19, DeepLabv3+26, etc.), SAINet performs better in multiple indicators. Traditional single-branch networks usually lose the information of small targets due to downsampling operations when processing high-resolution remote sensing images, and it is difficult to balance the extraction of global and local features. SAINet, through its dual-branch structure and innovative modules, effectively avoids these problems. For example, on the DeepGlobe dataset, the mIoU of DeepLabv3+ is 57.1%, while that of SAINet reaches 77.8%, with a significant improvement, proving that SAINet has a stronger ability to capture detailed features in complex scenes. Compared with some dual-branch networks (such as GLNet11, UHRSNet12, MBNet13, etc.), SAINet also has advantages. Although networks like GLNet use dual branches to extract global and local features respectively, when fusing features, they often simply add or concatenate them, without fully considering the importance of spatial position features. SAINet, on the other hand, screens salient regions through the SRM module and then dynamically allocates weights using the SIM module to achieve more accurate feature fusion. On the DeepGlobe dataset, the mIoU of SAINet is 6.2% higher than that of GLNet and 5.2% higher than that of MBNet. Similar performance advantages can also be seen on the Vaihingen and Potsdam datasets, indicating that the feature fusion method of SAINet is more effective and can better adapt to the complex features of high - resolution remote sensing images. On the Vaihingen dataset, the F1 scores of SAINet for the “Impervious Surface” and “Tree” categories are 92.5% and 89.3% respectively. Although they are better than those of GLNet, they are lower than those of BGFNet79. This is because the edges of these two categories are irregular and the areas are large. The SRI module of SAINet tends to focus more on small targets, resulting in insufficient feature extraction for these large area and complex edged categories. This phenomenon may not be fully explored in recent literature, and our research provides a new perspective and data support for understanding and solving the category - specific performance differences in high-resolution remote sensing image segmentation.
Conclusions
Semantic segmentation of high-resolution remote sensing imagery is a critical component in decision-making and analysis across a multitude of sectors, including water management, agriculture, military operations, and environmental protection. However, the current methodologies often fall short due to their lack of spatial location feature screening, leading to a less accurate interpretation of the imagery. To address these shortcomings, we have introduced a novel approach, the Spatially Adaptive Interaction Network (SAINet). SAINet is designed to dynamically interact with features in remote sensing semantic segmentation, focusing on spatial relevance. It incorporates a spatial refinement module that leverages local context information to filter spatial locations and extract prominent regions, thereby improving the quality of feature representation. Furthermore, our innovative spatial interaction module employs a spatial adaptive modulation mechanism to dynamically select and allocate spatial position weights. This mechanism fosters effective interaction between local salient areas and global information, significantly enhancing the network’s segmentation performance. The adaptability of SAINet allows it to capture more informative features, leading to a marked improvement in segmentation accuracy. Our experiments on widely recognized public datasets such as DeepGlobe, Vaihingen, and Potsdam have validated the effectiveness and capability of our proposed approach. In the future,we aim to boost segmentation accuracy, especially for “Impervious Surface” and “Tree” in the Vaihingen dataset. This can be achieved by optimizing the SRM and SIM modules to better capture features of objects with different sizes and complex shapes. Fine-tuning these modules can enhance the network’s ability to distinguish different land-cover types, which is crucial for accurate segmentation. We plan to conduct comparative experiments on high-resolution images in other scenarios like industrial or coastal areas. This helps verify SAINet’s generalization ability in diverse environments. Different scenes have unique features, and testing SAINet on them can offer a better understanding of its adaptability. Finally, since our network needs two - stage training and may reduce GPU utilization, we’ll improve the model algorithm to lower computational complexity. We’ll also explore the relationship between multi-scale feature aggregation and global context information aggregation to enhance the model’s performance. Reducing complexity improves model efficiency and makes it more applicable in resource-constrained settings.
Data availability
The DeepGlobe dataset used in this research was obtained from the Land Cover Classification Track in the DeepGlobe Challenge. The persistent web link to the dataset is https://www.kaggle.com/datasets/balraj98/deepglobe-land-cover-classification-dataset. The Potsdam and Vaihingen datasets used in this research were obtained from the ISPRS Test Project on Urban Classification, 3D Building Reconstruction and Semantic Labeling. The persistent web link to the datasets is https://www.isprs.org/education/benchmarks/UrbanSemLab/Default.aspx.
Code availability
The custom code and mathematical algorithms used in this project to generate the results crucial to the conclusions can all be obtained from the public GitHub code repository. Readers can access the complete code by visiting https://github.com/hehuan163/AINet.git. There are no access restrictions for this repository, and anyone can freely clone, download, and use this code. When using the code, please comply with the relevant open-source agreements and the license statements attached to the project.
References
Shen, X., Weng, L., Xia, M. & Lin, H. Multi-scale feature aggregation network for semantic segmentation of land cover. Remote Sens. 14, 6156 (2022).
Chang, R., Hou, D., Chen, Z. & Chen, L. Automatic extraction of urban impervious surface based on sah-unet. Remote Sens. 15, 1042 (2023).
Khan, S. D. & Basalamah, S. Multi-scale and context-aware framework for flood segmentation in post-disaster high resolution aerial images. Remote Sens. 15, 2208 (2023).
Cardama, F. J., Heras, D. B. & Argüello, F. Consensus techniques for unsupervised binary change detection using multi-scale segmentation detectors for land cover vegetation images. Remote Sens. 15, 2889 (2023).
Li, Y., Yan, E., Jiang, J., Cao, D. & Mo, D. Investigating the identification and spatial distribution characteristics of camellia oleifera plantations using high-resolution imagery. Remote Sens. 15, 5218 (2023).
Khan, S. D., Alarabi, L. & Basalamah, S. Deep hybrid network for land cover semantic segmentation in high-spatial resolution satellite images. Information 12, 230 (2021).
Paisitkriangkrai, S., Sherrah, J., Janney, P., Hengel, V.-D. et al. Effective semantic pixel labelling with convolutional networks and conditional random fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops 36–43 (2015).
Hu, J., Huang, Z., Shen, F., He, D. & Xian, Q. A bag of tricks for fine-grained roof extraction. In IGARSS 2023—2023 IEEE International Geoscience and Remote Sensing Symposium (IEEE, 2023).
Fu, G., Liu, C., Zhou, R., Sun, T. & Zhang, Q. Classification for high resolution remote sensing imagery using a fully convolutional network. Remote Sens. 9, 498 (2017).
Zhu, Q., Zhong, Y., Liu, Y., Zhang, L. & Li, D. A deep-local-global feature fusion framework for high spatial resolution imagery scene classification. Remote Sens. 10, 568 (2018).
Chen, W., Jiang, Z., Wang, Z., Cui, K. & Qian, X. Collaborative global-local networks for memory-efficient segmentation of ultra-high resolution images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 8924–8933 (2019).
Shan, L. et al. Uhrsnet: A semantic segmentation network specifically for ultra-high-resolution images. In 2020 25th International Conference on Pattern Recognition (ICPR) 1460–1466 (IEEE, 2021).
Shan, L. & Wang, W. Mbnet: A multi-resolution branch network for semantic segmentation of ultra-high resolution images. In ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2589–2593 (IEEE, 2022).
Du, X., He, S., Yang, H. & Wang, C. Multi-field context fusion network for semantic segmentation of high-spatial-resolution remote sensing images. Remote Sens. 14, 5830 (2022).
Long, J., Shelhamer, E. & Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 3431–3440 (2015).
Hao, S., Zhou, Y. & Guo, Y. A brief survey on semantic segmentation with deep learning. Neurocomputing 406, 302–321 (2020).
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K. & Yuille, A. L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062 (2014).
Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5–9, 2015, Proceedings, Part III 18 234–241 (Springer, 2015).
Badrinarayanan, V., Kendall, A. & SegNet, R. C. A deep convolutional encoder–decoder architecture for image segmentation. arXiv preprint arXiv:1511.00561 (2015).
Yu, F. & Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015).
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K. & Yuille, A. L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40, 834–848 (2017).
Shen, F. et al. An efficient multiresolution network for vehicle reidentification. IEEE Internet Things J. 9, 9049–9059 (2021).
Zhao, H., Shi, J., Qi, X., Wang, X. & Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2881–2890 (2017).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016).
Chen, L.-C., Papandreou, G., Schroff, F. & Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017).
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F. & Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV) 801–818 (2018).
Li, L., Zhou, T., Wang, W., Li, J. & Yang, Y. Deep hierarchical semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 1246–1257 (2022).
Fu, J. et al. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 3146–3154 (2019).
Shen, F., Wei, M. & Ren, J. Hsgnet: Object re-identification with hierarchical similarity graph network. arXiv preprint arXiv:2211.05486 (2022).
Liu, C. et al. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 82–92 (2019).
Shen, F. et al. Hsgm: A hierarchical similarity graph module for object re-identification. In 2022 IEEE International Conference on Multimedia and Expo (ICME) 1–6 (IEEE, 2022).
Li, M., Wei, M., He, X. & Shen, F. Enhancing part features via contrastive attention module for vehicle re-identification. In 2022 IEEE International Conference on Image Processing (ICIP) 1816–1820 (IEEE, 2022).
Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 66 (2017).
Xu, R., Shen, F., Wu, H., Zhu, J. & Zeng, H. Dual modal meta metric learning for attribute-image person re-identification. In 2021 IEEE International Conference on Networking, Sensing and Control (ICNSC) vol. 1 1–6 (IEEE, 2021).
Shen, F., Xie, Y., Zhu, J., Zhu, X. & Zeng, H. Git: Graph interactive transformer for vehicle re-identification. IEEE Trans. Image Process. 6, 66 (2023).
Xie, E. et al. Segformer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural. Inf. Process. Syst. 34, 12077–12090 (2021).
Fu, X., Shen, F., Du, X. & Li, Z. Bag of tricks for ‘vision meet alage’ object detection challenge. In 2022 6th International Conference on Universal Village (UV) 1–4 (IEEE, 2022).
Cheng, B., Misra, I., Schwing, A. G., Kirillov, A. & Girdhar, R. Masked-attention mask transformer for universal image segmentation. arXiv e-prints (2021).
Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
Wang, W. et al. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision 568–578 (2021).
Weng, W., Ling, W., Lin, F., Ren, J. & Shen, F. A novel cross frequency-domain interaction learning for aerial oriented object detection. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV) (Springer, 2023).
Qiao, C. et al. A novel multi-frequency coordinated module for sar ship detection. In 2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI) 804–811 (IEEE, 2022).
Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision 10012–10022 (2021).
Wang, Z. et al. Uformer: A general u-shaped transformer for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 17683–17693 (2022).
Yuan, L. et al. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision 558–567 (2021).
Shen, F., Du, X., Zhang, L. & Tang, J. Triplet contrastive learning for unsupervised vehicle re-identification. arXiv preprint arXiv:2301.09498 (2023).
Xiao, T. et al. Early convolutions help transformers see better. Adv. Neural. Inf. Process. Syst. 34, 30392–30400 (2021).
Chen, Y. et al. Mobile-former: Bridging mobilenet and transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 5270–5279 (2022).
Li, Y., Yao, T., Pan, Y. & Mei, T. Contextual transformer networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 45, 1489–1500 (2022).
Xia, Z., Pan, X., Song, S., Li, L. E. & Huang, G. Vision transformer with deformable attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 4794–4803 (2022).
Liu, Z. et al. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 11976–11986 (2022).
Wu, K., Xu, Z., Lyu, X. & Ren, P. Cross-supervised learning for cloud detection. GIScience Remote Sens. 60, 2147298 (2023).
Gu, X., Chen, X. & Du, L. Y. Simalstm-snp: Novel semantic relatedness learning model preserving both Siamese networks and membrane computing. J. Supercomput. 80, 3382–3411 (2024).
Yan, P. et al. Clustered remote sensing target distribution detection aided by density-based spatial analysis. Int. J. Appl. Earth Obs. Geoinf. 132, 104019 (2024).
Yang, D., Zhu, T., Wang, S., Wang, S. & Xiong, Z. Lfrsnet: A robust light field semantic segmentation network combining contextual and geometric features. Front. Environ. Sci. 10, 996513 (2022).
Jin, C., Tanno, R., Xu, M., Mertzanidou, T. & Alexander, D. C. Foveation for segmentation of ultra-high resolution images. arXiv preprint arXiv:2007.15124 (2020).
Shen, F., Shu, X., Du, X. & Tang, J. Pedestrian-specific bipartite-aware similarity learning for text-based person retrieval. In Proceedings of the 31th ACM International Conference on Multimedia (2023).
Shan, L., Li, X. & Wang, W. Decouple the high-frequency and low-frequency information of images for semantic segmentation. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 1805–1809 (IEEE, 2021).
Huynh, C., Tran, A. T., Luu, K. & Hoai, M. Progressive semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 16755–16764 (2021).
Hou, J., Guo, Z., Wu, Y., Diao, W. & Xu, T. Bsnet: Dynamic hybrid gradient convolution based boundary-sensitive network for remote sensing image segmentation. IEEE Trans. Geosci. Remote Sens. 60, 1–22 (2022).
Guo, S. et al. Isdnet: Integrating shallow and deep networks for efficient ultra-high resolution segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 4361–4370 (2022).
Liu, J. et al. A large-scale benchmark for vehicle logo recognition. In 2019 IEEE 4th International Conference on Image, Vision and Computing (ICIVC) 479–483 (IEEE, 2019).
Kamnitsas, K. et al. Efficient multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation. Med. Image Anal. 36, 61–78 (2017).
Shen, F. et al. A large benchmark for fabric image retrieval. In 2019 IEEE 4th International Conference on Image, Vision and Computing (ICIVC) 247–251 (IEEE, 2019).
Li, Y., Wu, J. & Wu, Q. Classification of breast cancer histology images using multi-size and discriminative patches based on deep learning. IEEE Access 7, 21400–21408 (2019).
Xie, Y., Shen, F., Zhu, J. & Zeng, H. Viewpoint robust knowledge distillation for accelerating vehicle re-identification. EURASIP J. Adv. Signal Process. 2021, 1–13 (2021).
Shen, F., Zhu, J., Zhu, X., Xie, Y. & Huang, J. Exploring spatial significance via hybrid pyramidal graph network for vehicle re-identification. IEEE Trans. Intell. Transp. Syst. 23, 8793–8804 (2021).
Katharopoulos, A., Vyas, A., Pappas, N. & Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning 5156–5165 (PMLR, 2020).
Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision 2980–2988 (2017).
Shan, L. & Wang, W. Densenet-based land cover classification network with deep fusion. IEEE Geosci. Remote Sens. Lett. 19, 1–5 (2021).
Demir, I. et al. Deepglobe 2018: A challenge to parse the earth through satellite images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops 172–181 (2018).
Zhao, H., Qi, X., Shen, X., Shi, J. & Jia, J. Icnet for real-time semantic segmentation on high-resolution images. In Proceedings of the European Conference on Computer Vision (ECCV) 405–420 (2018).
Badrinarayanan, V., Kendall, A. & Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39, 2481–2495 (2017).
Wang, J. et al. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43, 3349–3364 (2020).
Sun, H., Pan, C., He, L. & Xu, Z. A full-scale feature extraction network for semantic segmentation of remote sensing images. In 2022 4th International Conference on Intelligent Control, Measurement and Signal Processing (ICMSP) 725–728 (IEEE, 2022).
Chen, C., Qian, Y., Liu, H. & Yang, G. Clanet: A cross-linear attention network for semantic segmentation of urban scenes remote sensing images. Int. J. Remote Sens. 44, 7321–7337 (2023).
Wu, H., Huang, P., Zhang, M. & Tang, W. Ctfnet: Cnn-transformer fusion network for remote sensing image semantic segmentation. IEEE Geosci. Remote Sens. Lett. 6, 66 (2023).
Zhang, Y., Cheng, J., Bai, H., Wang, Q. & Liang, X. Multilevel feature fusion and attention network for high-resolution remote sensing image semantic labeling. IEEE Geosci. Remote Sens. Lett. 19, 1–5 (2022).
Sun, X., Qian, Y., Cao, R., Tuerxun, P. & Hu, Z. Bgfnet: Semantic segmentation network based on boundary guidance. IEEE Geosci. Remote Sens. Lett. 6, 66 (2023).
Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 7132–7141 (2018).
Jaderberg, M. et al. Spatial transformer networks. Adv. Neural Inf. Process. Syst. 28, 66 (2015).
Woo, S., Park, J., Lee, J.-Y. & Kweon, I. S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV) 3–19 (2018).
Acknowledgements
We are deeply grateful to the editors and reviewers of the scientific reports for their hard work.
Funding
This work was supported by the National Nature Science Foundation of China (No.42071343)
National Nature Science Foundation of China (No.42071428)
Research Fund of Liaoning Provincial Department of Education (No. LJKZ1070).
Author information
Authors and Affiliations
Contributions
This work was conducted in collaboration with all authors. Conceptualization, W.S., and H.H.; methodology, W.S., and H.H.; software, H.H.; validation, H.H., and J.D.; data curation, J.D.; writing-original draft preparation, H.H.; writing-review and editing, W.S., J.D.and H.H.; All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Song, W., He, H., Dai, J. et al. Spatially adaptive interaction network for semantic segmentation of high-resolution remote sensing images. Sci Rep 15, 15337 (2025). https://doi.org/10.1038/s41598-025-99428-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-99428-4