Enhanced remote sensing image feature classification using STFF-PSPNet

Li, Haiying; Gao, Jiaqi; Liu, Yang; Huang, Chenxi; Li, Lijun; Zhang, Liqiang

doi:10.1038/s41598-025-89094-x

Download PDF

Article
Open access
Published: 29 July 2025

Enhanced remote sensing image feature classification using STFF-PSPNet

Haiying Li^1,3,
Jiaqi Gao⁴,
Yang Liu²,
Chenxi Huang³,
Lijun Li¹ &
…
Liqiang Zhang¹

Scientific Reports volume 15, Article number: 27587 (2025) Cite this article

1252 Accesses
1 Citations
Metrics details

Subjects

Abstract

Semantic segmentation of remotely sensed images is crucial for urban planning and change detection, yet faces issues like sample imbalance and low data quality. This study compiles a GF-2 image dataset and refines the PSPNet model. Weights of different class samples were adjusted to prioritize minority classes, mitigating sample imbalance’s impact on classification. Data augmentation enhanced dataset quality. By replacing ResNet with the STFF network for better global feature extraction, adding attention modules, and using a combined loss function, the improved model shows excellent performance. It achieves a mAcc of 90.32 $\%$, mIoU of 76.04 $\%$, and a Dice coefficient of 85.15 $\%$. Comparison with other models verifies its superiority, and tests on public datasets prove strong generalization, offering valuable insights for remote sensing image processing.

RS-Dseg: semantic segmentation of high-resolution remote sensing images based on a diffusion model component with unsupervised pretraining

Article Open access 10 August 2024

Efficient remote sensing image classification using the novel STConvNeXt convolutional network

Article Open access 11 March 2025

An improved semantic segmentation algorithm for high-resolution remote sensing images based on DeepLabv3+

Article Open access 27 April 2024

Introduction

Over the last few years, with the fast progress of remote sensing technology, the availability of high-resolution satellite images has greatly increased¹. This has led to a growing research interest in the field of semantic segmentation of remotely sensed images. The goal is to classify and label various types of land cover and land use in images². It plays a crucial part in diverse applications like urban planning³, land change detection⁴, water body monitoring⁵, and road extraction⁶. However, due to the lack of high-quality information, recognizing feature types only from images can still be limited for problems such as obscure boundaries and overlapping regions, which makes the classification task increasingly difficult and affects the accuracy of the classification results⁷. In addition, it may be the case that the distribution of features in each category within the image is not uniform. This leads to a significant disparity in the number of samples in each category. Consequently, the generalization ability of the semantic segmentation model is reduced. Furthermore, in the process of acquisition, remote sensing images are covered by clouds, so there is a problem of partial information loss⁸. Therefore how to efficiently and accurately recognize feature types using computer vision methods has become the focus of research.

Against the current research backdrop, numerous intractable issues are faced, such as sample imbalance and low data quality. Datla et al.⁹ innovatively proposed a scene attribute-focused modeling approach. By skillfully integrating the Gaussian Mixture Model (GMM) and factor analysis techniques, they successfully overcame these difficulties and accurately and efficiently extracted highly discriminative scene vectors from remote sensing images. He et al.¹⁰ proposed using the MMSE framework to address the multi-class imbalance problem. Through selective integration and multiple undersampling rates, the computational cost was reduced, and various optimal solutions for decision-makers to weigh among different classes were provided. Regarding the problem of low data quality, Shorten et al.¹¹ through the analysis of various augmentation techniques, constructed better deep learning models by enhancing the size and quality of the training dataset. These methods have proposed effective solutions to the problems existing in the dataset.

Traditional land cover classification methods for remote sensing images, which rely on pixel-based classification techniques, have limited ability to accurately capture spatial and contextual information of land cover categories. In contrast, semantic segmentation is able to categorize each pixel point, which can determine the location and shape of features more accurately¹². There are two main learning methods, namely the traditional semantic segmentation algorithm and the full convolutional neural network algorithm. The traditional semantic segmentation methods include the commonly used and classical segmentation methods based on threshold¹³, edge¹⁴, and region. However, the traditional methods have the disadvantages of low accuracy and time-consuming.

Based on the deep learning develops and progress, the field of computer vision has achieved breakthrough progress. Convolutional Neural Networks(CNNs) have already turned into an important method for image processing. Using CNNs semantic segmentation can fully make use of the semantic information of images to accomplish the segmentation process¹⁵. Among them, Liu et al.¹⁶ proposed a method which employs the spatial residual truncation module to acquire and fuse multi-scale contextual features based on the Fully Convolutional Network(FCN)-based segmentation approach. Moreover, this method can successfully extract buildings in high-resolution remote sensing images.Guo et al.¹⁷ put forward a graph theory-based remote sensing image segmentation method. It can be applied to image segmentation by replacing the standard convolution of FCN with null convolution. It refines the boundary of the segmentation result through the utilization of the fully connected conditional random field, finally obtaining a more accurate segmentation result. In addition, the encoder-decoder based method has been used by many scholars, and its structure is a derived model structure. It is composed of two parts: encoder and decoder, the encoder part extracts the feature information from the original image, and the decoder part carries out the reconstruction of the feature information. Based on this structure, Hu et al.¹⁸ used the Unet algorithm to successfully identify water bodies that are threatening to ecological water systems such as eutrophication and green ponds by employing a cross-entropy loss function as well as referencing the improvement of the attention mechanism. Datla et al.¹⁹ developed a multi-modal semantic segmentation method combining panchromatic remote sensing images and digital elevation models. Through the integration of U-Net and Transformers, this method can effectively identify and delineate airport runways. Swetha et al.²⁰ proposed a novel network architecture, MS-VACSNet, which enhanced the segmentation accuracy for volcanic eruptions of different scales by introducing dilated convolutions into the U-Net architecture and verified its superior performance compared with existing techniques. Since traditional convolution has a relatively small receptive field in the shallow layer of the neural network, a convolution method with a larger scope-cavity convolution has been proposed²¹. Zhao et al.²² proposed Pyramid Scene Parsing Network (PSPNet), which averages feature maps with different sizes through the pooling to obtain more global semantics. Chen et al.²³ presented the cavity convolution Deeplab model, which boosts the precision of pixel localization and enlarges the receptive field by making use of dilated convolution and conditional random fields.

Although feature classification of high-resolution images using deep learning has been extensively studied, there are limitations in its research for the mountainous and hilly terrain that dominates the Hunan region of China. The primary motivation of this study is to propose an enhanced PSPNet algorithm aimed at achieving precise classification. This addresses the challenges of inadequate recognition of intricate features by existing semantic segmentation algorithms in satellite remote sensing images, as well as the issues related to low segmentation accuracy. The main contributions of this study are summarized as follows.

(1)
Creating datasets, that include complex feature types, provides an important resource to support research and development in related fields. In addition, validate the migration capabilities of improved models using publicly available datasets.
(2)
An improved semantic segmentation algorithm for PSPNet model is proposed using STFF. The accuracy (mAcc), intersection and concatenation ratio (mIoU), mDice coefficient, and detection speed of STFF-PSPNet algorithm as well as other algorithms (such as U-Net, DeepLabV3+, and Segformer) are compared.
(3)
The Combined Loss(CL) function is applied in the improved PSPNet network instead of the original loss function, which is better adapted to the dataset created in this study. In addition, an attention, specifically, CBAM (Convolutional Block Attention Module) was added to both the backbone and head layers to enhance feature extraction and improve detection accuracy.

Materials and methods

Research area

The study area is located in Yueyang City, Hunan Province, China, as shown in Fig. 1. It is located in the northeastern part of Hunan Province, with the city surrounded by mountains on two sides, a hilly area in the Southeast, the Dongting Lake plain in the Northwest, and a transitional shallow hilly area around the lake in the center, with a variety of feature types. The study mainly distinguishes between buildings, water bodies, cultivated land, woodland and grassland. Therefore, the test area selected for this study is representative, making the data diverse and rich.

Data

Data sources

(1)
Data Acquisition of Remote Sensing Images: In this research, high-resolution remote sensing images collected by Gaofen-2 are utilized, as shown in Fig. 2. These images were subjected to a series of preprocessing operations, including radiometric calibration, atmospheric correction, geometric correction, band fusion and enhancement, as well as image stitching and mosaicking, which were used to boost the accuracy and definition of the remote sensing images, etc²⁴. The preprocessed image contains RGB three channels and position information channels, but only RGB channels are used based on the purpose of this study.
(2)
Auxiliary data: this paper uses the publicly available dataset LandCover.ai²⁵ on the web, with feature type information including background, buildings, woodlands, water and roads. The purpose is to compare the improved algorithm of this study with other algorithms on public datasets, analyze the results, and verify the generalization ability and migratability of the algorithm.

Dataset

This study is dedicated to the classification of feature types in Yueyang City, through the preprocessed data as a block of image data, the features were classified and labeled using ArcMap 10.2. With the aim of labeling samples, we generated vector files and annotated images whose size was the same as that of the original images. Due to the large images, we used a sliding window to crop for processing the image and its corresponding labeled samples. First cropping out the square image with pixel value of 9216 size, which is not free from cloud cover, etc. We processed or deleted the images with cloud cover, blankness, and lack of clarity and verified the accuracy of the remaining labels²⁶. Then the remaining images are cropped to 512 × 512 size as shown in Fig. 3.

Data augmentation

Data augmentation has a significant influence on deep learning-based semantic segmentation methodologies, improving data multiplicity and regularizing the model. To extract feature information more effectively, guarantee recognition exactness and prevent overfitting during training, various augmentation means were utilized. Using data augmentation tools and OpenCV, the dataset was expanded through operations such as image rotation, image mirroring, random cropping, noise addition, and image stitching. The resulting augmented images, which are the outcomes of these five augmentation operations explained in Table 1, are shown in Fig. 4.

Table 1 Data augmentation applied to the Yueyang dataset images.

Full size table

After data augmentation, the dataset comprised 48,708 images (8118 original images augmented sixfold). From this dataset, we utilize a python random assignment strategy to divide the processed images .With a ratio of 6:2:2, allocate the corresponding labeled data into training, validation, and test sets. 29,226 images were used as the training set, 9741 images were taken as the test set, and 9741 images were regarded as the validation set.

Table 2 demonstrates the number of specific datasets.

Table 2 Yueyang dataset after data enhancement.

Full size table

Improved semantic segmentation algorithm based on PSPNet remote sensing images

PSPNet

Zhao et al.²² proposed Pyramid Scene Parsing Network (PSPNet) deep learning network. The traditional PSPNet algorithm uses pooled pyramids to aggregate contextual information from different regions through the proposed PSPNet, which allows the algorithm to have a large global receptive field. The diagram representing the network structure of PSPNet is exhibited in Fig. 5. For the input image, a pretrained ResNet model with an extended network strategy, using null convolution, is employed to extract feature maps. The primary purpose of the null convolution is to increase the receptive field. The feature map obtained from the last convolution is then passed to the pyramid pooling module. The input feature layer will be divided by the pyramid pooling module into 1$\times$1, 2$\times$2, 3$\times$3 and 6$\times$6 regions, reduce the feature latitude to 1/4 of the original by a 1$\times$1 convolutional layer, upsample these pyramid features directly to the same size as the input features, and then do a merge, i.e., the Concat operation, with the input features, so as to form the feature layer that contains both the local and global context information. Finally, the feature layer is subjected to convolutional kernel SoftMax classification to obtain prediction results for each pixel in the image.

Improvement of the PSPNet network

The richness of feature categories and irregular terrain in the Yueyang area, etc. create challenges for accurately recognizing feature types. To cope with these challenges, in order to improve the accuracy of detection, it is necessary to optimize the network relying on PSPNet.

The traditional PSPNet network, that uses ResNet backbone network²⁷ to extract features with deeper layers result in poorer computational efficiency, higher memory consumption, and a larger sensory field, is only available in the deeper stages, which affects the network segmentation accuracy and speed, among others. Therefore, in our study, the PSPNet backbone(ResNet) is replaced with STFF module. STFF contains two structures, one is Swin Transformer²⁸ with Transformer as the underlying architecture, and the other structure is similar to the feature pyramid network FPN structure²⁹ is the Feature Fusion Module (FFM). The STFF-PSPNet network is presented in Fig. 6. The input image is transformed into four feature maps of different sizes, namely 128$\times$128, 64$\times$64, 32$\times$32 and 16$\times$16 by Swin Transformer. Then the deeper feature maps are up-sampled and convolved by the feature fusion module, and fused with the upper level feature maps to get the multi-scale feature representation. The feature maps after up-sampling in each layer help the network to better understand and perceive the global and local information of the input image through the CBAM attention mechanism, thus improving the network’s ability to recognize and localize the target. Finally, the convolution operation is performed on the adjusted feature maps through convolutional or fully connected layers to generate feature maps with a size identical to that of the input image.

By using STFF, the network can be better adapted to targets of different scales and complexity, thus improving the generalization ability of the network. This is great significance for dealing with complex scenes and multi-scale targets, and is conducive to improve the practical application value of the model. In order to enhance feature extraction and attention to relevant targets, the attention mechanism is added to both the backbone and head layers. By incorporating the CBAM attention mechanism, it can help the model better learn the key features as well as the intrinsic structure of the input data, thus improving the model’s representation and generalization ability.

STFF

In our study, the STFF consists of two parts, the Swin-Transformer and the Feature Fusion Module. While the traditional ResNet model employs a deep residual linkage structure, the architectural framework of Swin Transformer takes full advantage of the Transformer’s self-attention mechanism and multi-attention mechanism, and also allows the model to extract features and capture long-distance dependencies more efficiently when dealing with large-size images by introducing a local attention mechanism and a sliding-window strategy²⁸. The Swin Transformer architecture is shown in the STFF module in Fig. 7 and is divided into four Stages.Each Stage consists of several Transformer Blocks. Each Transformer Block contains several Multi-Head Self-Attention layers and some fully connected layers dedicated to feature extraction and feature fusion.

Firstly, the input $H \times W \times 3$ RGB image is split into non-overlapping equal-sized N patches i.e., $\frac{H}{4} \times \frac{W}{4} \times 48$ by Patch Partition module(The number of channels changes from 3 to 48), and then the result is passed into Stage1’s Linear Embedding layer i.e. fully connected layer, and then input Swin Transformer Block layer to do self-attention mechanism to complete Stage1.After completing Stage1, then the result is passed into Stage2, Stage3 and Stage4, where Patch Merging layer is the Patch merging layer, which respectively the output feature map dimensions are $\frac{H}{8} \times \frac{W}{8}$ , $\frac{H}{16} \times \frac{W}{16}$ and $\frac{H}{32} \times \frac{W}{32}$ each Stage changes the dimension of the tensor to form a hierarchical representation.

Base on the Transformer Block, each Swin Transformer Block contains a Shift Window Based Multi-head Self-Attention Module (W-MSA/SW-MSA). As shown in Fig. 8, in each Swin Transformer Block, some Multi-Layer Perceptron(MLP) layers are also included for nonlinear transformation and combination of the features of the patches. We modified the architecture of the Swin Transformer by increasing the number of layers in stage 3 from 6 to 18, which can improve the expressive power of the model to capture more complex and detailed features, and thus enhance the performance on visual tasks, but at the same time, we need to pay attention to the training time and the risk of overfitting.

The architecture of Swin Transformer has 4 stages, and the output’s dimensions is $\frac{H}{4} \times \frac{W}{4}$, $\frac{H}{8} \times \frac{W}{8}$, $\frac{H}{16} \times \frac{W}{16}$ and $\frac{H}{32} \times \frac{W}{32}$, where H and W represent the height and width of the input image, respectively. As indicated in Fig. 9, to avoid the model’s over-adaptation to a specific scale and the loss of universality, we make use of top-down pathways and lateral connections to merge feature maps at various resolution levels and combine the outputs and semantic information at different resolutions. During the fusion procedure, a $1 \times 1$ convolution is initially applied to modify the channels of the previous level’s output. Then, we bring them together and repeat the $1 \times 1$ and up-sampling process two times to get the final output having the highest resolution. We adopt direct concatenation of each feature map along the channel direction, which retains all the data.

Convolutional block attention module (CBAM)

The utilization of the attention mechanism in image processing aims at seizing the contextual details existing in the image to apprehend the relevance and enable the model to give priority to the significant regions while disregarding the unimportant information.Our algorithm utilizes the CBAM attention mechanism model as depicted in Fig. 10, which comprises two components, namely a channel attention model and a spatial attention model³⁰. The channel attention module, denoted as CAM, facilitates the identification of the vital feature channels relevant to a particular task through analyzing the relationships between diverse channels and optimizing the distribution of feature maps, consequently improving the model performance. The spatial attention module, called SAM, pays more attention to the positional and contextual information of the whole image to better understand the local regions and is especially used to accurately extract edge features. CBAM connects these two modules in tandem, and using the module makes the network to better capture the correlation between the channels and the space, which improves the expression of the features and the performance of the model.

Loss function

Remote sensing image semantic segmentation for the target task of multiple classification, mainly take the cross-entropy loss, which has a larger penalty for misclassified samples, so it can help the model to learn the classification task better. However, remote sensing image feature terrain is complex, for the segmentation of the image cross entropy loss function is not ideal, so in our method, the improved loss function of PSPNet is used by a combination of the cross entropy loss function³¹ and the Dice Loss function³² to jointly optimize the feature parameters. The combined loss function (CL) is constituted by summing the cross-entropy loss function (CE) and Dice Loss function (DL) by assigning different weights to them.

The expression for the cross-loss entropy function is:

$$\begin{aligned} L_{CE} \left( p, q \right) = - \sum p\left( x \right) \times log\left( q \left( x \right) \right) \end{aligned}$$

(1)

where p represents the distribution of genuine labels and q denotes the distribution of model outputs.

The expression for the Dice Loss function is:

$$\begin{aligned} L_{DL} \left( p, q \right) = 1 - \frac{2\times \sum \left( p\times q \right) + \epsilon }{\sum p + \sum q + \epsilon } \end{aligned}$$

(2)

where p denotes the binarized distribution of the genuine labels, q represents the binarized distribution of the model outputs, and $\epsilon$ is a small positive number used to avoid the case where the denominator is zero.

The expression for the combined loss function (CL) is:

$$\begin{aligned} L_{CL}= & \alpha \times L_{CE} \left( p, q \right) + \beta \times L_{DL} \left( p, q \right) \end{aligned}$$

(3)

where $\alpha$ and $\beta$ are weighting coefficients used to regulate the share (contribution) of the cross-entropy loss function and the Dice Loss function in the combined loss function to better fit the model.

Model training

Training platforms and parameter settings

In this research, the semantic segmentation model is implemented via the Pytorch deep learning framework. The experiments were deployed on Ubuntu 20.04.2 operating system and the main experimental configurations are shown below: Processor Inter(R) Xeon(R) W-2245 CPU @ 3.90 GHz , RAM is 125G, GPU NVIDIA GeForce RTX 3090(24G). Python 3.11.3 was used as the development language, and PyTorch 2.0.0 + Cuda 11.8 was used as the framework for training and testing the model. The training configuration also included the use of a stochastic gradient descent (SGD) optimizer, with an initial learning rate of $1 \times 10^{-2}$, a momentum of 0.9, and a weight decay of 0.0005.

The batch size is configured to be 8, the iteration number is set to 160,000 times, which is equivalent to the training epoch of 100, as shown in Equation 4.

$$\begin{aligned} iteration= & \frac{example\;numbers\times epoch}{batchsize} \end{aligned}$$

(4)

The image input size is configured as 512$\times$512 pixels. The Poly learning strategy was selected, and the loss function CL weights $\alpha$ and $\beta$ were chosen to be 0.6 and 0.4, respectively. Throughout the training process, the log records were saved every 100 steps, and the model was evaluated every 2,000 steps, and the model with the highest mIou metrics was retained as the model weight. Table 3 summarizes the training parameters used in the experiment.

Table 3 Training parameter.

Full size table

Evaluation indicators of the model

In this study, to assess the performance of the augmented semantic segmentation model, several validation metrics are used, namely, overall accuracy (aAcc)³³, intersection over union (IoU), mean intersection over union (mIoU)³⁴, and dice similarity coefficient (Dice)³⁵. These metrics all depend on the confusion matrix as shown in Table 4. It compares the model’s prediction for each pixel with the true label to derive the model’s performance on different categories. By analyzing the confusion matrix, performance metrics such as accuracy, recall, dice coefficient etc. of the model on each category can be derived, thus helping to optimize the model and improve the semantic segmentation.

Table 4 Confusion matrix.

Full size table

$$\begin{aligned}&Accuracy = \frac{TP + TN}{TP + FP + FN + TN} \end{aligned}$$

(5)

$$\begin{aligned}&Precision = \frac{TP}{TP + FP} \end{aligned}$$

(6)

$$\begin{aligned}&Recall = \frac{TP}{TP + FN} \end{aligned}$$

(7)

$$\begin{aligned}&IoU = \frac{TP}{TP + FP +FN} \end{aligned}$$

(8)

$$\begin{aligned}&mIoU = \frac{1}{n + 1} {\textstyle \sum _{i = 0}^{n}IoU} \end{aligned}$$

(9)

$$\begin{aligned}&Dice = \frac{2TP}{2TP + FP + FN} \end{aligned}$$

(10)

Results and analysis

Training results

In this study, we tested different backbone networks based on the PSPNet algorithm like the convolution-based Resnet50, Resnet101, the MobileNetV2³⁶ with a lighter model, the Resnest network with a new variant of Resnet³⁷, and Swin Transformer. These backbone networks have good classification results on the dataset. The experimental outcomes are presented in Table 5.

Table 5 Comparative test results of backbone networks.

Full size table

According to the obtained results, it is observable that the model MobileNetV2 is more lightweight, with lower number of parameters and faster execution than the traditional Resnet network, however, it brings the price of lower accuracy. For the Resnet network variant of the Resnest network, although the execution of the network becomes faster, but the decline in accuracy is fatal. Swin Transformer as a backbone network achieves optimal results on our dataset because it is based on doing self-attention mechanism within its own sliding window, which reduces computational complexity while not only utilizing Transformer’s ability to extract global semantic information, but also retaining more local information within the window.

In order to better evaluate the performance of the improved PSPNet network for semantic segmentation of remote sensing images, this study compares it with six classical semantic segmentation models such as U-Net³⁵, DeepLabV3+²³, Fast-SCNN³⁸, etc. while keeping other training parameters consistent. The precision results of the ameliorated PSPNet and other canonical semantic segmentation models in the task of remote sensing image feature categorization are exhibited in Table 6. The accuracy of our improved model is significantly better than other models when relevant metrics are taken into account. The mean accuracy (mAcc) of the improved PSPNet model is augmented by at least 2.79$\%$ compared with other models. The mean intersection over union (mIoU) is improved by a minimum of 4.84$\%$ in comparison with other models. The mean Dice coefficient (mDice) improved by a minimum of 2.58$\%$ compared to other models. These data indicate that the improved PSPNet model has excellent performance in the task. The Fig. 11 is a confusion matrix based on STFF - PSPNet. STFF-PSPNet performs well in the classification of building and water categories, but there is still room for improvement in the differentiation of categories such as grassland, farmland, and forest land. In subsequent research, for these easily confused categories, the model structure can be further optimized or the parameters can be adjusted to improve the classification accuracy.

Table 6 Comparative experimental results of different algorithms.

Full size table

In addition, the experimental results produced by different models on the Yueyang dataset are shown in Fig. 12. We selected three images with representative types, which include feature classes such as water bodies, buildings and woodlands. All the network models recognized the feature types more accurately, but the classical models all suffer from misclassification to a greater or lesser extent, as well as omission problems. The STFF-PSPNet algorithm, proposed in this study, can be clearly seen that its segmentation results are significantly better than the other models, Fig. 12(c). For features with small targets, ourmethods are able to extract them more accurately and more completely and minimize the missing spatial details extracted from the remotely sensed images.

Ablation experiments

In order to evaluate the effectiveness of combining the CBAM module of the attention mechanism and the CL module of the loss function, this study formulated four different experimental methods on the original algorithm and compared them. As is shown in Table 7, the specific experimental findings are presented.

Table 7 Results of ablation experiments.

Full size table

Where Scenario 1 is the traditional PSPNet model with Resnet as the backbone network. Scheme 2 is a model that substitutes STFF for the backbone network in Scheme 1. Scheme 3 enlarges Scheme 2 by adding CBAM module to the model. Scheme 4 is a model that only changes the loss function based on Scheme 2. Ultimately, Scheme 5 enlarges Scheme 2 by integrating the CBAM module into the network model and also changing the loss function.

The outcomes in Table 7 suggest that the model’s accuracy is reliant on the combination of the CBAM and CL loss functions. It is easy to see that after replacing the backbone of the traditional PSPNet network, the evaluation metrics are enhanced in all aspects, while the model complexity is augmented. In contrast, the introduction of the CBAM module and the change of the CL loss function led to an improvement in the performance of the model, with an increase in the average accuracy by 1.04% and 0.45%, and in the average intersection and merger set ratio by 1.49% and 0.79%, respectively. It is noteworthy that when both the CBAM module and the CL loss function are employed in Scheme 5, the average accuracy of the model increases by 1.86% and the average intersection over union ratio improves by 2.99%, and the detection results attain the optimal detection performance. Consequently, it can be inferred that the performance of the remote sensing feature classification task can be substantially enhanced by substituting the backbone network with the STFF network and adding the attention mechanism CBAM module as well as combining the CL loss function to the network.

Migrability of the segmentation model

The validation data that is predominantly used for remote sensing image segmentation stems from the same study region as the training data, which consequently fails to effectively reveal the robustness of the segmentation model.

Though a significant number of segmentation models achieve a high level of accuracy on their validation data, their performance undergoes degradation when applied to different geographical regions.Insufficient training samples or limited model generalization capabilities are mainly responsible for this degradation. With the aim of validating the portability of the enhanced model, we performed experiments making use of the publicly accessible dataset LandCover.ai. The experiments additionally utilized GF-2 satellite images as the remote sensing data source, and the images were processed with uniform preprocessing.

We also made targeted adjustments to the model. At the output layer, according to the new four-category task, as shown in Fig.13, we redesigned the number of neurons and activation functions of the output layer to correspond to four categories. At the same time, in order to fully utilize the existing feature extraction capabilities and accelerate the training speed, we froze some of the bottom feature extraction layers of the model. These layers have learned general image features in the five-category task and are expected to continue to play a role in the four-category task. Subsequently, we set appropriate training parameters, such as adopting a smaller learning rate to avoid excessive adjustments to the frozen layers, and reasonably determine the batch size and number of training epochs according to the dataset size and computing resources. Then, we used the adjusted model and the four-category dataset for retraining.

The outcomes of these tests are presented in Table 8. The STFF-PSPNet has considerable accuracy in feature classification, indicating that it is quite transferable and has the potential for wider application and dissemination.

Table 8 Public dataset test results.

Full size table

As revealed by Table 9 our empirical findings of transfer learning are compared with the data in the original paper, the overall mIoU reaches 89.32%, which is 3.72% higher than that of LandCover.ai.

Table 9 Comparision of migrability model results.

Full size table

Discussion

Semantic segmentation of remote sensing images can be helpful in applications such as urban planning and water body change detection. However, the detection accuracy is affected by the intricate terrain and the resolution of remote sensing images. In order to accurately and efficiently recognize the land use types, the paper proposes a semantic segmentation method based on improved deep learning model. The use of STFF features makes the network better adapted to different scales and complex targets and improves the generalization ability of the network. Employing the attention mechanism CBAM in the STFF-PSPNet network and replacing the original loss function with the combined loss function CL enables the model to better seize the feature information of the input data. The results of the comparison test show that, on the one hand, the improved accuracy of the present model is optimal by modifying the backbone module in the PSPNet network while other modules remain unchanged. On the other hand, under comparison with other mainstream semantic segmentation algorithms, the detection speed and accuracy of this model in semantic segmentation for remote sensing image feature classification are superior to other methods. The results of ablation experiments show that the present model is better than other algorithms to obtain optimal detection results.

Our improved semantic segmentation method shows good segmentation performance in both the self-constructed dataset and the LandConver dataset that verifies the migration capability, however, the number of model parameters is enlarged due to the inclusion of STFF, and further improvement is needed in order to increase the detection speed. In future research, the improved PSPNet model can also be used in practical applications such as land change detection and urban planning.

Conclusion

In this study, we propose an improved PSPNet semantic segmentation model to address the challenges in the task of feature type extraction in remote sensing imagery. The model uses STFF as the backbone network, which consists of Swin Transformer and feature fusion model with integrated CBAM module, thus improving the detection accuracy of the model. In addition, we adopt the combined loss function that is more suitable for complex terrain classification, which replaces the cross-entropy loss function commonly used for semantic segmentation and improves the robustness of the model. By consideration, this method effectively solves the problems existing in the existing classical semantic segmentation models, such as inaccurate feature edge extraction and misclassification. By conducting a comparison with the existing classical semantic segmentation models, we found that the model outperforms other algorithms in terms of detection accuracy (mAcc) and Dice coefficient in the application of remote sensing imagery feature classification, reaching percentages of 90.32% and 85.15%, respectively. Although the model has shown good performance in various experiments, there are limitations.The inclusion of STFF makes the network model parameters larger, which leads to an increase in the cost of training. Therefore, for future research we will consider the training cost and accuracy improvement balanced with each other to further come to improve the task.

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Li, D., Wang, M. & Jiang, J. China’s high-resolution optical remote sensing satellites and their mapping applications. Geo Spat. Inf. Sci. 24, 85–94 (2021).
Article Google Scholar
Yi, Z. et al. Scene-aware deep networks for semantic segmentation of images. IEEE Access 7, 69184–69193 (2019).
Article Google Scholar
Fan, R. et al. Fine-scale urban informal settlements mapping by fusing remote sensing images and building data via a transformer-based multimodal fusion network. IEEE Trans. Geosci. Remote Sens. 60, 1–16 (2022).
Google Scholar
Hu, X. & Zhuang, S. Large-scale spatial-temporal identification of urban vacant land and informal green spaces using semantic segmentation. Remote Sens. 16, 216 (2024).
Article ADS Google Scholar
Weng, L. et al. Water areas segmentation from remote sensing images using a separable residual segnet network. ISPRS Int. J. Geo Inf. 9, 256 (2020).
Article Google Scholar
Dong, S. & Chen, Z. Block multi-dimensional attention for road segmentation in remote sensing imagery. IEEE Geosci. Remote Sens. Lett. 19, 1–5 (2021).
Google Scholar
Sui, B., Cao, Y., Bai, X., Zhang, S. & Wu, R. Bibed-seg: Block-in-block edge detection network for guiding semantic segmentation task of high-resolution remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 16, 1531–1549 (2023).
Article ADS Google Scholar
Li, H., Zhang, L. & Shen, H. A principal component based haze masking method for visible images. IEEE Geosci. Remote Sens. Lett. 11, 975–979 (2013).
Article ADS Google Scholar
Datla, R. et al. Learning scene-vectors for remote sensing image scene classification. Neurocomputing 587, 127679 (2024).
Article Google Scholar
He, Y.-X., Liu, D.-X., Lyu, S.-H., Qian, C. & Zhou, Z.-H. Multi-class imbalance problem: A multi-objective solution. Inf. Sci. 680, 121156 (2024).
Article Google Scholar
Shorten, C. & Khoshgoftaar, T. M. A survey on image data augmentation for deep learning. J. Big Data 6, 1–48 (2019).
Article Google Scholar
Liu, X., Deng, Z. & Yang, Y. Recent progress in semantic image segmentation. Artif. Intell. Rev. 52, 1089–1106 (2019).
Article Google Scholar
Liu, J., Geng, Y., Zhao, J., Zhang, K. & Li, W. Image semantic segmentation use multiple-threshold probabilistic R-CNN with feature fusion. Symmetry 13, 207 (2021).
Article ADS Google Scholar
Li, Y., Liu, Z., Yang, J. & Zhang, H. Wavelet transform feature enhancement for semantic segmentation of remote sensing images. Remote Sens. 15, 5644 (2023).
Article ADS Google Scholar
Yuan, X., Shi, J. & Gu, L. A review of deep learning methods for semantic segmentation of remote sensing imagery. Expert Syst. Appl. 169, 114417 (2021).
Article Google Scholar
Liu, Y. et al. Automatic building extraction on high-resolution remote sensing imagery using deep convolutional encoder-decoder with spatial pyramid pooling. IEEE Access 7, 128774–128786 (2019).
Article Google Scholar
Guo, R. et al. Pixel-wise classification method for high resolution remote sensing imagery using deep neural networks. ISPRS Int. J. Geo Inf. 7, 110 (2018).
Article Google Scholar
Hu, Y. et al. Extraction of eutrophic and green ponds from segmentation of high-resolution imagery based on the EAF-Unet algorithm. Environ. Pollut. 343, 123207 (2024).
Article CAS PubMed Google Scholar
Datla, R., Chalavadi, V. & Mohan, C. K. A multimodal semantic segmentation for airport runway delineation in panchromatic remote sensing images. In Fourteenth International Conference on Machine Vision (ICMV 2021) Vol. 12084, 46–52 (SPIE, 2022).
Swetha, G., Datla, R., Vishnu, C. et al. MS-VACSNet: A network for multi-scale volcanic ash cloud segmentation in remote sensing images. In 2023 18th International Conference on Machine Vision and Applications (MVA) 1–6 (IEEE, 2023).
Yu, F., Koltun, V. & Funkhouser, T. Dilated residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 472–480 (2017).
Zhao, H., Shi, J., Qi, X., Wang, X. & Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2881–2890 (2017).
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K. & Yuille, A. L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40, 834–848 (2017).
Article PubMed Google Scholar
Li, A. & Xia, G. The influence of geometric correction on the accuracy of the extraction of the remote sensing reflectance of water. Int. J. Remote Sens. 42, 2280–2291 (2021).
Article Google Scholar
Boguszewski, A., Batorski, D., Ziemba-Jankowska, N., Dziedzic, T. & Zambrzycka, A. Landcover. ai: Dataset for automatic mapping of buildings, woodlands, water and roads from aerial imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 1102–1110 (2021).
He, Z., Gong, C., Hu, Y. & Li, L. Remote sensing image dehazing based on an attention convolutional neural network. IEEE Access 10, 68731–68739 (2022).
Article Google Scholar
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 770–778 (2016).
Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision 10012–10022 (2021)
Lin, T.-Y. et al. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2117–2125 (2017).
Woo, S., Park, J., Lee, J.-Y. & Kweon, I. S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV) 3–19 (2018).
Bahri, A., Majelan, S. G., Mohammadi, S., Noori, M. & Mohammadi, K. Remote sensing image classification via improved cross-entropy loss and transfer learning strategy based on deep convolutional neural networks. IEEE Geosci. Remote Sens. Lett. 17, 1087–1091 (2019).
Article ADS Google Scholar
Li, X. et al. Dice loss for data-imbalanced NLP tasks. arXiv preprint arXiv:1911.02855 (2019).
Badrinarayanan, V., Kendall, A. & Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39, 2481–2495 (2017).
Article PubMed Google Scholar
Long, J., Shelhamer, E. & Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 3431–3440 (2015).
Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5–9, 2015, Proceedings, Part III Vol. 18, 234–241 (Springer, 2015).
Li, X., Ye, H. & Qiu, S. Cloud contaminated multispectral remote sensing image enhancement algorithm based on mobilenet. Remote Sens. 14, 4815 (2022).
Article ADS Google Scholar
Zhang, H. et al. Resnest: Split-attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2736–2746 (2022).
Poudel, R. P., Liwicki, S. & Cipolla, R. Fast-scnn: Fast semantic segmentation network. arXiv preprint arXiv:1902.04502 (2019).

Download references

Acknowledgements

The Yueyang data in this article comes from the high-score business department of Hunan Aerospace Yuanwang Technology Co., Ltd. Thanks to all the authors for their contributions to this study.

Funding

Natural Science Foundation of Hunan Province (No. 2024JJ8037)

Author information

Authors and Affiliations

National Forestry and Grassland Engineering Technology Research Center for Harvesting Equipment of Non-wood Forest Fruits, Central South University of Forestry and Technology, Changsha, 410004, China
Haiying Li, Lijun Li & Liqiang Zhang
Hunan Automotive Engineering Vocational University, Zhuzhou, 412001, China
Yang Liu
Hunan Aerospace Yuanwang Technology Co. Ltd, Changsha, China
Haiying Li & Chenxi Huang
Hunan Aerohunter Electronic Technology Co. Ltd, Changsha, China
Jiaqi Gao

Authors

Haiying Li
View author publications
Search author on:PubMed Google Scholar
Jiaqi Gao
View author publications
Search author on:PubMed Google Scholar
Yang Liu
View author publications
Search author on:PubMed Google Scholar
Chenxi Huang
View author publications
Search author on:PubMed Google Scholar
Lijun Li
View author publications
Search author on:PubMed Google Scholar
Liqiang Zhang
View author publications
Search author on:PubMed Google Scholar

Contributions

H.L.: Author of the main text; involved in data collection and data processing, made tables and figures, methodology, conceived and designed the experiments, research methods, writing and review & editing of the paper. J.G.: Participated in data collection, data processing and other work, and collected information on the working method. Also involved in text writing. Y.L.: Collected information on the working method, data processing and other work. C.H.: Participated in data collection, data processing and other work, and collected information on the working method. L.Z.: Participated in text writing, analyzed the samples. Writing and review & editing. L.L.: Responsible for the revision and correction of grammar and spelling. All authors have read and agreed to the published version of the manuscript.

Corresponding authors

Correspondence to Lijun Li or Liqiang Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Li, H., Gao, J., Liu, Y. et al. Enhanced remote sensing image feature classification using STFF-PSPNet. Sci Rep 15, 27587 (2025). https://doi.org/10.1038/s41598-025-89094-x

Download citation

Received: 12 November 2024
Accepted: 03 February 2025
Published: 29 July 2025
Version of record: 29 July 2025
DOI: https://doi.org/10.1038/s41598-025-89094-x

Subjects

Abstract

Similar content being viewed by others

RS-Dseg: semantic segmentation of high-resolution remote sensing images based on a diffusion model component with unsupervised pretraining

Efficient remote sensing image classification using the novel STConvNeXt convolutional network

An improved semantic segmentation algorithm for high-resolution remote sensing images based on DeepLabv3+

Introduction

Materials and methods

Research area

Data

Data sources

Dataset

Data augmentation

Improved semantic segmentation algorithm based on PSPNet remote sensing images

PSPNet

Improvement of the PSPNet network

STFF

Convolutional block attention module (CBAM)

Loss function

Model training

Training platforms and parameter settings

Evaluation indicators of the model

Results and analysis

Training results

Ablation experiments

Migrability of the segmentation model

Discussion

Conclusion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links