Multi-scale boundary-aware network for remote sensing image semantic segmentation

Shan, Keyi; Tan, Li; Li, Yibo; Jia, Tan

doi:10.1038/s41598-025-33943-2

Download PDF

Article
Open access
Published: 26 January 2026

Multi-scale boundary-aware network for remote sensing image semantic segmentation

Keyi Shan¹^na1,
Li Tan¹,
Yibo Li¹^na1 &
…
Tan Jia¹^na1

Scientific Reports volume 16, Article number: 3797 (2026) Cite this article

446 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Accurate semantic segmentation of high-resolution remote sensing images is crucial for urban planning, land cover mapping, and environmental monitoring. However, this task remains challenging due to the significant scale variations and blurred or unclear boundaries between objects in complex scenes. Conventional neural networks (CNNs) are effective in extracting local spatial details but have limited capability in modeling global context, while Transformer-based approaches capture long-range dependencies but often overlook fine structures and boundary cues and incur high computational costs. Therefore, we propose a network integrating CNN with Transformer, termed the Multi-Scale Boundary-Aware Network (MSBANet). The Multi-Scale Transformer Block (MSTB) extracts multi-scale semantic and boundary structural features through a Multi-Header Self-Attention (MHSA) mechanism and a Multi-Scale Convolutional Gated Linear Unit (MConvGLU). The Global-Local Fusion Module (GLFM) aligns deep semantic features with shallow spatial details during decoding, ensuring boundary structure integrity throughout multi-scale reconstruction. Furthermore, the Uncertainty Boundary Awareness Module (UBAM) employs an entropy-guided attention mechanism to enhance the model’s focus on uncertain regions, thereby optimising edge segmentation outcomes. Extensive ablation and comparative experiments demonstrate the effectiveness of each module and the overall architecture. The proposed MSBANet achieves 84.63% and 87.15% mIoU on the ISPRS Vaihingen and ISPRS Potsdam datasets, respectively, outperforming several state-of-the-art methods.

Introduction

With the rapid development of sensor technology, the spatial details, spectral resolution and imaging accuracy of remote sensing images have significantly improved, making them an important data source for obtaining surface information. They play a crucial role in fields such as urban planning, land cover mapping, ecological and environmental monitoring^{1, 2, 3}. As one of the core tasks in understanding remote sensing scenes, semantic segmentation precisely classifies each pixel, providing key support for object recognition, detailed mapping and intelligent decision-making^{4, 5}.

However, in high-resolution remote sensing images, object categories such as buildings, roads, water bodies and vegetation usually have significant scale differences and ambiguous boundaries. Achieving accurate semantic segmentation remains a huge challenge⁶, traditional image segmentation methods relying on handcrafted features or shallow models are inadequate for this task. In this context, deep learning approaches have demonstrated outstanding performance, benefiting from their strong feature representation ability⁷. CNN - based models can effectively extract local spatial detail features and have achieved excellent performance on several publicly available remote sensing datasets. For ejxample, Fully Convolutional Networks(FCNs)⁸, U-Net⁹, Seg-Net¹⁰ and the DeepLab series^{11, 12, 13} have become mainstream methods. However, due to the limited receptive field of convolution, these methods perform poorly in capturing global context correlations. Recently, Transformer architectures have shown notable effectiveness in capturing long-range dependencies through their attention mechanism, and has been widely applied to remote sensing imagery¹⁴. It has been shown that Transformer demonstrates the potential to outperform traditional CNNs in building extraction¹⁵, vegetation detection¹⁶, and Unmanned Aerial Vehicle-based aerial scene parsing¹⁷. However, its computational complexity grows quadratically with resolution and struggles to capture local spatial details in high-resolution imagery.

Recent research indicates that integrating CNNs with Transformer models in image semantic segmentation can leverage the strengths of both approaches while mitigating their respective limitations, thereby enhancing overall model performance. ST-UNet¹⁸ embeds the Swin Transformer¹⁹ into the U-Net to fuse contextual and local features, though its edge segmentation remains limited. UNetFormer⁷ combines U-Net and Transformer modules to improve semantic representation in complex urban scenes, but its multi-scale fusion strategy still requires refinement. Zhang et al.²⁰ proposed a representative CNN–Transformer hybrid model for high-resolution imagery, where the encoder employs a Swin Transformer and the decoder uses a CNN-based ASPP module, further enhanced with skip connections, squeeze-and-excitation attention, and an auxiliary boundary detection branch. More recently, FTransUNet²¹ proposed a hierarchical multimodal fusion strategy, employing CNN-based shallow fusion for local details and Transformer-based deep fusion for global semantics, thereby leveraging the strengths of both paradigms across modalities.

Despite significant advances in remote sensing semantic segmentation achieved by existing CNNs, Transformers, and hybrid approaches, high-resolution imagery remains challenging: Insufficient multi-scale feature characterisation prevents the model from capturing boundary features, leading to blurred segmentation boundaries caused by classification errors at the junctures of different semantic categories. Furthermore, recent research on remote sensing image segmentation has underscored the importance of boundary-aware modelling in dense prediction tasks^{22, 23}. These considerations have prompted us to develop a unified framework that synergistically enhances both multi-scale representation and boundary-aware capabilities. To address these challenges, we propose MSBANet—a specially designed CNN-Transformer hybrid framework. Unlike existing CNN–Transformer hybrid models that mainly focus on global–local feature fusion, our framework is a boundary-oriented hybrid architecture built upon three complementary components - multi-scale modeling, boundary consistency preservation, and uncertainty-based refinement. It is designed to enhance multi-scale representation and boundary accuracy in high-resolution remote sensing imagery. Its principal contributions can be summarised as follows:

We proposed the MSBANet semantic segmentation architecture, which combines multi-scale boundary modeling with edge enhancement that is uncertainty-aware.
We design a collaborative multi-scale mechanism consisting of MSTB and GLFM. MSTB enhancing the model’s ability to capture the significant scale variation features in the category edge regions. Meanwhile, GLFM performs cross-level feature alignment, ensuring the consistency of boundary structures during the multi-scale reconstruction process.
We propose the UBAM, which quantifies prediction uncertainty using information entropy and directs attention to ambiguous boundary regions. Embedding UBAM into the segmentation head to enhance the boundary refinement capability of the model.

Related work

Semantic segmentation methods based on CNN

The objective of semantic segmentation is to assign semantic labels at the pixel level. Unlike traditional pixel-wise classification, CNN-based approaches leverage spatial and contextual features to achieve higher accuracy and stability. FCNs⁸ marked a milestone by establishing the foundation for CNN-based models. Subsequently, encoder - decoder architectures such as U-Net⁹, Seg-Net¹⁰, and PSPNet²⁴ were proposed, which integrate high-level semantics with fine-grained spatial details through skip connections. Later, the DeepLab series^{11, 12, 13} introduces the Atrous Spatial Pyramid Pooling (ASPP) module, which further improves segmentation accuracy by capturing multi-scale features to integrate contextual information. Other CNN-based variants, including context-aware and adaptive feature selection modules²⁵, have also been explored to improve local detail representation.

Although significant progress has been made, CNN-based approaches continue to encounter difficulties in complex urban remote sensing scenarios. Due to their restricted receptive field and insufficient ability to model global dependencies, they frequently misidentify small buildings, subtle textures, and occluded targets, thereby reducing segmentation accuracy.

Semantic segmentation methods based on transformer

Capturing global semantic information is essential for accurate semantic segmentation, and Transformer-based approaches have therefore attracted increasing attention. By incorporating self-attention modules, these approaches successfully capture multi-scale features and long-range dependencies, thereby enhancing segmentation accuracy in remote sensing imagery.

The Vision Transformer (ViT)²⁶ splits an image into non-overlapping patches and leverages self -attention to model contextual relationships, demonstrating the potential of Transformers for visual tasks. Inspired by this, SETR²⁷ employed a Transformer encoder for semantic segmentation, showing flexibility in combination with different decoders. Subsequently, the Swin Transformer¹⁹ introduced shifted windows and a hierarchical design, which greatly reduces computational complexity while improving multi-scale feature representation, making it more suitable for high-resolution dense prediction. Other variants, such as CSwin²⁸ and TransNext²⁹, further optimized the attention mechanism or robustness for specific vision tasks, indicating the rapid evolution of Transformer-based segmentation networks. Recently, lightweight Transformer architectures have also been explored for large-scale urban mapping, such as the global UOS segmentation framework³⁰ in, demonstrating strong cross-region generalization through efficient multi-scale context modeling.In addition, specialized Transformer frameworks have emerged for complex and highly heterogeneous scenes, such as the DualFormer model³¹, which employs pyramidal Transformer encoding to better capture multi-scale semantic variations in irregular urban informal settlements.

Despite these advances, Transformer models still face critical limitations. The self-attention mechanism requires computing pairwise relationships across all tokens, leading to quadratic complexity with image resolution and substantial memory consumption. Moreover, Transformers focus primarily on global context but are less effective in capturing local features. These limitations have encouraged the design of hybrid architectures that leverage CNNs for spatial detail and Transformers for long-range dependencies, particularly in high-resolution remote sensing segmentation.

Semantic segmentation methods combining CNN and transformer

Traditional CNNs have remarkable effects in extracting local spatial details, but their capabilities in modeling global contexts are limited. Although Transformer-based methods can capture long-range dependencies, they often ignore fine structures and boundary information and have high computational complexity. In semantic segmentation tasks, both global information and local information are crucial for accurately understanding the semantic structure of images. Researchers have begun to combine CNN with Transformer models to fully leverage their respective advantages. Chen et al. proposed CTSeg³², a novel CNN and ViT collaborative segmentation framework following the encoder-decoder architecture for land-use/land-cover classification of high-resolution remote sensing images. By simultaneously leveraging the local features of CNN and the global dependencies of Transformer through multi-scale self-attention and bidirectional knowledge distillation, it improves the accuracy of remote sensing semantic segmentation. Fan et al.³³ proposed a Multidimensional Information Fusion Network for high-resolution remote sensing image segmentation by integrating CNN and Transformer and introducing frequency information. By multi-scale integration of local details, global semantics and frequency domain features, the model’s ability to identify fine-grained boundaries and scale-varying targets is enhanced, effectively alleviating complex interference factors such as inter-class ambiguity. The hybrid attention network proposed by Wang et al.³⁴ adaptively fuses multi-scale local features and long-range context relations through channel - spatial attention, and enhances the recognition ability for structurally similar and boundary-ambiguous regions through a global cross-fusion module, further highlighting the advantages of the CNN–Transformer hybrid architecture in multi-scale modeling and complex boundary depiction.

In summary, although significant progress has been made in the field of remote sensing semantic segmentation, key challenges remain in effectively integrating the strengths of convolutional neural networks and Transformers, enhancing boundary recognition capabilities, and further improving segmentation accuracy.

Method

Network architecture

The overall architecture of MSBANet is illustrated in Fig. 1. We employ a CNN-based ResNet50 as the encoder, a backbone network that has been extensively validated in the field of remote sensing semantic segmentation. It comprises four residual stages, each performing downsampling and generating feature maps with resolutions of H/4 × W/4, H/8 × W/8, H/16 × W/16, and H/32 × W/32, respectively, featuring channel counts of 256, 512, 1024, and 2048.

To maintain computational efficiency and facilitate multi-scale feature fusion, we first apply a 1 × 1 convolution to compress all encoder outputs to 512 channels in the decoder, and this channel dimension is preserved throughout all decoding stages. Subsequently, the MSTB module within the decoder extracts multi-scale information, progressively upsamples it via GLFM, and integrates it with corresponding local spatial features from the encoder. This ensures boundary structure consistency throughout the multi-scale reconstruction process. Finally, an UBAM module is embedded within the segmentation head to generate more accurate segmentation result maps at the same resolution as the original input.

Multi-Scale transformer block

MSTB is designed to simultaneously enhance global contextual modeling and local detail refinement. As illustrated in Fig. 2, it contains two complementary submodules: MHSA and MConvGLU.

(1) MHSA.

To enhance the model’s ability to extract global semantic and multi-scale structural information, we introduce a multi-scale multi-head self-attention module³⁵. This module structurally integrates multi-scale convolutions, channel attention, and self-attention mechanisms to capture both global information and contextual details within regions exhibiting significant scale variations. Through three parallel, separable deep dilated convolutional branches, it acquires multi-scale receptive fields ranging from local to global, thereby enhancing the network’s sensitivity to scale-varying objects. After aggregating and fusing the features from these three branches, multi-head self-attention computations are performed. This approach effectively reduces computational costs while preserving contextual diversity. Concurrently, the lightweight channel attention branch adaptively adjusts feature channel weights, further enhancing representational discriminative power. This design enables the model to model contextual information while maintaining computational efficiency.

(2) MConvGLU.

GLU was originally introduced as a channel mixer in NLP and was later extended to vision transformers as ConvGLU²⁹ by incorporating depthwise convolution into the gating branch. While ConvGLU generates token-specific gating signals based on local spatial features—offering finer control than global attention mechanisms such as SE—the standard design relies on a fixed 3 × 3 convolution. This restricts its ability to model the diverse boundary patterns and multi-scale object variations that characterize high-resolution remote sensing imagery. In particular, Single-scale convolution is difficult to simultaneously capture geometric changes at different scales. Especially in shadow, occlusion and low-contrast areas, it is more likely to cause confusion in classification at the boundaries.

To overcome these drawbacks, we propose MConvGLU, which replaces the single 3 × 3 convolution with parallel depthwise separable convolutions of sizes 3 × 3, 5 × 5, and 7 × 7. These multi-scale kernels collectively provide richer spatial priors: the 3 × 3 branch focuses on high-frequency edge details, the 5 × 5 branch captures intermediate structural cues, and the 7 × 7 branch models large-scale and blurred boundary regions. The fused multi-scale gating signals can adaptively adjust the feature weights according to the boundary width and texture complexity, thereby more effectively distinguishing real boundaries and enhancing the model’s ability to identify complex boundaries such as building outlines and vegetation edges.

By combining MHSA and MConvGLU, MSTB is capable of extracting multi-scale global semantic information and boundary structural information.

Global-Local fusion module

Shallow features extracted by the model convey rich spatial details but lack semantic information, whereas deep features encode strong semantic information yet lose fine-grained resolution. Direct concatenation or linear fusion often results in redundancy or misalignment. To effectively integrate deep semantic information with shallow spatial details during the decoding process that progressively fuses multi-scale features, thereby preserving boundary structure integrity, we have introduced the GLFM³⁶ The model architecture is illustrated in Fig. 3.

Module first upsamples semantic features derived from the Transformer, aligning and integrating them with high-resolution detail features from the encoder at the channel dimension. Subsequently, a global channel modulation mechanism suppresses redundant channel responses, emphasising key structural information relevant to semantics. Building upon this, a spatially guided branch driven jointly by deep semantic and shallow detail features is introduced. This selectively enhances fused features on a per-location basis, achieving more precise spatial localisation. Through joint adaptive modulation across channels and spatial dimensions, GLFM simultaneously preserves semantic consistency and detail continuity during cross-layer feature fusion. This ensures boundary integrity and structural restoration quality during the multi-scale reconstruction phase.

Uncertainty boundary aware module

In image semantic segmentation, pixels at the boundaries between different semantic categories are prone to category uncertainty, leading to classification errors that blur segmentation boundaries. To address this, we propose a lightweight uncertainty boundary-aware module, as shown in Fig. 4, embedding it within the final segmentation head to enhance the model’s boundary refinement capability. The module takes the predicted map as input to guide attention map generation, assigning higher attention values to regions of high uncertainty. During forward propagation, residual connections are employed to enhance focus on these areas, thereby improving boundary segmentation accuracy.

The core concept of UBAM centres on focusing attention towards regions of high prediction uncertainty—areas frequently situated at boundaries where classification categories are prone to confusion. Building upon the local contextual attention module²⁵ originally designed for binary classification tasks, we introduce information entropy as a quantitative measure of uncertainty to guide attention generation, thereby extending its applicability to multi-class classification tasks. Specifically, this module generates a per-pixel attention map by analysing the entropy distribution of predicted category probabilities. For the input feature map and predicted probability distribution, the module first normalises the probabilities before calculating entropy values to quantify prediction uncertainty at each pixel location. The entropy range reflects a transition from high confidence to high uncertainty: high entropy typically occurs in ambiguous regions near object boundaries, while low entropy corresponds to high-confidence areas. The attention map is generated by normalising entropy values, assigning higher weights to high-confidence regions and lower weights to uncertain areas. Element-wise multiplication between the input feature map and the attention map yields the weighted output. To preserve original features and ensure training stability, this module employs a residual connection mechanism, combining the original feature map with the weighted feature map to produce the final output.

Through an entropy-based uncertainty modelling mechanism, the UBAM module can identify and optimise high-uncertainty boundary regions, thereby enhancing the model’s capability for boundary refinement.

Dataset and experiment setting

Dataset

1)The ISPRS Potsdam dataset consists of 38 orthophoto tiles with a ground sampling distance (GSD) of 5 cm, each tile covering 6000 × 6000 pixels. The dataset includes six semantic categories: impervious surfaces, buildings, low vegetation, trees, vehicles, and background. It provides four spectral bands (red, green, blue, and near-infrared) together with DSM data. In our experiments, only the RGB channels were adopted for training and evaluation. A total of 24 images were employed for training and 14 for testing. All images were pre-split into 1024 × 1024 patches for subsequent model training.

2)The ISPRS Vaihingen dataset provides 33 orthophoto image tiles with a ground sampling distance (GSD) of 9 cm. Each tile includes three channels (IR, R, and G) together with DSM data, and the semantic categories are impervious surfaces, buildings, low vegetation, trees, vehicles, and background. In our experiments, only the orthophotos were used. Sixteen tiles were chosen for training and seventeen for testing. All images were pre-split into 1024 × 1024 patches to construct the training samples.

The Vaihingen image dataset primarily contains small and scattered village buildings, while the Potsdam image dataset mainly consists of dense urban building complexes, and the two datasets complement each other.

Implementation details

All experiments were conducted using PyTorch 2.0 with CUDA 11.8 on a single NVIDIA RTX 3090 GPU. The models were optimized with AdamW, initialized at a learning rate of 6 × 10⁻⁴, a weight decay of 0.01, and updated through a cosine annealing schedule. Training proceeded for 100 epochs with a batch size of 8. Since the original orthophotos had already been partitioned into 1024 × 1024 patches, these patches were directly employed as inputs. To enhance data diversity, several augmentation operations were applied during training, including random scaling (0.5, 0.75, 1.0, 1.25, 1.5), horizontal and vertical flipping, and random rotation. Random seeds were fixed across experiments to ensure reproducibility. For inference, multi-scale testing combined with flipping was adopted to improve prediction robustness.

Evaluation metrics

To evaluate segmentation performance, three commonly used metrics were adopted: overall accuracy (OA), mean intersection over union (mIoU), and mean F1 score. These metrics are widely employed in semantic segmentation research[35], as they jointly reflect global accuracy, region-level consistency, and the balance between precision and recall. Based on the confusion matrix, the metrics are defined as follows:

$$\:\text{OA}\text{=}\frac{{\sum\:}_{\text{k}\text{=1}}^{\text{N}}\text{T}{\text{P}}_{\text{k}}}{{\sum\:}_{\text{k}\text{=1}}^{\text{N}}\left(\text{T}{\text{P}}_{\text{k}}\text{+}\text{F}{\text{P}}_{\text{k}}\text{+}\text{T}{\text{N}}_{\text{k}}\text{+}\text{F}{\text{N}}_{\text{k}}\right)}$$

(1)

$$\:\text{mIoU}\text{=}\frac{\text{1}}{\text{N}}{\sum\:}_{\text{k}\text{=1}}^{\text{N}}\frac{\text{T}{\text{P}}_{\text{k}}}{\text{T}{\text{P}}_{\text{k}}\text{+}\text{F}{\text{P}}_{\text{k}}\text{+}\text{F}{\text{N}}_{\text{k}}}$$

(2)

$$\:\text{precision}\text{=}\frac{\text{1}}{\text{N}}{\sum\:}_{\text{k}\text{=1}}^{\text{N}}\frac{\text{T}{\text{P}}_{\text{k}}}{\text{T}{\text{P}}_{\text{k}}\text{+}\text{F}{\text{P}}_{\text{k}}}$$

(3)

$$\:\text{r}\text{ecall}\text{=}\frac{\text{1}}{\text{N}}{\sum\:}_{\text{k}\text{=1}}^{\text{N}}\frac{\text{T}{\text{P}}_{\text{k}}}{\text{T}{\text{P}}_{\text{k}}\text{+}\text{F}{\text{N}}_{\text{k}}}$$

(4)

$$\:\text{F}\text{1=}\frac{\text{2}\times\text{precision}\times\text{recall}}{\text{precision}+\text{recall}}$$

(5)

where TPₖ, FPₖ, TNₖ, and FNₖ denote the true positives, false positives, true negatives, and false negatives for class k, respectively. All metrics range from 0 to 1, with higher values indicating better performance.

To further quantify boundary segmentation quality, we additionally adopt the Boundary IoU (BIoU) metric³⁷, which evaluates pixel-level accuracy within a narrow band around the ground-truth and predicted contours. Unlike region-based metrics such as mIoU and F1, BIoU is specifically designed to measure the correctness of object boundaries and has been widely used in recent semantic segmentation and remote-sensing studies^{38, 39, 40}. The metric idefined as follows:

$${\text{BIoU=}}\frac{{{\text{|}}({{\text{G}}_{\text{d}}} \cap {\text{G)}} \cap {\text{(}}{{\text{P}}_{\text{d}}} \cap {\text{P}}){\text{|}}}}{{{\text{|}}({{\text{G}}_{\text{d}}} \cap {\text{G)}} \cup {\text{(}}{{\text{P}}_{\text{d}}} \cap {\text{P}}){\text{|}}}}$$

(6)

$\:\text{G}$ and $\:\text{P}$ denote the ground-truth and predicted binary masks. The sets $\:{\text{G}}_{\text{d}}$ and $\:{\text{P}}_{\text{d}}$ correspond to the pixels located within a boundary band of width d around the boundaries of $\:\text{G}$ and $\:\text{P}$, respectively. The parameter d controls the thickness of the boundary region used to evaluate the alignment of the predicted boundaries with the ground truth.

Experiment results and analysis

Ablation

To systematically evaluate the independent contributions and synergistic effects of each module within the model, we conducted ablation experiments involving the gradual integration of UBAM, GLFM, and MSTB, whilst undertaking an in-depth analysis of the reasons behind the observed performance improvements. The detailed results are summarized in Table 1.

Table 1 Ablation study of each component of the MSBANet.

Full size table

After incorporating UBAM into the baseline model, both Vaihingen and Potsdam datasets demonstrated modest improvements in mIoU, accompanied by concurrent gains in BIoU. UBAM’s ability to automatically localise high-uncertainty boundary regions within predictions using entropy metrics enables the model to proactively focus on category transitions and contour ambiguity during inference. This approach not only enhances regional classification accuracy but also markedly improves boundary localisation quality.

The introduction of GLFM yielded stable performance improvements across both datasets. However, on the BIoU metric, a slight decline occurs when GLFM is added alone. This stems from GLFM’s primary function being to align features from different scales across deep and shallow layers, rather than directly enhancing boundary sensitivity. When the encoder’s high-level semantic features have not yet acquired stronger multi-scale structural representation through MSTB, the fusion operation may introduce some interference to shallow-level local boundary details, leading to slight accuracy fluctuations in boundary regions.

The most pronounced single-module performance gains are observed in MSTB evaluations, where Vaihingen achieves a 1.20% mIoU improvement alongside a notable BIoU enhancement. MSTB combines global self-attention with a multi-scale convolutional gating structure, endowing the model with enhanced multi-scale semantic modelling capabilities and superior recognition of complex boundary morphologies. This enables it to outperform the baseline and other single modules on BIoU.

When MSTB and GLFM are employed jointly, synergistic effects emerge: MSTB furnishes richer multi-scale semantic and boundary details, while GLFM ensures these features maintain geometric consistency during cross-stage fusion. This concurrently enhances both mIoU and BIoU metrics. At this stage, GLFM no longer exhibits degradation in boundary quality; rather, the structurally clearer deep semantic features provided by MSTB render its fusion process more stable and effective.

Ultimately, integrating UBAM, GLFM, and MSTB yields a complete MSBANet architecture that achieves mIoU gains of 2.09% and 1.22% across the two datasets respectively, alongside an substantial improvement in BIoU metrics. The synergistic interaction among these components establishes a comprehensive model architecture, optimising overall performance.

To further evaluate the effectiveness of the proposed MConvGLU, we conducted additional ablation experiments by progressively replacing this module with its original counterparts (ConvGLU and GLU). As shown in Table 2, substituting MConvGLU with ConvGLU resulted in a slight performance decline across both datasets, whereas employing GLU—which entirely lacks local spatial modelling capabilities—led to more pronounced degradation. These findings confirm the efficacy of the improvements introduced by MConvGLU.

Table 2 Ablation study of MConvGLU.

Full size table

Comparative experiments

To validate the effectiveness of the model, we conducted comparison experiments with representative remote sensing segmentation methods on the Potsdam dataset. The comparative models include typical models (U-Net⁹, DeepLabv3 + ¹³, FCN8s⁸, recent models such as hybrid models based on CNN-Transformer (UnetFormer⁷, CMTFNet³⁵, lightweight modes (UrbanSSF⁴¹) and multimodal model (FTransUNet²¹. All methods are trained and evaluated under the same hardware and software environment and experimental setup to ensure a fair comparison. The quantitative results are summarized in Table 3, while the corresponding qualitative examples are presented in Figs. 5 and 6.

Based on experimental results from Potsdam and Vaihingen, our approach demonstrates greater stability in both overall accuracy and boundary quality. Whether evaluated using region metrics such as mIoU and MeanF1, or the BIoU metric which more accurately reflects boundary performance, our method achieves a distinct advantage. Visualisation results indicate this advantage is particularly pronounced in boundary-sensitive regions such as building roofs, road edges, tree canopy outlines, and vehicles. In contrast, other models frequently exhibit issues like boundary jitter, contour discontinuities, or small object coalescence, whereas our predictions generally adhere more closely to the actual structure.

These distinctions stem from differing feature representation approaches. Traditional CNN models (e.g., UNet, FCN) suffer from localised convolution, leading to excessive smoothing of details in scenes with large scale variations or complex shapes. DeepLabv3+, meanwhile, tends to produce boundary jaggedness during its upsampling stage. Whilst CNN-Transformer approaches (e.g., UNetFormer, CMTFNet) possess global modelling capabilities, boundary information remains predominantly derived from shallow features. Consequently, when shallow textures or local structures are insufficient, the resulting boundaries exhibit reduced stability. FTransUNet, which fuses RGB and DSM modalities, theoretically captures richer information in vegetation and shadowed areas. However, multimodal fusion introduces additional feature alignment challenges, resulting in boundary stability that remains inferior to our model.

By contrast, our approach enhances the model’s inherent understanding of boundaries through structural refinement. MSTB addresses Transformers’ insufficient attention to local structures through a multi-scale convolutional gating mechanism; GLFM emphasises geometric consistency during fusion, reducing boundary shifts caused by cross-scale misalignment; UBAM refines high-uncertainty regions such as shadows, overlaps, and blurred boundaries, yielding clearer and more stable final boundaries. Results demonstrate that this multi-level structural reinforcement proves more robust than simply stacking additional modalities or deeper network architectures. It is better suited to remote sensing scenarios with scale variations, yielding finer boundary segmentation outcomes.

Table 3 Quantitative comparison results on the Potsdam and Vaihingen test.

Full size table

Conclusions

In this study, we propose a convolutional neural network-transformer hybrid architecture termed MSBANet, which enhances high-resolution remote sensing image segmentation performance by strengthening multi-scale representation and boundary modelling capabilities. Through the synergistic design of MSTB, GLFM and UBAM, the model achieves clearer object contours, superior structural consistency and greater robustness in complex regions. Experiments on the Potsdam and Weinheim datasets demonstrate that MSBANet consistently outperforms existing methods in both regional-level accuracy and boundary quality.

Despite these advantages, MSBANet retains limitations: the added multi-scale and fusion modules increase computational cost compared to lightweight networks; reliance on single-modal RGB data may compromise robustness under extreme illumination or sparse texture conditions. Future research will explore more efficient architectures and integrate complementary modalities to further enhance boundary stability and generalisation capabilities.

Data availability

The dataset used for testing the algorithms in this paper is the publicly available remote sensing image detection dataset ISPRS Potsdam and ISPRS Vaihingen, which can be downloaded from : https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/Default.aspx (accessed on 21 November 2024).

References

Li, Q. et al. A review of Building extraction from remote sensing imagery: geometrical structures and semantic attributes[J]. IEEE Trans. Geosci. Remote Sens. 62, 1–15 (2024).
ADS Google Scholar
Zheng, J. et al. A Review of Individual Tree Crown Detection and Delineation from Optical Remote Sensing Images: Current Progress and future[J] (IEEE Geoscience and Remote Sensing Magazine, 2024).
Cheng, J. et al. Methods and datasets on semantic segmentation for unmanned aerial vehicle remote sensing images: A review[J]. ISPRS J. Photogrammetry Remote Sens. 211, 1–34 (2024).
Article ADS Google Scholar
Huang, L. et al. Deep-learning-based semantic segmentation of remote sensing images: A survey[J]. IEEE J. Sel. Top. Appl. Earth Observations Remote Sens. 17, 8370–8396 (2023).
Article ADS Google Scholar
Wang, J. et al. Cross-sensor domain adaptation for high Spatial resolution urban land-cover mapping: from airborne to spaceborne imagery[J]. Remote Sens. Environ. 277, 113058 (2022).
Article Google Scholar
Li, J. et al. A review of remote sensing image segmentation by deep learning methods[J]. Int. J. Digit. Earth. 17 (1), 2328827 (2024).
Article ADS MathSciNet Google Scholar
Wang, L. et al. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery[J]. ISPRS J. Photogrammetry Remote Sens. 190, 196–214 (2022).
Article ADS Google Scholar
Long, J., Shelhamer, E. & Darrell, T. Fully convolutional networks for semantic segmentation. Proceedings of the IEEE conference on computer vision and pattern recognition. 3431–3440 (2015).
Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. International Conference on Medical image computing and computer-assisted intervention. 234–241 (Cham: Springer international publishing, 2015).
Badrinarayanan, V., Kendall, A., Cipolla, R. & Segnet A deep convolutional encoder-decoder architecture for image segmentation[J]. IEEE Trans. Pattern Anal. Mach. Intell. 39 (12), 2481–2495 (2017).
Article ADS PubMed Google Scholar
Chen, L. C. et al. Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs[J]. IEEE Trans. Pattern Anal. Mach. Intell. 40 (4), 834–848 (2017).
Article ADS PubMed Google Scholar
Chen L C, Papandreou G, Kokkinos I, et al. Semantic image segmentation with deep convolutional nets and fully connected crfs[J]. arXiv preprint arXiv:1412.7062, 2014. https://arxiv.org/abs/1412.7062
Chen, L. C. et al. Encoder-decoder with atrous separable convolution for semantic image segmentation. Proceedings of the European conference on computer vision (ECCV). 801–818 (2018).
Wang, R. et al. Transformers for remote sensing: A systematic review and analysis[J]. Sensors 24 (11), 3495 (2024).
Article ADS PubMed PubMed Central Google Scholar
Hasan, K. R. et al. Deep-learning-based semantic segmentation for remote sensing: A bibliometric literature review[J]. IEEE J. Sel. Top. Appl. Earth Observations Remote Sens. 17, 1390–1418 (2023).
Article ADS Google Scholar
He B , Wu D , Wang L ,et al. FA-HRNet: A New Fusion Attention Approach for Vegetation Semantic Segmentation and Analysis[J]. Remote Sensing, 2024, 16(22).
Wang, Z. et al. UAVSeg: Dual-encoder cross-scale attention network for UAV images’ semantic segmentation[J]. IEEE Trans. Geosci. Remote Sens. 63, 1–17 (2024).
Google Scholar
He, X. et al. Swin transformer embedding UNet for remote sensing image semantic segmentation[J]. IEEE Trans. Geosci. Remote Sens. 60, 1–15 (2022).
Google Scholar
Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF international conference on computer vision. 10012–10022 (2021).
Zhang, C. et al. Transformer and CNN hybrid deep neural network for semantic segmentation of very-high-resolution remote sensing imagery[J]. IEEE Trans. Geosci. Remote Sens. 60, 1–20 (2022).
CAS Google Scholar
Ma, X. et al. A multilevel multimodal fusion transformer for remote sensing semantic segmentation[J]. IEEE Trans. Geosci. Remote Sens. 62, 1–15 (2024).
ADS Google Scholar
Ma, F. et al. Fast task-specific region merging for SAR image segmentation[J]. IEEE Trans. Geosci. Remote Sens. 60, 1–16 (2022).
CAS Google Scholar
Ma, F. et al. Fast SAR image segmentation with deep task-specific superpixel sampling and soft graph convolution[J]. IEEE Trans. Geosci. Remote Sens. 60, 1–16 (2021).
Google Scholar
Zhao, H. et al. Pyramid scene parsing network. Proceedings of the IEEE conference on computer vision and pattern recognition. 2881–2890 (2017).
Zhang, R. et al. Adaptive context selection for polyp segmentation. International conference on medical image computing and computer-assisted intervention. 253–262 (Cham: Springer International Publishing, 2020).
A. Dosovitskiy. et al. An image is worth 16×16 words: Transformers for image recognition at scale. Proceedings of the International Conference on Learning Representations. 1–22 (2021).
Zheng, S. et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6881–6890 (2021).
Dong, X. et al. Cswin transformer: A general vision transformer backbone with cross-shaped windows. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12124–12134 (2022).
Shi, D. & TransNeXt Robust Foveal Visual Perception for Vision Transformers. arXiv 2024[J]. arXiv preprint https://arxiv.org/abs/2311.17132.
Fan, R. et al. The first urban open space product of global 169 megacities using remote sensing and Geospatial data[J]. Sci. Data. 12 (1), 586 (2025).
Article PubMed PubMed Central Google Scholar
Fan, R. et al. Refined Urban Informal Settlements Mapping at agglomeration-scale with the Guidance of background-knowledge from easy-accessed Crowdsourced Geospatial data[J] (IEEE Transactions on Geoscience and Remote Sensing, 2025).
Chen, J. et al. CTSeg: CNN and ViT collaborated segmentation framework for efficient land-use/land-cover mapping with high-resolution remote sensing images[J]. Int. J. Appl. Earth Obs. Geoinf. 139, 104546 (2025).
Google Scholar
Fan J , Li J , Liu Y ,et al.Frequency-aware robust multidimensional information fusion framework for remote sensing image segmentation[J].Engineering Applications of Artificial Intelligence, 2024, 129(000):17.
Chen, Y. et al. Hybrid attention fusion embedded in transformer for remote sensing image semantic segmentation[J]. IEEE J. Sel. Top. Appl. Earth Observations Remote Sens. 17, 4421–4435 (2024).
Article ADS Google Scholar
Wu, H. et al. CMTFNet: CNN and multiscale transformer fusion network for remote-sensing image semantic segmentation[J]. IEEE Trans. Geosci. Remote Sens. 61, 1–12 (2023).
Google Scholar
Yang, J. et al. D-net: dynamic large kernel with dynamic feature fusion for volumetric medical image segmentation[J]. Biomed. Signal Process. Control. 113, 108837 (2026).
Article Google Scholar
Cheng, B. et al. Boundary IoU: Improving object-centric image segmentation evaluation. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 15334–15342 (2021).
Xie, C. et al. MaSS13K: A Matting-level Semantic Segmentation Benchmark. Proceedings of the Computer Vision and Pattern Recognition Conference. 14046–14056 (2025).
Bernhard, M. et al. What’s Outside the Intersection? Fine-grained Error Analysis for Semantic Segmentation Beyond IoU. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 968–977 (2024).
Chen, Z. et al. Accurate contour preservation for semantic segmentation by mitigating the impact of pseudo-boundaries[J]. Int. J. Appl. Earth Obs. Geoinf. 126, 103615 (2024).
Google Scholar
Wang, Z. et al. Accurate semantic segmentation of very high-resolution remote sensing images considering feature state sequences: from benchmark datasets to urban applications[J]. ISPRS J. Photogrammetry Remote Sens. 220, 824–840 (2025).
Article ADS Google Scholar

Download references

Funding

This research is funded by National Natural Science Foundation of China (62472010) and the Chongqing Natural Science Foundation (CSTB2024NSCQ-MSX0687).

Author information

Keyi Shan, Yibo Li and Tan Jia contributed equally to this work.

Authors and Affiliations

School of Computer and Artificial Intelligence, Beijing Technology and Business University, Beijing, 100048, China
Keyi Shan, Li Tan, Yibo Li & Tan Jia

Authors

Keyi Shan
View author publications
Search author on:PubMed Google Scholar
Li Tan
View author publications
Search author on:PubMed Google Scholar
Yibo Li
View author publications
Search author on:PubMed Google Scholar
Tan Jia
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualization, K.S. and L.T.; methodology, K.S.; software, K.S.; validation, K.S., T.J. and Y.L.; investigation, K.S. and Y.L.; resources, L.T.; data curation, K.S., T.J.; writing—original draft preparation, K.S.; writing—review and editing, L.T.; visualization, K.S., L.Y. and T.J.; supervision, L.T.; project administration, L.T.; funding acquisition, L.T. All authors have read and agreed to the published version of the manuscript. All authors reviewed the manuscript.

Corresponding author

Correspondence to Li Tan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Shan, K., Tan, L., Li, Y. et al. Multi-scale boundary-aware network for remote sensing image semantic segmentation. Sci Rep 16, 3797 (2026). https://doi.org/10.1038/s41598-025-33943-2

Download citation

Received: 15 October 2025
Accepted: 23 December 2025
Published: 26 January 2026
Version of record: 28 January 2026
DOI: https://doi.org/10.1038/s41598-025-33943-2