Introduction

With the rapid development of sensor technology, the spatial details, spectral resolution and imaging accuracy of remote sensing images have significantly improved, making them an important data source for obtaining surface information. They play a crucial role in fields such as urban planning, land cover mapping, ecological and environmental monitoring1, 2, 3. As one of the core tasks in understanding remote sensing scenes, semantic segmentation precisely classifies each pixel, providing key support for object recognition, detailed mapping and intelligent decision-making4, 5.

However, in high-resolution remote sensing images, object categories such as buildings, roads, water bodies and vegetation usually have significant scale differences and ambiguous boundaries. Achieving accurate semantic segmentation remains a huge challenge6, traditional image segmentation methods relying on handcrafted features or shallow models are inadequate for this task. In this context, deep learning approaches have demonstrated outstanding performance, benefiting from their strong feature representation ability7. CNN - based models can effectively extract local spatial detail features and have achieved excellent performance on several publicly available remote sensing datasets. For ejxample, Fully Convolutional Networks(FCNs)8, U-Net9, Seg-Net10 and the DeepLab series11, 12, 13 have become mainstream methods. However, due to the limited receptive field of convolution, these methods perform poorly in capturing global context correlations. Recently, Transformer architectures have shown notable effectiveness in capturing long-range dependencies through their attention mechanism, and has been widely applied to remote sensing imagery14. It has been shown that Transformer demonstrates the potential to outperform traditional CNNs in building extraction15, vegetation detection16, and Unmanned Aerial Vehicle-based aerial scene parsing17. However, its computational complexity grows quadratically with resolution and struggles to capture local spatial details in high-resolution imagery.

Recent research indicates that integrating CNNs with Transformer models in image semantic segmentation can leverage the strengths of both approaches while mitigating their respective limitations, thereby enhancing overall model performance. ST-UNet18 embeds the Swin Transformer19 into the U-Net to fuse contextual and local features, though its edge segmentation remains limited. UNetFormer7 combines U-Net and Transformer modules to improve semantic representation in complex urban scenes, but its multi-scale fusion strategy still requires refinement. Zhang et al.20 proposed a representative CNN–Transformer hybrid model for high-resolution imagery, where the encoder employs a Swin Transformer and the decoder uses a CNN-based ASPP module, further enhanced with skip connections, squeeze-and-excitation attention, and an auxiliary boundary detection branch. More recently, FTransUNet21 proposed a hierarchical multimodal fusion strategy, employing CNN-based shallow fusion for local details and Transformer-based deep fusion for global semantics, thereby leveraging the strengths of both paradigms across modalities.

Despite significant advances in remote sensing semantic segmentation achieved by existing CNNs, Transformers, and hybrid approaches, high-resolution imagery remains challenging: Insufficient multi-scale feature characterisation prevents the model from capturing boundary features, leading to blurred segmentation boundaries caused by classification errors at the junctures of different semantic categories. Furthermore, recent research on remote sensing image segmentation has underscored the importance of boundary-aware modelling in dense prediction tasks22, 23. These considerations have prompted us to develop a unified framework that synergistically enhances both multi-scale representation and boundary-aware capabilities. To address these challenges, we propose MSBANet—a specially designed CNN-Transformer hybrid framework. Unlike existing CNN–Transformer hybrid models that mainly focus on global–local feature fusion, our framework is a boundary-oriented hybrid architecture built upon three complementary components - multi-scale modeling, boundary consistency preservation, and uncertainty-based refinement. It is designed to enhance multi-scale representation and boundary accuracy in high-resolution remote sensing imagery. Its principal contributions can be summarised as follows:

  • We proposed the MSBANet semantic segmentation architecture, which combines multi-scale boundary modeling with edge enhancement that is uncertainty-aware.

  • We design a collaborative multi-scale mechanism consisting of MSTB and GLFM. MSTB enhancing the model’s ability to capture the significant scale variation features in the category edge regions. Meanwhile, GLFM performs cross-level feature alignment, ensuring the consistency of boundary structures during the multi-scale reconstruction process.

  • We propose the UBAM, which quantifies prediction uncertainty using information entropy and directs attention to ambiguous boundary regions. Embedding UBAM into the segmentation head to enhance the boundary refinement capability of the model.

Related work

Semantic segmentation methods based on CNN

The objective of semantic segmentation is to assign semantic labels at the pixel level. Unlike traditional pixel-wise classification, CNN-based approaches leverage spatial and contextual features to achieve higher accuracy and stability. FCNs8 marked a milestone by establishing the foundation for CNN-based models. Subsequently, encoder - decoder architectures such as U-Net9, Seg-Net10, and PSPNet24 were proposed, which integrate high-level semantics with fine-grained spatial details through skip connections. Later, the DeepLab series11, 12, 13 introduces the Atrous Spatial Pyramid Pooling (ASPP) module, which further improves segmentation accuracy by capturing multi-scale features to integrate contextual information. Other CNN-based variants, including context-aware and adaptive feature selection modules25, have also been explored to improve local detail representation.

Although significant progress has been made, CNN-based approaches continue to encounter difficulties in complex urban remote sensing scenarios. Due to their restricted receptive field and insufficient ability to model global dependencies, they frequently misidentify small buildings, subtle textures, and occluded targets, thereby reducing segmentation accuracy.

Semantic segmentation methods based on transformer

Capturing global semantic information is essential for accurate semantic segmentation, and Transformer-based approaches have therefore attracted increasing attention. By incorporating self-attention modules, these approaches successfully capture multi-scale features and long-range dependencies, thereby enhancing segmentation accuracy in remote sensing imagery.

The Vision Transformer (ViT)26 splits an image into non-overlapping patches and leverages self -attention to model contextual relationships, demonstrating the potential of Transformers for visual tasks. Inspired by this, SETR27 employed a Transformer encoder for semantic segmentation, showing flexibility in combination with different decoders. Subsequently, the Swin Transformer19 introduced shifted windows and a hierarchical design, which greatly reduces computational complexity while improving multi-scale feature representation, making it more suitable for high-resolution dense prediction. Other variants, such as CSwin28 and TransNext29, further optimized the attention mechanism or robustness for specific vision tasks, indicating the rapid evolution of Transformer-based segmentation networks. Recently, lightweight Transformer architectures have also been explored for large-scale urban mapping, such as the global UOS segmentation framework30 in, demonstrating strong cross-region generalization through efficient multi-scale context modeling.In addition, specialized Transformer frameworks have emerged for complex and highly heterogeneous scenes, such as the DualFormer model31, which employs pyramidal Transformer encoding to better capture multi-scale semantic variations in irregular urban informal settlements.

Despite these advances, Transformer models still face critical limitations. The self-attention mechanism requires computing pairwise relationships across all tokens, leading to quadratic complexity with image resolution and substantial memory consumption. Moreover, Transformers focus primarily on global context but are less effective in capturing local features. These limitations have encouraged the design of hybrid architectures that leverage CNNs for spatial detail and Transformers for long-range dependencies, particularly in high-resolution remote sensing segmentation.

Semantic segmentation methods combining CNN and transformer

Traditional CNNs have remarkable effects in extracting local spatial details, but their capabilities in modeling global contexts are limited. Although Transformer-based methods can capture long-range dependencies, they often ignore fine structures and boundary information and have high computational complexity. In semantic segmentation tasks, both global information and local information are crucial for accurately understanding the semantic structure of images. Researchers have begun to combine CNN with Transformer models to fully leverage their respective advantages. Chen et al. proposed CTSeg32, a novel CNN and ViT collaborative segmentation framework following the encoder-decoder architecture for land-use/land-cover classification of high-resolution remote sensing images. By simultaneously leveraging the local features of CNN and the global dependencies of Transformer through multi-scale self-attention and bidirectional knowledge distillation, it improves the accuracy of remote sensing semantic segmentation. Fan et al.33 proposed a Multidimensional Information Fusion Network for high-resolution remote sensing image segmentation by integrating CNN and Transformer and introducing frequency information. By multi-scale integration of local details, global semantics and frequency domain features, the model’s ability to identify fine-grained boundaries and scale-varying targets is enhanced, effectively alleviating complex interference factors such as inter-class ambiguity. The hybrid attention network proposed by Wang et al.34 adaptively fuses multi-scale local features and long-range context relations through channel - spatial attention, and enhances the recognition ability for structurally similar and boundary-ambiguous regions through a global cross-fusion module, further highlighting the advantages of the CNN–Transformer hybrid architecture in multi-scale modeling and complex boundary depiction.

In summary, although significant progress has been made in the field of remote sensing semantic segmentation, key challenges remain in effectively integrating the strengths of convolutional neural networks and Transformers, enhancing boundary recognition capabilities, and further improving segmentation accuracy.

Method

Network architecture

Fig. 1
figure 1

The overall network architecture of MSBANet.

The overall architecture of MSBANet is illustrated in Fig. 1. We employ a CNN-based ResNet50 as the encoder, a backbone network that has been extensively validated in the field of remote sensing semantic segmentation. It comprises four residual stages, each performing downsampling and generating feature maps with resolutions of H/4 × W/4, H/8 × W/8, H/16 × W/16, and H/32 × W/32, respectively, featuring channel counts of 256, 512, 1024, and 2048.

To maintain computational efficiency and facilitate multi-scale feature fusion, we first apply a 1 × 1 convolution to compress all encoder outputs to 512 channels in the decoder, and this channel dimension is preserved throughout all decoding stages. Subsequently, the MSTB module within the decoder extracts multi-scale information, progressively upsamples it via GLFM, and integrates it with corresponding local spatial features from the encoder. This ensures boundary structure consistency throughout the multi-scale reconstruction process. Finally, an UBAM module is embedded within the segmentation head to generate more accurate segmentation result maps at the same resolution as the original input.

Multi-Scale transformer block

MSTB is designed to simultaneously enhance global contextual modeling and local detail refinement. As illustrated in Fig. 2, it contains two complementary submodules: MHSA and MConvGLU.

Fig. 2
figure 2

Architecture of the multi-scale transformer block.

(1) MHSA.

To enhance the model’s ability to extract global semantic and multi-scale structural information, we introduce a multi-scale multi-head self-attention module35. This module structurally integrates multi-scale convolutions, channel attention, and self-attention mechanisms to capture both global information and contextual details within regions exhibiting significant scale variations. Through three parallel, separable deep dilated convolutional branches, it acquires multi-scale receptive fields ranging from local to global, thereby enhancing the network’s sensitivity to scale-varying objects. After aggregating and fusing the features from these three branches, multi-head self-attention computations are performed. This approach effectively reduces computational costs while preserving contextual diversity. Concurrently, the lightweight channel attention branch adaptively adjusts feature channel weights, further enhancing representational discriminative power. This design enables the model to model contextual information while maintaining computational efficiency.

(2) MConvGLU.

GLU was originally introduced as a channel mixer in NLP and was later extended to vision transformers as ConvGLU29 by incorporating depthwise convolution into the gating branch. While ConvGLU generates token-specific gating signals based on local spatial features—offering finer control than global attention mechanisms such as SE—the standard design relies on a fixed 3 × 3 convolution. This restricts its ability to model the diverse boundary patterns and multi-scale object variations that characterize high-resolution remote sensing imagery. In particular, Single-scale convolution is difficult to simultaneously capture geometric changes at different scales. Especially in shadow, occlusion and low-contrast areas, it is more likely to cause confusion in classification at the boundaries.

To overcome these drawbacks, we propose MConvGLU, which replaces the single 3 × 3 convolution with parallel depthwise separable convolutions of sizes 3 × 3, 5 × 5, and 7 × 7. These multi-scale kernels collectively provide richer spatial priors: the 3 × 3 branch focuses on high-frequency edge details, the 5 × 5 branch captures intermediate structural cues, and the 7 × 7 branch models large-scale and blurred boundary regions. The fused multi-scale gating signals can adaptively adjust the feature weights according to the boundary width and texture complexity, thereby more effectively distinguishing real boundaries and enhancing the model’s ability to identify complex boundaries such as building outlines and vegetation edges.

By combining MHSA and MConvGLU, MSTB is capable of extracting multi-scale global semantic information and boundary structural information.

Global-Local fusion module

Shallow features extracted by the model convey rich spatial details but lack semantic information, whereas deep features encode strong semantic information yet lose fine-grained resolution. Direct concatenation or linear fusion often results in redundancy or misalignment. To effectively integrate deep semantic information with shallow spatial details during the decoding process that progressively fuses multi-scale features, thereby preserving boundary structure integrity, we have introduced the GLFM36 The model architecture is illustrated in Fig. 3.

Fig. 3
figure 3

Architecture of the global-local fusion module.

Module first upsamples semantic features derived from the Transformer, aligning and integrating them with high-resolution detail features from the encoder at the channel dimension. Subsequently, a global channel modulation mechanism suppresses redundant channel responses, emphasising key structural information relevant to semantics. Building upon this, a spatially guided branch driven jointly by deep semantic and shallow detail features is introduced. This selectively enhances fused features on a per-location basis, achieving more precise spatial localisation. Through joint adaptive modulation across channels and spatial dimensions, GLFM simultaneously preserves semantic consistency and detail continuity during cross-layer feature fusion. This ensures boundary integrity and structural restoration quality during the multi-scale reconstruction phase.

Uncertainty boundary aware module

In image semantic segmentation, pixels at the boundaries between different semantic categories are prone to category uncertainty, leading to classification errors that blur segmentation boundaries. To address this, we propose a lightweight uncertainty boundary-aware module, as shown in Fig. 4, embedding it within the final segmentation head to enhance the model’s boundary refinement capability. The module takes the predicted map as input to guide attention map generation, assigning higher attention values to regions of high uncertainty. During forward propagation, residual connections are employed to enhance focus on these areas, thereby improving boundary segmentation accuracy.

Fig. 4
figure 4

Architecture of the uncertainty boundary aware module.

The core concept of UBAM centres on focusing attention towards regions of high prediction uncertainty—areas frequently situated at boundaries where classification categories are prone to confusion. Building upon the local contextual attention module25 originally designed for binary classification tasks, we introduce information entropy as a quantitative measure of uncertainty to guide attention generation, thereby extending its applicability to multi-class classification tasks. Specifically, this module generates a per-pixel attention map by analysing the entropy distribution of predicted category probabilities. For the input feature map and predicted probability distribution, the module first normalises the probabilities before calculating entropy values to quantify prediction uncertainty at each pixel location. The entropy range reflects a transition from high confidence to high uncertainty: high entropy typically occurs in ambiguous regions near object boundaries, while low entropy corresponds to high-confidence areas. The attention map is generated by normalising entropy values, assigning higher weights to high-confidence regions and lower weights to uncertain areas. Element-wise multiplication between the input feature map and the attention map yields the weighted output. To preserve original features and ensure training stability, this module employs a residual connection mechanism, combining the original feature map with the weighted feature map to produce the final output.

Through an entropy-based uncertainty modelling mechanism, the UBAM module can identify and optimise high-uncertainty boundary regions, thereby enhancing the model’s capability for boundary refinement.

Dataset and experiment setting

Dataset

1)The ISPRS Potsdam dataset consists of 38 orthophoto tiles with a ground sampling distance (GSD) of 5 cm, each tile covering 6000 × 6000 pixels. The dataset includes six semantic categories: impervious surfaces, buildings, low vegetation, trees, vehicles, and background. It provides four spectral bands (red, green, blue, and near-infrared) together with DSM data. In our experiments, only the RGB channels were adopted for training and evaluation. A total of 24 images were employed for training and 14 for testing. All images were pre-split into 1024 × 1024 patches for subsequent model training.

2)The ISPRS Vaihingen dataset provides 33 orthophoto image tiles with a ground sampling distance (GSD) of 9 cm. Each tile includes three channels (IR, R, and G) together with DSM data, and the semantic categories are impervious surfaces, buildings, low vegetation, trees, vehicles, and background. In our experiments, only the orthophotos were used. Sixteen tiles were chosen for training and seventeen for testing. All images were pre-split into 1024 × 1024 patches to construct the training samples.

The Vaihingen image dataset primarily contains small and scattered village buildings, while the Potsdam image dataset mainly consists of dense urban building complexes, and the two datasets complement each other.

Implementation details

All experiments were conducted using PyTorch 2.0 with CUDA 11.8 on a single NVIDIA RTX 3090 GPU. The models were optimized with AdamW, initialized at a learning rate of 6 × 10⁻⁴, a weight decay of 0.01, and updated through a cosine annealing schedule. Training proceeded for 100 epochs with a batch size of 8. Since the original orthophotos had already been partitioned into 1024 × 1024 patches, these patches were directly employed as inputs. To enhance data diversity, several augmentation operations were applied during training, including random scaling (0.5, 0.75, 1.0, 1.25, 1.5), horizontal and vertical flipping, and random rotation. Random seeds were fixed across experiments to ensure reproducibility. For inference, multi-scale testing combined with flipping was adopted to improve prediction robustness.

Evaluation metrics

To evaluate segmentation performance, three commonly used metrics were adopted: overall accuracy (OA), mean intersection over union (mIoU), and mean F1 score. These metrics are widely employed in semantic segmentation research[35], as they jointly reflect global accuracy, region-level consistency, and the balance between precision and recall. Based on the confusion matrix, the metrics are defined as follows:

$$\:\text{OA}\text{=}\frac{{\sum\:}_{\text{k}\text{=1}}^{\text{N}}\text{T}{\text{P}}_{\text{k}}}{{\sum\:}_{\text{k}\text{=1}}^{\text{N}}\left(\text{T}{\text{P}}_{\text{k}}\text{+}\text{F}{\text{P}}_{\text{k}}\text{+}\text{T}{\text{N}}_{\text{k}}\text{+}\text{F}{\text{N}}_{\text{k}}\right)}$$
(1)
$$\:\text{mIoU}\text{=}\frac{\text{1}}{\text{N}}{\sum\:}_{\text{k}\text{=1}}^{\text{N}}\frac{\text{T}{\text{P}}_{\text{k}}}{\text{T}{\text{P}}_{\text{k}}\text{+}\text{F}{\text{P}}_{\text{k}}\text{+}\text{F}{\text{N}}_{\text{k}}}$$
(2)
$$\:\text{precision}\text{=}\frac{\text{1}}{\text{N}}{\sum\:}_{\text{k}\text{=1}}^{\text{N}}\frac{\text{T}{\text{P}}_{\text{k}}}{\text{T}{\text{P}}_{\text{k}}\text{+}\text{F}{\text{P}}_{\text{k}}}$$
(3)
$$\:\text{r}\text{ecall}\text{=}\frac{\text{1}}{\text{N}}{\sum\:}_{\text{k}\text{=1}}^{\text{N}}\frac{\text{T}{\text{P}}_{\text{k}}}{\text{T}{\text{P}}_{\text{k}}\text{+}\text{F}{\text{N}}_{\text{k}}}$$
(4)
$$\:\text{F}\text{1=}\frac{\text{2}\times\text{precision}\times\text{recall}}{\text{precision}+\text{recall}}$$
(5)

where TPₖ, FPₖ, TNₖ, and FNₖ denote the true positives, false positives, true negatives, and false negatives for class k, respectively. All metrics range from 0 to 1, with higher values indicating better performance.

To further quantify boundary segmentation quality, we additionally adopt the Boundary IoU (BIoU) metric37, which evaluates pixel-level accuracy within a narrow band around the ground-truth and predicted contours. Unlike region-based metrics such as mIoU and F1, BIoU is specifically designed to measure the correctness of object boundaries and has been widely used in recent semantic segmentation and remote-sensing studies38, 39, 40. The metric idefined as follows:

$${\text{BIoU=}}\frac{{{\text{|}}({{\text{G}}_{\text{d}}} \cap {\text{G)}} \cap {\text{(}}{{\text{P}}_{\text{d}}} \cap {\text{P}}){\text{|}}}}{{{\text{|}}({{\text{G}}_{\text{d}}} \cap {\text{G)}} \cup {\text{(}}{{\text{P}}_{\text{d}}} \cap {\text{P}}){\text{|}}}}$$
(6)

\(\:\text{G}\) and \(\:\text{P}\) denote the ground-truth and predicted binary masks. The sets \(\:{\text{G}}_{\text{d}}\) and \(\:{\text{P}}_{\text{d}}\) correspond to the pixels located within a boundary band of width d around the boundaries of \(\:\text{G}\) and \(\:\text{P}\), respectively. The parameter d controls the thickness of the boundary region used to evaluate the alignment of the predicted boundaries with the ground truth.

Experiment results and analysis

Ablation

To systematically evaluate the independent contributions and synergistic effects of each module within the model, we conducted ablation experiments involving the gradual integration of UBAM, GLFM, and MSTB, whilst undertaking an in-depth analysis of the reasons behind the observed performance improvements. The detailed results are summarized in Table 1.

Table 1 Ablation study of each component of the MSBANet.

After incorporating UBAM into the baseline model, both Vaihingen and Potsdam datasets demonstrated modest improvements in mIoU, accompanied by concurrent gains in BIoU. UBAM’s ability to automatically localise high-uncertainty boundary regions within predictions using entropy metrics enables the model to proactively focus on category transitions and contour ambiguity during inference. This approach not only enhances regional classification accuracy but also markedly improves boundary localisation quality.

The introduction of GLFM yielded stable performance improvements across both datasets. However, on the BIoU metric, a slight decline occurs when GLFM is added alone. This stems from GLFM’s primary function being to align features from different scales across deep and shallow layers, rather than directly enhancing boundary sensitivity. When the encoder’s high-level semantic features have not yet acquired stronger multi-scale structural representation through MSTB, the fusion operation may introduce some interference to shallow-level local boundary details, leading to slight accuracy fluctuations in boundary regions.

The most pronounced single-module performance gains are observed in MSTB evaluations, where Vaihingen achieves a 1.20% mIoU improvement alongside a notable BIoU enhancement. MSTB combines global self-attention with a multi-scale convolutional gating structure, endowing the model with enhanced multi-scale semantic modelling capabilities and superior recognition of complex boundary morphologies. This enables it to outperform the baseline and other single modules on BIoU.

When MSTB and GLFM are employed jointly, synergistic effects emerge: MSTB furnishes richer multi-scale semantic and boundary details, while GLFM ensures these features maintain geometric consistency during cross-stage fusion. This concurrently enhances both mIoU and BIoU metrics. At this stage, GLFM no longer exhibits degradation in boundary quality; rather, the structurally clearer deep semantic features provided by MSTB render its fusion process more stable and effective.

Ultimately, integrating UBAM, GLFM, and MSTB yields a complete MSBANet architecture that achieves mIoU gains of 2.09% and 1.22% across the two datasets respectively, alongside an substantial improvement in BIoU metrics. The synergistic interaction among these components establishes a comprehensive model architecture, optimising overall performance.

To further evaluate the effectiveness of the proposed MConvGLU, we conducted additional ablation experiments by progressively replacing this module with its original counterparts (ConvGLU and GLU). As shown in Table 2, substituting MConvGLU with ConvGLU resulted in a slight performance decline across both datasets, whereas employing GLU—which entirely lacks local spatial modelling capabilities—led to more pronounced degradation. These findings confirm the efficacy of the improvements introduced by MConvGLU.

Table 2 Ablation study of MConvGLU.

Comparative experiments

To validate the effectiveness of the model, we conducted comparison experiments with representative remote sensing segmentation methods on the Potsdam dataset. The comparative models include typical models (U-Net9, DeepLabv3 + 13, FCN8s8, recent models such as hybrid models based on CNN-Transformer (UnetFormer7, CMTFNet35, lightweight modes (UrbanSSF41) and multimodal model (FTransUNet21. All methods are trained and evaluated under the same hardware and software environment and experimental setup to ensure a fair comparison. The quantitative results are summarized in Table 3, while the corresponding qualitative examples are presented in Figs. 5 and 6.

Based on experimental results from Potsdam and Vaihingen, our approach demonstrates greater stability in both overall accuracy and boundary quality. Whether evaluated using region metrics such as mIoU and MeanF1, or the BIoU metric which more accurately reflects boundary performance, our method achieves a distinct advantage. Visualisation results indicate this advantage is particularly pronounced in boundary-sensitive regions such as building roofs, road edges, tree canopy outlines, and vehicles. In contrast, other models frequently exhibit issues like boundary jitter, contour discontinuities, or small object coalescence, whereas our predictions generally adhere more closely to the actual structure.

These distinctions stem from differing feature representation approaches. Traditional CNN models (e.g., UNet, FCN) suffer from localised convolution, leading to excessive smoothing of details in scenes with large scale variations or complex shapes. DeepLabv3+, meanwhile, tends to produce boundary jaggedness during its upsampling stage. Whilst CNN-Transformer approaches (e.g., UNetFormer, CMTFNet) possess global modelling capabilities, boundary information remains predominantly derived from shallow features. Consequently, when shallow textures or local structures are insufficient, the resulting boundaries exhibit reduced stability. FTransUNet, which fuses RGB and DSM modalities, theoretically captures richer information in vegetation and shadowed areas. However, multimodal fusion introduces additional feature alignment challenges, resulting in boundary stability that remains inferior to our model.

By contrast, our approach enhances the model’s inherent understanding of boundaries through structural refinement. MSTB addresses Transformers’ insufficient attention to local structures through a multi-scale convolutional gating mechanism; GLFM emphasises geometric consistency during fusion, reducing boundary shifts caused by cross-scale misalignment; UBAM refines high-uncertainty regions such as shadows, overlaps, and blurred boundaries, yielding clearer and more stable final boundaries. Results demonstrate that this multi-level structural reinforcement proves more robust than simply stacking additional modalities or deeper network architectures. It is better suited to remote sensing scenarios with scale variations, yielding finer boundary segmentation outcomes.

Table 3 Quantitative comparison results on the Potsdam and Vaihingen test.
Fig. 5
figure 5

Visual comparisons of ISPRS Potsdam dataset.

Fig. 6
figure 6

Visual comparisons of ISPRS Vaihingen dataset.

Conclusions

In this study, we propose a convolutional neural network-transformer hybrid architecture termed MSBANet, which enhances high-resolution remote sensing image segmentation performance by strengthening multi-scale representation and boundary modelling capabilities. Through the synergistic design of MSTB, GLFM and UBAM, the model achieves clearer object contours, superior structural consistency and greater robustness in complex regions. Experiments on the Potsdam and Weinheim datasets demonstrate that MSBANet consistently outperforms existing methods in both regional-level accuracy and boundary quality.

Despite these advantages, MSBANet retains limitations: the added multi-scale and fusion modules increase computational cost compared to lightweight networks; reliance on single-modal RGB data may compromise robustness under extreme illumination or sparse texture conditions. Future research will explore more efficient architectures and integrate complementary modalities to further enhance boundary stability and generalisation capabilities.