Introduction

Semantic segmentation of high-resolution UAV remote sensing images plays a crucial role in environmental monitoring1, urban planning2, and disaster management3, attracting widespread attention. Traditional remote sensing image semantic segmentation methods typically rely on handcrafted features (e.g., texture, spectral indices, and shape descriptors manually designed by experts) combined with classical machine learning algorithms4. These methods generally necessitate the manual design of feature extractors. While such approaches can yield acceptable performance in simple or low-resolution scenarios, they face considerable challenges when applied to complex environments and high-resolution UAV imagery that is rich in detail2,3,4. With the rapid development of deep learning technology5,6,7,8, deep learning-based remote sensing image semantic segmentation methods have demonstrated superior performance due to their powerful feature extraction and modeling capabilities, gradually becoming a research hotspot in this field8,9,10,11,12.

Deep learning-based remote sensing image semantic segmentation models have shown outstanding performance across various tasks13. However, their effectiveness heavily depends on large amounts of accurately annotated training data14. Although the widespread adoption of UAV technology has made remote sensing image acquisition more accessible, annotating pixel-level training samples remains a time-consuming and labor-intensive task15, requiring expert annotators. On the other hand, reducing the number of training samples often leads to severe degradation in model performance3,10,11,12. As a result, research teams face significant challenges16.

To address the issue of insufficient labeled data, researchers have proposed methods that combine semi-supervised learning with transfer learning17,18. Semi-supervised learning leverages a small amount of labeled data along with a large volume of unlabeled data, enabling the model to learn underlying patterns from unlabeled samples. This reduces the dependence on labeled samples and provides an effective approach to mitigating the scarcity of annotated data19,20. Transfer learning facilitates knowledge transfer from large-scale image datasets to target tasks21, thereby reducing the demand for extensive training samples in the target domain. Existing transfer learning-based methods for remote sensing image semantic segmentation primarily rely on parameter transfer, in which pre-trained model parameters from large-scale natural image datasets such as ImageNet22 are directly transferred and fine-tuned on remote sensing datasets23,24,25. Although this approach enhances model performance, the significant differences in data distribution and feature representation between remote sensing images and natural images limit the effectiveness of parameter transfer25. To overcome this challenge, researchers have adopted knowledge distillation as an alternative to parameter transfer, aiming to fully utilize the knowledge learned from large-scale natural image datasets26,27.

Knowledge distillation is an extended approach to transfer learning. First, a teacher model is trained on a large-scale dataset. Then, its outputs are used to guide the training of a student model on a target dataset, which is often much smaller in scale. The student model learns from the teacher model by mimicking its soft labels (i.e., probability distribution vectors), thereby improving its own performance and generalization ability. Previous studies have shown that knowledge distillation often achieves better results than direct parameter transfer26,27. Currently, most semantic segmentation methods based on knowledge distillation employ models pre-trained on the ImageNet dataset as teacher models to guide the training of student models on small-scale remote sensing datasets27,28,29. For instance, Wang et al.27 used knowledge distillation to transfer knowledge from a teacher model pre-trained on ImageNet to student models trained on the Cityscapes15, CamVid30, and Pascal VOC31 semantic segmentation datasets. Their experimental results showed that, compared to parameter transfer, their method improved the mean Intersection over Union (mIoU) by 3.6%, 2.75%, and 2.27% on these three datasets, respectively. However, due to the significant differences between natural image datasets and remote sensing image datasets, when only a limited number of labeled samples are available in the target remote sensing dataset, directly using a teacher model trained on large-scale natural image datasets to guide the student model may not achieve optimal performance. A promising research question is whether introducing an intermediate-scale remote sensing dataset between the large-scale natural image dataset and the small-scale target remote sensing dataset—utilizing a multi-level knowledge distillation approach—could further enhance the performance of the student model.

This study proposed a few-shot remote sensing image semantic segmentation method that combines multi-level knowledge distillation with semi-supervised learning. Our main contributions are as follows:

  1. (1)

    We have optimized the encoder of the DeepLabV3 + model32 by replacing the High-Resolution Network (HRNet) as the backbone network and introducing the Polarized Self-Attention (PSA) module, which enhanced the model’s ability to restore fine details and improved pixel-level classification accuracy.

  2. (2)

    We proposed a few-shot semantic segmentation method that combined multi-stage knowledge distillation and semi-supervised learning, achieving high performance with only a few labeled samples.

Related work

To improve the accuracy of semantic segmentation of high-resolution UAV images with small samples, researchers have explored various approaches. Early studies mainly relied on transfer learning (TL), where models pre-trained on large-scale natural image datasets (e.g., ImageNet) were fine-tuned on small-scale remote sensing datasets33,34. This strategy has been shown to enhance the performance and convergence speed of the semantic segmentation model, but its effectiveness is limited due to the large domain gap between natural and remote sensing images. To address this limitation, knowledge distillation (KD) has been proposed as an advanced scheme of transfer learning. KD transfers soft labels or intermediate features from a teacher model to a student model, enabling the latter to better adapt to the target task35,36,37,38,39. For instance, Tuia et al.40 combined KD with unsupervised domain adaptation to improve cross-domain adaptability. Li et al.41 extended this by introducing cross-domain and cross-modal KD to align 2D–3D features, and Zhou et al.42 designed a graph-attention guided KD to capture land-cover relationships. More recently, Yuan et al.43 proposed feature augmentation KD, improving student learning, though at a higher computational cost.

Semi-supervised learning (SSL) has emerged as another key approach to alleviate the scarcity of labeled data. By combining a small number of labeled samples with abundant unlabeled data, SSL methods have been shown to significantly improve model performance44,45,46,47,48,49,50,51,52,53. A variety of techniques have been developed, including self-training48,52, consistency regularization45, and generative adversarial networks (GANs)44,53. For instance, Zhang et al.51, proposed a consistency-based network with multi-view augmentation and dynamic pseudo-labels; Yang et al.47 developed ST + + to mitigate noisy pseudo-labels through strong augmentation and selective retraining; Ke et al.54 proposed Structured Consistency Loss, enforcing consistency at both pixel and structural levels; and Chen et al.55 presented Noise-Robust Consistency Regularization, which mitigates pseudo-label noise through feature perturbation and robust loss. These studies collectively demonstrate that SSL can effectively improve accuracy under limited labeled data.

Recent trends have focused on integrating transfer learning, knowledge distillation, and semi-supervised learning into unified frameworks, leveraging their complementary advantages. Transfer learning initializes the model with large-scale datasets, SSL enhances model performance by leveraging a large amount of unlabeled data, and KD enables the model to stably adapt across different domains. For example, in 2023, Yuan et al.56 proposed a semi-supervised framework known as Mutual Knowledge Distillation. This framework utilizes dual mean teacher models and pseudo-labeling, along with data and feature augmentation, to enhance training diversity and improve segmentation performance in scenarios with limited labeled data. In the same year, Chen et al.57 proposed a semi-supervised knowledge distillation framework for large-scale urban object mapping in remote sensing, transferring knowledge from pre-trained teacher models to student models with SSL, thereby improving learning under limited labeled data and boosting segmentation accuracy for man-made objects. In 2024, Ma et al.58 introduced a semi-supervised road detection method based on knowledge distillation. By utilizing knowledge distillation to transfer feature knowledge from a teacher model to a student model and incorporating a semi-supervised learning strategy, this approach further optimizes segmentation performance in road detection tasks. In 2025, Song et al.59 introduced RS-MTDF, which employs multiple frozen vision foundation models (e.g., DINOv2, CLIP) as teachers, achieving state-of-the-art results on ISPRS Potsdam, LoveDA, and DeepGlobe. These hybrid approaches highlight the potential of combining TL, KD, and SSL for robust remote sensing segmentation.

Nevertheless, most methods still rely on ImageNet-pretrained teachers, whose data distribution differs significantly from remote sensing imagery, limiting their effectiveness in small-sample scenarios. To bridge this gap, we proposed a semantic segmentation method that combines multi-level knowledge distillation with semi-supervised learning, introducing a medium-scale dataset as an intermediate bridge between large-scale natural image datasets and the target domain, thereby improving adaptation and segmentation accuracy.

Dataset and method

Dataset

This study utilized the ImageNet natural image dataset22, Huawei Ascend Cup remote sensing dataset (HW)60, and the land cover UAV remote sensing dataset of Erhai (EH) Valley in Yunnan Province. ImageNet, a benchmark dataset for image classification, consists of 14,197,122 images with 21,841 Synset indices, including approximately 12 million training images, 50,000 validation images, and 100,000 test images. It serves as a fundamental dataset for pre-training deep learning models, which are later fine-tuned for various tasks, including remote sensing image segmentation. The HW dataset comprises 100,000 semantic segmentation images of 256 × 256 pixels, covering 17 land cover categories, primarily used for land-use classification. The dataset was divided into training, validation, and testing sets in an 8:1:1 ratio. The EH dataset, captured using UAVs, boasts a spatial resolution of 3 cm and features images with dimensions of 512 × 512 pixels, offering detailed geographic information about the Erhai watershed (25.67°N, 100.16°E). The dataset comprised six land cover classes: Buildings, Vegetation, Roads, Water Bodies, Farmland, and Others. It included 11,000 labeled images and 168,987 unlabeled images. The labeled images were randomly selected, cropped, and labeled from various large-sized images to ensure the representativeness and diversity of the training set. This study employed a fixed data partitioning strategy61, dividing the labeled images into a training set, a validation set, and a test set in a ratio of 8:1:1. A separate test set was reserved to ensure the reproducibility of the results and to balance the computational efficiency and reliability. The unlabeled images were utilized in the semi-supervised learning stage. Image examples and annotations for the three datasets are shown in Fig. 1.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Sample images and annotations of the three datasets.

High-resolution remote sensing image semantic segmentation method

Considering the significant differences in feature distributions between natural images and semantic segmentation images, we proposed a few-shot high-resolution remote sensing image semantic segmentation framework that integrates multi-stage knowledge distillation and semi-supervised learning (MKD + SSL) (Fig. 2). This framework introduced a moderate-scale remote sensing dataset as an intermediate domain between the ImageNet source domain and the target UAV remote sensing dataset, leveraging multi-stage knowledge distillation to model with limited target domain labeled samples. Furthermore, semi-supervised learning was employed to fully utilize the vast number of unlabeled samples in the target domain, thereby enhancing the model’s overall performance.

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

A few-shot remote sensing semantic segmentation framework combining multi-stage knowledge distillation and semi-supervised learning.

Backbone model selection and optimization

Currently, DeepLabV3 + is one of the mainstream models widely recognized for its superior performance in the field of semantic segmentation32. It typically employs ResNet as its backbone network. However, ResNet struggles with effectively preserving details and accurately defining boundaries, particularly when processing high-resolution remote sensing images that have complex boundaries and rich details. To overcome these shortcomings, this study replaced the DeepLabV3 + backbone with the High-Resolution Network (HRNet) and further introduced the Polarized Self-Attention (PSA) module after the backbone’s feature maps to enhance semantic segmentation performance. The applications of the HRNet enabled the DeepLabV3 + to maintain high-resolution features throughout the network, significantly improving multi-scale feature fusion capability, and making it especially suitable for high-resolution remote sensing image segmentation tasks with complex boundaries and fine details. The PSA module employed a polarized filtering strategy to maintain high internal resolution and utilizes SoftMax and Sigmoid functions to enhance the high dynamic range, refining and strengthening high-level semantic features.

Multi-stage knowledge distillation strategy

This study proposed a multi-stage knowledge distillation strategy to achieve efficient model adaptation from the source domain to the target domain through progressive transfer and knowledge transmission (Fig. 2). In the first stage of knowledge distillation, the pre-trained model (M) on ImageNet was fine-tuned using the HW data set by updating only the parameters of the fully connected layer. The fine-tuned model served as the initial teacher model (MT1). We then randomly initialized the parameters of the student model (MS_init1) and employed the initial teacher model MT1 to guide the student model to train on the HW dataset. Throughout the entire training process of the student model (MS_init1), the parameters of the teacher model remained fixed to fully preserve the knowledge acquired from the source domain. The student model (MS_init1) started with randomly initialized parameters and was progressively optimized during training. The optimization was guided by three types of loss functions. First, the supervised loss, based on cross-entropy, ensured that the student model learned accurate semantic segmentation from the labeled samples. Second, the knowledge distillation loss transferred knowledge from the teacher to the student by allowing the student to mimic the softened probability outputs of the teacher, which, compared to one-hot labels, revealed inter-class similarities and provided richer guidance. In this study, a temperature of three was used to generate the soft outputs, encouraging the student to inherit semantic knowledge from the teacher and enhance its generalization under limited labeled data. Third, the Elastic Weight Consolidation (EWC) loss imposed constraints on important parameters of the student model, thereby reducing the risk of forgetting critical source domain knowledge while adapting to new tasks. The final training objective was achieved by combining these three loss functions with corresponding weights. By incorporating EWC loss, the student model was optimized through the joint minimization of the supervised loss and knowledge distillation loss, yielding the well-trained student model (MS_trained1).

In the second stage, MS_trained1 was further fine-tuned on the EH dataset (updating only the full connection layer) to generate a new teacher model (MT2), and a student model (MS_init2) was randomly initialized. Then the new teacher model (MT2) was used to guide the training of the student model (MS_init2) on the EH dataset. This process also employed the supervised loss, knowledge distillation loss, and EWC loss described above, allowing MS_trained2 to adapt to the EH dataset while inheriting semantic knowledge from MT2 and maintaining stability of key parameters. Through this multi-stage distillation process, the model achieved efficient cross-domain adaptation from the ImageNet dataset to the HW dataset and then to the EH dataset, ultimately providing a high-quality initial model for subsequent semi-supervised learning.

Semi-supervised learning strategy

Annotating high-resolution remote sensing images is often costly and time-consuming, whereas unlabeled data is more abundant and easier to obtain in practical applications. To fully utilize these unlabeled samples and further enhance the accuracy of the segmentation model, this study combined semi-supervised learning (SSL) techniques with the training process (Fig. 2). At the beginning of training, the multi-level knowledge distillation (MKD) model served as the initial teacher model (MT3), and a new student model (MS_init3) was initialized with the parameters of the MT3 model. Given an unlabeled sample xu, the teacher model MT3 generated a pseudo-label \(\:{\widehat{\varvec{y}}}_{\varvec{u}}\). To ensure pseudo-label reliability, an entropy-based filtering mechanism was applied, retaining only low-entropy samples62(xuconf, \(\:{\widehat{\varvec{y}}}_{\varvec{u}}\)conf), which were then treated as pseudo-labeled data and combined with labeled data (xl,yl) for training the student model. During training, the total loss function consisted of two key components: (1) Supervised loss Lsup, which applied to labeled data (xl, yl) and measured the difference between the student model’s predictions and the ground-truth labels; and (2) Consistency loss Lcons, which applied to pseudo-labeled data (xuconf, \(\:{\widehat{\varvec{y}}}_{\varvec{u}}\)conf), encouraging the student model to produce predictions that were consistent with the high-confidence outputs of the teacher model. By jointly optimizing these losses, the student model received strong supervision from labeled data while learning features from unlabeled data, enhancing its adaptation to the target domain. Additionally, the teacher model’s parameters were dynamically updated through an Exponential Moving Average (EMA) mechanism63 rather than being fixed. If the student model’s and teacher model’s parameters at step tth were respectively denoted as Wt(S) and Wt(T), the EMA update formula was: \(\:{\varvec{W}}_{\varvec{t}}^{\left(\varvec{T}\right)}=\varvec{\upalpha\:}\cdot\:{\varvec{W}}_{\varvec{t}-1}^{\left(\varvec{T}\right)}+\left(1-\varvec{\upalpha\:}\right){\varvec{W}}_{\varvec{t}}^{\left(\varvec{S}\right)}\:\)where α was typically set close to 1 (e.g., 0.999) to ensure smooth updates, allowing the teacher model to gradually absorb new information from the student model. This dynamic update mechanism enabled the teacher model to continuously generate higher-quality pseudo-labels, progressively improving training effectiveness. Through an iterative cycle of pseudo-label generation, supervised learning, and consistency learning, the student model (MS_trained3) drove improvements in the teacher model, which in turn provided increasingly refined pseudo-labels. This mutual enhancement process ensured that the teacher and student models collaborated effectively.

Model evaluation and parameter settings

Model evaluation metrics

This study primarily used GFLOPs (Giga Floating Point Operations) and IoU (Intersection over Union) as evaluation metrics to comprehensively assess the model’s performance and efficiency in remote sensing image semantic segmentation tasks. GFLOPs measured the computational complexity and resource consumption of the model, where a lower value indicated reduced computational requirements during inference. IoU was one of the most commonly used evaluation metrics in semantic segmentation, assessing the segmentation accuracy for each category by calculating the overlap between predicted and ground truth regions. The mean IoU (mIoU) averaged the IoU values across all categories, providing a more holistic measure of the model’s performance in segmentation tasks. Additionally, FPS (Frames Per Second) was used to quantify the model’s inference speed in practical applications, indicating the number of images the model can process per second under a given hardware setup.

Model training and hyperparameters

In this study, deep learning experiments were conducted using an NVIDIA GeForce RTX 4090 GPU. The software environment was configured with PyTorch 1.7.0 as the deep learning framework, accelerated by CUDA 12.0 for efficient computation. The hyperparameter settings for the model were as follows: the input image size was set to 512 × 512 pixels with 3 channels, the batch size was 8, and the total number of training epochs was 100. The choice of 100 training epochs was intended to ensure sufficient model convergence while avoiding overfitting, especially when handling large datasets and complex models, which aligns with common practices in relevant studies33,43,48. During training, the initial learning rate was set to 0.001, and Stochastic Gradient Descent (SGD) was used as the optimizer. The loss function employed was Online Hard Example Mining (OHEM)64, which improves segmentation performance by prioritizing difficult samples. Additionally, the momentum value was set to 0.9, the confidence threshold was 0.95, and weight decay was 0.001 to enhance model stability and prevent overfitting. In the semi-supervised learning phase, data augmentation techniques such as random flipping and brightness adjustment were applied during pseudo-label generation by the teacher model to improve model robustness. These configurations and parameter settings provided the necessary technical support and optimization conditions for achieving high-accuracy semantic segmentation in high-resolution remote sensing images.

Results

Experimental results of model improvement

To evaluate the effectiveness of the proposed model improvements, we conducted a comparative experiment on the EH remote sensing image dataset (Table 1). In these experiments, the DeepLabV3 + model utilized various networks as the backbone, and we employed traditional knowledge distillation (single-stage) instead of multi-stage knowledge distillation. All experiments utilized a pre-trained model on ImageNet to guide the training of the student model on the EH dataset.

Table 1 Performance of DeepLabV3 + with different backbone networks on the EH dataset.

To ensure consistency and reliability of the method, all models were trained using the entire labeled training set of the EH dataset, which consists of 8,800 images. For a fair comparison, all models used identical parameter configurations, data augmentation strategies, and evaluation datasets.

The experiment results indicated that both HRNetV2-W48 and its improvement HRNetV2-W48_PSA achieved superior segmentation performance while maintaining relatively low computational complexity (1,160.1 GFLOPs and 1,262.1 GFLOPs, respectively). Their mIoU scores reached 81.31% and 81.93%, respectively. Compared to the traditional ResNet-101 backbone, HRNetV2-W48_PSA not only exhibited lower computational complexity (1,262.1 GFLOPs) but also significantly improved segmentation performance, with mIoU increasing from 79.04% to 81.93%.

The experimental results indicated that the HRNetV2-W48_PSA model, which integrated the PSA module, achieved a 0.6% increase in mIoU compared to the HRNetV2-W48 model. Additionally, the computational cost rose by over 100 GFLOPs. However, the tests indicated that the actual inference speed of these two models was nearly equivalent (HRNetV2-W48_PSA achieves 42 FPS, while HRNetV2-W48 reaches 44 FPS). Therefore, we recommend choosing HRNetV2-W48_PSA, as it maintains efficient inference while enhancing feature representation and segmentation accuracy, thus achieving the best balance between performance and computational cost.

Experimental results of semi-supervised learning and multi-stage knowledge distillation

To validate the effectiveness of the framework that combines multi-stage knowledge distillation and semi-supervised learning in small-sample target tasks, this study conducted experiments using 10% of the labeled samples (880 images) from the EH labeled training set and all unlabeled images (168,987 images). To ensure the consistency and reliability of the method, the experiments maintained the same parameter configurations, data augmentation strategies, and evaluation dataset. We compared the performance differences with and without semi-supervised learning under different knowledge distillation strategies (Table 2). The experimental results indicated that the introduction of an intermediate domain consistently enhanced performance across all categories.

Table 2 Comparison of multi-stage Knowledge Distillation Strategies with and Without Semi-Supervised Learning.

When semi-supervised learning was not employed, the multi-stage knowledge distillation (MKD) model (ImageNet_HW/No) improved the mIoU by 5.58% over the single-stage knowledge distillation (SKD) model from ImageNet (ImageNet/No). This enhancement led to IoU increases of 8.4% for Road, 9.98% for Farmland, 5.48% for Building, and 5.04% for Grassland. Similarly, in comparison with the SKD method used in HW (HW/No), the mIoU of the MKD method rose by 3.58%, and the IoU for each category saw improvements ranging from 1.01% to 5.05%, demonstrating the efficacy of intermediate region distillation. When combined with semi-supervised learning (SSL), the mIoU of the MKD + SSL model (ImageNet_HW/Yes) was still 3.06% higher than that of the SKD model of ImageNet (ImageNet/Yes). This improvement resulted in IoU gains of 5.21% for Road, 3.74% for Farmland, 2.17% for Building, and 3.19% for Grassland. Compared with the SKD model based on the HW dataset (HW/Yes), the MKD + SSL model (ImageNet_HW/Yes) also showed similar results. The mIoU of the MKD + SSL method has increased by 2.42%, while the IoU of each category has increased by 0.16% to 4.15%.

The experimental results also indicated that semi-supervised learning significantly enhanced the performance of both single-stage and multi-stage KD models. Specifically, the mIoU of the MKD + SSL model has increased by 3.01% compared to that of the MKD model. The IoU values for all categories have also improved, with the improvement range ranging from 1.10% to 4.96%.

To provide an intuitive visualization of the segmentation performance, we randomly selected several unseen samples from the EH test set for qualitative analysis. The visualization results in the semi-supervised learning setting demonstrated that the ImageNet_HW17 strategy achieved the best performance (Fig. 3).

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Visualization of segmentation results on EH dataset samples.

Effect of unlabeled data quantity

Upon utilizing various proportions of unlabeled data from the EH dataset, we examined the effect of different quantities of unlabeled samples on the performance of the MKD + SSL framework. This dataset contained a total of 168,987 unlabeled images. We conducted semi-supervised learning using 10% and 50% of the unlabeled images randomly selected, as well as all the unlabeled images (Table 3). The experimental results indicated that, in comparison to the baseline without SSL (74.04% mIoU), including only 10% of the unlabeled data can increase the model’s mIoU by 1.16%. When the amount of unlabeled data was increased to 50%, the mIoU further improved to 76.40% (+ 2.36%), and utilizing the complete set achieved 77.05% (+ 3.01%). The results indicated that the mIoU of the proposed MKD + SSL framework was positively correlated with the total amount of unlabeled data used in semi-supervised learning. However, as the number of unlabeled samples participating in the semi-supervised training increases, the improvement in the model’s mIoU tends to plateau after reaching a certain threshold. Therefore, when computing resources are abundant, it is recommended to utilize as much unlabeled data as possible to enhance the performance of the model. Conversely, when computing resources are limited, it is advised to use approximately half of the available unlabeled data for semi-supervised learning.

Table 3 Impact of unlabeled data proportion on segmentation accuracy (mIoU).

Discussion

Comparative analysis of experimental results

To further validate the effectiveness of the proposed semantic segmentation method, we conducted experiments on the Cityscapes dataset, which contains 2,975 training, 500 validation, and 1,525 test images65. The training set was used to construct labeled and unlabeled subsets: specifically, 1/16, 1/8, 1/4, and 1/2 of the training images were randomly selected as labeled data, while the remaining images were treated as unlabeled data. The resulting mIoU was then compared with that of existing approaches (Table 4).

Table 4 Comparison with the latest SOTA methods on the Cityscapes Dataset.

In the comparative experiments, we employed our proposed MKD + SSL method with HRNetV2-W48_PSA as the backbone network, a pre-trained model on ImageNet as the teacher model, GTAV as the intermediate domain, and Cityscapes as the target domain. All methods evaluated in Table 4 were semantic segmentation approaches, ensuring consistency with our task. The Cityscapes dataset is a widely used benchmark in this field, and conducting experiments on this dataset provides a unified and fair evaluation for comparison. The results demonstrated that the proposed MKD + SSL framework consistently outperforms existing methods when using the same number of labeled samples. Moreover, as the size of the labeled dataset in the target domain decreases, the performance improvement of the proposed method becomes more significant. Compared to existing methods, our model achieved mIoU improvements of 1.5% and 1.4% when using 1/16 and 1/8 labeled samples, respectively. When the labeled sample size increased to 1/4 and 1/2, the mIoU gain decreased, but was still superior to existing methods. These results indicated that our MKD + SSL framework offers notable advantages in small-sample scenarios, making it particularly well-suited for few-shot semantic segmentation tasks.

Impact of intermediate domain selection

We observed that the selection of the intermediate domain in multi-stage knowledge distillation has a significant impact on model performance. A well-chosen intermediate domain effectively bridges the gap between the source and target domains, facilitating more efficient knowledge transfer. However, if the intermediate domain is inappropriately selected, it may lead to negative transfer, causing a decline in model performance on the target domain. In the aforementioned comparative experiments, we attempted to use the HW dataset as the intermediate domain but encountered negative transfer effects. We hypothesized that this occurred due to significant differences in scene distribution between the intermediate domain (HW) and the target domain (Cityscapes).

To evaluate the impact of intermediate domain selection on the model performance, we conducted a t-SNE analysis68 to visualize the feature distribution across different datasets (Fig. 4). The results showed that the HW dataset lay between ImageNet and EH in the feature representation space (Fig. 4a), with partial overlap with both domains. Specifically, HW included categories such as water, transportation, buildings, farmland, grassland, forest, and bare soil, many of which (e.g., water, buildings, farmland, grassland) overlap with the categories in the EH dataset (Background, Road, Farmland, Building, Water, Grassland). This semantic consistency explained why HW serves as an effective intermediate domain to bridge ImageNet and EH, thereby improving transfer performance. In contrast, when we used GATV as the intermediate domain, the results were even worse than the single-stage KD method. We attributed this to the fact that GATV mainly contained urban street scenes, which differed substantially from the EH dataset, a finding that was consistent with our t-SNE visualization. For the transfer from ImageNet to Cityscapes, however, GATV proved to be a more suitable intermediate domain. GATV features were closer to Cityscapes both visually and quantitatively, which facilitates smoother domain adaptation (Fig. 4b).

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

t-SNE visualization of intermediate domain selection.

We further conducted experiments comparing GATV and HW as intermediate domains for Cityscapes, and the results showed that GATV yields better performance. This can be attributed to the fact that both GATV and Cityscapes were urban scene datasets with many overlapping categories, making their feature distributions more consistent. These observations were also supported by our t-SNE visualization. Therefore, when selecting an intermediate domain dataset, we recommend choosing one with a similar feature distribution or overlapping categories with the target domain dataset to ensure a smooth domain transformation.

Ablation experiments

To validate the contributions of each component, we conducted a comprehensive ablation analysis evaluating (i) multi-stage KD without SSL, (ii) SSL without multi-stage KD, and (iii) the impact of removing EWC and PSA. The results, summarized in Table 5, showed that removing SSL decreases mIoU by 3.01% (77.05% → 74.04%), removing multi-stage KD while keeping SSL yields 73.99% (vs. 77.05% for the full model), and removing PSA slightly reduces performance (74.04%→73.66%). These observations confirm that the components of the MKD + SSL framework are complementary, and EWC and PSA provide additional robustness to the model.

Table 5 The effect of multi-stage KD, SSL, EWC, and PSA on the model’s mIoU.

Model generalization analysis

To further assess the generalization capability of the proposed method, additional experiments were conducted on the Cityscapes dataset. This dataset differs significantly from the EH UAV remote sensing imagery in terms of resolution, imaging conditions, and target category distribution. Nevertheless, the results demonstrated that the proposed method still outperforms several representative semi-supervised approaches in urban scene segmentation (see Table 4). This indicates that the proposed MKD + SSL framework can adapt to semantic segmentation tasks in diverse scenarios, exhibiting strong cross-dataset generalization potential. In future work, we will conduct further experiments on other publicly available remote sensing datasets, such as UAVid and LoveDA, to further validate the robustness and practical applicability of the method in broader application contexts.

Broader impacts and ethical considerations

The proposed MKD + SSL framework has the potential to enhance the efficiency and accuracy of high-resolution remote sensing image semantic segmentation in various domains, such as urban planning, agricultural monitoring, ecological conservation, and disaster management. By reducing reliance on large-scale labeled datasets, this approach may facilitate the broader adoption of remote sensing applications. However, it is important to note that the method may also pose potential risks in practical use. For example, in urban surveillance or land-use regulation scenarios, remote sensing technology may involve issues of privacy and data security; in military or sensitive areas, its application without appropriate oversight could raise ethical concerns. Therefore, in future work, we will not only continue to improve the robustness and generalizability of the model but also place greater emphasis on addressing its ethical boundaries in real-world applications.

Limitations and future work

Although the proposed MKD + SSL framework demonstrates strong performance in few-shot remote sensing semantic segmentation, several limitations should be acknowledged. First, the effectiveness of multi-stage knowledge distillation largely depends on the selection of an appropriate intermediate domain. Although we have mentioned above that the selection of the intermediate domain should take into account the similarity with the feature distribution of the target domain dataset, further exploration is needed in future research on how to quantify this similarity to guide the selection of the intermediate domain. Second, the proposed framework introduces additional computational overhead due to multi-stage training and semi-supervised learning, which may limit its applicability in resource-constrained environments. Although the inference efficiency remains acceptable, the overall training cost is higher than that of single-stage approaches. Finally, the current method has been validated on UAV and urban scene datasets. Its performance on other remote sensing scenarios, such as multispectral or hyperspectral imagery, has not yet been explored. Future work will focus on reducing training complexity, improving intermediate domain selection strategies, and extending the framework to broader remote sensing applications.

Conclusion

This study addressed the high annotation cost associated with training semantic segmentation models for high-resolution remote sensing images by proposing a few-shot semantic segmentation method that combined multi-stage knowledge distillation and semi-supervised learning. By introducing an intermediate-domain dataset between the large-scale natural image dataset (ImageNet) and the small-scale UAV remote sensing image dataset, the proposed method gradually reduces the distribution gap between the source and target domains, effectively enhancing the model’s transferability and adaptation in high-resolution remote sensing imagery. Meanwhile, the use of pseudo-labeling and consistency learning effectively leveraged large amounts of unlabeled data, further improving the segmentation accuracy of the model. The results validated the effectiveness of introducing an intermediate domain for multi-stage distillation, while also demonstrating the potential of pseudo-labeling and consistency learning in enhancing model performance. The proposed method was particularly valuable in real-world remote sensing scenarios where annotated data was extremely scarce and can be widely applied in fields such as environmental monitoring, urban planning, agricultural assessment, and disaster management. In summary, this study expands the scope of deep learning research in few-shot remote sensing semantic segmentation and provides a feasible technical framework for future large-scale, high-resolution remote sensing analysis and applications.