Introduction

Landslides, as one of the most frequent natural disasters globally, exhibit high destructiveness to surrounding environments and predominantly occur in mountainous, hilly, and plateau regions1,2,3,4. Worldwide, landslide disasters cause substantial casualties and property losses each year5. Against the backdrop of ongoing global climate change and intensified human engineering activities. The steep terrain, complex geological conditions, and harsh environment of the upper Yellow River region in China further exacerbate the frequency and destructive impact of landslides in this area6. Consequently, there is an urgent need to develop rapid and intelligent landslide detection methods for the upper Yellow River region to enhance the intelligence and effectiveness of disaster prevention and mitigation strategies.

Traditional landslide detection methods primarily depend on field surveys, which are constrained by limited spatial coverage, high time consumption, labor intensity, and low efficiency, often failing to satisfy the demands of emergency response departments7,8. In recent years, the rapid advancement of remote sensing technologies and computer science has promoted the development of novel approaches for landslide detection9. Current methodologies can be broadly categorized into four types: visual interpretation, change detection, machine learning, and deep learning methods10,11.

The visual interpretation method relies on analyzing the texture and geometric features of landslides in remote sensing imagery to delineate landslide boundaries12. For instance, Lv et al. proposed an automatic adaptive region-growing algorithm based on very high-resolution (VHR) remote sensing images. Through experiments conducted at two distinct landslide sites on Lantau Island, Hong Kong, China, using VHR remote sensing imagery, the effectiveness and advantages of the proposed method were demonstrated13. Additionally, Lv et al. introduced a landslide inventory mapping method based on adaptive regional spatial–spectral similarity for automatically identifying landslide areas from bi-temporal remote sensing images. This approach begins by adaptively extracting spectrally homogeneous regions around each pixel to leverage spatial contextual information. Subsequently, a shape description algorithm is proposed to construct shape descriptor vectors for quantifying spatial structural differences between regions. Finally, by integrating shape vectors with regional brightness information, a spatial–spectral similarity measure is constructed to generate a change intensity map, which is then binarized using the Otsu threshold method to produce a landslide distribution map. Experimental results indicate that, compared with eight existing methods across four real-world landslide datasets, the proposed method achieves superior performance in both qualitative visualization and quantitative metrics (such as overall accuracy), demonstrating its suitability for practical landslide detection tasks14. Although change detection methods based on bi-temporal remote sensing imagery can capture landslide regions by analyzing pixel-level changes, their performance relies heavily on high-precision remote sensing data; for lower-resolution images, the results may exhibit substantial errors.

The change detection method compares multi-temporal remote sensing images of landslide-prone areas to extract information on surface morphological changes, thereby assessing landslide occurrence and evolution15,16. While this method enables long-term time-series monitoring and captures dynamic processes of landslides, it is prone to geometric distortions in areas with significant topographic relief, often causing inaccuracies in characterizing surface changes in complex mountainous terrain.

Machine learning methods utilize powerful feature extraction and pattern recognition capabilities to learn nonlinear landslide characteristics from training data, thereby enhancing the detection of landslides with subtle changes17,18. Commonly used machine learning algorithms in landslide identification include Decision Tree, Random Forest, Support Vector Machines, and Gradient Boosted Regression Tree19,20,21. By comprehensively analyzing topographic, morphological, spectral, and textural features of landslides in remote sensing imagery, machine learning algorithms markedly enhance identification capabilities and greatly improve efficiency compared to manual visual interpretation22,23. Nonetheless, these models require the automatic extraction of numerous salient features from training data and still encounter limitations when dealing with highly complex and heterogeneous characteristics, thus constraining their performance in landslide detection tasks24.

Deep learning, a subfield of machine learning, has achieved remarkable progress alongside the rapid rise of artificial intelligence25. Compared with traditional machine learning, deep learning models offer the advantage of deeper network architectures and eliminate the need for manual feature engineering, enabling them to process large-scale datasets and making them particularly suitable for wide-area landslide detection tasks26. Moreover, deep learning facilitates end-to-end extraction of semantic features from low-level to high-level representations, allowing it to automatically distill useful information from complex data. At present, deep learning methods for landslide detection can be grouped into two main approaches: object detection and semantic segmentation27,28.

Object detection primarily locates landslide areas by assigning class labels and generating bounding boxes, with the Faster R-CNN and YOLO series being the most widely adopted algorithms29,30. Numerous researchers have achieved notable results in landslide detection using these methods. For example, Yang et al. proposed a lightweight attention-guided YOLO with a horizontal scaling layer for landslide detection. Their approach replaces the original YOLO backbone with MobileNetv3 and employs a lightweight pyramidal feature reuse and fusion attention mechanism to improve detection performance31. Ding et al. introduced Deformable Adaptive Focusing YOLO (DAF-YOLO), which incorporates an Enhanced Deformable Convolutional Network (EDCN) to improve recognition of anomalous landslides, a lightweight sliding-window attention mechanism to enhance background discrimination, and an adaptive zooming loss framework to reduce missed detections and false alarms32. Chandra et al. applied YOLO series algorithms to satellite and UAV imagery, concluding that YOLOv7 achieved an F1-score of 0.995, outperforming other YOLO variants33. Jin et al. proposed an Efficient Residual Channel-wise Soft Threshold Attention (ERCA) mechanism, which employs adaptive soft thresholding via deep learning to suppress background noise, thereby enhancing the feature learning capacity of Faster R-CNN. Incorporating ERCA into the Faster R-CNN backbone improved feature extraction and boosted landslide detection performance34. Although these studies demonstrate the potential of object detection algorithms for landslide identification, a fundamental limitation is their inability to perform pixel-level annotation, thereby preventing precise delineation of landslide boundaries.

Semantic segmentation technology, which performs pixel-wise classification of images, is widely used in AI applications such as autonomous driving and facial recognition. In landslide detection, semantic segmentation models can accurately delineate boundaries and separate landslides from background areas35. To improve accuracy, researchers have mainly focused on dataset enhancement and model refinement. Wang et al. employed RGB and LAB color correction for preprocessing to improve image quality, constructed a dataset integrating normal and anomalous images with features such as trees, roads, buildings, rivers, riverbanks, farmland, and landslide anomalies, and trained their model using the GANomaly framework36. Liu et al. proposed a complex background enhancement method based on multi-scale samples to improve data quality and conducted a comparative analysis using the Mask R-CNN model. Their results indicated that training with background-enhanced samples yielded higher precision across all metrics37. Ren et al. introduced ResM-FusionNet, a detection method that uses ResNet-50 as the backbone for feature extraction and integrates a multilayer perceptron as the decoder. Comparative experiments with SegFormer, DeepLabv3, and UNet demonstrated that ResM-FusionNet surpassed these models across all evaluation metrics26. Li proposed a lightweight landslide detection method based on DeepLabv3 + and a dual-attention mechanism. This approach replaces the Xception backbone with MobileNetv2 and incorporates dual attention with a lightweight convolutional attention module to accelerate training and improve feature extraction. The improved model achieved higher accuracy than the original DeepLabv3 + and outperformed classical models such as UNet, PSPNet, HRNet, and Swin Transformer in landslide identification38. Lin et al. proposed CBAM-U-net, a U-net with progressive convolutional attention modules, by integrating spatial and channel attention into each down-sampling stage. Comparative analysis against U-net, FCN, and DeepLabv3 + showed that CBAM-U-net achieved superior performance across all precision metrics39. Ghorbanzadeh et al. explored the feasibility of integrating deep learning models with object-based image analysis for landslide detection, concluding that the overall accuracy within an integrated framework surpasses that achieved by either approach individually40. Furthermore, Ghorbanzadeh et al. proposed a self-supervised learning method for landslide detection. Comparative experiments revealed that their method, utilizing only 10% of labeled data, outperformed competing models trained with 100% of the labeled data41. Kariminejad et al. employed unmanned aerial vehicle (UAV) technology to acquire high-precision maps of the semi-arid Golestan Province in Iran. They trained UAV-derived datasets using DeepLab-v3+, Link-Net, MA-Net, PSP-Net, ResU-Net, and SQ-Net architectures, achieving high-accuracy detection of landslides and sinkholes in challenging environments42. Shahabi et al. developed an unsupervised learning model using a convolutional autoencoder (CAE) to address the issue of limited training data. Experiments demonstrated that their model achieved optimal accuracy in landslide detection tasks by clustering CAE-extracted deep features alongside slope and NDVI data43. Piralilou et al. combined object-based image analysis with multilayer perceptron neural networks, logistic regression, and random forests for landslide detection. They found that the overall accuracy was optimized when multi-scale results from each machine learning method were merged using Dempster–Shafer theory44. Kumar et al. proposed a landslide detection model based on an ensemble deep learning classifier. The method begins by preprocessing GIS images with Gabor filters, followed by the extraction of multiple features including texture, vegetation indices, and brightness. Subsequently, a novel hybrid optimization algorithm—combining Teamwork Optimization Algorithm (TOA) and Poverty and Rich Optimization (PRO)—is employed to select the optimal feature subset from the fused features. These optimal features are then fed into an ensemble classifier composed of Recurrent Neural Network (RNN), Bidirectional Long Short-Term Memory (Bi-LSTM), and Bidirectional Gated Recurrent Unit (Bi-GRU) for training and detection. The final result is obtained by averaging the outputs of the three models. Experimental results indicated that the ensemble model achieved a training accuracy of 87%, outperforming other comparative models. However, the authors noted that the method faces challenges related to time-consuming training data preparation and computational complexity45. Zhang et al. demonstrated that fusing DEM data with optical imagery enhanced model robustness and improved prediction reliability46. Furthermore, feature fusion through a dual-branch network achieved slightly higher accuracy than early fusion of RGB and DEM channels. Moreover, Saha et al. systematically elaborated on the extensive applications of deep learning-based multi-sensor earth observation technologies across numerous domains. By integrating data from diverse sensors (e.g., optical, SAR, LiDAR, hyperspectral) and combining them with deep learning models, it becomes possible to understand and monitor Earth’s dynamics more accurately and comprehensively47. Although the aforementioned studies have made notable progress in landslide detection and offered novel approaches for automated landslide identification, certain limitations remain.

To overcome the limitations of existing methods, this study introduces a novel landslide detection approach based on multimodal data fusion and an improved DeepLabv3 + model. Without altering the core network architecture of the original model, the contributions are as follows: First, a Multimodal Fusion module based on a dual-branch network is designed to integrate visual and topographic features, enhancing the feature extraction capacity. Additionally, the original decoder is improved through dual convolution for stronger fusion capacity and hierarchical regularization for greater robustness. This strengthens feature representation in boundary-sensitive regions, provides a high-quality feature foundation for edge refinement, and markedly improves boundary segmentation accuracy in complex scenarios while maintaining computational efficiency. Second, the original ResNet backbone is replaced with a ConvNeXt network, which employs larger convolutional kernels to substantially expand the receptive field and strengthen feature extraction. Finally, a Small Target Head with an attention module combining spatial and channel attention mechanisms is incorporated. This suppresses background noise and increases model sensitivity to small target regions through simultaneous spatial and channel attention.

Research area and datasets

Research area

This study focuses on the upper Yellow River region, located at the border of Qinghai and Gansu Provinces (35°07′–36°42′ N, 99°59′–103°18′ E), as illustrated in Fig. 1. Situated on the southeastern margin of the Tibetan Plateau, the region is characterized by a typical alpine valley landform with an elevation range from 1,564 m to 4,951 m48. The pronounced topographic relief and steep slopes substantially influence slope stability. The area contains complex geological structures, active fault systems, and fragmented rock masses. These conditions, combined with sparse vegetation cover and extensive land desertification, intensify the instability of surface geomaterials49. Additionally, frequent human engineering activities in recent years have further disturbed natural slopes, increasing the likelihood of landslide hazards50,51.

Located within the intense tectonic zone of the Qinghai–Tibet Plateau, the complex geological structure exhibits high instability under the coupled effects of strong seismic activity and frequent freeze–thaw cycles. Furthermore, the harsh high-altitude, hypoxic environment and the vast, inaccessible terrain pose significant challenges for the deployment and maintenance of conventional monitoring equipment, while remote sensing techniques are often limited by vegetation and snow cover. Compounded by a scarcity of fundamental geological data and extreme logistical difficulties, field investigations and risk assessments are severely hampered. Consequently, this has led to the formation of a systemic problem—unique to the fragile plateau environment—that is far more complex, from genetic mechanisms to monitoring and early-warning, than those encountered in other regions worldwide. Given the limitations of field investigations, the development of efficient and accurate computer vision algorithms for automatic landslide identification and risk assessment is of significant scientific value and practical importance for strengthening regional geological disaster prevention and control.

Fig. 1
figure 1

Overview of research area. Map created using ArcGIS 10.8 (https://www.esri.com).

Datasets

This study utilized 2-m resolution Gaofen-6 (GF-6) satellite imagery and SRTM DEM data released by NASA to construct a landslide dataset for the study area through manual visual interpretation. Due to constraints in GF-6 acquisition timing, only single-temporal imagery from January 2025 was used, thus lacking seasonal variability in landslide information. Furthermore, limitations inherent to single-view imagery produced shadow occlusion on certain north-facing slopes, leading to the omission of a small number of landslides. Ultimately, 257 landslides were identified and mapped; their spatial distribution is shown in Fig. 1.

The remote sensing characteristics of representative landslides in the study area are depicted in Fig. 2. Landslides differ markedly from their surrounding environment in morphology, tone, texture, and vegetation cover. In vegetated areas, landslide features are particularly distinct (Fig. 2a-e). In sparsely vegetated regions, identification relies on subtle geomorphological features such as slip scars, tonal contrasts on slopes, and the presence of “twin gullies with a common source,” as shown in Fig. 2f-h. Additionally, small landslides in the study area often appear as targets smaller than 50 × 50 pixels in the 2 m GF-6 imagery, yet they typically exhibit a clear tonal contrast against the background. For identifying such small landslides, 3D topographic information from Google Earth was used as auxiliary verification, as demonstrated in Fig. 2i-j.

During data processing, the 30 m resolution SRTM DEM data were first upsampled to 2 m to match the GF-6 imagery. Images containing landslide areas were then cropped into 512 × 512 pixel samples, and landslide boundaries were annotated using the LabelMe tool. To overcome the limited number of landslide samples, data augmentation techniques—including rotation, flipping, and color enhancement—were applied to expand the dataset, ultimately yielding 2,672 landslide samples. Finally, the augmented dataset was randomly divided into training, validation, and test subsets in an 8:1:1 ratio. This resulted in 2,130 samples allocated to the training set, with the validation and test sets each containing 271 landslide samples.

Fig. 2
figure 2

Some landslides in the study area. a-e illustrate landslides with sparse vegetation cover; f-h show double-gullied homologous landslides; i-j depict small-scale landslides. Map created using ArcGIS 10.8 (https://www.esri.com).

Methods

DeepLabv3 + model

DeepLabv3+, introduced by Google in 2018, is a classical semantic segmentation model52. The architecture adopts an encoder-decoder structure, in which the encoder is strengthened by an Atrous Spatial Pyramid Pooling (ASPP) module for semantic enrichment, while the decoder is employed for pixel-wise prediction53.

In the DeepLabv3 + framework, the encoder typically integrates a backbone network, such as ResNet or Xception, with the ASPP module54. The backbone functions as a strong feature extractor, hierarchically capturing multi-scale semantic features from the input55. The ASPP module enlarges the receptive field by applying convolutions with different dilation rates, thereby acquiring multi-scale contextual information56. Moreover, multiple convolutional layers within the encoder progressively reduce the spatial resolution of feature maps through convolution and pooling while simultaneously increasing channel depth. This compresses image features and facilitates high-level semantic extraction57.

The decoder produces precise segmentation by fusing hierarchical features from the encoder58. First, a 1 × 1 convolution is used to adjust the channel depth of deep encoder features to match shallow features. The deep features are then upsampled via bilinear interpolation to align spatial resolution with shallow features, followed by channel-wise concatenation for integration. The fused features are refined through 3 × 3 convolutions to enhance feature representation and discriminative ability. Finally, features are upsampled to the input image resolution, and a 1 × 1 convolutional classifier performs per-pixel classification. This design, combining hierarchical feature fusion and progressive upsampling, leverages semantic guidance from deep features and spatial details from shallow features. It reduces spatial information loss caused by pooling and strides, thereby significantly improving segmentation accuracy, particularly along object boundaries59,60,61,62. The flowchart of the DeepLabv3 + model is shown in Fig. 3.

Fig. 3
figure 3

The original technical flowchart of the DeepLabv3 + model.

Improved DeepLabv3 + model

Multimodal fusion

Conventional approaches typically use only RGB imagery as input, or combine RGB with DEM through simple band stacking into a four-channel input. Such strategies often fail to fully capture critical topographic information embedded in the DEM. To better exploit multi-source characteristics, this study designs a dual-branch network to process RGB and DEM data separately. A spatial-channel attention fusion module is then employed to integrate spectral and topographic features effectively, with the fused features serving as input to the model.

The Convolutional Block Attention Module (CBAM), as a lightweight attention mechanism, is not only easily integrable into mainstream convolutional neural networks but also incurs nearly negligible computational overhead. It innovatively combines the outputs of both max pooling and average pooling operations to generate attention weights across two critical dimensions: channel and space. This enables the network to dynamically concentrate on more salient features. The detailed architecture of the CBAM module is illustrated in Fig. 4.

Fig. 4
figure 4

The structure of CBAM.

Within the CBAM framework, the input feature map \(\:F\) is first modulated by the channel attention mechanism, producing an intermediate feature representation \(\:{F}_{1}\). This output is subsequently processed by the spatial attention module to obtain further refined features, thereby forming a sequentially connected cascaded structure. The overall process can be formally expressed by Eq. 1 as follows63:

$$\:\left\{\begin{array}{c}{F}_{1}=\:{M}_{c}\left(F\right)\otimes\:F\:\:\:\\\:{F}_{2}={M}_{s}\left({F}_{1}\right)\otimes\:{F}_{1}\end{array}\right.$$
(1)

In Eq. (1), \(\:{M}_{c}\left(F\right)\) denotes the output weight matrix from the channel attention mechanism applied to the input feature map \(\:F\); \(\:{M}_{s}\left({F}_{1}\right)\) represents the output weight matrix generated by the spatial attention mechanism from the intermediate feature ma \(\:{F}_{1}\); and the operator \(\:\otimes\:\) indicates the element-wise multiplication—a weighting operation—between the corresponding feature maps.

ConvNeXt network

Traditional DeepLabv3 + architectures usually employ ResNet or Xception as the backbone. In this study, the ConvNeXt network, characterized by a large-kernel design, is adopted as the backbone. Inspired by the Vision Transformer (ViT), ConvNeXt introduces Transformer design concepts into convolutional networks64. Its refinements include: (1) a stem layer using a 4 × 4 convolution with a large stride for efficient downsampling; (2) inverted bottleneck blocks with 7 × 7 depthwise separable convolutions, expanding the receptive field and strengthening feature extraction; (3) adoption of the Gaussian Error Linear Unit (GELU) activation, which provides smoothness, non-zero negative gradients, and input-dependent probabilistic gating, offering superior representational capacity over ReLU; and (4) the use of LayerNorm in place of BatchNorm, with one LayerNorm layer retained per residual block65. The overall ConvNeXt architecture is shown in Fig. 5.

Fig. 5
figure 5

ConvNeXt architecture diagram.

Small-object attention mechanism

Some landslides in the study area are small targets (smaller than 50 × 50 pixels). To improve detection accuracy for such objects, a small‑object attention mechanism is introduced, equipped with a spatial-channel attention mechanism applied to low-level features of the backbone. This mechanism suppresses background noise and enhances feature representation of small objects, providing high-quality edge cues for boundary refinement.

In summary, compared to the conventional DeepLabv3+, the improved model introduces optimizations in three aspects: input data, backbone feature extraction, and boundary refinement. The overall workflow is presented in Fig. 6.

Fig. 6
figure 6

Technical flowchart of the improved DeepLabv3 + model.

Model evaluation

To quantitatively assess the performance of the improved DeepLabv3 + model, five common metrics are used: Precision, Accuracy, Recall, F1-Score, and Intersection over Union (IoU).

Precision refers to the proportion of correctly identified landslide samples among all samples predicted as landslides66:

$$\:{Precision}{\:=\:}\frac{{TP}}{{TP}{+}{FP}}{}$$
(2)

Accuracy is defined as the proportion of correctly predicted samples among all samples67:

$$\:{Accuracy}{\:=\:}\frac{{TP}{+}{TN}}{{TP}{+}{TN}{+}{FP}{+}{FN}}$$
(3)

Recall measures the proportion of correctly predicted landslide samples among all actual landslide samples68:

$$\:{Recall}{\:=\:}\frac{{TP}}{{TP}{+}{FN}}$$
(4)

The F1-Score, the harmonic mean of Precision and Recall, evaluates both precision and completeness69:

$$\:{F}{1-}{Score}{\:=\:}\frac{{2}\times{Precision}{\times}{Recall}}{{Precision}{+}{Recal}}$$
(5)

Intersection over Union (IoU) is calculated as the ratio of the intersection area between predicted and ground truth regions to the area of their union70:

$$\:{IoU}{\:=\:}\frac{{TP}}{{TP}{+}{FP}{+}{FN}}$$
(6)

In Eqs. (2)–(6), TP (True Positive) denotes cases where the model correctly predicts a sample as a landslide; TN (True Negative) refers to samples that are correctly predicted as non-landslide; FP (False Positive) indicates cases where the model incorrectly predicts a sample as a landslide; and FN (False Negative) represents samples that are incorrectly predicted as non-landslide.

Results

Experimental environment

The experimental configuration adopted in this study was as follows. The hardware platform comprised an Intel i9-13900HX CPU and an NVIDIA GeForce RTX 4060 GPU with 8 GB of RAM. The software environment was built on the Python programming language, with the PyTorch framework used to construct and train the deep neural network model. During training, the Adam optimizer was employed with a weight decay coefficient of 0.001. The initial learning rate was set to 0.0001 and decayed using a cosine annealing strategy. Owing to the memory limitations of the GPU, the batch size was determined to be 4 after testing, representing the maximum value sustainable for stable operation on the available hardware. Regarding the training duration, the maximum number of epochs was initially set to 500, providing ample redundancy for model convergence. To effectively mitigate overfitting and conserve computational resources, an early stopping strategy was implemented. Based on observed performance fluctuations on the validation set, training was terminated if the performance improvement remained below a threshold of 0.0001 for 30 consecutive epochs. This combination of “patience value” and threshold allows for the timely detection of convergence while preventing premature termination due to short-term fluctuations. Furthermore, a warm-up phase of 50 epochs was established, during which the early stopping mechanism was disabled, ensuring that the model could develop preliminary learning capacity without premature intervention.

The detailed configurations of the comparative models are described below. Specifically, the UNet model was configured with a four-layer deep encoder and optimized using the Adam optimizer with an initial learning rate of 0.001. The DeepLabv3 + model utilized a ResNet101 backbone as its feature extraction network, also employed the Adam optimizer, and was set with an initial learning rate of 0.001. The Swin Transformer model was trained using the AdamW optimizer, with an initial learning rate of 0.0001 and a weight decay coefficient of 0.01. The SegFormer model, incorporating a MiT-B5 encoder structure, adopted the AdamW optimizer, an initial learning rate of 0.0001, and a weight decay of 0.001. The HRNet model utilized Stochastic Gradient Descent (SGD) as its optimization method, with an initial learning rate of 0.01. The Fast-SCNN model employed the Adam optimizer with its initial learning rate set to 0.001.

Comparative experiments

To evaluate the effectiveness of the proposed FCA-DeepLab model for landslide identification, comparative experiments were conducted against several widely used semantic segmentation models: UNet, Swin Transformer, SegFormer, HRNet, Fast-SCNN, and the original DeepLabv3+. As presented in Table 1, all seven models achieved landslide detection accuracy above 0.8, indicating competent recognition capability. Comparative results across seven evaluation metrics showed that the proposed model outperformed the others in landslide identification accuracy. Specifically, FCA-DeepLab achieved an Accuracy of 0.912 and an IoU of 81.8%, representing improvements of 0.057 and 5.3%, respectively, over the original DeepLabv3+, and surpassing the other models by at least 0.037 and 2.9%. Moreover, the model attained Precision, Recall, and F1-Score values of 0.865, 0.870, and 0.867, respectively—gains of 0.055, 0.055, and 0.054 over the baseline DeepLabv3+.

The superior performance of the proposed model is primarily attributed to the multimodal fusion mechanism introduced in the FCA-DeepLab architecture. By employing a dual-branch network to process DEM and RGB data in parallel and integrating a spatial-channel attention fusion module, the model effectively combines topographic and spectral features to derive more discriminative representations. These enhanced features are further processed by the ConvNeXt backbone, which applies 7 × 7 convolutional kernels to improve feature extraction. In addition, the incorporated small-target detection head leverages a spatial-channel attention mechanism to retain spatial details from low-level features—such as edges, textures, and the precise contours of small objects—thereby addressing the common problem of overlooked small targets.

Table 1 Comparison of five model accuracy metrics.

Figure 7 illustrates the detection results of the different models on the test set. While all models exhibit varying degrees of false positives and false negatives, each demonstrates the ability to delineate landslide boundaries when spectral contrast with the background is strong. However, UNet, Swin Transformer, SegFormer, HRNet, Fast-SCNN, and the original DeepLabv3 + showed relatively weaker extraction performance, with more pronounced misclassification and omission errors. In contrast, FCA-DeepLab produced fewer such errors, particularly excelling in small-target detection where it achieved clearer boundary delineation. This demonstrates that the specialized small-target detection head significantly improves sensitivity to subtle features, enhancing segmentation performance for small-scale landslides.

In summary, the proposed FCA-DeepLab model surpasses the other four models in both quantitative metrics and qualitative segmentation results. These findings confirm that the introduced modifications effectively enhance boundary processing capability, enabling more accurate delineation and prediction of landslide boundaries, which is vital for efficient and reliable landslide monitoring.

Fig. 7
figure 7

Comparison of the results of five models. (a)-(g) represent the UNet, DeepLabv3+, Swin Transformer, SegFormer, HRNet, Fast-SCNN, and FCA-DeepLab models, respectively.

Ablation experiments

To systematically assess the contribution of each innovative component in the proposed FCA-DeepLab model, a comprehensive ablation study was conducted. This study aimed to isolate and evaluate the impact of individual modules on overall performance, thereby clarifying their functional significance. The experiments focused on three aspects: first, excluding the multimodal fusion mechanism to examine its role in integrating spectral and topographic information; second, replacing the ConvNeXt backbone with the original ResNet101 to compare their relative effectiveness; and third, removing the small‑object attention mechanism to assess its influence on the segmentation accuracy of small landslide boundaries.

Quantitative results, including Intersection over Union (IoU) and F1-Score, are summarized in Table 2. For more intuitive interpretation, visualizations of prediction outcomes on the test set are also provided, enabling direct comparison of model performance across different configurations, as shown in Fig. 8.

From the quantitative metrics presented in the table, a decline in all evaluation indicators is evident when the model operates without the multimodal fusion mechanism, resulting in performance that is slightly inferior to that of the complete improved model incorporating this module. This outcome preliminarily confirms the necessity of multimodal information fusion. More importantly, visual comparative analysis of the prediction maps reveals that the model lacking topographic feature assistance shows a strong tendency for confusion when processing non-landslide backgrounds with spectral characteristics similar to those of landslides (e.g., exposed rock or certain vegetation covers). This appears as a higher incidence of misclassifying such background pixels as landslides, thereby increasing the false positive rate. These observations clearly demonstrate that the introduced multimodal fusion mechanism effectively exploits topographic information (e.g., elevation, slope) as discriminative cues, significantly reducing the model’s over-reliance on spectral features alone and enabling more accurate distinction between landslides and confusing backgrounds in complex terrain.

For the backbone network comparison, ResNet101 was employed as a baseline. While the ResNet101-based model achieved acceptable overall accuracy, it did not demonstrate a distinct advantage over the improved model with the ConvNeXt backbone. Their recall rates, reflecting the ability to capture landslides, were generally comparable. However, qualitative analysis reveals clear differences: segmentation results from the ConvNeXt backbone produced more refined and accurate boundary extraction, with clearer and better-adhered contours. This suggests that the ConvNeXt network, through its larger kernel design and modernized architecture, enhances feature extraction and contextual representation, particularly in boundary determination. As a result, it delivers segmentation outcomes with superior boundary precision, even if the improvement is not fully reflected in overall pixel-level accuracy metrics.

The ablation experiment targeting the small‑object attention mechanism shows that its removal does not cause a dramatic decrease in overall accuracy metrics, similar to the trends observed in the other two ablation tests. Nevertheless, targeted case comparisons highlight its crucial role: the module significantly enhances sensitivity to small-scale landslides, improving the completeness and clarity of their boundary segmentation. For medium-to-large landslides, however, the addition of this module does not markedly alter boundary performance. This indicates that the small-target module functions as a specialized component, with its primary contribution being the optimization of segmentation detail for small-scale features. It compensates for the limitations of standard segmentation models in handling scale variation and improves the overall adaptability and fine-detail recognition capacity of the model across landslides of varying sizes.

The combined results of the three ablation experiments demonstrate that each innovative component contributes to performance enhancement and that they operate synergistically to improve comprehensive accuracy and robustness across diverse conditions. In summary, the core contributions of this study—the multimodal fusion mechanism, the ConvNeXt backbone, and the small-target detection module—complement one another by strengthening feature fusion, feature extraction, and detail optimization, respectively, thereby enhancing both accuracy and reliability in landslide identification.

Fig. 8
figure 8

Results of the ablation experiment. (a)-(d) represent the landslide prediction results from the ablation experiment for the following configurations respectively: Remove the multimodal fusion mechanism; Using the ResNet101 backbone network; Remove the small-object attention mechanism; FCA-DeepLab.

Table 2 Accuracy metrics of ablation experiment.

Generalization experiment

To further assess the generalization ability of the improved model, it was tested on a publicly available landslide dataset of Bijie City, Guizhou Province, China, released by Wuhan University71. A key distinction between the datasets lies in landslide types: those in the upper Yellow River region are predominantly loess landslides, while those in Bijie City are largely vegetation-covered. Their spectral characteristics in optical imagery, and their contrast with the background, differ considerably between the two regions. Thus, testing on the Bijie City dataset effectively validates the generalization performance of the improved model.

Table 3 presents the accuracy results of the improved model trained on the Bijie dataset. The extraction accuracy for Bijie landslides is comparable to that obtained in the primary study area, indicating strong generalization and promising potential for wider application.

Table 3 Generalization experiment accuracy metric.

Figure 9 illustrates the predictive performance of the model on selected Bijie City samples, highlighting results across three representative land cover types. In Fig. 9 (a), where the landslide is surrounded by man-made features such as roads and buildings, the model accurately distinguishes the landslide body from artificial structures, with no misclassification of urban features. This demonstrates strong resistance to interference in complex built environments. Figure 9 (b) displays detection results for a series of small landslides. Such targets are often difficult to identify due to their scale and inconspicuous features. The model nevertheless maintains high sensitivity, delineating their boundaries with relative completeness. Figure 9 (c) shows a vegetated area. Despite spectral convergence between landslides and vegetation, the model correctly captures the landslide distribution without major omissions, highlighting robustness under vegetation interference.

However, certain limitations remain in precise boundary reconstruction. Some delineated edges show serrated irregularities or localized blurring, suggesting that the model’s capacity for fine-grained edge restoration still requires improvement. Future research will focus on optimizing the boundary regression strategy to further enhance the geometric fidelity of segmentation results.

Discussions

Fig. 9
figure 9

Some extraction results of the Bijie landslide. (a) corresponds to landslides adjacent to buildings, roads, and bare rock edges; (b) corresponds to small-scale landslides; (c) corresponds to vegetation-covered landslides. Map created using Python 3.9 (https://www.python.org).

This study addresses major challenges in landslide detection under complex environments, including the omission of small targets and limitations in feature extraction, by proposing a method that fuses topographic data and optical imagery with an improved DeepLabv3 + model. Through systematic integration of three key modules—multimodal feature fusion, backbone optimization, and small-target detection—the approach achieves significant performance gains. A comprehensive discussion of the findings is presented below.

Innovation and effectiveness of the method

The core innovations of this research are reflected in three aspects. First, the dual-branch multimodal feature fusion module effectively bridges the modality gap between optical imagery and topographic characteristics. By applying deep feature fusion instead of simple input-level stacking, the model learns stronger correlations between visual and terrain features. This substantially enhances discrimination between spectrally similar surfaces. In the loess landslide zones of the upper Yellow River region, where landslides often occur among exposed rock and soil, visual features alone are insufficient. Experiments confirm that the multimodal module significantly improves accuracy in confusion-prone bare rock areas.

Second, replacing the original ResNet backbone with the ConvNeXt network provides a critical enhancement. Its use of 7 × 7 convolutional kernels and redesigned architecture expands the receptive field and captures broader contextual information. This is particularly important for landslide detection, where spatially extensive features require context-rich representation.

Finally, the added small-target detection head with a spatial-channel attention mechanism enables the model to adaptively focus on small-scale regions. This considerably improves detection of small landslides and enhances precision in boundary segmentation.

Compared with existing research

The findings are consistent with existing research trends while offering measurable improvements. While existing studies generally recognize the importance of integrating optical imagery and topographic data for enhancing landslide identification accuracy, conventional methods predominantly rely on input-level fusion—such as simply combining DEM with RGB imagery—which often fails to adequately capture deep inter-modal correlations. For instance, Lu et al. proposed a dual-encoder U-Net model that incorporates a hierarchical design to integrate optical and DEM-derived deep features, successfully detecting specific landslides. However, the reported landslide detection accuracy of this method was only 0.78410. Similarly, Li et al. introduced a DemDet network, which employs an attention-based neural network to integrate DEM, hillshade, and optical imagery for identifying forested landslides in the Jiuzhaigou earthquake-affected area. Their results indicated a detection accuracy of only 0.800 in specific regions72. By contrast, the dual-branch fusion mechanism in this study achieves feature-level integration. The spatial-channel attention modules strengthen interaction between spectral and topographic features, effectively reducing false positives caused by spectrally similar surfaces like bare soil or roads. This underscores the advantage of deep fusion over basic stacking. The experimental results demonstrate that the proposed method achieved a landslide detection accuracy exceeding 0.830 across multiple datasets, including both our proprietary dataset and the external Bijie City landslide validation set, thereby confirming its effectiveness and reliability.

In addition, improvements targeting small-scale detection address a known weakness in existing semantic segmentation models. Methods such as DeepLabv3 + often prioritize large-scale feature extraction and neglect detailed representation of small targets, leading to omissions. While some studies introduced attention mechanisms, they are usually applied globally without a specialized focus. The dedicated small-target detection head in this study autonomously suppresses background noise and enhances feature responses in small regions. Ablation results confirm that this module markedly improves recall for small landslides, highlighting the value of modular designs tailored to specific challenges over general global adjustments.

Limitations and prospects

This study has limitations mainly linked to image acquisition and computational complexity. Regarding imagery, only single-temporal data from January 2025 were available, limiting representation of seasonal variations. In the upper Yellow River region, summer vegetation significantly alters landslide signatures compared to winter scenes. Furthermore, although experimental results demonstrate the improved model’s generalizability, it still exhibits insufficient temporal generalization. Consequently, subsequent work should prioritize acquiring images from other seasons (such as summer and autumn) to enable the model to fully learn from landslides with diverse visual features, thereby enhancing its adaptability to seasonal variations and different surface cover conditions.

Regarding the model, ConvNeXt with large kernels improves feature extraction but increases training time by about 75% compared with the original version. Model performance is also partly dependent on image resolution. While 2 m GF-6 imagery was used here, statistical analysis indicates that small landslides account for about 35% of events in the study area, and their reliable detection may require even finer-resolution data. Therefore, to address challenges such as extended training times, future work should explore the adoption of mixed-precision training and multi-GPU data parallelism techniques. These approaches would fully leverage the computational capabilities of modern hardware, significantly reducing the training cycle. Furthermore, efforts should be directed toward employing higher-resolution optical remote sensing imagery to investigate its impact on model performance and to further validate and extend the findings of this study.

Conclusions

This study proposes and validates an enhanced landslide detection method (FCA-DeepLab) based on an improved DeepLabv3 + architecture. The method achieves refined delineation of landslide boundaries and reliable identification of small-scale landslides in high-resolution remote sensing imagery. The principal conclusions are as follows:

(1) Multimodal Feature Fusion Module: A dual-branch network was used to process DEM and optical imagery in parallel. By leveraging spatial-channel attention for deep feature interaction, the module significantly enhances discrimination between spectrally similar surface features.

(2) ConvNeXt Backbone Network: Replacing the conventional ResNet with ConvNeXt, which employs large 7 × 7 kernels, expands the receptive field and strengthens contextual awareness, thereby improving the characterization of large-scale landslides.

(3) Small‑object attention mechanism: A specialized component with spatial-channel attention preserves low-level details, markedly improving recall for small-scale landslides and addressing omission problems common in conventional segmentation models.

The framework developed in this study provides a reliable tool for rapid and accurate landslide hazard monitoring while offering transferable insights for other remote sensing segmentation tasks. Future work should focus on integrating multi-temporal and multi-resolution datasets to capture seasonal variations and on optimizing efficiency through lightweight architectures, thereby enhancing practical deployment.