Introduction

Geological hazards, particularly landslides, are significant natural processes that pose a major threat to human lives and infrastructure globally1,2. Geological hazards cause extensive economic losses and casualties each year3, often triggered by a complex interplay of environmental factors (such as topography, geology, climate) and triggering factors (e.g., seismic activity, heavy rainfall)4. To mitigate these risks, landslide susceptibility mapping (LSM) has become an indispensable tool for spatially predicting landslide occurrence and guiding effective disaster management and land use planning5.

Traditionally, LSM relied on heuristic and statistical methods6, in recent years, the field has been revolutionized by the emergence of machine learning (ML) models. These models are increasingly favored due to their superior ability to capture complex, non-linear relationships within high-dimensional data7, leading to more accurate and reliable predictions. Among the various ML algorithms, the random forest (RF) model has gained widespread recognition8,9. It is an ensemble learning algorithm that combines the power of multiple decision trees, offering exceptional stability, high accuracy, and computational efficiency10. Furthermore, a key advantage of the RF model is its ability to quantify the importance of different environmental factors11, providing critical insights into the underlying mechanisms of landslide development. while SVM are known for their effectiveness in high-dimensional spaces12. The XGBoost13 and deep learning models14 have been explored to further enhance prediction performance.

Despite the proliferation of ML-based LSM studies, a significant gap remains in applying these advanced techniques to specific, high-stakes linear infrastructure, such as high-speed railways15. Notably, infrastructure projects, particularly railway construction, are critically influenced by geological hazards16. The construction and use of roads can damage surface vegetation, leading to slope instability that threatens the safety of the infrastructure17. Slope failures can not only disrupt transportation and damage property but also cause casualties and adversely affect socio-economic stability and regional development. Therefore, it is particularly important to evaluate the risk of geological hazards along major constructing or operating roads.

The Hefei-Fuzhou High-speed Railway is a critical transportation artery connecting the central and southeastern regions in China. This railway traversed a diverse and complex geomorphological landscape, including plains, hills and mountains, making it highly vulnerable to geological hazards18. In recent years, there have been several catastrophic geological hazards reported in nearby regions. For example, an embankment collapse disrupted the Meida Highway and killed 48 travelers in May 202419. A disastrous avalanche in Lishui City claimed 38 lives in November 201520. A major landslide in Chengtian village, Zhejiang destroyed a section of a provincial highway and blocked the local traffic for 17 days21. These catastrophic hazards in nearby regions with similar climatic and topographic characteristics highlight the necessity of mapping regional landslide susceptibility for this crucial linear infrastructure22. Zhou et al.23 and Huang et al.24 mapped landslide susceptibility for few counties in Jiangxi Province and no work has been done for Anhui Province, the northern part of our study area. Zhou et al.25 mapped landslide susceptibility for Fujian Province. While these regional scale LSM studies have been conducted in the provinces along this corridor, they often lack the fine-scale resolution required for infrastructure-specific risk management. Specifically, previous studies have not provided a comprehensive and tailored landslide susceptibility assessment for this crucial railway corridor, nor have they systematically addressed the distinct landslide-driving mechanisms in its geologically different northern and southern sections.

To address these critical knowledge gaps, this study provides a comprehensive landslide susceptibility assessment for the Hefei-Fuzhou High-Speed Railway corridor using the Random Forest model. We aim to offer robust scientific support for disaster prevention and mitigation along this major transportation artery. The central objectives of this study are threefold:

  • To develop a robust landslide susceptibility model using the Random Forest algorithm and generate a high-resolution susceptibility map for the railway corridor.

  • To evaluate the importance of various landslide influencing factors to identify the dominant drivers of slope instability in the study area.

  • To conduct a comparative analysis of the northern and southern sub-regions to reveal segment-specific controlling mechanisms and provide a foundation for tailored risk mitigation strategies.

This study will fill a key gap in geological hazard mapping for one of China’s most important high-speed railway corridors, offering a valuable methodological framework for similar infrastructure projects in developing regions.

Study area

In this study, the Huangshan-Fuzhou section of Hefei-Fuzhou High-Speed Railway and its vicinity is selected as the study area. This crucial transportation artery, located in southeastern China, traverses a complex geological and geomorphological landscape, passing through five cities in Anhui Province, one city in Jiangxi Province, and two cities in Fujian Province before reaching Fuzhou City (Fig. 1).

The study area exhibits significant topographic variability, with elevation ranging from approximately − 8 m to over 2000 m. The terrain is dominated by low-to-medium mountains and hills, characterized by steep slopes and deeply incised valleys, which are conducive to slope instability. The region spans multiple tectonic units, including the Yangtze Plate and South China Plate, and exhibits well-developed fault structures and relatively active neotectonic movements. The main exposed strata consist predominantly of Presinian mica-quartz schist and the third and fourth intrusive granites from the Yanshanian period, with developed folds and relatively fractured rock mass, resulting in a complex geological hazard background26.

The region is characterized by a subtropical monsoon climate, with a distinct long rainy season extending from April to September with an average annual rainfall of 1283–2308 mm. The area is also frequently affected by typhoons in the summer and autumn, which can bring short-duration, high-intensity rainfall events. Such extreme precipitation acts as a key triggering factor, leading to soil saturation, increased pore water pressure, and a subsequent decrease in slope stability, thereby significantly increasing landslide risk27.

The study area has a history of significant landslide activity. As of 2020, more than 9,000 landslides of varying scales had been recorded within Jiangxi Province alone28. There are many landslides in the northeast and southwest regions of the corridor, and the distribution is concentrated (the red dots in Fig. 1 represent large-scale landslide traces). These large-scale landslide traces are closely associated with areas of high relief and intense human activity, highlighting the need for a systematic and comprehensive susceptibility assessment.

Fig. 1
Fig. 1
Full size image

Location of the study area (The basemap was created using a DEM to produce a hillshade map. The legend and other elements were also generated using QGIS 3.1029).

Methodology

Data

Landslide and non-landslide samples

A comprehensive landslide inventory for the study area was compiled through a multi-step process. Using high-resolution satellite imagery from Google Earth, we employed visual interpretation to identify and delineate the boundaries of landslide traces along the Hefei-Fuzhou High-Speed Railway corridor30,31. This was systematically verified and supplemented by field investigations and a review of existing literature, ensuring the precise spatial positioning and accuracy of the inventory. This process yielded a dataset of 2,299 landslide locations, which were used as our positive samples (landslides distribution shown in Fig. 1).

To create a balanced dataset, an equal number of non-landslide points (negative samples) were generated. We first identified the non-landslide areas by taking the spatial complement of the landslide polygons within the study region. Subsequently, 2,299 random points were generated within these non-landslide areas using GIS software. To ensure spatial representativeness and avoid sampling bias, a minimum distance of 1 km was maintained between any two sample points. The final dataset, consisting of 4,598 landslide and non-landslide points, was carefully screened for invalid values and served as the training data for the machine learning model.

Environmental factors

The occurrence of landslides is a complex phenomenon influenced by the interaction of multiple environmental factors. A scientifically grounded selection of these factors is crucial for developing an accurate landslide susceptibility model. Based on a comprehensive review of existing literature32,33 and the specific characteristics of the study area, we selected 13 key factors, which were categorized into four groups: topographic, geological, hydrological, and land cover factors (Table 1; Fig. 2).

Topographic factors

Topography plays a crucial role in controlling landslide occurrence by influencing slope stability, water flow and weathering processes34. Six topographic factors were derived from a high-resolution Digital Elevation Model (DEM), including Slope, Aspect, Curvature, Topographic Position Index (TPI), Topographic Relief, Terrain Wetness Index (TWI). Slope angle is widely recognized as the most critical factor influencing landslide susceptibility. Steeper slopes are more susceptible to failure as they experience higher shear stress, reducing overall stability. Aspect affects the duration and intensity of solar radiation and rainfall received by a slope. This in turn influences soil moisture, weathering processes, and vegetation distribution, leading to variations in slope stability35. Curvature affects the convergence or divergence of surface runoff and subsurface water flow, which affects soil moisture and pore water pressure, thereby impacting slope stability36. TPI reflects the relative position of a point on a slope compared to its surroundings, which helps to identify different landforms (e.g., valleys, ridges, and slopes) with distinct susceptibility characteristics. Relief can measure the complexity of the terrain in the region by quantifying the fluctuation of the terrain. TWI represents the spatial distribution of soil moisture, which is a critical hydrological factor37.

Geological factors

Among the geological factors, we selected geology and distance to faults as evaluation indicators. The geological map of the study area provides a basis for assessing susceptibility based on the lithological characteristics of a region. Faults are zones of structural weakness and fracturing. The closer a location is to a fault, the higher the likelihood of rock mass fracturing and weathering, making it more susceptible to landslide38.

Hydrological and land cover factors

Hydrological conditions and human activities are also key influencing factors that affect slope stability. River erosion at the toe of a slope can reduce its lateral support, leading to instability. Therefore, proximity to rivers is often a significant factor in landslide susceptibility. Precipitation is a primary triggering factor for landslides. Prolonged and intense rainfall increases soil saturation, raises groundwater levels, and elevates pore water pressure, reducing the shear strength of the slope material. Land cover reflects both natural conditions and human activities, disturbing slope equilibrium and significantly increasing landslide risk. Fractional vegetation cover reflects the density of vegetation in a region, directly influencing slope stability by affecting the efficiency of rainfall infiltration and the intensity of surface runoff. Higher FVC generally correlates with greater slope stability.

All these factors were processed using GIS software, and their data sources are detailed in Table 1. The spatial distribution of these factors across the study area is shown in Fig. 2.

Table 1 Environmental factors and data sources.
Fig. 2
Fig. 2
Full size image

Influence factors in the study area: (a) River (b) DEM (c) Relief (d) Slope (e) Curvature (f) TWI (g) Rain (h) FVC (i) TPI (j) Fault (k) Landcover (L) Geology (m) Aspect (Note: This map was generated using QGIS 3.1029 with a base map from Tianditu44.).

Methods

The overall workflow of this study is depicted in the flow chart (Fig. 3). Firstly, a landslide inventory was created by visual interpretation of high-resolution satellite images, followed by field investigation. The landslide samples and the randomly generated non-landslide points constitute the model training samples, which were then integrated with the environmental factor data. RF algorithm was used to train the susceptibility model, which was subsequently applied to the entire study area to predict landslide susceptibility values. Finally, a landslide susceptibility map was generated and classified into five risk levels (Very Low, Low, Moderate, High and Very High) using the Natural Breakpoint Classification method.

Fig. 3
Fig. 3
Full size image

The overall workflow of this work.

Building landslide susceptibility model

As an ensemble learning algorithm, RF utilizes decision trees as base classifiers. It repeatedly performs bootstrap sampling from the original dataset to generate multiple subsets, and constructs a decision tree for each subset. These individual trees are then combined into an ensemble. During prediction, each tree votes on the classification outcome, and the final result is determined by majority voting across all trees in the forest.

For the model training, we configured the hyperparameters as follows: the forest was composed of 100 decision trees, each with a maximum depth of 8. The minimum number of samples required to split an internal node was set to 10, while the minimum for a leaf node was 5. At each split, the number of features considered was the square root of the total feature count. The model was trained using the 13 environmental factors described previously. To prepare the data, the landslide and non-landslide samples were divided into training and validation sets at an 80/20 ratio, with a fixed random seed (random_state = 42) to ensure reproducibility.

The choice of 100 trees for our RF model was carefully considered based on a trade-off between computational efficiency and predictive performance. Although increasing the number of trees (e.g., to 500 or 1 000) could theoretically enhance model stability. Preliminary tests confirmed that the model’s accuracy and stability plateaus around 100 trees, while using more trees significantly increases computation time without a meaningful improvement in predictive power.

Evaluating model performance

Model accuracy

Receiver Operating Characteristic (ROC) and Area Under the Curve (AUC) are used to comprehensively evaluate the classification performance of the model45. The AUC value was calculated based on the ROC curve, which ranges from 0 to 1. Generally, greater than 0.7 of the AUC value indicates a better discriminatory ability and a more accurate model46. The ROC curve is plotted based on the True Positive Rate and False Positive Rate of the model at different thresholds.

Model reliability

Calibration curve or Reliability diagram is frequently used to assess model uncertainty47. The Mean Square Error of the calibration curve (a.k.a. the Brier score), the Expected Calibration Error (ECE) and the Maximum Calibration Error (MCE) are three important indicators in model reliability evaluation48. They are always used to measure the gap between the predicted probability and the actual probability. The lower the value of these indicators, the better the model’s reliability is.

Quantifying the importance of environmental factors

To quantify the importance of the environmental factors, this study employed a dual-approach analysis. First, we used the Gini Impurity metric, an intrinsic feature importance method of the RF model, to conduct a global assessment49. This method calculates a feature’s overall importance by measuring the average reduction in node impurity it provides when used for splitting, aggregated across all trees in the forest. This allowed us to identify the dominant factors influencing landslide susceptibility across the entire study area. Second, to compare the controlling factors within the distinct sub-regions, we utilized Permutation Importance50. This method assesses a feature’s significance by measuring the decrease in model performance when its values are randomly shuffled. A substantial drop in prediction accuracy indicates that the feature is critical for the specific sub-region. By training the same RF model on data from different sub-regions, we were able to conduct a fine-grained analysis of the varying factor importances. This dual approach allowed us to analyze factor importance at both a macro- and micro-level, providing a more robust and nuanced understanding of the landslide-driving mechanisms.

Results

Inter-factor correlation analysis

Fig. 4
Fig. 4
Full size image

Plot of correlation coefficients between factors.

To assess the potential for multicollinearity among the continuous environmental factors (DEM, Slope, Relief, TWI, Curvature, River, Rain, and FVC), the Pearson’s Correlation Coefficient (PCC) was used in this study. Results of the correlation matrix are shown in Fig. 4. Previous works show that PCC > 0.7 could be used as a threshold to determine high degree of covariance between the variables51. From Fig. 4, he highest PCC in our study area is 0.49 and the mean absolute PCCs is 0.17, both of which are far lower than 0.7. Therefore, the above mentioned 13 influencing factors were used to establish this landslide susceptibility model. The categorical factors (Distance to Fault, Aspect, Geology, Landcover, and TPI) were not included in this analysis.

Model performances and reliability

Fig. 5
Fig. 5
Full size image

ROC curves and AUC values.

The performance of our Random Forest model was evaluated using both accuracy and reliability metrics. The Receiver Operating Characteristic (ROC) curve and its corresponding Area Under the Curve (AUC) value were used to assess the model’s predictive accuracy. As shown in Fig. 5, the AUC values of the training dataset and the validation dataset were 0.93 and 0.85, respectively. While a slight performance difference exists between the two datasets, the model maintains a high discriminant efficiency, confirming its strong prediction ability and capacity to effectively distinguish between landslide-prone and non-prone areas.

Furthermore, we evaluated the model’s reliability using the Calibration curve to assess the alignment between predicted probabilities and actual outcomes. A perfect Calibration curve lies on the 45-degree line of the plot, indicating a perfect match between predicted and actual probabilities. As shown in Fig. 6, our model exhibits high reliability, as evidenced by low values for key metrics: a Brier score of 0.1589, an Expected Calibration Error (ECE) of 0.0572, and a Maximum Calibration Error (MCE) of 0.1178. These results collectively demonstrate the model’s robustness and its suitability for landslide susceptibility assessment.

Fig. 6
Fig. 6
Full size image

Calibration curve.

The landslide susceptibility map and factor importance

To identify key environmental factors that have significant impacts on the landslide susceptibility model, the importance of landslide evaluation factors is obtained as shown in following Fig. 7. We can see that Slope (0.31), Relief (0.18) and TWI (0.11) are among the top three environmental factors that contribute to landslide susceptibility in the entire study area.

Fig. 7
Fig. 7
Full size image

Assessment of the importance of each factor’s characteristics.

By applying the trained RF model to the entire study area, we generated a landslide susceptibility map of the study area. With the Jenks Natural Breaks Classification Method, the results were categorized into five categories: Very Low (< 0.13), Low [0.13–0.31), Medium [0.31–0.50), High [0.50–0.69), and Very High (> 0.69) susceptibility (Fig. 8). The Very High and High susceptibility categories account for a combined 31.87% of the study area (14.36% and 17.51%, respectively). These high-risk areas are mainly distributed in mountainous areas with high elevation, high annual precipitation, and steep slope, spatially correlating well with the distribution of the landslide inventory. In contrast, the Low and Very Low susceptibility zones, which comprise 50.97% of the study area, are typically found in relatively flatter terrain dominated by farmland, paddy land, and urban areas.

Fig. 8
Fig. 8
Full size image

Spatial distribution of landslide susceptibility evaluation results. The northern and southern rectangular boxes in the figure represent the two typical sub-regions selected for this study. (Note: This map was generated using QGIS 3.1029 with a base map from Tianditu44.).

Comparison of the South and North sub-regions

Our results show that high and very high susceptible areas are mainly located at the north and south ends of the study area (red rectangular box in Fig. 7). The results, as shown in Fig. 9, confirm that the dominant factors differ between these two areas. In the northern sub-region, the top three factors are Slope (0.091), Relief (0.040) and Curvature (0.033). This highlights that topography remains the primary driver of landslide susceptibility in this section. However, in the southern sub-region, while Slope (0.135) remains the most important factor, Rain (0.026) ranks as the third most significant factor after Curvature (0.031). The increased importance of precipitation in the south is likely due to its proximity to the coast, which leads to higher annual rainfall and more frequent typhoon events. This comparative analysis demonstrates the segment-specific controls on slope instability and the need for tailored mitigation strategies.

Fig. 9
Fig. 9
Full size image

Assessment of Permutation Importance in the sub-regions: (a) the northern sub-region; (b) the southern sub-region.

Discussion

This study provides a comprehensive landslide susceptibility assessment along the Hefei-Fuzhou High-Speed Railway, a critical infrastructure corridor in southeastern China, using a RF model. Our discussion is structured to provide a deeper interpretation of our findings beyond mere results, focusing on three key areas: the regional variability of landslide-controlling factors, the robustness of our model, and a critical analysis of its limitations for future research.

Regional variability of landslide susceptibility

A primary contribution of this study is the identification of distinct regional drivers of landslide susceptibility along the railway corridor, which for the first time provides a targeted hazard assessment for this key transportation artery. While topographic factors are universally important, their specific nature and interaction with hydroclimatic conditions vary significantly between the northern and southern sections. This finding underscores the necessity of considering regional environmental characteristics when assessing risks along linear infrastructure that traverses diverse geographic zones.

In the mountainous and hilly terrain of the northern section, the most influential factors are Slope, Relief, and Curvature, which collectively point to a topography-dominated landslide regime. The high susceptibility in this region is primarily driven by steep slopes that generate greater shear stress, and high relief which is indicative of vigorous geomorphic processes and active uplift. The strong correlation with curvature, especially in areas of high relief, suggests that local terrain morphology is key to stress redistribution and the concentration of both surface and subsurface water flow, which ultimately controls slope stability. This aligns with previous studies in similar mountainous terrains where landslides are controlled by stress and structural configurations.

In contrast, the southern section, which lies in Fujian Province, is governed by Slope, Curvature, and Rain. The heightened importance of precipitation in this sub-region highlights a hydroclimatic-driven landslide mechanism, consistent with studies identifying intense, short-duration rainfall as the primary trigger for catastrophic landslides in Fujian52. This difference is rotted in the region’s distinct geological and climatic setting. The southern sub-region is characterized by a warm, humid climate with abundant rainfall and frequent typhoons, which significantly exacerbates slope instability. Intense rainfall events facilitate water infiltration into fractured bedrock and deep weathering layers, leading to elevated pore water pressure that drastically reduces the shear strength of the slope material, particularly along weak lithological layers53.

In addition, the significant differences between the northern and southern sub-regions are further explained by their unique lithology and land use patterns. The northern sub-region is underlain by Proterozoic metamorphic rocks (such as schist and gneiss) and Mesozoic Yanshanian granites, that weather into strong, coarse-grained soils. In such environments, topographic factors become the primary control on slope stability. Conversely, the southern sub-region is dominated by the red sedimentary beds, that weather into clay-rich, expansive soils. This lithology, combined with heavy rainfall, creates ideal conditions for the development of high pore water pressure above weak and slippery layers, leading to rapid destabilization. Additionally, land use practices amplify these disparities. The northern sub-region benefits from well-preserved forests which provide root reinforcement. while the southern sub-region, with dense populations, experiences intensive agricultural use and deforestation, which disrupts soil structure and increases infiltration capacity, thereby exacerbating landslide susceptibility.

Model reliability and uncertainty analysis

The robustness of our findings is supported by a rigorous data and modeling methodology. The RF model’s high predictive accuracy (AUC = 0.85) and reliability (low Brier score, ECE, and MCE) confirm its suitability for this analysis. Crucially, we extended our reliability assessment by analyzing the model’s predictive standard deviation (SD), which provides a quantitative measure of uncertainty54,55.

As shown in Fig. 10, the spatial distribution of SD reveals that the model’s predictions are highly confident across most of the study area, with over 90% of the region showing an SD of less than 0.3. This low uncertainty indicates a high level of consensus among the individual decision trees within the forest, suggesting that the selected environmental factors effectively and unambiguously distinguish between stable and unstable slope conditions56. It also indirectly reflects the high quality and representativeness of our training inventory57, which sufficiently captures the diverse environmental conditions along the railway corridor.

However, areas with higher uncertainty (SD > 0.3) are primarily located within transition zones between different susceptibility classes (e.g., between “low” and “high” susceptibility) and in regions with complex geological conditions. Within these transition zones, the values of environmental factors fall near the decision boundaries of many trees, leading to lower inter-tree consensus and, consequently, higher SD. This finding is critical for practical application. For hazard management, the SD map acts as a vital prioritization tool58. Areas classified as “high susceptibility” with “low uncertainty” should be assigned the highest priority for immediate mitigation measures. In contrast, areas with both “high susceptibility” and “high uncertainty” require further field investigation or expert evaluation before committing significant resources59. This approach transforms a static susceptibility map into a more dynamic and risk-informed decision-support system, ensuring the efficient allocation of limited resources.

Fig. 10
Fig. 10
Full size image

Spatial distribution of Standard Deviation (SD) (Note: This map was generated using QGIS 3.1029 with a base map from Tianditu44.).

Limitations and future research

While this study provides valuable insights, we acknowledge several limitations that offer avenues for future research. First, while the 13 factors used in our model represent a comprehensive suite of common landslide triggers, region-specific phenomena may have been overlooked. For instance, future models could incorporate variables like wind speed and typhoon tracks to better capture the localized dynamics in the southern sub-region60. Second, this study focuses on static susceptibility mapping. A more advanced approach would be to incorporate active slope deformation data, such as from InSAR, to enable a transition from static susceptibility maps to dynamic hazard warnings. This would provide more effective support for risk management and emergency response along the Hefei-Fuzhou High-Speed Railway.

Conclusion

This study successfully developed and validated a Random Forest model to map landslide susceptibility along the Hefei-Fuzhou High-Speed Railway, a critical transportation artery in southeastern China. The model demonstrated high predictive accuracy, providing a robust quantitative basis for landslide risk management.

Our findings reveal two major implications for hazard assessment of linear infrastructure. First, while topographic factors (Slope, Relief, and TWI) are the dominant drivers of landslide susceptibility across the entire corridor, their influence varies significantly between sub-regions. We found a clear distinction between a topography-dominated landslide regime in the northern section and a hydroclimatic-driven regime in the southern section, where rainfall emerges as a critical factor. This pronounced regional variability underscores the necessity of moving beyond a uniform approach and adopting tailored, location-specific strategies for disaster prevention.

Second, this research provides not only a crucial susceptibility map to guide hazard mitigation but also a vital methodological framework. By demonstrating how machine learning can be used to identify and explain complex regional differences in landslide drivers, our work offers a valuable scientific basis for assessing geological hazards for similar large-scale projects in diverse environments. The study’s systematic analysis of model uncertainty, using the standard deviation of RF predictions, provides a practical tool for prioritizing mitigation efforts and allocating resources effectively.

Despite its contributions, the study has certain limitations. The model’s accuracy is affected by uncertainties in multi-source data and data gaps in some coastal areas. Future research could enhance predictive performance and regional applicability by integrating more precise remote sensing data and incorporating dynamic factors such as slope deformation monitoring to enable a shift from static susceptibility analysis to a real-time early warning system for slope instability.