Introduction

Slope stability analysis represents one of the most critical challenges in modern transportation infrastructure engineering, particularly for high road embankments where failure can result in catastrophic consequences, including loss of life, property damage, and significant economic losses. The factor of safety (FOS) serves as a fundamental parameter in assessing slope stability and is traditionally calculated through classical methods such as the limit equilibrium approach introduced by Fellenius2 and later refined by Bishop3 using slip circle analysis. However, these conventional analytical methods often may not fully capture the complex, non-linear relationships inherent in geotechnical systems, particularly when dealing with heterogeneous soil conditions and varying environmental factors. Similarly, numerical approaches, including finite element analysis (FEM), offer comprehensive stress-strain characterization but demand substantial computational resources and specialized expertise, creating practical limitations for large-scale applications4,5. The advent of machine learning (ML) techniques has revolutionized slope stability prediction methods, offering superior capabilities to model complex patterns and relationships in geotechnical data. Recent studies have demonstrated the effectiveness of various ML approaches, including artificial neural networks6,7,8, support vector machines9, evolutionary-optimized deep learning models10, and random forest models11,12 in predicting slope stability with remarkable accuracy. Among these techniques, random forest has gained particular attention due to its robust performance, ability to handle non-linear relationships, and resistance to overfitting13,14. Despite the success of traditional random forest models, the unique characteristics of geotechnical data present specific challenges that require advanced modeling approaches. High road embankments often exhibit clustered or hierarchical data structures due to spatial correlations, varying geological conditions, and construction phases, which violate the independence assumptions of conventional ML methods. Mixed Effects Random Forest (MERF) addresses these limitations by incorporating both fixed and random effects, making it particularly suitable for clustered geotechnical data15,16. The performance of machine learning models is heavily dependent on hyperparameter optimization, which significantly impacts prediction accuracy and model generalization17,18. Traditional hyperparameter tuning methods often suffer from computational inefficiency and can become trapped in local optima. Metaheuristic optimization algorithms have emerged as powerful alternatives for hyperparameter optimization, with the Artificial Bee Colony (ABC) algorithm demonstrating exceptional performance in various optimization problems19,20. Recent research has highlighted the importance of addressing data quality issues, particularly outliers, which can significantly impact model performance in geotechnical applications21,22,23. The inherent variability in geotechnical parameters and measurement uncertainties in slope stability datasets accentuates the influence of outliers24. Furthermore, data normalization has been identified as a critical preprocessing step that can substantially improve machine learning algorithm accuracy25. The integration of multiple optimization techniques with advanced machine learning models has shown promising results in geotechnical engineering applications. Hybrid approaches combining particle swarm optimization with neural networks26, evolutionary optimization techniques27, and metaheuristic-ML integration28 have demonstrated superior performance compared to standalone methods. Recent studies have further validated the effectiveness of hybrid optimization approaches, such as the Sparrow Search Algorithm and Harris Hawk Optimization with random forest models29. Despite these advancements, there remains a significant gap in the literature regarding the application of Mixed Effects Random Forest optimized by the Artificial Bee Colony algorithm specifically for factor of safety prediction in high road embankments. The complex interaction between clustered geotechnical data, the need for robust hyperparameter optimization, and the critical importance of accurate FOS prediction in high-stakes infrastructure projects necessitates a comprehensive hybrid framework that addresses these challenges simultaneously. While recent studies have demonstrated the efficacy of models like Gaussian Process Regression (GPR) and advanced neural networks for slope stability prediction30,31, these approaches typically assume independent and identically distributed (i.i.d.) data. Geotechnical data from embankments, however, is inherently hierarchical and clustered due to spatial correlations, construction phases, and varying geological strata. Models that ignore this cluster-induced correlation risk biased predictions and inflated Type I errors. The Mixed Effects Random Forest (MERF) framework is uniquely suited for this challenge as it explicitly models both fixed (global) effects and random (cluster-specific) effects, a capability that standard GPR or ANN architectures lack15. This makes MERF, and by extension its optimized variant ABC-MERF, a more statistically sound and appropriate choice for modeling clustered geotechnical data.

This study presents a novel hybrid machine learning framework that integrates Mixed Effects Random Forest with Artificial Bee Colony optimization for enhanced factor of safety prediction in high road embankments. The proposed approach leverages the clustering capabilities of MERF to handle spatial correlations in geotechnical data while utilizing ABC optimization for efficient hyperparameter tuning. The model assessment employs multiple complementary metrics, including Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and coefficient of determination (R2), with statistical significance validation through Wilcoxon signed-rank tests to establish performance differentials. The framework incorporates comprehensive feature analysis through seven critical embankment parameters: height, slope inclination, soil cohesion, internal friction angle, unit weight, moisture content, and compaction degree. These parameters represent the primary geotechnical factors governing stability performance. Additionally, the framework incorporates comprehensive data preprocessing, including outlier detection and normalization, to ensure optimal model performance. Through extensive validation on real-world datasets, this research aims to advance the state-of-the-art in slope stability prediction and provide practitioners with a robust tool for risk assessment in critical infrastructure projects. A rigorous model comparison protocol implements equivalent optimization effort across all modeling approaches to ensure unbiased performance evaluation and statistical validity of comparative assessments. Enhanced engineering interpretability integrates feature importance analysis and partial dependence visualization for transparent engineering decision support and model explanation.

Data production and preprocessing

Finite element dataset development

This investigation presents a comprehensive computational dataset for factor of safety (FOS) analysis developed through the systematic integration of finite element method (FEM) numerical simulations.

Geometric and material parameterization

The dataset encompasses a parametric analysis of road embankment configurations across five discrete height intervals—6, 12, 18, 24, and 30 m—designed to systematically evaluate geotechnical stability performance under diverse geometric conditions. Each numerical model incorporated diverse slope geometries and employed a range of geotechnical materials for the embankment, as systematically presented in Fig. 1.

Fig. 1
figure 1

Typical embankment geometric configurations illustrating height and slope variations used in the parametric study.

Table 1 presents the systematic classification of slope geometries (SGi) for embankment configurations across five height categories ranging from 6 to 30 m. These geometric configurations are quantified using vertical-to-horizontal (V: H) dimensional ratios, where V represents the embankment height and H denotes the corresponding horizontal projection, as illustrated in Fig. 1. The V: H ratio serves as the primary indicator of slope inclination, with lower ratios corresponding to steeper slope configurations.

For the 6-m height category, four distinct slope geometries are analyzed: SG1 (2:1), SG2 (1.5:1), SG3 (1.2:1), and SG4 (1:1), representing progressively steeper inclinations. The 12-m embankment analysis encompasses eight geometric variations (SG1 through SG8), spanning from a gentle 2:1 slope to a steep 1:1 configuration, with advanced geometries such as SG7 (1.5:1 Berm 1.5:1) incorporating strategically positioned intermediate berms for enhanced stability.

Table 1 Embankment slope configurations and berm arrangements for heights of 6, 12, 18, 24, and 30 m.

The parametric investigation encompassed the systematic evaluation of diverse slope configurations, incorporating varying slope inclinations and the strategic placement of berms (horizontal benches) at specified elevations. Standard berm configurations were designed with a consistent width of 2.0 m, adhering to established engineering practice for high embankment designs. The incorporation of berms serves multiple stability enhancement functions: interruption of continuous potential failure surfaces, local reduction of slope gradients, and improvement of overall stability performance, particularly for elevated embankments where sustained steep inclinations may compromise geotechnical stability.

Table 2 presents the laboratory-determined physical and mechanical properties of soil materials employed for embankment modeling. These geotechnical parameters were obtained through standardized testing protocols in accordance with established geotechnical engineering standards and represent characteristic values for conventional road embankment construction materials. The systematic study integrates twelve embankment material variants (S1–S12) with varying unit weights (18–23 kN/m3), moisture contents (7–20%), deformation moduli (10–50 MPa), cohesion values (2–40 kPa), friction angles (25–40°), and Poisson’s ratios (0.25–0.30). The parametric study employs three distinct foundation soil classifications (SF1, SF2, and SF3) to represent the spectrum of subgrade conditions encountered in embankment construction. Foundation soil FS1 characterizes high-strength conditions with a CBR of 15.0%, a deformation modulus of 38.0 MPa, and a friction angle of 40.0°, representing well-compacted granular materials in favorable construction environments. FS2 exhibits moderate strength properties, including a CBR of 5.0%, a deformation modulus of 18.5 MPa, and a friction angle of 30.0°, corresponding to typical subgrade conditions in standard highway projects. Foundation soil FS3 represents weak subgrade scenarios with the lowest bearing capacity (CBR 3.0%), deformation modulus of 13.27 MPa, and friction angle of 20.0°, characteristic of soft clays requiring special design consideration. The cohesion values progressively increase from FS1 (10.0 kPa) through FS2 (20.0 kPa) to FS3 (30.0 kPa), reflecting the transition from granular to cohesive soil behavior. Unit weights range from 18.0 to 22.0 kN/m3 across the three classifications, representing realistic density variations in foundation materials. This systematic variation ensures comprehensive coverage of subgrade conditions from high-strength granular foundations to weak compressible soils. The selected parameter ranges provide a robust foundation for parametric analysis and machine learning model development across diverse geotechnical scenarios.

Table 2 Physical and mechanical properties of embankment soil classifications (S1–S12).

Furthermore, a uniformly distributed load of 14 kPa was applied to simulate standard pavement loading conditions in accordance with EN 1991-2 specifications (Eurocode 1, 2018), representing typical traffic loading over a 25 m embankment section.

Numerical modeling

Factor of safety calculations were performed using finite element analysis implemented in GeoStudio software (version 2024), with the SLOPE/W module utilized for stability assessment based on the Mohr-Coulomb failure criterion. Within the GeoStudio SLOPE/W computational framework, the factor of safety (FOS) is determined through comparative analysis of the mobilizable shear strength capacity against the shear stress demand required to maintain equilibrium conditions along the critical failure surface. Model boundaries were strategically dimensioned to eliminate potential boundary effects at the base and lateral boundaries, ensuring unrestricted soil deformation under applied loading conditions and facilitating natural development of critical failure mechanisms without computational artifacts. Boundary conditions were strategically implemented to ensure realistic numerical simulation while preserving computational stability and convergence. Fixed displacement constraints were imposed at the model base to restrict vertical and horizontal movement, while lateral boundaries were constrained against horizontal displacement along the X-axis. This boundary configuration ensures appropriate stress transfer mechanisms and prevents non-physical deformation patterns that could adversely affect the accuracy of the finite element solution. To ensure mesh-independent results, a systematic convergence study was performed using a representative 18 m embankment model with foundation soil SF2, evaluating three mesh densities: coarse mesh with 1.0 m elements comprising 2450 elements, medium mesh with 0.5 m elements comprising 4890 elements, and fine mesh with 0.25 m elements comprising 9780 elements. The factor of safety convergence results demonstrated progressive refinement with coarse mesh yielding FOS = 1.245, medium mesh producing FOS = 1.238, and fine mesh achieving FOS = 1.237. The negligible difference between medium and fine mesh results (less than 0.1%) confirmed numerical convergence, leading to the selection of 0.5 m quadrilateral and triangular elements size as optimal for balancing computational accuracy with efficiency, as depicted in Fig. 2. This mesh density was consistently applied across all 1176 finite element simulations to ensure reliable and computationally feasible analysis throughout the parametric study.

The critical slip surface identification is accomplished through a comprehensive global search algorithm that systematically evaluates potential failure mechanisms throughout the slope domain. The resulting FOS serves as the fundamental quantitative metric for assessing slope stability performance, with values below unity indicating potential instability and values significantly above unity representing adequate safety margins under the prescribed loading conditions.

Fig. 2
figure 2

Finite element mesh configuration for the embankment model.

Comparison and validation

A comprehensive comparison of parameter ranges between the synthetic dataset developed in this study and corresponding ranges reported in literature and real case studies demonstrates the representative nature of the computational framework. The embankment height parameter in the present study spans 6–30 m, which aligns well with literature ranges reported by Mesa-Lavista et al.1 and encompasses typical highway embankment heights of 8–25 m encountered in practice. This height range ensures comprehensive coverage of both moderate and high embankment scenarios commonly encountered in transportation infrastructure projects. The cohesion values in the synthetic dataset range from 2 to 40 kPa, effectively covering the broader literature range of 0–50 kPa as documented by Lin et al.13. The study parameters encompass typical fill materials with cohesion values of 5–35 kPa, ensuring representation of both cohesionless granular materials and moderately cohesive soils commonly used in embankment construction. The friction angle parameter spans 25–40° in the present investigation, which falls within the literature range of 13–51° reported by Karir et al.32 and effectively covers granular fills with typical friction angles of 28–42°. This range represents the spectrum from fine-grained soils with moderate friction resistance to well-graded granular materials with high internal friction characteristics. California Bearing Ratio (CBR) values in the synthetic dataset range from 3 to 15%, aligning with the literature range of 2 to 20% documented by Duncan et al.33. The selected range effectively represents subgrade soil conditions from 2 to 18% CBR, covering weak to moderately strong foundation conditions typically encountered in embankment construction projects. Slope ratio configurations span from 1:1 to 2:1 (V: H) in the present study, which corresponds to standard engineering practice ranges of 1:1 to 3:1 and encompasses highway slope configurations typically ranging from 1.5:1 to 2:1. This geometric parameter space represents the full spectrum, from steep slopes requiring careful stability analysis to gentler configurations commonly used in standard highway designs. The synthetic parameter space encompasses 95% of typical real-world embankment conditions, ensuring broad model applicability to practical engineering scenarios. The systematic factorial design ensures that correlations between parameters are naturally preserved as they occur in practice, thereby maintaining realistic parameter interactions and enhancing the dataset’s engineering relevance for slope stability predictions.

To establish the validity and reliability of the numerical modeling framework, a comprehensive validation analysis was performed utilizing parameters identical to those documented by Mesa-Lavista et al.1. The validation configuration employed a 12-m embankment height with foundation soil classification SF1 (CBR = 15%), maintaining strict adherence to the benchmark study’s experimental specifications. The finite element model precisely reproduced the geotechnical material properties, including cohesive strength and internal friction angle parameters, alongside the geometric design configurations encompassing slope inclinations and berm placement arrangements, as established in the reference investigation.

Fig. 3
figure 3

Comparison of FOS results between the present study and Mesa-Lavista et al.1

Figure 3 presents a systematic comparative assessment of factor of safety (FOS) values between the current investigation and the benchmark study of Mesa-Lavista et al.1 across twelve distinct embankment material variants (S1 through S12). The comparative analysis reveals exceptional agreement between the two methodological approaches, with FOS deviations constrained within ± 0.1, corresponding to approximately ± 5% relative error within the investigated FOS range of 1.4 to 2.5. These marginal discrepancies fall well within acceptable tolerances for geotechnical finite element analysis, as established by recognized engineering standards and peer-reviewed literature. The strong concordance between computational results validates the accuracy and robustness of the finite element modeling framework employed throughout this investigation. This validation establishes high confidence in the integrity of the comprehensive numerical dataset comprising 1176 simulations, thereby providing a reliable foundation for subsequent machine learning model development and training phases. The successful benchmarking against published reference results demonstrates that the implemented numerical framework possesses the requisite accuracy to generate dependable predictions suitable for advanced computational modeling and practical geotechnical engineering applications.

Additional benchmarking was performed using three published studies with varying embankment heights, soil properties, and boundary conditions, demonstrating the robustness and accuracy of the proposed computational framework across diverse geotechnical scenarios. The validation encompasses three distinct case studies representing different embankment configurations and soil conditions. The first validation case, based on Bandara et al.34, examines an 18-m embankment constructed with unsaturated clay characterized by a cohesion of 15 kPa, a friction angle of 28°, and a unit weight of 19 kN/m3 on a rock foundation. The literature reported a factor of safety (FOS) of 1.28, while the present study predicted 1.32, yielding an acceptable error of + 3.1%. This close agreement validates the model’s capability to handle unsaturated soil conditions and rock foundation interfaces. The second validation case, based on Huang et al. (2023), examines a 24-m embankment featuring layered soil conditions: the upper layers have a cohesion of 8 kPa and a friction angle of 32°, while the lower layers have a cohesion of 12 kPa and a friction angle of 35°, all under pore water pressure conditions. The literature FOS value of 1.18 compares favorably with the predicted value of 1.15, resulting in an error of − 2.5%. This validation demonstrates the framework’s effectiveness in modeling complex layered soil profiles and groundwater effects on slope stability. The third validation case, based on Duncan et al.33, examines a 30-m embankment constructed with cohesive soil having a cohesion of 25 kPa, a friction angle of 30°, and a unit weight of 20 kN/m3 under different boundary constraint conditions. The literature FOS of 1.82 closely matches the predicted value of 1.85, with an error of + 1.6%. This case validates the model’s performance for high embankments with cohesive soil conditions and varying boundary constraints. Based on this multi-case validation summary, the comparison demonstrates excellent agreement between literature values and the present FOS calculations, with prediction errors ranging from − 2.5% to + 3.1%. The consistently low error margins across diverse geotechnical conditions, embankment heights (18–30 m), and varying soil properties confirm the computational framework’s reliability and broad applicability to real-world slope stability problems. The validation results establish confidence in the model’s predictive capability for practical engineering applications involving complex embankment-foundation systems under various loading and boundary conditions.

Data preprocessing

Inputs–output correlation

Input–output correlation in ML refers to the statistical relationship between input and output variables21,35. Understanding these correlations is essential for feature selection, as it helps identify which inputs are most relevant to predict the output, potentially improving model accuracy and reducing complexity. It aids in feature engineering by guiding the creation or transformation of features to capture underlying patterns better. Recognizing strong correlations can simplify models, making them more understandable and reducing overfitting risk. Additionally, correlations reveal multicollinearity issues, where inputs are highly correlated, which can affect some models’ performance and needs to be addressed during preprocessing36. This knowledge influences algorithm selection, as some models handle linear or nonlinear relationships differently. Overall, input-output correlation is crucial for building efficient models that make accurate predictions by focusing on the most relevant features and enhancing interpretability.

Table 3 provides comprehensive descriptive statistics for the dataset, presenting minimum, maximum, mean, skewness, kurtosis, and range values for all simulations and summarizing the key variables employed in the embankment stability analysis.

Table 3 Descriptive statistics for the dataset.

These variables encompass soil properties (unit weight, moisture, deformation modulus, cohesion, friction angle, and Poisson’s ratio) and geometric parameters (five slope angles and number of berms), collectively providing a robust framework for evaluating embankment stability and predicting soil behavior under diverse geometric conditions. The analysis reveals that FOS exhibits positive skewness (1.41) and positive kurtosis (3.21), indicating a right-tailed distribution with more extreme values, which is typical for geotechnical safety factor datasets.

Fig. 4
figure 4

Pearson correlation coefficients between geotechnical soil properties, slope geometric parameters, and factor of safety.

The correlation triangular heatmap in Fig. 4 offers a detailed visualization of the relationships between 13 variables influencing embankment stability, with the safety factor as the key output.

The variables include material properties such as cohesion (c), friction angle (Φ), unit weight, moisture, and deformation modulus for embankment materials, alongside geometric factors like height, slopes (1–5), and number of berms. The color scale, ranging from deep blue (− 1.00) to dark red (1.00), effectively highlights correlation strength and direction, with yellow indicating near-zero values. The matrix demonstrates expected strong positive correlations among classical strength parameters, with friction angle (φ) showing robust correlations with cohesion (c = 0.87) and Young’s modulus (E = 0.70), which aligns with fundamental soil mechanics principles and suggests the dataset represents cohesive soils where shear strength parameters exhibit interdependence. The observed strong correlations (e.g., between friction angle and cohesion) are a deliberate feature of the synthetic dataset, designed to reflect realistic interactions between soil properties that co-vary in natural materials (e.g., well-graded granular soils have high friction angles and low cohesion, while clays have higher cohesion and lower friction angles). While this could theoretically lead to overfitting, the use of tree-based models like RF and MERF, which are robust to multicollinearity, mitigates this risk. Furthermore, the exceptional performance on the held-out test set (30% of data) and the validation against independent case studies from the literature confirm that the models have learned the underlying physical relationships rather than spurious synthetic patterns. The slope geometry analysis reveals intriguing interdependencies, with slope angles (Slopes 1–5) showing a progression where consecutive slope angles demonstrate moderate to strong positive correlations ranging from 0.34 to 0.80, indicating geometric continuity suggesting multi-bench or terraced slope configurations rather than simple planar slopes. The safety factor relationships provide particularly valuable insights, confirming theoretical expectations through strong positive correlations with friction angle (0.70) and cohesion (0.33), while the negative correlation with unit weight (− 0.36) appropriately reflects how increased soil weight reduces stability through higher driving forces. However, the negative correlation between Poisson’s ratio (ν) and several slope angles (− 0.50 to − 0.63) warrants investigation, possibly indicating that materials with higher Poisson’s ratios require more conservative slope designs based on field observations. The N° Berm parameter’s strong correlations with multiple slope angles (0.65–0.80) demonstrate sophisticated integration of bearing capacity theory into slope stability analysis, while from a computational perspective, the clean correlation structure with minimal noise suggests well-controlled data collection, with zero correlations with height parameters indicating appropriate use of controlled variables to isolate effects of other parameters, making this matrix an effective tool for capturing complex interdependencies in slope stability problems and providing a solid foundation for developing predictive models.

Outliers and normalization data

Effective outlier detection and treatment constitute a critical preprocessing step in machine learning applications. Proper identification and handling of outliers significantly enhance model performance and reliability22.

Fig. 5
figure 5

Statistical distribution and outlier analysis of inputs using the interquartile range (IQR) method.

Figure 5 illustrates the outlier distribution in our dataset using boxplot visualization. Outliers were identified and treated using the interquartile range (IQR) method23. The capping technique was employed, wherein extreme outliers beyond the lower and upper limits were replaced with the respective boundary values. This approach preserves data integrity by maintaining the original dataset size, unlike trimming methods that result in data reduction. The researchers normalized the cleaned datasets using min-max normalization, as expressed in Eq. (1) (Cabello-Solorzano et al 25), to enhance model performance, reduce variance, and ensure consistent scaling across features.

$$\:{y}_{i\:Norm}=\frac{{y}_{i}-{y}_{min}}{{y}_{max}-{y}_{min}}$$
(1)

where ynorm represents the normalized value, yi is the original value to be normalized, ymin is the minimum value in the dataset, and ymax is the maximum value in the dataset.

Min-max normalization was selected over alternative methods like z-score standardization for two primary reasons. First, it bounds all features to an identical range [0, 1], which is beneficial for the distance-based calculations in the Artificial Bee Colony optimization algorithm. Second, tree-based models like Random Forest and MERF are invariant to monotonic transformations of the features; they do not require features to be centered or scaled to a unit variance. However, min-max scaling can still provide a slight performance benefit for the hyperparameter optimization process itself. The chosen method ensured consistent scaling without altering the underlying distribution of the data37.

Baseline models

Random forest (RF)

Random Forest (RF) is a robust ensemble learning algorithm that constructs multiple decision trees using random subsets of both training samples and features11. Originally introduced by Breiman in the early 2000s, RF effectively enhances prediction accuracy while mitigating overfitting through ensemble aggregation. The algorithm employs two key techniques: bagging (bootstrap aggregating), which creates multiple bootstrap samples from the original dataset, and the random subspace method, which selects random feature subsets for constructing each individual tree. Each decision tree is built using specific splitting criteria—Gini impurity for classification tasks or mean squared error for regression problems.

The final RF prediction is obtained by aggregating predictions from all constituent trees: majority voting for classification and averaging for regression tasks, as expressed in Eq. (2)38:

$$\:{\stackrel{-}{f}}_{rf}^{B}=\frac{1}{B}\sum\:_{b=1}^{B}T(x,{O}_{b})$$
(2)

where \(\:{\stackrel{-}{f}}_{rf}^{B}\) represents the average tree output, B denotes the total number of trees, and T (x, Ob) is the output of the b-th tree, and Ob represents the random parameters for the b-th tree.

This ensemble approach provides several advantages: enhanced prediction accuracy through variance reduction, improved generalization capability, and robust feature importance estimation. RF demonstrates exceptional performance across diverse applications due to its ability to handle large datasets efficiently and maintain resistance to overfitting.

However, RF exhibits certain limitations: susceptibility to overfitting with highly noisy datasets, challenges in handling severely imbalanced class distributions, increased computational overhead for enormous datasets requiring numerous trees, and difficulty capturing complex non-linear feature interactions. Additionally, RF may struggle with extrapolation beyond the training data range and can be sensitive to correlated features.

Consequently, Mixed Effects Random Forest (MERF) was developed to address these limitations, particularly for hierarchical and grouped data structures, as discussed in the following subsection.

Mixed effects random forest (MERF)

Mixed Effects Random Forest (MERF) is an advanced machine learning methodology that combines the strengths of mixed-effects models with random forest algorithms. This hybrid approach is particularly effective for analyzing data with complex correlation structures, including hierarchical, longitudinal, or grouped data, where traditional RF may exhibit limitations due to its inability to account for within-cluster dependencies. MERF was originally developed by Hajjem et al.15. This framework aims to tackle the challenge of correlated observations within clusters while preserving the predictive power of ensemble methods. The mathematical framework of MERF is expressed through the following system of equations:

$$\:{Y}_{i}=f\left({X}_{i}\right)+{Z}_{i}{b}_{i}+{\varepsilon }_{i}$$
(3)
$$\:{b}_{i}\sim N\left(0,\:D\right)$$
(4)
$$\:{\epsilon\:}_{i}\sim N(0,\:{R}_{i})$$
(5)
$$\:{V}_{i}=\left({Y}_{i}\right)={Z}_{i}D{Z}_{i}^{T}+{R}_{i}$$
(6)
$$\:{R}_{i}={\sigma\:}^{2}{I}_{{n}_{i}}$$
(7)

where i = 1, …, m represents clusters, each containing ni observations (j = 1, …, ni), and Yi denotes the regression output vector (ni × 1) for cluster i. Xi represents the design matrix of input parameters (ni × p) for cluster i, f(Xi) stands for the fixed effects estimated by the RF component, and Zi denotes the design matrix (ni × q) for random effects, typically comprising feature subsets from Xi. bi represents the random effects vector (q × 1) specific to cluster i, where Zibi captures linear cluster-specific deviations, i represents the measurement errors for cluster i. D, Ri, and Vi are covariance matrices.

The model operates under two fundamental assumptions:

  • Independence assumption: Random effects bi and measurement errors εi are mutually independent.

  • Correlation structure: Repeated measurements of Yi are correlated only through between-cluster variation, resulting in independent within-cluster measurement errors εi and a diagonal structure for Ri as specified in Eq. (7)

To estimate both fixed effects and random effects parameters in Eqs. (37), MERF employs a maximum likelihood objective function (GLL) through an iterative Expectation-Maximization (EM) algorithm. This optimization procedure alternates between estimating the random forest fixed effects and updating the random effects parameters until convergence is achieved (Table 4).

The MERF algorithm offers several advantages over traditional RF: improved handling of clustered data structures, enhanced prediction accuracy for hierarchical datasets, explicit modeling of within-cluster correlation, and maintenance of interpretability through the mixed-effects framework. These characteristics make MERF particularly suitable for applications involving repeated measurements, multi-level data structures, or scenarios where accounting for cluster-specific effects is crucial for accurate prediction.

Table 4 MERF algorithm with EM16.
$$\:GLL\left(f,\:\left.u\right|y\right)=\sum\:_{i=1}^{n}{({y}_{i}-f({X}_{i}-{Z}_{i}{u}_{i})}^{T}{R}_{i}^{-1}{({y}_{i}-f({X}_{i}-{Z}_{i}{u}_{i})}^{}+{u}_{i}^{T}{D}^{-1}{u}_{i}+log\left|G\right|+log\left|{R}_{i}\right|$$
(8)
$$\:{\in\:}_{i}=\left[{Y}_{i}-f\left({X}_{i}\right)-{Z}_{i}{b}_{i}\right]$$
(9)

Considering the generalized likelihood measure (GLLk) after the kth cycle, the algorithm converges when:

$$\:\frac{\left|{GLL}_{k}-{GLL}_{k-1}\right|}{{GLL}_{k-1}}<\delta\:$$
(10)

The convergence threshold (δ) was set to 1 × 10− 6, and maximum iterations of 100. For the dataset used in this study, the EM algorithm typically converged within 40 iterations, demonstrating stable and efficient parameter estimation.

However, MERF exhibits several inherent limitations that can impact its practical implementation and performance:

  1. 1.

    Computational complexity: MERF requires significant computational resources due to its iterative estimation procedure for random effects parameters. The expectation-maximization algorithm employed for parameter estimation can be time-intensive, particularly for large datasets with numerous clusters or high-dimensional feature spaces.

  2. 2.

    Model interpretability: While MERF offers advantageous information regarding both fixed and random effects, the interpretability remains challenging due to the black-box nature of the random forest component. The complex interaction between the ensemble tree structure and the mixed-effects framework can obscure the understanding of individual feature contributions and cluster-specific variations.

  3. 3.

    Hyperparameter sensitivity: MERF performance is heavily dependent on the appropriate selection and tuning of hyperparameters for both the random forest component (e.g., number of trees, tree depth, splitting criteria) and the mixed-effects structure (e.g., random effects specification, covariance matrix parameterization). Suboptimal hyperparameter choices can significantly degrade model performance and prediction accuracy.

  4. 4.

    Convergence challenges: The optimization process for fitting MERF models can encounter convergence difficulties, especially when dealing with complex hierarchical structures, insufficient sample sizes within clusters, or poorly conditioned covariance matrices. These convergence issues may result in unstable parameter estimates or failure to reach optimal solutions.

  5. 5.

    Scalability limitations: As the number of clusters or cluster sizes increases, MERF faces scalability challenges in terms of both computational efficiency and memory requirements, potentially limiting its applicability to very large-scale datasets.

Consequently, the Artificial Bee Colony (ABC) optimization algorithm was integrated with MERF to develop ABC-MERF, specifically designed to address these limitations through intelligent hyperparameter optimization and enhanced convergence properties, as detailed in the following subsection.

Artificial bee colony-mixed effects random forest (ABC-MERF)

ABC-MERF is a hybrid model that combines the ABC algorithm with MERF. This innovative approach merges the strengths of both techniques to enhance predictive performance and model interpretability, especially in complex datasets with hierarchical structures. The ABC algorithm is elaborated in the following subsection, while MERF was explained in the previous section.

Artificial bee colony (ABC)

The Artificial Bee Colony (ABC) algorithm is a metaheuristic optimization technique that mimics the intelligent foraging behavior exhibited by honey bee swarms. Originally introduced by Akay and Karaboga19 in 2005, the ABC algorithm has gained considerable recognition in the optimization community due to its algorithmic simplicity, computational efficiency, and robust performance across diverse optimization landscapes.

The ABC algorithm comprises two fundamental components: food sources, which represent candidate solutions in the search space, and three distinct bee populations that collectively orchestrate the optimization process. The algorithmic framework operates through four sequential phases:

  1. a.

    Initialization.

The initialization phase establishes the foundational parameters of the algorithm, including the number of food sources (SN), the maximum number of cycles for food source abandonment (limit), and the total number of iterations. Food sources are randomly distributed throughout the search space according to Eq. (11):

$$\:{x}_{i,\:j}={x}_{min,\:j}+rand\left(0,\:1\right)\times\:\left({x}_{max,\:j}-{x}_{min,\:j}\right)$$
(11)

where xi = (xi,1, xi,2, …, xi, D) represents the i-th food source position, and D denotes the dimensionality of the optimization problem20. Here, xi, j corresponds to the j-th component of the i-th food source, with i = 1, 2,…,SN and j = 1, 2,…,D. The parameters xmax,j and xmin,j define the upper and lower bounds for the j-th dimension, respectively. The population is equally divided between employed and onlooker bees, each numbering SN.

  1. b.

    Employed bee phase.

During the employed bee phase, each employed bee is associated with a specific food source, representing a potential solution within the optimization landscape. These bees conduct local exploration around their assigned food sources using the neighborhood search Eq. (12):

$$\:{v}_{i,\:j}={x}_{i,\:j}+{\phi\:}_{i,\:j}\times\:\left({x}_{i,\:j}-{x}_{k,\:j}\right)$$
(12)

In this context, vi represents the newly generated candidate solution for the food source xi, while j is randomly selected from the range (1, 2, …, D). The index k is randomly chosen from (1, 2, …, SN) with the constraint that k ≠ 1, and φi, j is a uniformly distributed random number within the interval [− 1, 1].

Solution quality is preserved through a greedy selection mechanism described by Eq. (13):

$$\:{x}_{i}^{new}=\left\{\begin{array}{c}{v}_{i}\:if\:f\left({v}_{i}\right)<f\left({x}_{i}\right)\\\:{x}_{i}\:if\:otherwise\end{array}\right.$$
(13)

where f(vi) and f(xi) represent the objective function values for the new and current solutions, respectively.

  1. c.

    Onlooker bee phase.

Onlooker bees play a pivotal role in intensifying the search process by concentrating on promising food sources identified during the employed bee phase. Their probabilistic selection mechanism maintains an optimal balance between exploitation of high-quality solutions and exploration of uncharted regions. The fitness value for each food source location is calculated using Eq. (14):

$$\:f\:{it}_{i}=\left\{\begin{array}{c}\frac{1}{1+{f}_{i}}\:if\:{f}_{i}\ge\:0\\\:1+\left|{f}_{i}\right|\:otherwises\end{array}\right.$$
(14)

where f iti represents the fitness value and fi is the objective function value for the i-th food source. The selection probability for each candidate solution is subsequently computed using Eq. (15):

$$\:{p}_{i}=\frac{f{it}_{i}}{{\sum\:}_{m}^{SN}f{it}_{m}}$$
(15)

Food sources with higher fitness values possess greater selection probabilities. Selected food sources undergo the neighborhood search procedure described in Eq. (12), followed by greedy selection as specified in Eq. (13). Upon completion of this process, the trial counter triali is reset to zero if the new solution vi surpasses the previous food source; otherwise, traili is incremented by one.

  1. d.

    Scout bee phase.

The scout bee phase implements a diversification mechanism to prevent premature convergence and maintain population diversity. Each candidate solution maintains an associated counter, denoted as triali (where i = 1, 2, …, SN), which tracks the number of consecutive iterations without improvement. When this counter exceeds a predefined threshold (limit parameter), the corresponding food source is considered exhausted and is abandoned by the employed bee. A new food source is subsequently generated using the initialization equation (Eq. 11), and the associated trial counter is reset to zero. This abandonment and regeneration process ensures continuous exploration of the search space while maintaining algorithmic diversity. The complete algorithmic procedure for the ABC optimization technique is presented in Table 5.

Table 5 Artificial bee colony algorithm20.

The implementation of the ABC-MERF framework for safety factor prediction follows a systematic approach comprising seven distinct phases, as illustrated in the simplified flowchart presented in Figure 6. The methodology integrates the Artificial Bee Colony optimization algorithm with mixed-effects random forest modeling to achieve optimal hyperparameter configuration and enhanced predictive performance.

Fig. 6
figure 6

Flowchart illustrating the integration of FEM simulations with machine learning for safety factor prediction in slope stability analysis.

Data preparation and preprocessing: The dataset underwent comprehensive evaluation to verify its compatibility with MERF implementation, specifically confirming the presence of well-defined hierarchical structures essential for mixed-effects modeling. Subsequently, rigorous data preprocessing procedures were executed to ensure data quality and analytical readiness. This preprocessing pipeline included systematic outlier detection and treatment, followed by data normalization through appropriate imputation techniques, as detailed in Sect. 1.3.2. These preparatory steps were crucial for maintaining data integrity and ensuring optimal model performance.

MERF model architecture definition: The MERF model structure was systematically configured by specifying critical architectural parameters, including the number of decision trees to be incorporated within the random forest ensemble, the number of predictor variables evaluated at each decision node split, and the fundamental components governing the mixed-effects modeling framework. This architectural specification establishes the foundational structure upon which the optimization process operates.

ABC algorithm initialization: The ABC optimization algorithm was initialized with carefully selected parameters to ensure effective exploration of the hyperparameter space. The algorithmic configuration comprised a population size of 30 food sources, with 15 employed bees and 25 onlooker bees participating in the optimization process. The abandonment limit was established at 90 cycles, while the maximum iteration count was set to 150. Each food source within the population represents a unique configuration of MERF hyperparameters, forming potential solutions within the optimization landscape. Additionally, the algorithm-specific parameters C and γ were initialized at 124.512 and 0.512, respectively, to guide the search process effectively.

The ABC Optimization Cycle consists of three coordinated phases that are executed iteratively during the optimization process.

Employed Bee phase: During this phase, each food source undergoes comprehensive evaluation through MERF model training using its corresponding hyperparameter configuration. The fitness assessment is conducted based on predetermined evaluation metrics, such as mean squared error (MSE) or classification accuracy, depending on the problem formulation. New candidate solutions are generated through controlled perturbations applied to existing food sources, introducing diversity while maintaining proximity to promising regions. Selection operates on a greedy basis, where food sources demonstrating superior fitness values are retained for subsequent iterations.

Onlooker Bee phase: Food source selection occurs probabilistically, with selection probabilities proportional to their respective fitness values. This mechanism ensures that high-performing configurations receive increased attention while maintaining exploration diversity. New food sources are systematically generated and subjected to comparative evaluation against existing solutions, facilitating continuous improvements in population quality.

Scout Bee phase: Exhausted food sources—those failing to demonstrate improvement over the predefined limit—are identified and replaced with randomly generated alternatives. This diversification mechanism prevents premature convergence and maintains population diversity throughout the optimization process.

Hyperparameter optimization process: The ABC algorithm systematically refines MERF hyperparameters through iterative cycles of evaluation, generation, and selection. This evolutionary process enables comprehensive exploration of the hyperparameter space, ultimately converging toward optimal configurations that maximize model performance. The optimization objective focuses on identifying hyperparameter combinations that yield superior predictive accuracy while maintaining model generalizability.

Final model construction and validation: The optimal hyperparameter configuration identified through the ABC optimization process serves as the foundation for constructing the final MERF model. The complete dataset is partitioned using a 70:30 split ratio for training and testing purposes, respectively. To ensure robust performance estimation and mitigate overfitting risks, a stratified 5-fold cross-validation procedure is implemented.

The cross-validation process involves partitioning the training dataset into five equally sized, mutually exclusive folds. During each validation iteration, four folds are aggregated to form the training subset, while the remaining fold serves as the validation set. The MERF model undergoes training on the designated training subset and subsequent evaluation on the corresponding validation set. This procedure is repeated across all five folds, ensuring that each data point participates in validation exactly once.

The aggregated performance metrics across all cross-validation folds provide a reliable and unbiased estimate of the model’s predictive capabilities, offering greater statistical confidence than single holdout validation approaches.

Model evaluation and performance assessment: The optimized MERF model is deployed to generate predictions on the reserved test dataset, which remains completely independent of the model development process. Comprehensive performance evaluation is conducted using multiple complementary metrics, including Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and coefficient of determination (R2). These metrics collectively quantify the model’s predictive accuracy, residual distribution characteristics, and explained variance, providing a holistic assessment of model effectiveness. The multi-metric evaluation approach ensures thorough characterization of model performance across different aspects of predictive quality, enabling comprehensive comparison with alternative modeling approaches and validation of the proposed ABC-MERF framework’s efficacy.

Models’ performance indicators

Model performance evaluation constitutes a critical component in machine learning methodology, providing quantitative measures to assess the predictive accuracy, reliability, and generalization capability of developed models. These metrics serve as objective benchmarks that enable systematic comparison of different modeling approaches and facilitate the identification of optimal solutions for specific problem domains. In this study, three complementary performance indicators were employed, which are the coefficient of determination (R2) in Eq. (16), root mean squared error (RMSE) in Eq. (17), and mean absolute error (MAE) in Eq. (18)12,17. Similar to RMSE, a lower MAE indicates better model performance. Generally, a model with a higher R2 and lower RMSE and MAE is usually considered a better performer.

$$\text{Mean Absolute Error }\:MAE=\frac{1}{N}{\sum\:}_{i=1}^{N}\left|{P}_{i}-{A}_{i}\right|$$
(16)
$$\text{Mean Squared Error }\:MSE=\frac{1}{N}{\sum\:}_{i=1}^{N}{({P}_{i}-{A}_{i})}^{2}$$
(17)
$$\text{Root Mean Squared Error }\:RMSE=\sqrt{\frac{1}{N}{\sum\:}_{i=1}^{N}{({P}_{i}-{A}_{i})}^{2}}$$
(18)
$$\text{R-squared }\:{R}^{2}=1-\frac{{\sum\:}_{i=1}^{N}{({P}_{i}-{A}_{i})}^{2}}{{\sum\:}_{i=1}^{N}{({P}_{i}-{\stackrel{-}{A}}_{i})}^{2}}$$
(19)

where A is the actual values, P is the predicted values, A̅ is the mean of actual values, and N is the number of data points.

All model development, training, and evaluation procedures were implemented using Python 3.12.7, leveraging its comprehensive ecosystem of machine learning libraries and numerical computation tools. This implementation environment ensures reproducibility and facilitates integration with established machine learning workflows and best practices.

Results and discussions

Hyperparameter tuning

Hyperparameter tuning is the process of optimizing the values of hyperparameters in ML model to achieve the best performance. Unlike model parameters learned from the data during training, hyperparameters are set before training begins. There are three main methods: grid search, random search, and Bayesian optimization18,39. In this paper, grid search was used for hyperparameter tuning because it selects combinations of hyperparameter values from a specified range. Grid search (GridSearchCV) is a popular and straightforward method for optimizing hyperparameters in machine learning40,41. The technique works by systematically testing all possible combinations within a predefined hyperparameter space to determine the optimal settings. However, grid search has notable drawbacks when dealing with continuous hyperparameters, where determining effective sampling strategies becomes difficult. Additionally, as the number of hyperparameters increases, grid search becomes computationally expensive because the required evaluations grow exponentially with each added dimension. Table 6 displays the tuned hyperparameters for this paper.

Table 6 Optimal hyperparameter configurations for each machine learning algorithm.

Table 6 shows the optimal hyperparameters and performance metrics for each machine learning model used to predict the factor of safety in slope stability analysis. For clarity, the table includes only the most influential hyperparameters identified through sensitivity analysis, excluding those with minimal impact on model performance. These optimized settings resulted in highly accurate predictive models.

Models’ performance analysis

Figure 7 presents a comparative evaluation of Random Forest (RF), Mixed Effect Random Forest (MERF), and Artificial Bee Colony Mixed Effect Random Forest (ABC-MERF) models for predicting the factor of safety (FOS) in slope stability analysis of high road embankments. The RMSE results reveal a clear performance hierarchy, with RF showing the highest error (0.036), while both MERF (0.026) and ABC-MERF (0.025) demonstrate substantially improved accuracy. The 28% reduction in RMSE from RF to MERF indicates that the mixed effects approach effectively reduces prediction variance, while the marginal improvement from MERF to ABC-MERF suggests that hyperparameter optimization through ABC provides incremental but meaningful gains. The MAE analysis shows even more pronounced differentiation, with RF exhibiting the highest MAE (0.026), followed by MERF (0.017) and ABC-MERF (0.016). This 35% improvement from RF to MERF in MAE, compared to the 28% improvement in RMSE, indicates that the mixed effects approach particularly excels at reducing typical prediction errors rather than just managing outliers. All models demonstrate excellent explanatory power with R2 values exceeding 0.98, where RF achieves 0.981 and both MERF and ABC-MERF reach 0.991. This 1% improvement in R2 translates to explaining an additional 1% of variance, which is significant in geotechnical applications where model reliability is paramount. From a geotechnical engineering perspective, these performance metrics are exceptionally promising, as MAE values below 0.02 indicate that typical prediction errors are less than 2% of the FOS scale, falling within acceptable engineering tolerances for preliminary slope stability assessments.

Fig. 7
figure 7

Performance metrics comparison for factor of safety prediction models (RF, MERF, and ABC-MERF).

The progressive improvements observed from RF to MERF to ABC-MERF can be attributed to two key methodological enhancements: the incorporation of mixed effects modeling to handle hierarchical geotechnical data structures and the implementation of Artificial Bee Colony optimization for enhanced hyperparameter tuning. While the numerical differences may appear marginal, these improvements are practically significant in geotechnical engineering contexts where FOS values typically range from 1.0 to 3.0, and even minor enhancements in predictive accuracy can substantially impact safety margin assessments and risk evaluation for critical infrastructure projects. The consistently high performance across all models suggests several positive factors: robust feature selection that captures the underlying physical mechanisms governing embankment stability, high-quality geotechnical datasets with comprehensive site characterization, and the appropriateness of tree-based ensemble methods for modeling the complex, non-linear relationships inherent in slope stability analysis.

From Fig. 8, the scatter plots compare predicted versus actual factor of safety values for the three machine learning models (RF, MERF, and ABC-MERF), providing visual confirmation of the quantitative performance metrics previously discussed. All three models demonstrate excellent predictive accuracy, with data points closely aligned along the ideal 1:1 line (perfect prediction line). The RF model (R2 = 0.981) shows slightly more scatter around the ideal line, particularly in the mid-range FOS values (1.0–1.4), indicating some prediction variability.

Fig. 8
figure 8

Scatter plots of predicted versus actual factor of safety values for RF, MERF, and ABC-MERF models.

Both MERF and ABC-MERF models (R2 = 0.991) exhibit tighter clustering around the perfect prediction line, with ABC-MERF showing marginally less scatter, especially in the higher FOS range (1.4–1.8). The consistent performance across the entire FOS range (approximately 0.6 to 1.8) demonstrates model robustness for various slope stability conditions, from marginally stable to highly stable embankments. Notably, all models maintain satisfactory predictive accuracy at critical FOS values near 1.0–1.5, which are particularly important for engineering decision-making in slope stability assessments.

To contextualize the performance of the proposed ABC-MERF model, a comparison with results from the literature demonstrates its competitive advantages. Huang et al. (2023) reported R2 values of 0.94–0.96 and RMSE values of 0.15–0.22 for various machine learning models including Support Vector Machines and Long Short-Term Memory networks applied to slope stability problems. Gaussian Process Regression as implemented by Zhu, et al. 30 achieved an R2 of 0.94 and RMSE of 0.08 using field data from a single site with 500 data points. Support Vector Machine with Particle Swarm Optimization (SVM-PSO) reported by Li et al. (2022) demonstrated improved performance with R2 of 0.96 and RMSE of 0.06 across 800 synthetic data points under simple geometric conditions. Deep Neural Network approaches by Huang et al. (2023) further advanced the field with R2 of 0.98 and RMSE of 0.04 using mixed data sources comprising 1000 data points. The ABC-MERF framework developed in this study achieved superior performance metrics with R2 = 0.991 and RMSE = 0.025 across a comprehensive finite element dataset of 1176 simulations. While direct comparison is limited due to differences in datasets, data sources, and problem complexity, the significantly higher R2 and substantially lower RMSE achieved by ABC-MERF underscore the potential advantage of integrating mixed effects modeling and bio-inspired optimization for handling the complexities of embankment stability analysis. This suggests that the ABC-MERF framework offers a competitive, high-accuracy alternative to existing machine learning methods in the geotechnical domain while mining interpretability and the ability to handle clustered data structures.

Figure 9 provides comprehensive diagnostic insights into the model performance characteristics and underlying assumptions through residual analysis and Q–Q plots, revealing distinct patterns in prediction accuracy and error distribution across the three machine learning approaches.

Fig. 9
figure 9

Diagnostic plots for model validation: (top row) residuals versus predicted values for RF, MERF, and ABC-MERF models, and (bottom row) Q–Q plots assessing the normality of residuals against theoretical quantiles for each modeling approach.

The residuals versus predicted value plots demonstrate that all three models achieve excellent centeredness around zero residuals, indicating the absence of systematic bias, though subtle differences in error patterns emerge upon closer examination. The Random Forest model exhibits residuals ranging from approximately − 0.15 to + 0.15 with relatively uniform scatter across the prediction range, though some slight heteroscedasticity is observable with marginally increased variance at higher predicted FOS values, and a few notable outliers appear at the extremes of the prediction range. The MERF model shows improved residual behavior with a more constrained error range (− 0.3 to + 0.3) and more uniform variance distribution across predicted values, though the broader residual range suggests that while the model captures complex hierarchical relationships, it may introduce slightly higher variability in some predictions. The ABC-MERF model demonstrates improved residual pattern, with errors tightly clustered between − 0.15 and + 0.15, exhibiting the most homogeneous variance across the entire prediction spectrum and fewer extreme outliers, confirming the beneficial effects of hyperparameter optimization on prediction stability. The Q–Q plots offer critical details about the normality assumptions underlying model residuals, with all three models showing strong adherence to the theoretical normal distribution along the diagonal reference line. The Random Forest Q–Q plot reveals excellent normality in the central portion of the distribution with slight deviations at the extreme quantiles, suggesting occasional larger prediction errors that deviate from the expected Gaussian behavior. The MERF Q–Q plot demonstrates superior normality across a broader range of quantiles, with only minor deviations at the extreme tails, indicating that the mixed effects framework successfully captures the underlying error structure while maintaining distributional assumptions. The ABC-MERF Q–Q plot exhibits the closest adherence to the theoretical normal line across nearly the entire quantile range, with minimal deviations even at the extremes, confirming that the optimization process not only improves prediction accuracy but also enhances the statistical properties of the residual distribution.

To comprehensively validate the model comparisons, we supplemented the Wilcoxon signed-rank tests with additional statistical analyses. Paired-sample t-tests confirmed statistically significant differences between all model pairs: RF vs. MERF (t = 8.34, p < 0.001), MERF vs. ABC-MERF (t = 3.21, p = 0.003), and RF vs. ABC-MERF (t = 12.45, p < 0.001). The effect sizes, quantified by Cohen’s d, were substantial, showing a large improvement from RF to MERF (d = 0.72) and a smaller but meaningful gain from MERF to ABC-MERF (d = 0.31). This hierarchy of performance was further corroborated by bootstrap-derived 95% confidence intervals for RMSE: RF [0.034, 0.038], MERF [0.024, 0.028], and ABC-MERF [0.023, 0.027]. Therefore, while the absolute numerical improvements may appear modest, they are both statistically significant and practically relevant in geotechnical engineering, where the factor of safety (FOS) operates on a critical scale typically between 1.0 and 3.0. Beyond point predictions, the MERF framework allows for the estimation of prediction intervals by leveraging the variance of the random effects and the residual error. For the ABC-MERF model, the 95% prediction interval for FOS predictions was calculated to be approximately ± 0.08. This provides a quantitative measure of uncertainty, which is essential for risk-informed decision-making in engineering practice.

These diagnostic analyses collectively validate the robustness of all three approaches while highlighting the progressive improvements achieved through mixed effects modeling and ABC optimization, with ABC-MERF demonstrating superior performance in both prediction accuracy and adherence to statistical assumptions critical for reliable uncertainty quantification in geotechnical applications.

Individual effect of key input variables on the predicted output

As shown in Fig. 10, the Partial Dependence Plots (PDPs) reveal the individual influence of key input on factor of safety predictions output, providing critical insights into the physical relationships governing embankment stability as captured by the machine learning models.

Fig. 10
figure 10

Partial dependence plots illustrating the individual effects of inputs on safety factor predictions: (a) friction angle, (b) embankment height, (c) cohesion, and (d) slope 1 angle.

The friction angle (φ) demonstrates a pronounced positive correlation with FOS, exhibiting an exponential-like increase from approximately 0.87 to 1.25 as friction angle values progress from negative to positive ranges, which aligns with fundamental soil mechanics principles where higher internal friction angles directly enhance shear strength and slope stability. Conversely, the embankment height shows a strong inverse relationship with FOS, declining from approximately 1.3 to 0.8 as height increases from − 1 to 2 (normalized units), reflecting the well-established geotechnical principle that taller embankments generate greater driving forces and reduced stability margins due to increased overburden stresses and longer potential failure surfaces. The cohesion parameter exhibits a distinctive asymptotic behavior, showing rapid improvement in FOS from 0.93 to approximately 1.06 as cohesion increases from low to moderate values, before plateauing at higher cohesion levels, indicating that while cohesive strength significantly impacts stability at lower values, its influence diminishes beyond a certain threshold due to the predominance of other failure mechanisms. The slope angle (Slope 1) displays a clear linear negative relationship with FOS, decreasing from approximately 1.08 to 0.96 across the parameter range, confirming the expected inverse correlation between slope steepness and stability margins. These PDPs collectively demonstrate that the machine learning models have successfully captured the underlying physics of slope stability, with the most influential parameters being friction angle and height, followed by cohesion and slope geometry. This provides confidence that the model predictions are based on mechanistically sound relationships rather than spurious correlations and offers valuable insights for geotechnical engineers in understanding parameter sensitivities for embankment design optimization.

Figure 11 displays six 2D partial dependence plots that illustrate the interactive effects of key inputs on factor of safety predictions. These visualizations reveal how pairs of variables jointly influence slope stability, offering important information for engineering decision-making. The friction angle (Ø) vs. height plot shows that the friction angle becomes increasingly critical as embankment height increases, with a clear transition zone where higher friction angles are essential for stability in taller embankments. The friction angle (Ø) vs. cohesion (c) interaction demonstrates complementary behavior between cohesion and friction angle, where lower values of one parameter can be compensated by higher values of the other, illustrating the fundamental principle of soil shear strength.

Fig. 11
figure 11

Two-dimensional partial dependence plots showing interactive effects between friction angle (Ø), cohesion (c), embankment height, and slope angle (Slope 1) on factor of safety predictions.

The friction angle (ذ) vs. Slope 1 relationship reveals that steeper slopes require significantly higher friction angles to maintain stability, with a pronounced gradient indicating high sensitivity to slope geometry. The height vs. cohesion (c) plot shows that cohesion becomes more critical for stability as embankment height increases; particularly for high embankments, cohesive strength is essential. The Height vs. Slope 1 interaction illustrates the combined geometric effects, showing that the combination of increased height and steeper slopes creates exponentially challenging stability conditions. Finally, the cohesion (c) vs. Slope 1 plot demonstrates how cohesion requirements increase dramatically with slope steepening, particularly for slopes exceeding 0.5 radians (approximately 29°). The contour patterns reveal non-linear interactions that simple univariate analysis would miss, enabling more informed geotechnical design decisions and risk assessment for high road embankments.

SHAP analysis

The SHAP (SHapley Additive exPlanations) analysis provides a comprehensive ranking of feature importance and reveals the directional impact of each input parameter on factor of safety predictions, offering crucial insights into the relative significance and behavioral patterns of variables in embankment stability assessment. The results in Fig. 12 demonstrate that friction angle (Ø) emerges as the most influential parameter, with SHAP values ranging from approximately − 0.15 to + 0.35, indicating its dominant role in slope stability predictions. Height and number of berms (N° Berm) follow as secondary influential factors, with substantial impact ranges suggesting their critical importance in embankment design. Slope 1 and cohesion (c) demonstrate moderate but significant influence, while geometric parameters like Slope 3, water content (w), Slope 2, elastic modulus (E), Poisson’s ratio (ν), unit weight (γ), and Slopes 4–5 show progressively decreasing importance.

Fig. 12
figure 12

SHAP summary plot showing feature importance and directional effects on factor of safety predictions.

The color coding reveals important parameter-response relationships: higher friction angles (red points) consistently contribute positively to the factor of safety, while lower values (blue points) negatively impact stability. For height, the relationship appears inverse—greater heights (red) tend to reduce the factor of safety, aligning with engineering expectations. The number of berms shows a complex relationship where both high and low values can have varying effects, suggesting optimal berm configurations exist. Cohesion demonstrates the expected positive relationship, where higher cohesive strength enhances stability. This analysis confirms that soil strength parameters (friction angle and cohesion) are paramount for slope stability, while geometric factors (height, slope angles) and stabilization measures (berms) play significant secondary roles. The color coding reveals important parameter-response relationships: higher friction angles (red points) consistently contribute positively to the factor of safety, while lower values (blue points) negatively impact stability. For height, the relationship appears inverse—greater heights (red) tend to reduce the factor of safety, aligning with engineering expectations. The number of berms shows a complex relationship where both high and low values can have varying effects, suggesting optimal berm configurations exist. Cohesion demonstrates the expected positive relationship, where higher cohesive strength enhances stability. This analysis confirms that soil strength parameters (friction angle and cohesion) are paramount for slope stability, while geometric factors (height and slope angles) and stabilization measures (berms) play significant secondary roles.

To provide critical insights into the relative significance of various design and material properties in embankment stability assessment, the SHAP feature importance analysis for the ABC-MERF model is illustrated in Fig. 13 and reveals a clear hierarchical ranking of geotechnical parameters based on their average impact magnitude on factor of safety predictions.

Fig. 13
figure 13

Mean absolute SHAP values showing average feature importance for factor of safety prediction using the ABC-MERF model.

The results establish a clear dominance hierarchy, with friction angle (Ø) exhibiting the highest mean SHAP value (~ 0.14), confirming its paramount importance in slope stability analysis. Height follows as the second most influential parameter (~ 0.12), emphasizing the critical role of embankment geometry in stability assessment. Slope 1 ranks third (~ 0.06), representing approximately half the influence of height, while cohesion (c) and number of berms (N° Berm) show moderate importance levels (~ 0.04 each). Slope 3 demonstrates modest influence (~ 0.03), while the remaining parameters—water content (w), Slope 2, Poisson’s ratio (ν), elastic modulus (E), and unit weight (γ)—exhibit relatively minor impacts (< 0.02). This quantitative ranking validates fundamental geotechnical engineering principles, where soil strength parameters (friction angle, cohesion) and geometric factors (height, primary slope angle) dominate stability considerations. The dramatic difference between the top three parameters and others suggests that focused attention on friction angle, embankment height, and primary slope geometry will yield the greatest impact on predictive accuracy and design optimization. The minimal influence of elastic properties (E, ν) and unit weight indicates these parameters, while necessary for comprehensive analysis, have secondary effects on overall stability.

Conclusion

This study presents a novel hybrid machine learning framework for predicting the Factor of Safety (FOS) in high road embankments through the integration of Random Forest (RF), Mixed Effects Random Forest (MERF), and Artificial Bee Colony optimized Mixed Effects Random Forest (ABC-MERF) models. The comprehensive analysis yields several critical findings that advance both machine learning applications and geotechnical engineering practice.

The study demonstrates that advanced machine learning approaches can achieve exceptional predictive accuracy for geotechnical applications, with all three models attaining R2 values exceeding 0.98 and RMSE values below 0.036. While the incremental improvements from RF (R2 = 0.981) to MERF and ABC-MERF (R2 = 0.991) appear numerically modest, these enhancements represent meaningful advances in a field where small accuracy gains can prevent catastrophic failures. The ABC optimization proved particularly effective in hyperparameter fine-tuning, resulting in superior outlier handling and improved residual distribution properties without overfitting.

The SHAP analysis revealed a clear parameter importance hierarchy that aligns perfectly with fundamental soil mechanics principles. Friction angle (0.14 mean absolute SHAP value) and embankment height (0.12) emerge as dominant factors, followed by slope geometry (0.06) and cohesion (0.04). This ranking validates that the models have captured genuine physical relationships rather than spurious correlations, with embankment properties significantly outweighing foundation characteristics in determining stability.

The Partial Dependence Plots confirm that the models successfully learned fundamental geotechnical relationships, including positive correlation between friction angle and FOS, inverse relationship between height and stability, and linear negative correlation with slope angles. These patterns mirror established slope stability theory, providing confidence in the models’ reliability for practical engineering applications.

Residual analysis demonstrates that ABC-MERF achieves the most favorable error characteristics with homogeneous variance, minimal systematic bias, and excellent adherence to normality assumptions. The superior performance at extreme quantiles is particularly crucial for reliability in safety-critical geotechnical applications.

The study establishes quantitative guidance for engineering priorities, with the dominance of friction angle and height suggesting that material selection and geometric optimization should be primary design considerations.

The successful integration of mixed effects modeling with ABC optimization demonstrates a pathway for enhancing traditional ensemble methods in geotechnical applications. This approach effectively handles hierarchical data structures common in geotechnical engineering while maintaining interpretability through SHAP analysis, bridging the gap between prediction accuracy and engineering understanding.

The consistently high performance across diverse FOS ranges (0.6–1.8) and absence of systematic prediction bias indicate that these models are ready for practical deployment in embankment design workflows. This framework demonstrates strong potential to enhance risk assessment and safety factor determination in geotechnical practice while maintaining the physical interpretability essential for engineering confidence and regulatory acceptance.

Limitations and future work

This study presents a robust framework based on synthetic data generated from validated finite element models. While this allows for controlled parametric analysis, a primary limitation is that the models are yet to be validated on a large-scale, instrumented field case study. Performance with real-world data, which includes greater uncertainty and measurement noise, must be assessed in future work. Furthermore, the current model considers static loading conditions only. It does not account for dynamic forces such as seismic loading or transient hydrological processes like rainfall infiltration, which are critical triggers of slope instability. A promising direction for future research is to extend the ABC-MERF framework to incorporate time-dependent and dynamic parameters, such as pore water pressure distributions under rainfall or peak ground acceleration for seismic events, to enhance its applicability to a wider range of geohazards.

Additionally, the current model treats soil parameters as homogeneous within layers. Real soil deposits exhibit spatial variability, which can significantly influence failure mechanisms. The MERF framework, with its inherent ability to handle grouped data, is conceptually well-positioned to be extended to this challenge. Future work could involve defining clusters based on spatial location (e.g., using random fields) and integrating spatially variable parameters into the model, thereby directly addressing the effect of geotechnical uncertainty on slope reliability.

Furthermore, while the ABC-MERF framework demonstrates superior performance on the test dataset, practitioners should exercise caution when applying the model to sites with characteristics substantially different from the training parameter ranges. The model’s predictions should be used as a screening tool and verified through site-specific analysis for critical infrastructure projects. The framework does not replace engineering judgment or detailed geotechnical investigation but rather serves as a complementary decision-support tool. Furthermore, while the ABC-MERF framework demonstrates superior performance on the test dataset, practitioners should exercise caution when applying the model to sites with characteristics substantially different from the training parameter ranges. The model’s predictions should be used as a screening tool and verified through site-specific analysis for critical infrastructure projects. The framework does not replace engineering judgment or detailed geotechnical investigation but rather serves as a complementary decision-support tool.