Introduction

The development of river networks is influenced by complex biogeomorphic processes and various contributing factors, with channel evolution primarily driven by abiotic processes, including morphological dynamics and hydrodynamic characteristics1,2. However, recent studies have increasingly emphasized the importance of biological processes, particularly vegetation flow dynamics3,4,5,6,7. Aquatic vegetation plays a key role in habitat protection, sediment transport, and other ecological processes7,8. Vegetation such as reeds, mangroves, and other wetland species have been widely employed in river ecosystem management as nature-based solutions (NBS) for ecological protection. Undoubtedly, quantifying the impact of vegetation on flow resistance is essential for supporting the design of river system management and restoration strategies9. Therefore, a thorough understanding of vegetation-induced resistance is crucial for evaluating the role of aquatic vegetation in ecological engineering10.

Accurately quantifying vegetation-induced resistance remains a formidable challenge11, as it necessitates a deep understanding of how fluid dynamics interact with vegetation structures and influence hydrodynamic processes. Nepf 3 explored how vegetation influences hydrodynamics, particularly how the shape, distribution, and density of vegetation affect fluid flow. Manners et al.12 developed a model to quantify the spatial distribution of vegetation roughness across different scales. However, most studies to date have primarily focused on determining the drag coefficient Cd, which is typically calibrated based on experimental measurements in the lab or field13. Figure 1 presents the cross-sectional flow patterns through submerged vegetation. The contribution of vegetation to overall flow resistance is commonly either neglected, integrated into the total river resistance, or estimated through roughness or friction coefficients14,15.

Fig. 1
figure 1

Cross-section of flow through submerged vegetation (a emergent vegetation, b submerged vegetation, c vegetation distribution).

The parameterization of Cd has been a widely discussed issue in recent research, with many researchers5,16,17,18,19,20 exploring various testing methods and empirical equations to calculate resistance coefficients for vegetation coverage. However, with the limitations of traditional methods in complex natural conditions, recent studies have turned to machine learning (ML) techniques to improve the accuracy and applicability of resistance coefficient predictions for vegetated systems, especially in the fields of water resources management and ecological engineering21,22,23,24. Various prediction methods have provided crucial technical support for the study of vegetation resistance, enabling the development of nonlinear and robust formulas between inputs and outputs without requiring any prior knowledge of underlying phenomena.

Recent studies have applied various ML methods to predict ecological and hydrodynamic variables. For instance, Tian et al.25 used Random Forest (RF), Support Vector Machine (SVM), and Back Propagation Neural Network (BPNN) to predict groundwater sulphate content based on 501 samples. Similarly, Deng et al.26 combined Artificial Neural Networks (ANN) and SVM with hybrid algorithms to model algae growth and eutrophication over a thirty-year period. Parisouj et al.27 employed Support Vector Regression (SVR), ANN-BP, and Extreme Learning Machine (ELM) for river flow prediction in four rivers. Additionally, Kumar et al.28 explored six ML methods for predicting flow velocity in vegetated alluvial channels29, and later extended their research by using both standalone and hybrid ML techniques to predict flow velocity. Focusing on the resistance coefficient of channels, Roushangar and Shahnazi30 evaluated the sensitivity and robustness of three ML methods, including SVM, to predict Manning’s roughness coefficient in channels with ripple bedforms. Similarly, Mir and Patel24 applied four ML techniques to predict Manning’s roughness coefficient in alluvial channels with bedforms, demonstrating the effectiveness of these approaches in estimating flow resistance. In addition to ML methods, Liu et al.31 utilized genetic programming techniques to predict the drag coefficients of non-submerged vegetation, a method also applied to estimate the drag coefficients of flexible vegetation in wave-driven flows11. These studies provide valuable insights into the complexities of flow-vegetation interactions.

This study utilizes six standalone ML methods and three deep learning (DL) techniques, enhanced by three intelligent optimization algorithms. The performance of these models is compared against three widely used empirical equations to assess their predictive accuracy. The paper is structured as follows: Section "Methods" details the ML and DL approaches employed, along with the optimization algorithms and evaluation metrics. Section "Collected database" provides an overview of the data sources, preprocessing steps, and the dataset design. In Section "Results and discussion", the results are analyzed, focusing on the impact of parameter omission and the application of combined algorithms in flow resistance prediction. Finally, Section "Conclusion" presents the conclusions of the study.

Methods

Dimensional analysis and functional equation

In ML, the functional form representation of physical processes is typically encapsulated by constructing a mathematical model. This can be formalized as \(y = f(x)\), where y denotes the dependent variable, x represents the independent variables, and f signifies the mapping from input to output, embodying the ML model itself. More sophisticated models, such as neural networks, employ a series of layers coupled with nonlinear activation functions to represent, capturing more intricate and nonlinear relationships between inputs and outputs. Similarly, the determination of the vegetation drag coefficient, Cd, can be approached using such methodologies. These data can be directly obtained using pre-configured force measurement setups32,33,34. Drawing upon classical theories of flow around a cylinder, such analyses facilitate a comprehensive understanding of the forces involved.

$$F_{d} = \frac{1}{2}C_{d} md\rho U_{p}^{2}$$
(1)

where \(C_{d}\) represents the drag coefficient for vegetation, \(m\) denotes the quantity of vegetation per unit area, \(d\) is the diameter of the vegetation, \(\rho\) is the density of water, and \(U_{p}\) is the reference flow velocity, typically referring to the pore water velocity within the vegetative flow.

In a flume experiment using cylindrical dowels as a model, Nepf35 discovered that the drag coefficient of vegetation clusters is related to the Reynolds number, Red, and the spacing between the vegetation, \(\Delta S\). Building on this foundation, Stone and Shen36 found that the drag coefficient Cd varies with changes in vegetation Reynolds number, vegetation density, and vegetation diameter. Subsequent researchers analyzed the quantitative relationship between the drag coefficient, vegetation, and flow parameters, proposing a equation to calculate the drag coefficient37.

$$\frac{1}{2}C_{d} ah_{w} U_{p}^{2} = gh_{w} (1 - \lambda )\frac{\partial h}{{\partial x}}$$
(2)

where a represents the frontal area, \(h_{w}\) denotes the flow depth, \(\lambda\) is the vegetation density, and \(\partial h/\partial x\) represents the energy slope. Consequently, an analysis of the various factors affecting flow resistance can generally be divided into two categories: one is the flow resistance itself, and the other is the resistance introduced by the presence of vegetation, which includes form drag and friction drag.

$$f_{1} (C_{d} ,d,U_{p} ,\rho ,h_{w} ,h_{v} ,K,g,\mu ,L_{x} ,L_{y} ,\lambda ) = 0$$
(3)

where \(h_{v}\) represents the height of the vegetation, K denotes the stiffness, \(\mu\) is the dynamic viscosity, Lx is the distance between plants in the direction of the flow, and Ly is the distance between plants in the direction perpendicular to the flow, across the channel width. To further simplify and nondimensionalize Eq. 3, consider the relevant parameters and their dimensional forms.

$$f_{2} \left(C_{d} ,\frac{{h_{v} }}{{h_{w} }},\frac{{\rho U_{p}^{2} h_{v}^{4} }}{K},\frac{{U_{p}^{2} }}{{gh_{v} }},\frac{{\rho U_{p} d}}{\mu },\frac{d}{{L_{x} }},\frac{d}{{L_{y} }},\lambda\right ) = 0$$
(4)

The third term of Eq. 4 attracts our attention due to \(K \approx \infty\),

$$\frac{{\rho U_{p}^{2} h_{v}^{4} }}{K} \approx 0$$
(5)

The blockage ratio \(\beta\) is often used in fluid dynamics and environmental studies to describe the fraction of the channel that is obstructed by objects such as vegetation and defined as \(\beta = d/L_{y}\). Therefore, we can get the equation for Cd,

$$C_{d} \sim f(\lambda ,\alpha ,\beta ,F_{r} ,Re_{p} )$$
(6)

where \(\lambda\) is the volume fraction occupied by vegetation, \(\alpha\) is the submersion degree, \(\beta\) is the blockage ratio, Fr is the Froude number, and Rep represents the vegetation Reynolds number.

By using ML and DL methods, input–output relationships are established, as shown in Eq. 6. These methods are based on real-world data, with the goal of building a model that captures underlying patterns rather than fitting the data exactly. The model’s framework and key methodologies are illustrated in Fig. 2. This model allows for predicting future responses to new inputs, going beyond replicating observed data to forecast unseen scenarios.

Fig. 2
figure 2

Research framework and methodology.

Machine learning algorithms

In this study, the ML and DL models are organized into three distinct groups: non-ensemble ML models (SVM, ELM, RR), ensemble ML models (RF, XGBoost, LSBoost), and DL models (BP, GRN, LSTM). MATLAB was employed for all modeling efforts, ensuring consistency across the nine models by using identical input and output parameters for training.

Non-integrated machine learning models

This research examines the application of three non-ensemble ML models SVM, ELM and Multiple Ridge Regression (RR) in predicting the coefficients of resistance to vegetation water flow. As highlighted by Deng and Liu38, SVM is effective for both classification and regression tasks. Its strength lies in its optimization capabilities through various kernel functions, which allow it to excel in nonlinear problem solving. However, selecting appropriate kernel functions and tuning parameters are crucial to prevent overfitting or underfitting. ELM knowns for its rapid training capabilities, was used in a recent study to predict water levels in the Hei River39. This model is suitable for large datasets and excels by setting hidden layer parameters randomly and determining output weights analytically. RR uses L2 regularization to manage highly correlated variables, preventing overfitting and making it ideal for data with multicollinearity. This model offers reliable predictions, particularly useful in studies of complex hydrodynamic interactions with vegetation.

Integrated machine learning models

In modern ML, RF, introduced by Breiman40, is a prominent ensemble method that uses multiple decision trees to boost prediction accuracy. Each tree is trained on the entire dataset with a unique random vector, enhancing diversity in training and combining predictions through a majority voting mechanism for improved predictive power. RF is effective with large, high-dimensional datasets and is robust against noise and outliers, though it is prone to overfitting in noisy scenarios. Boosting, another powerful ensemble strategy, enhances the accuracy of weak learners through iterative training focused on previously misclassified instances. XGBoost, developed by Chen and Guestrin41 and used in various engineering applications42,43,44, is a gradient boosting model renowned for its speed and efficiency on large datasets. XGBoost excels in managing overfitting and ensuring high predictive accuracy, although its performance is sensitive to parameter settings. LSBoost employs linear models as weak learners within a gradient boosting framework, ideal for datasets with prominent linear relationships. However, its effectiveness decreases with non-linear data complexities, in contrast to methods using decision trees as base learners.

Deep learning algorithm

Backpropagation (BP) is an essential algorithm for training neural networks, widely used to update weights by computing the gradient of the loss function to minimize errors. Often paired with gradient descent, BP efficiently trains complex multilayer feedforward networks, showing robust performance across various applications. The Gated Recurrent Unit (GRU) simplifies the Long Short-Term Memory (LSTM) architecture by combining forget and input gates into a single “update gate” and merging the cell state with the hidden state, enhancing learning efficiency for sequence prediction tasks45,46. LSTM networks address long-term dependencies in sequence data, a challenge for traditional RNNs, with gating mechanisms such as input, forget, and output gates. These features make LSTM effective for applications that require understanding relationships over extended periods, such as language modeling and time series analysis.

Optimised learning algorithms

PSO algorithm

Particle Swarm Optimization (PSO) is a collective intelligence-based optimization technique introduced by Eberhart and Kennedy47. This algorithm simulates the social behavior of bird flocks, where each “particle” represents a candidate solution in the solution space. Particles adjust their positions and velocities by tracking the best positions found individually and by the group as a whole. The standard form of PSO is,

$$v_{i} = \omega \times v_{i} + c_{1} \times {\text{rand}} () \times \left( {{\text{ pbest }}_{i} - x_{i} } \right) + c_{2} \times {\text{rand}} () \times \left( {{\text{ gbest }}_{i} - x_{i} } \right)$$
(7)

where \(\omega\) is the inertia factor, \(v_{i}\) is the particle’s velocity, and rand() is a random number between 0 and 1, c1 and c2 are learning factors. PSO is particularly well-suited for dealing with continuous optimization problems and has been widely applied in engineering design, neural network training, and other scientific research fields.

WSO algorithm

Whale Shark Optimization (WSO) is a novel optimization algorithm inspired by the hunting behavior of great white sharks48. The algorithm simulates principles such as the social behavior of sharks, a leader–follower structure, position updating, leaping behavior, and fitness evaluation. Through these mechanisms, the WSO algorithm can effectively search for the optimal or near-optimal solutions within the solution space.

$$w_{k + 1}^{i} = \frac{{w_{k}^{i} + w_{k + 1}^{i} }}{2 \times R}$$
(8)

where \(w_{k}^{i}\) and \(w_{k + 1}^{i}\) is the positions of the two best solutions retained, and they are used to update the positions of other sharks in the swarm. R is a random number between 0 and 1.

RIME algorithm

The RIME algorithm49 simulates the growth processes of “soft time” and “hard time” in time-ice modeling to develop a soft time search strategy and a hard time puncturing mechanism. This approach enhances both exploration and exploitation in optimization methods. When tested against ten established algorithms and ten recently improved ones, RIME demonstrated superior performance. The key optimization process involves,

$$R_{ij}^{{\text{new }}} = R_{best,j} + r_{1} \cdot \cos \theta \cdot \varepsilon \cdot \left( {C \cdot \left( {Ub_{ij} - Lb_{ij} } \right) + Lb_{ij} } \right),r_{2} < E$$
(9)

where \(R_{ij}^{{\text{new }}}\) is the new position of the updated particle, \(R_{best,j}\) is the \(j - th\) particle of the best rime-agent in the rime-population, \(r\) is a random number, \(\cos \theta\) is the direction of particle movement, \(\varepsilon\) is the environmental factor, C is the degree of adhesion, E is the coefficient of being attached.

Empirical equations for \(C_{d}\) and model evaluation indicators

Researchers have studied the estimation of vegetation resistance coefficients from various perspectives. We have selected the most widely used calculation equations for this purpose. Cheng and Nguyen50 introduced an empirical best-fit function to characterize the relationship between \(C_{d}\) and \(Re_{v}\), derived from existing experimental data on \(C_{d}\).

$$C_{d} = \frac{50}{{Re_{v}^{0.43} }} + 0.7\left[ {1 - \exp \left( { - \frac{{Re_{v} }}{15000}} \right)} \right]$$
(10)

where \(Re_{v}\) is the vegetation Reynolds number, which is defined as \(Re_{v} = 0.25\pi \lambda (1 - \lambda )Re_{p}\).

Sonnenwald et al.51 redefined \(\alpha_{0}\) and \(\alpha_{1}\) in terms of \(\lambda\) and \(d\) through least-squares curve fitting using available experimental data, leading to a function for estimating \(C_{d}\).

$$C_{d} = \frac{12950d + 64}{{Re_{p} }} + 34d + 6.4\lambda + 1$$
(11)

Kothyari et al.52 developed a \(C_{d}\) prediction model through multi-parameter regression analysis based on their experimental data. The model is expressed as,

$$C_{d} = 1.53[1 + 0.45\ln (1 + 100\lambda )]Re_{p}^{ - 3/50}$$
(12)

To assess and compare the performance of the ML and DL models developed in this study, five key metrics were employed: the mean absolute error (MAE), root mean square error (RMSE), index of agreement (IA),  determination (R2), and Kling-Gupta Efficiency (KGE)53,54,55.

$${\text{MAE}} = \frac{1}{n}\sum\limits_{i = 1}^{n} {\left| {y_{{{\text{pred}},i}} - y_{{{\text{true}},i}} } \right|}$$
(13)
$${\text{RMSE}} = \sqrt {\frac{1}{n}\sum\limits_{i = 1}^{n} {(y_{{{\text{pred}},i}} - y_{{{\text{true}},i}} )^{2} } }$$
(14)
$${\text{IA}} = 1 - \frac{{\sum\nolimits_{i = 1}^{n} {(y_{{{\text{pred}},i}} - y_{{{\text{true}},i}} )^{2} } }}{{\sum\nolimits_{i = 1}^{n} {(|y_{{{\text{true}},i}} - \overline{y}_{{{\text{true}}}} | + |y_{{{\text{pred}},i}} - \overline{y}_{{{\text{pred}}}} |)^{2} } }}$$
(15)
$${\text{KGE}} = 1 - \sqrt {\left( {R - 1} \right)^{2} + \left( {\xi - 1} \right)^{2} + \left( {\gamma - 1} \right)^{2} }$$
(16)
$${\text{R}}^{2} = 1 - \frac{{\sum\nolimits_{i = 1}^{n} {(y_{{{\text{true}},i}} - y_{{{\text{pred}},i}} )^{2} } }}{{\sum\nolimits_{i = 1}^{n} {(y_{{{\text{true}},i}} - \overline{y}_{{{\text{true}}}} )^{2} } }}$$
(17)

where \(y_{{{\text{true}},i}}\) is the observed value, \(y_{{{\text{predict}},i}}\) is the predicted value, \(\overline{y}_{{{\text{true}}}}\) is the mean of the true values, \(\overline{y}_{{{\text{pred}}}}\) is the mean of the predicted values, n is the total number of data, R is the correlation coefficient between the observed and predicted values, \(\xi\) is the ratio of the standard deviations of the observed and predicted values, \(\gamma\) is the ratio of the means of the observed and predicted values.

In addition, Tables 1 and 2 provide the datasets collected in this study along with the hyperparameter settings for the ML models used in this study. These include data from 20 research teams, the key models and their optimized values, such as maximum depth, learning rate, and other metrics. These configurations serve as guidelines for setting hyperparameters in prediction tasks.

Table 1 Summary of experimental data.
Table 2 Summary of hyperparameter data.

Collected database

The preprocessing of raw data is a critical step in developing ML models, as it enhances the efficiency and accuracy of predictive operations. This preprocessing involves identifying and handling missing values, eliminating duplicates and outliers, and standardizing the dataset. In this study, outliers are detected based on their deviation from the mean using the Z-score detection method. A higher Z-score indicates a greater deviation from the mean, making it an effective measure for identifying anomalous data points. This method is highly versatile, as it can be applied to various data distributions and is not restricted to normal distributions. The red hollow circles in Fig. 3 highlight the detected outliers.

$$Z = \frac{x - \mu }{\sigma }$$
(18)

where x is a sample of the data set, \(\mu\) is the sample mean, and \(\sigma\) is the standard deviation.

Fig. 3
figure 3

Outlier data detection.

Detailed data distribution is shown in Table 1.

In order to better understand the distribution characteristics of each parameter and provide valuable information for subsequent modeling and analysis, Fig. 4 presents the frequency distributions of six key parameters, including input parameters \(\lambda\), \(\alpha\), \(\beta\), \(Re_{p}\), \(F_{r}\) and the output parameter \(C_{d}\). The red curves in the figure represent the fitted distributions, revealing the skewness of each parameter’s distribution. Overall, all the parameters exhibit some degree of skewness, and each parameter has a distinct distribution pattern. Among the input parameters, \(\lambda\) shows a clear skewed distribution, with most of the data concentrated in the lower range, demonstrating a tendency for values to cluster around a small range. In contrast, \(\alpha\) exhibits a more uniform distribution, with the values spread more evenly across the entire range, showing no significant concentration.

Fig. 4
figure 4

Histograms of input and output parameters.

Consequently, the data was randomly partitioned into three subsets: the training dataset (60%), the validation dataset (20%), and the test dataset (20%). The random division of the data is clearly illustrated in Fig. 5.

Fig. 5
figure 5

Comparison of variables impact on Cd in different datasets.

Results and discussion

Exploring inter-feature correlations and distributions

In this study, the relationships between the predictive model parameters were analyzed using Pearson’s correlation coefficient to evaluate their interdependence. The coefficient ranges from -1 to 1, where 0 indicates no linear correlation, and -1 or + 1 represents a perfect negative or positive linear relationship, respectively. Five parameters were identified in Section "Dimensional analysis and functional equation", and Fig. 6 illustrates the Pearson correlation coefficient matrix for the input parameters. The analysis revealed no significant pairwise correlations, except between λ and β. Specifically, β, which represents the blockage ratio, exhibits a strong negative correlation with density λ. This finding is expected, as β is inherently linked to the vegetation arrangement pattern, implying a certain degree of collinearity between λ and β.

Fig. 6
figure 6

Variable distribution and Pearson’s correlation coefficient analysis.

The boxplots for the six normalized parameters in Fig. 7 reveal distinct distribution patterns. \(\lambda\) and \(C_{d}\) show relatively low variability, indicating that their standardized values are concentrated within a narrow range, suggesting consistency and less fluctuation. On the other hand, \(\alpha\) and \(F_{r}\) exhibit significant variability, with wider interquartile ranges and clear outliers, suggesting these parameters experience more substantial changes or fluctuations. This indicates that \(\alpha\) and \(F_{r}\) may be more sensitive to external factors or experimental conditions. \(Re_{p}\) and \(\beta\), in comparison, show moderate variability, indicating some level of fluctuation but without the extreme variation observed in \(\alpha\) and \(F_{r}\), implying that these parameters tend to be more stable but still exhibit some level of inconsistency across the data.

Fig. 7
figure 7

Boxplot of normalized parameters.

ML predictive performance evaluation

ML has been widely applied to predict environmental quality in numerous studies; however, its use in estimating the drag coefficient (Cd) remains underexplored. This study employs ML techniques to predict Cd under both submerged and non-submerged vegetation conditions, with performance evaluated using metrics such as MAE, RMSE, IA and KGE across nine ML models (Fig. 8). Among these, XGBoost achieves the lowest MAE of 0.0802. In contrast, models such as SVM, ELM, and RR exhibit greater variability and significant deviations from actual values. This is reflected in their lower R2 values in scatter plots, with SVM, for instance, achieving an R2 of 0.7083, indicating limited predictive power and accuracy. On the other hand, XGBoost and LSBoost display superior performance, with XGBoost achieving the highest R2 of 0.9552, indicating a robust linear relationship between predicted and actual values. While LSTM and GRU demonstrate commendable predictive capabilities, they still fall short compared to XGBoost. Based on these findings, XGBoost is identified as the most effective model for predicting Cd.

Fig. 8
figure 8

Comparison of MAE, RMSE, IA and KGE for various ML models.

XGBoost demonstrates exceptional performance in predicting the drag coefficient Cd, significantly outperforming traditional models such as SVM, ELM, and RR. These traditional models exhibit considerable variability and relatively lower predictive accuracy, with SVM achieving an R2 of 0.7083, as illustrated in Fig. 9. In contrast, XGBoost stands out among ensemble methods, achieving the highest R2 of 0.9552, reflecting a nearly perfect linear relationship between predicted and observed values, as shown in Fig. 10. Although DL models, such as LSTM and GRU (shown in Fig. 11), exhibit satisfactory performance with R2 values exceeding 0.8, they still fall short of the precision and robustness demonstrated by XGBoost. The superior predictive performance of XGBoost can be attributed to its advanced ensemble learning framework, which combines multiple weak learners to improve accuracy, alongside effective regularization techniques that prevent overfitting. These features allow XGBoost to achieve an optimal balance between model complexity and prediction stability. Consequently, XGBoost is particularly well-suited for estimating flow resistance in vegetated channels, as evidenced by its consistent and highly accurate results across all evaluation metrics.

Fig. 9
figure 9

Predicted and measured based on non-ensemble ML models (a line-graph and scatter plot of predicted vs. measured based on SVM, b line-graph and scatter plot of predicted vs. measured based on ELM, c line-graph and scatter plot of predicted vs. measured based on RR).

Fig. 10
figure 10

Predicted and measured based on ensemble ML models (a line-graph and scatter plot of predicted vs. measured based on RF, b line-graph and scatter plot of predicted vs. measured based on XGBoost, c line-graph and scatter plot of predicted vs. measured based on LSBoost).

Fig. 11
figure 11

Predicted and measured based on DL models (a line-graph and scatter plot of predicted vs. measured based on BP, b line-graph and scatter plot of predicted vs. measured based on GRU, c line-graph and scatter plot of predicted vs. measured based on LSTM).

As a comparison, we have included the results from widely used empirical equations for Cd, as shown in Eqs. 10 to 12. To maintain consistency with ML, we compared these equations with the observed values, as depicted in Fig. 12. Notably, the R2 values for the three empirical equations were all below 0.6, highlighting the limitations of these models when applied to larger datasets. This is evident in the line and scatter plots of the predicted versus measured Cd values, where the deviations between the predicted and actual values are substantial. In contrast, the machine learning-based Cd prediction method effectively addresses these issues by learning complex patterns in the data, leading to more accurate and robust predictions across a broader range of datasets. In conclusion, while empirical equations may work for limited datasets or simple cases, they struggle with more complex data. The machine learning-based approach fills this gap, offering improved predictive performance over a wider range of conditions.

Fig. 12
figure 12

Predicted and measured based on empirical equation (a line-graph and scatter plot of predicted vs. measured based on Eq. 10, b line-graph and scatter plot of predicted vs. measured based on Eq. 11, c line-graph and scatter plot of predicted vs. measured based on Eq. 12).

Model accuracy analysis based on optimised learning algorithms

Enhancing predictive performance by integrating ML with statistical and optimization methods, this study nominates XGBoost as the model with the best performance due to its robust generalizability and accuracy. Recognizing the necessity to adapt model selection based on distinct usage scenarios, the study employs optimization algorithms, namely PSO, WSO, and RIME optimization algorithm, to refine the predictive accuracy of each evaluated ML model. These enhancements, aimed at aligning the models more closely with specific predictive requirements, are systematically documented in Table 3, illustrating the tailored adjustments and resultant improvements.

Table 3 Simulation accuracy based on optimised learning algorithms.

The PSO-XGBoost model demonstrated superior performance across all metrics, achieving the lowest MAE of 0.076, RMSE of 0.117, and the R2 value of 0.959. These results highlight the model’s exceptional ability to produce highly accurate predictions, with minimal deviation from observed values. The superior performance of PSO-XGBoost can be attributed to its advanced ensemble learning framework and robust regularization techniques, which allow it to effectively capture complex, nonlinear relationships within the data. In contrast, although models such as SVM, ELM, and RR showed some improvement after optimization, these models’ relatively simplistic architectures limited their capacity to exploit the intricate patterns present in the dataset, resulting in relatively modest performance gains.

While DL models, including LSTM and GRU, demonstrated notable improvements following optimization, they still fell short of XGBoost in terms of predictive accuracy. The RIME-BP model, for instance, achieved a R2 value of 0.948 and a KGE score of 0.946, reflecting the potential of advanced optimization techniques to enhance the performance of both traditional and DL models. Nevertheless, despite these gains, DL models were unable to match the precision and stability exhibited by XGBoost, particularly in terms of MAE and RMSE. This analysis underscores the continued dominance of ensemble methods like XGBoost in handling complex, real-world predictive tasks, particularly in the context of flow resistance estimation in vegetated channels.

Discussion

Model prediction based on missing parameters

To enhance the understanding of variable interdependencies in Cd prediction, Pearson correlation coefficients were calculated between Cd and various parameters, revealing a spectrum of linear associations. The parameter α exhibited the strongest positive correlation, marked by a coefficient of 0.81, denoting a significant linear relationship. Fr followed with a coefficient of 0.28, indicating a moderate positive correlation. Notably, Rep displayed a negative correlation of -0.20, underscoring an inverse relationship with Cd, while λ had a modest positive correlation of 0.17. The parameter β showed an almost negligible correlation of 0.03, implying minimal linear association with Cd. These results highlight the multifaceted factors affecting the drag coefficient and emphasize the necessity of considering diverse parameters for precise fluid dynamics modeling. Considering the significant correlations with parameters α (indicative of vegetation characteristics) and Fr (reflective of hydraulic conditions), it is imperative to assess the predictive capabilities of our chosen models when one or both of these influential parameters are absent. This examination will illuminate the models’ resilience in the face of incomplete data and delineate the extent to which model efficacy relies on these critical inputs. Detailed scenarios involving missing parameter values and the corresponding impacts on model performance are presented in Table 4.

Table 4 Parameter combination patterns and missing values.

From the data presented in Fig. 13, it is evident that the correlation coefficients vary between 0.855 to 0.913, demonstrating that XGBoost consistently ensures high predictive accuracy, even in scenarios of missing values. Although λ and β are considered key factors that influence the decline in accuracy, the model’s predictive performance remains closely matched in scenarios where either λ or β is missing. While Rep does not have a significant impact on model performance on its own, the absence of both λ or β simultaneously results in the poorest predictive accuracy.

Fig. 13
figure 13

Predicted vs. measured based on different combinations.

Simulation algorithms based on different combinations

In this study, the integration of different ML models, including SVM-RF, BP-RF, and LSTM-XGBoost, is investigated to enhance predictive accuracy and performance. This fusion-based approach, as illustrated in Fig. 14, leverages the complementary strengths of individual models, enabling the extraction of intricate data patterns that single models might overlook. By combining the unique capabilities of each model, this strategy aims to improve robustness, minimize prediction errors, and better address the complexities inherent in the underlying data. Such methodological integration is well-established in model fusion theory, offering a promising pathway for enhanced predictive power in complex systems.

Fig. 14
figure 14

Predicted vs. measured based on combined ML.

The study revealed that combined models also possess strong predictive performance, particularly the LSTM-XGBoost hybrid, which achieved an accuracy of 0.9548. This combination harnesses the LSTM’s strength in capturing dependencies in sequential data along with XGBoost efficiency in handling complex nonlinear patterns. By integrating these approaches, we were able to extract and utilize features within the data more effectively, resulting in a significant performance enhancement in time series prediction tasks. This synergistic approach underscores the potential of model fusion in leveraging the unique advantages of different algorithms to address diverse challenges in data analysis.

Conclusion

This study presents a novel integrated framework for predicting the drag coefficient Cd, evaluating nine different ML and DL models, further enhanced by optimization algorithms PSO, WSO, and RIME to improve predictive accuracy. The results demonstrate that the XGBoost model is the most robust, achieving an R2 value of 0.9552. Notably, traditional ML methods also exhibit significant adaptability when combined with optimization algorithms, with even the least accurate models, such as SVM, showing substantial improvements, achieving R2 values exceeding 0.8 after optimization. In addition to enhancing individual models, the integration of hybrid approaches, particularly the LSTM-XGBoost fusion, highlights the transformative potential of model ensemble techniques in improving the predictive accuracy for complex environmental systems. Another key feature of this framework is its resilience in handling incomplete datasets; even with missing critical parameters (such as \(\beta\), \(\lambda\), and \(Re_{p}\), XGBoost maintains outstanding predictive performance. This study provides an efficient and reliable tool for predicting vegetation-induced flow resistance, advancing the understanding of aquatic vegetation hydrodynamics and offering scientific support for ecological restoration and watershed management. Future research may focus on using DL methods to predict resistance and flow velocity data over longer time scales. Additionally, the hybrid approach demonstrated significant potential for broader applications and could be leveraged to enhance the model’s generalization and performance across different environmental contexts.