Oil and natural gas occupy a significant position within China’s energy strategy, which is of paramount importance to the country’s national development. Pipeline transportation is one of the most prevalent modes of transportation for oil and gas and has evolved into the fifth largest transportation industry in China. However, when a pipeline is in operation, because of the environment and pipeline medium, pipeline corrosion occurs, which in turn leads to various accidents1. Corrosion is a significant factor contributing to pipeline failure, as evidenced by relevant statistical data. Increased pipeline operation time results in a gradual reduction in strength, ultimately leading to failure. Corrosion has been recognized as the principal cause of pipeline failure incidents. Consequently, the corrosion rate is commonly employed as an evaluation index for pipeline corrosion2,3. There are many factors affect the corrosion rates of pipelines, such as the pipeline medium components, temperature, flow rate, pH, and dissolved oxygen and CO2 contents. These factors interact with each other and are interrelated, forming an intricate corrosion system4. Consequently, the development of a multifactor, high-dimensional model for accurately predicting the corrosion rate of subsea pipelines will be a focal point of future research.

Advances in computer science have led researchers worldwide to conduct extensive studies on predicting pipeline corrosion rates through machine learning5,6,7,8,9,10,11,12,13,14,15. Jin et al. proposed buffer operator theory to develop an enhanced DGM(1,1) model for forecasting pipeline corrosion rates over time, which significantly outperforms the conventional DGM(1,1) model in terms of predictive accuracy16. Biezma et al. proposed a fuzzy logic method to predict and analyse the corrosion rates of pipelines, considering six influencing factors. This approach improves both the accuracy and stability of the predictions17. Zhang et al. employed a distinctive BP neural network model for predicting pipeline corrosion rates, obtaining results that align more closely with measured values and effectively illustrating the correlation between various factors and the corrosion rate18. Nagoor et al. employed an ANN model to predict the service life of a crude oil pipeline, achieving a prediction accuracy of 99.97%19. Bo et al. predicted the corrosion rates of pipelines via the PSO-MGM(1,1) model, and its prediction accuracy was 16% higher than that of the MGM(1,1) model20. Xiao et al. used the WOA-BP algorithm to predict the corrosion rates of subsea pipelines, and the average absolute percentage error of their predictions was 3.689%, which was much lower than that of the comparison model1. Jia et al. used kernel principal component analysis to determine the corrosion rates of subsea pipelines and related factors and established a KPCA-SVM model21. This method reduces the interference of low-correlation data, improves the prediction accuracy and reduces the prediction difficulty. Nagoor et al.employed a Bayesian regularization-based neural network framework to predict dry airway lifespan with high accuracy, even when handling datasets containing missing parameters22. Xiao et al. predicted the corrosion rates of subsea pipelines via a combined PSO-TSO-BPNN model with an average absolute percent error of 1.8441%4, which represents a significant improvement in both the accuracy and stability of the model. The modelling methods proposed by the above scholars all have unique advantages but are limited by the optimization algorithms and the neural network’s own limitations, which may make them unable to obtain accurate predictions of pipeline corrosion rates for multifactorial and high-dimensional problems.

This paper presents a hybrid model, KPCA-NGO-LSSVM, for predicting the corrosion rates of subsea pipelines, utilizing kernel principal component analysis (KPCA) and Northern Goshawk optimization (NGO) to increase the performance of the least squares support vector machine (LSSVM). Kernel principal component analysis (KPCA) is employed to downscale the data and determine the principal factors influencing the corrosion rates of subsea pipelines, thus reducing the complexity of processing model data and increasing the efficiency of modelling operations. The penalty parameter γ and the kernel parameter \({\sigma ^2}\) are optimized through the NGO algorithm to increase the precision of the prediction model and address the challenges of inconsistent predictions and insufficient generalization capability. Through experimental validation and a comparison of the error metrics, the KPCA-NGO-LSSVM model is shown to outperform existing methods. Specifically, the mean absolute percentage error (MAPE) is reduced to less than 2%, and the root mean square error (RMSE) is significantly lower than those of conventional models. The KPCA-NGO-LSSVM model provides reliable technical support for accurately predicting subsea pipeline corrosion rates. This model provides a scientific basis for optimizing corrosion protection strategies, guiding pipeline maintenance decisions, and ensuring flow safety. Furthermore, the model has significant potential in extending the service life of subsea pipelines and reducing operational and maintenance costs.

Principles of the NGO algorithm and LSSVM modelling

Principles of the NGO algorithm

The Northern Goshawk optimization (NGO) algorithm was introduced in 2022 by Mohammad Dehghani and colleagues. The algorithm replicates the Northern Goshawk’s behaviour during hunting, focusing on prey recognition, attack, pursuit, and evasion. The Northern Goshawk optimization algorithm divides the hunting process into two phases: prey identification and attack (exploration phase) and chasing and escape (exploitation phase)23.

Initialization

The Northern Goshawk algorithm can be represented by the following matrix for the Northern Goshawk population:

$$X={\left[ {\begin{array}{*{20}{c}} {{X_1}} \\ \vdots \\ {{X_{\text{i}}}} \\ \vdots \\ {{X_N}} \end{array}} \right]_{N \times m}}=\left[ {\begin{array}{*{20}{c}} {{x_{1,1}}}& \cdots &{{x_{1,j}}} \\ \vdots & \ddots & \vdots \\ {{x_{i 1}}}& \cdots &{{x_{i,j}}} \\ \vdots &{}& \vdots \\ {{{\text{x}}_{N,1}}}& \cdots &{{x_{N,j}}} \end{array}} \right]$$
(1)

.

X is the population matrix of Northern Goshawks; Xi denotes the position of the ith Northern Goshawk; \({x_{i,j}}\) indicates the jth-dimensional position of the ith Northern Goshawk; N is the number of Northern Goshawk populations; and m refers to the number of dimensions in the solution problem.

In the Northern Goshawk optimization algorithm, the objective function of the problem is utilized to compute the objective function value of each Northern Goshawk; the objective function value of the Northern Goshawk population can be represented as a vector of objective function values:

$$F={\left[ {\begin{array}{*{20}{c}} {{F_1}} \\ \vdots \\ {{F_2}} \\ \vdots \\ {{F_N}} \end{array}} \right]_{N \times 1}}={\left[ {\begin{array}{*{20}{c}} {F\left( {{X_1}} \right)} \\ \vdots \\ {F\left( {{X_2}} \right)} \\ \vdots \\ {F\left( {{X_N}} \right)} \end{array}} \right]_{N \times 1}}$$
(2)

.

where F is the objective function vector of the Northern Goshawk population and Fi is the objective function value of the ith Northern Goshawk population.

Prey identification and attack (Global search)

During the initial phase of hunting, the Northern Goshawk selects a prey item at random and attacks it quickly. This phase improves the NGO algorithm’s exploration capability by randomizing the selection of prey in the search space. In this phase, a global search of the search space is conducted to determine the optimal region. During this phase, the Northern Goshawks exhibit the prey selection and attack behaviours described in Eqs. (3)-(5):

$${P_{\text{i}}}={X_k},\; {\text{i}}=1, 2, 3 \cdots , N;\;{\text{j}}=1, 2, 3 \cdots , {\text{m}};\;k=1, 2 \cdots \;{\text{i-}}1, \;{\text{i}}, {\text{i}}+1, \cdots N$$
(3)
$$x_{{i,j}}^{{new,P1}}=\{ \begin{array}{*{20}{l}} {{x_{i,j}}+r\left( {{p_{i,j}} - I{x_{i,j}}} \right),{F_{{P_i}}}<{F_i}} \\ {{x_{i,j}}+r\left( {{x_{i,j}} - {p_{i,j}}} \right),{F_{{P_i}}} \geqslant {F_i}} \end{array}$$
(4)
$${X_i}=\{ \begin{array}{*{20}{l}} {X_{{\text{i}}}^{{new,P1}},F_{i}^{{new,P1}}<{F_i}} \\ {{X_i},F_{i}^{{new,P1}} \geqslant {F_i}} \end{array}$$
(5)

.

Where Pi denotes the location of the ith Northern Goshawk’s prey; \({F_{{P_{\text{i}}}}}\) is the objective function value for the position of the ith Northern Goshawk’s prey; k represents a random integer within a specified range [1,N]; \(X_{{\text{i}}}^{{new,P1}}\) represents the updated location of the ith Northern Goshawk; \(x_{{i,j}}^{{new,P1}}\) represents the updated position in the jth dimension of the ith Northern Goshawk; \(F_{{\text{i}}}^{{{\text{new}},P1}}\) is the value of the objective function pertaining to the ith Northern Goshawk following the update process in phase 1; r is a random number within the interval [0, 1]; and I denotes a randomly selected integer, either 1 or 2.

Chase and escape (Localized search)

After a Northern Goshawk attacks its prey, the prey will attempt to escape capture. Thus, in the concluding phases of hunting, the Northern Goshawk must sustain its chase. The Northern Goshawks’ high pursuit speed enables them to chase and ultimately capture prey in nearly any circumstance. The simulation of this behaviour improves the algorithm’s capacity for local search within the search space. This hunting activity is presumed to be in proximity to an attack position with a radius R. In the subsequent phase, it is described by Eqs. (6)-(8):

$$x_{{i,j}}^{{new,P2}}={x_{i,j}}+R(2r - 1){x_{i,j}}$$
(6)
$$R=0.02(1 - \frac{{\text{t}}}{T})$$
(7)
$${X_{\text{i}}}=\{ \begin{array}{*{20}{l}} {X_{i}^{{new,P2}},F_{i}^{{new,P2}}<{F_i}} \\ {{X_i},F_{i}^{{new,P2}} \geqslant {F_i}} \end{array}$$
(8)

.

Where t represents the current iteration number and T denotes the maximum iteration limit; \(\:{X}_{i}^{new,P2}\:\)represents the updated position of the ith Northern Goshawk during the second stage; \(x_{{i,j}}^{{new,P2}}\) represents the updated position of the jth dimension of \(X_{{\text{i}}}^{{new,P2}}\); and \(F_{{\text{i}}}^{{new,P2}}\) is the value of the objective function pertaining to the ith Northern Goshawk following the update process in the second stage.

LSSVM algorithm

Various machine learning algorithms, including backpropagation neural networks (BP), random forests (RF), and support vector machines (SVM), have been widely used to predict corrosion rates in subsea pipelines. While these methods have demonstrated varying degrees of success, they often face challenges in computational efficiency and model generalizability when dealing with small-to-medium scale datasets characterized by high dimensionality and strong nonlinearity. For instance, BP models typically require substantial computational resources and extensive hyperparameter tuning, while SVM and ensemble methods like RF may encounter overfitting risks in limited-data scenarios.

In contrast, Least Squares Support Vector Machines (LSSVM) show clear advantages in this particular application setting. In our preliminary study (see Figs. 2 and 3; Table 4), LSSVM consistently demonstrated superior performance metrics through systematic comparisons with three representative algorithms (BP, RF, and traditional SVM), with much higher prediction accuracy and stability than the other algorithms, and this improved performance stems from the unique mathematical formulation of the LSSVM, which converts the quadratic optimisation problem into a system of linear equations by means of equal constraints, thus ensuring a global optimisation solution while maintaining the simplicity of the model. ensuring a globally optimised solution while maintaining model simplicity. In addition, its structural risk minimisation principle enhances the generalisation capability, which is particularly important for offshore engineering applications where the cost of in situ data collection is high and the size of the dataset is limited.

As an advanced variant of support vector machines (SVM), LSSVM addresses the original algorithm’s computational complexity through innovative problem reformation. Where conventional SVM solves convex quadratic programming problems, LSSVM transforms this into solving linear equations via kernel space mapping and regularization techniques. This fundamental improvement not only accelerates computation but also improves numerical stability, making it particularly suitable for handling the sparse, high-dimensional corrosion datasets typical of subsea pipeline monitoring systems.

The LSSVM is an advanced learning and predictive algorithm derived from the conventional support vector machine (SVM) algorithm. This algorithm streamlines the solution of quadratic optimization problems by converting them into linear Eq. 24.

The steps for using the LSSVM algorithm are as follows:

For a given value from the training dataset \(\left( {{x_i}, {y_i}} \right)\), \({x_i}={\left( {{x_{i1}}, {x_{i2}}, \cdots , {x_{id}}} \right)^T}\) is the d-dimensional input vector, and the output data value is \({y_i}\); N is the total number of training data values.

  1. (1)

    To transform the input space into the feature space, a nonlinear function is employed, \(\phi \left( {{x_{\text{i}}}} \right)\). The process of estimating the nonlinear function is represented by Eq. (9)25:

$${\text{f}}\left( x \right)=b+\left\langle {\phi \left( x \right),\omega } \right\rangle$$
(9)

Where \(\omega\) is the weight vector, b is the bias term, and \(\left\langle . \right\rangle\) denotes the inner product operation.

  1. (2)

    The precise values of parameters \(\omega\) and b are determined on the basis of the fundamental principle of risk mitigation:

$$\left\{ {\begin{array}{*{20}{l}} {\hbox{min} J\left( {\overrightarrow \omega ,\xi } \right)=\frac{1}{2}{{\left\| {\overrightarrow \omega } \right\|}^2}+c\sum\limits_{{i=1}}^{l} {\xi _{i}^{2}} } \\ {s.t.{y_i}=\phi \left( {{x_i}} \right)\overrightarrow \omega +b+{\xi _i}\begin{array}{*{20}{c}} {}&{i=1, \cdots , l} \end{array}} \end{array}} \right.$$
(10)

Where c is the penalty factor and \({\xi _i}\) is the slack variable.

  1. (3)

    Introducing the Lagrangian operator \(\alpha\) yields the Lagrangian function:

$$L\left( {\overrightarrow \omega ,b,\xi ,\alpha } \right)=\frac{1}{2}{\left\| {\overrightarrow \omega } \right\|^2}+c\sum\limits_{{i=1}}^{l} {\xi _{i}^{2}} - \sum\limits_{{i=1}}^{l} {{\alpha _i}} \left[ {\overrightarrow \omega \phi \left( {{x_i}} \right)+b+{\xi _i} - {y_i}} \right]$$
(11)
  1. (4)

    Setting the derivatives of \(\overrightarrow \omega , b, \;{\xi _i}, \alpha\) to zero provides the conditions for finding the optimal solution of the problem.

$$\left\{ {\begin{array}{*{20}{c}} {\frac{{\partial L}}{{\partial \omega }}=0 \to \sum\limits_{{i=1}}^{l} {{\alpha _i}\phi \left( {{x_i}} \right)} } \\ {\frac{{\partial J}}{{\partial b}}=0 \to \sum\limits_{{i=1}}^{l} {{\alpha _i}=0} } \\ {\frac{{\partial J}}{{\partial b}}=0 \to {\alpha _i}=c{\xi _i}} \\ {\frac{{\partial J}}{{\partial \alpha }}=0 \to \overrightarrow \omega \phi \left( {{x_i}} \right)+b+{\xi _i} - {y_i}=0} \end{array}} \right.$$
(12)
  1. (5)

    Eliminating the parameters \(\overrightarrow \omega\) and \({\xi _i}\) in Eq. (11), we convert Eq. (12) into.

$$\left[ {\begin{array}{*{20}{c}} 0&1& \cdots &1 \\ 1&{K\left( {{x_i},{y_i}} \right)+\frac{1}{c}}& \cdots &{K\left( {{x_i},{x_j}} \right)} \\ \vdots & \vdots &{}& \vdots \\ i&{K\left( {{x_i},{y_i}} \right)}& \cdots &{K\left( {{x_i},{x_j}} \right)+\frac{1}{c}} \end{array}} \right]\left[ {\begin{array}{*{20}{c}} b \\ {{\alpha _i}} \\ \vdots \\ {{\alpha _n}} \end{array}} \right]=\left[ {\begin{array}{*{20}{c}} 0 \\ {{y_i}} \\ \vdots \\ {{y_n}} \end{array}} \right] \to f\left( x \right)=\sum\limits_{{i=1}}^{l} {{\alpha _i}K\left( {{x_i},{x_j}} \right)} +b$$
(13)

Where \(K\left( {{x_i},{x_j}} \right)\) is the kernel function, expressed as

$$K\left( {{x_i},{x_j}} \right)=\exp \left( { - \frac{{{{\left\| {{x_i} - {x_j}} \right\|}^2}}}{{2{\delta ^2}}}} \right)$$
(14)

A is the parameter of the kernel function. Data prediction can be performed by resolving the unknown data in Eqs. (13)26,27.

KPCA-NGO-LSSVM model

Kernel principal component analysis

High-dimensional feature values can lead to the curse of dimensionality; therefore, to extract the essential features and improve the predictive accuracy while minimizing the model complexity, kernel principal component analysis (KPCA) is applied to reduce the data dimensions28.

Kernel principal component analysis (KPCA) is a nonlinear dimensionality reduction method that transforms raw data into a high-dimensional feature space by utilizing a kernel function, followed by the application of principal component analysis (PCA) within that feature space.

The principles of the KPCA algorithm are as follows:

  1. (1)

    The sample set of the original running data \({x_k}\) is nonlinearly transformed on the basis of the nonlinear kernel function \(\Phi\), which maps \({x_k}\) to a high-dimensional linear feature space. Then, its covariance matrix is computed for the new sample set; i.e.,

$$\overrightarrow C =\frac{1}{m}\sum\limits_{{j=1}}^{m} {\overrightarrow \varphi \left( {{x_j}} \right)} \overrightarrow \varphi {\left( {{x_j}} \right)^T}$$
(15)
  1. (2)

    The eigenvalues \(\lambda\) and eigenvectors \(\overrightarrow v\) of matrix C are calculated. The following condition must be satisfied:

$$\lambda \overrightarrow v - C\overrightarrow v =0$$
(16)
  1. (3)

    A nonlinear function \(\overrightarrow \varphi \left( {{x_i}} \right)\) is introduced on both sides, and the eigenvectors are represented linearly from \(\overrightarrow \nu\) to \(\overrightarrow \varphi \left( {{x_i}} \right)\); i.e.,

$$\overrightarrow v =\sum\limits_{{i=1}}^{m} {{\alpha _i}} \overrightarrow \varphi \left( {{x_i}} \right)$$
(17)
  1. (4)

    The kernel function matrix \(\overrightarrow K \left( {i,j} \right)=\left\langle {\overrightarrow \varphi \left( {{x_i}} \right),\overrightarrow \varphi \left( {{x_j}} \right)} \right\rangle\) is defined and transformed:

$$mk\overrightarrow \alpha - K\overrightarrow a =0$$
(18)

where \(\overrightarrow \alpha\) is the eigenvector of K, the eigenvalue is \(mk\), and the subscript i denotes an element in the input dataset.

For any sample, the projection to the principal element \(\overrightarrow \varphi \left( x \right)\) in the feature space F is29:

$$\overrightarrow v \overrightarrow \varphi \left( x \right)=\sum\limits_{{i=1}}^{m} {{\alpha _i}} \overrightarrow \varphi \left( {{x_i}} \right)\overrightarrow \varphi \left( {{x_j}} \right)=\sum\limits_{{i=1}}^{m} {{\alpha _i}} \overrightarrow K \left( {{x_i},x} \right)$$
(19)

Predictive modelling

Initially, the KPCA algorithm is employed to reduce the dimensionality of the data, and the NGO algorithm is then applied to optimize the penalty parameter γ and the kernel parameter A of the LSSVM algorithm, thereby yielding a composite corrosion rate prediction model for subsea pipelines, referred to as the KPCA-NGO-LSSVM model. The flowchart of this process is shown in Fig. 1. The NGO-LSSVM and NGO-LSSVM models are established for validation against the integrated KPCA-NGO-LSSVM model.

Fig. 1
figure 1

Flowchart of the KPCA-NGO-LSSVM model.

Model evaluation indicators

To thoroughly assess the predictive accuracy of the KPCA-NGO-LSSVM corrosion rate model for subsea oil and gas pipelines, the mean absolute percentage error (MAPE), root mean squared error (RMSE) and coefficient of determination (R2) were employed as evaluation metrics:

$$\:MAPE=\sum\:_{i=1}^{n}\frac{1}{n}\left|\frac{{y}_{i}-{x}_{i}}{{x}_{i}}\right|\times\:100\text{\%}$$
(20)
$$\:RMSE=\sqrt{\frac{1}{n}{\sum\:}_{i=1}^{n}{\left({y}_{i}-{x}_{i}\right)}^{2}}$$
(21)
$$\:{R}^{2}=\frac{{\sum\:}_{i=1}^{n}{\left({y}_{i}-\stackrel{-}{x}\right)}^{2}}{{\sum\:}_{i=1}^{n}{\left({x}_{i}-\stackrel{-}{x}\right)}^{2}}$$
(22)

Where \({x_i}, {y_i}\) are the true and predicted values of the ith sample, respectively, for \(i=1,2, \cdots ,n;\) n is the total number of samples represented; MAPE indicates the model’s overall error; and RMSE denotes the deviation of the predicted values from the actual values. A lower MAPE and RMSE indicate greater prediction accuracy and better predictive performance of the model.

Example analysis

Dataset segmentation

Three distinct types of pipeline corrosion rate data from the literature were selected for algorithmic prediction. Due to space constraints, the predictive research process is detailed only for data 1, while results for data 2 and data 3 are presented in result form.

The data 1 in this paper are 50 subsea pipelines corrosion data from reference 30; some of the data values are shown in Tables 1 and 40 of which are chosen as the training set, and the remaining 10 of which are used as the test set for model prediction and error checking.

Table 1 Corrosion rate data for selected subsea pipelines.

The data 2 in this paper are 100 overseas oil and gas pipelines corrosion data from reference31; some of the data values are shown in Tables 2, 80 of which are chosen as the training set, and the remaining 20 of which are used as the test set for model prediction and error checking.

Table 2 Corrosion rate data for selected overseas oil and gas pipelines.

The data 3 in this paper are 28 subsea multiphase flow pipelines corrosion data from reference32; some of the data values are shown in Table 3 and 22 of which are chosen as the training set, and the remaining 6 of which are used as the test set for model prediction and error checking.

Table 3 Corrosion rate data for selected subsea multiphase flow pipelines.

Data preprocessing

In kernel principal component analysis, the kernel function can be used to map the original data to a high-dimensional space, perform nonlinear dimensionality reduction, and mine the nonlinear information in the data.21 Therefore, nine influencing factors in subsea pipeline corrosion rate for data 1 were downscaled using KPCA. The magnitudes of the variance contributions of the nine principal components were obtained in MATLAB 2020a, as shown in Table 4.

The magnitudes of the eigenvalues and the cumulative contributions reflect the magnitudes of the influence of the principal components, as shown in Table 4. In this work, the first six principal components F1, F2, F3, F4, F5, and F6, with cumulative contributions greater than 85% were extracted.

Table 4 Analysis of variance contribution ratios of nine principal components of the corrosion factors of a subsea pipeline.

The eigenvectors of the first six principal components selected in this paper are shown in Table 5. The eigenvector of each principal component indicates each factor’s explanatory ability, and the closer the absolute value is to one, the stronger its explanatory ability is, implying that the factor has a greater influence on subsea pipeline corrosion. As shown in Table 5, F1 has a greater correlation with system pressure, F2 with a medium flow rate, F3 with pH, F4 with water content, F5 with temperature, and F6 with CO2 partial pressure.

Table 5 Eigenvectors of each factor of the first 7 principal components of the corrosion factors of a subsea pipeline.

Finally, the system pressure, water content, medium flow rate, pH, temperature, and CO2 partial pressure all have greater impacts on subsea pipeline corrosion rates for data 1 than the other factors. The above subsea pipeline corrosion factors are substituted into the combined model for the next prediction.

Analysis of the forecast results

Based on the characteristics of small sample size and high-dimensional features in the dataset, we initially selected four machine learning algorithms—BP, LSSVM, SVM and RF—for comparative analysis. The results (Figs. 2 and 3; Tables 6and 7) demonstrate that the LSSVM model outperforms the RF, BP and SVM models in terms of prediction accuracy and stability. Therefore, we adopted the LSSVM algorithm as the predictive model for estimating the corrosion rate of submarine pipelines.

Fig. 2
figure 2

Comparison of the predicted and real values for single models.

Fig. 3
figure 3

Comparison of the relative errors for single models.

Table 6 Statistics of the predicted values and relative errors for single models.
Table 7 Comparison of the evaluation indicators for single models prediction results.

Four optimized portfolio models based on the LSSVM algorithm were subsequently developed and compared. The results indicate that the predictive outcomes of the integrated KPCA-NGO-LSSVM model and three other combined models after training are shown in Figs. 4 and 5, and Table 8. Figure 4 shows that the KPCA-NGO-LSSVM model yields superior predictions and stability, followed by the NGO-LSSVM model, whereas the KPCA-PSO-LSSVM model and PSO-LSSVM model yield the worst predictions and least stability.

Fig. 4
figure 4

Comparison of the predicted and real values for combined models.

Fig. 5
figure 5

Comparison of the relative errors for combined models.

Table 8 Statistics of the predicted values and relative errors for combined models.

As shown in Figs. 4 and 5, and Table 8, the stability of the predicted values of the KPCA-PSO-LSSVM and PSO-LSSVM models is low overall, with maximum relative errors of 5.42% and 5.81% and average relative errors of 3.59% and 4.27%, respectively. The predicted values of the NGO-LSSVM model exhibit good stability, with a maximum relative error of 5.10% and an average relative error of 2.49%. The KPCA-NGO-LSSVM model exhibits superior performance, demonstrating optimal stability in terms of the predicted values, with a maximum relative error of 4.80% and an average relative error of 1.80%, both of which are lower than those of the other models, indicating the most effective predictions.

Table 9 Comparison of evaluation indicators for the prediction results of combined models based on data 1.

Similarly, prediction studies were also performed for Data 2 and Data 3 using these four algorithms. The comparative indicators of their prediction results are presented in Tables 10 and 11.

Table 10 Comparison of evaluation indicators for the predictive results of combined models based on data 2.
Table 11 Comparison of evaluation indicators for the predictive results of combined models based on data 3.

From Tables 9 and 10, and 11, along with the prediction results across multiple datasets, the KPCA-NGO-LSSVM algorithm exhibits optimal stability in predicted values and significantly outperforms the NGO-LSSVM, KPCA-PSO-LSSVM, and PSO-LSSVM algorithms in prediction accuracy.

Conclusion

This paper presents the fundamental principles of the KPCA, NGO, and LSSVM algorithms and establishes a composite model for predicting the corrosion rates of subsea pipelines utilizing the KPCA-NGO-LSSVM approach. The following conclusions are derived from the validation and error analysis of the corrosion rate data pertaining to subsea pipelines:

  1. (1)

    The KPCA algorithm was utilized for data dimensionality reduction to obtain the six factors that have the greatest influence on the corrosion rates of subsea pipelines, i.e., system pressure, water content, pH, temperature, and CO2 partial pressure. The multiple correlations between the influencing factors were eliminated, the complexity of the data was reduced, and the efficiency of the modelling operation was improved.

  2. (2)

    Based on the data characteristics, four algorithms—BP, SVM, LSSVM, and RF—were selected and compared. The LSSVM model has a MAPE of 7.1398%, an RMSE of 0.1939 and an R2 of 0.8047, which indicated that the LSSVM algorithm demonstrated significantly better predictive performance and stability than the other algorithms. Therefore, LSSVM was adopted as the predictive model for estimating the corrosion rate of subsea pipelines.

  3. (3)

    Based on prediction results from three distinct datasets, the combined KPCA-NGO-LSSVM model demonstrates significantly superior prediction accuracy and stability compared to the other three models. These results demonstrate that the combined KPCA-NGO-LSSVM model achieves higher prediction accuracy and superior stability for subsea pipeline corrosion rate prediction. This model provides robust technical support for accurately predicting corrosion rates, offering significant potential for extending the service life of subsea pipelines and reducing operational and maintenance costs.

  4. (4)

    The predictive accuracy and stability of the algorithmic model improve with larger data samples. Consequently, a comprehensive pipeline corrosion database could be developed to derive a corrosion rate prediction model with broader applicability and enhanced efficacy.