Advancing LightGBM with data augmentation for predicting the residual strength of corroded pipelines

Wang, Qiankun; Lu, Hongfang; Li, Fan; Cheng, Y. Frank

doi:10.1038/s41529-025-00673-9

Download PDF

Article
Open access
Published: 22 October 2025

Advancing LightGBM with data augmentation for predicting the residual strength of corroded pipelines

Qiankun Wang¹,
Hongfang Lu¹,
Fan Li² &
…
Y. Frank Cheng¹

npj Materials Degradation volume 9, Article number: 128 (2025) Cite this article

2315 Accesses
1 Citations
Metrics details

Subjects

Abstract

Machine learning methods have been widely applied in predicting the residual strength of corroded pipelines due to their powerful predictive capabilities. However, the effective application of these techniques is constrained by the limited availability of high-quality data, as traditional pipeline burst tests are both costly and time-consuming. This study addresses the challenge of data limitations by applying and comparing three advanced data augmentation models—Tabular Variational Autoencoder (TVAE), Copula Generative Adversarial Network (CopulaGAN), and conditional tabular generative adversarial network (CTGAN)—to enhance the corroded pipeline dataset. The augmented datasets were used to train a LightGBM model for residual strength prediction. Among the three, the CopulaGAN-LightGBM data augmentation yielded the best improvement, increasing the model’s R² by 4.46%. Additionally, SHapley Additive exPlanations (SHAP) analysis was conducted on the CopulaGAN-LightGBM model to interpret feature importance, identifying wall thickness, defect depth, and pipe diameter as the most influential factors affecting residual strength. Finally, a practical online platform implementing the proposed model has been developed to enable real-time residual strength prediction. The results demonstrate that combining LightGBM with effective data augmentation techniques provides a reliable solution to overcome data limitations in pipeline corrosion assessment.

Introduction

Pipelines play a critical role in the transportation system, serving as the primary means of transporting oil and gas^1,2,3,4,5. Due to the complex and variable operational conditions of oil and gas pipelines, they are susceptible to corrosion and other defects^6,7,8,9. Corrosion can greatly reduce the structural integrity of pipelines, potentially leading to leakage or fracture^10,11. The failure of oil and gas pipelines can lead to severe economic and environmental consequences. The residual strength of a pipeline refers to the maximum internal pressure that the pipeline can withstand in the presence of corrosion defects before structural failure occurs. It is commonly represented by the burst pressure of corroded pipelines¹². Given the serious consequences of failure, scientifically accurate prediction of the residual strength of corroded pipelines is essential.

Traditional methods for predicting residual strength primarily consist of empirical formula methods and finite element analysis. However, empirical formula methods tend to be conservative, potentially resulting in significant prediction errors¹³. Although the finite element analysis is widely recognized for its accuracy and reliability in residual strength prediction^14,15, its application often requires case-specific modeling and meshing strategies tailored to different corrosion geometries¹⁶. This process can be time-consuming and resource-intensive. The development of machine learning has significantly enhanced the prediction of residual strength in corroded pipelines, with data-driven approaches increasingly replacing traditional methods. In recent years, substantial progress has been made in applying machine learning techniques to predict the residual strength of corroded pipelines. Xiao integrated key physical factors associated with the failure mechanisms of corroded pipelines and employed various machine learning models to predict residual strength¹⁷. The results indicate that the proposed method outperforms traditional empirical models in terms of predictive accuracy. Ma et al. integrated empirical formulas with ensemble learning techniques, leveraging these expressions to guide the training process of machine learning models¹⁸. This approach achieves higher accuracy than five traditional empirical formulas and demonstrates strong applicability across pipelines with varying strength levels. Wang and Lu applied a meta-learner to the outputs of base learners, which significantly improved predictive accuracy and reduced overfitting¹⁹. Miao and Zhao optimized the deep extreme learning machine (DELM) using hybrid teaching-learning-based optimization (HTLBO)²⁰. The proposed model can predict residual strength within a relative error range of 6%.

Despite significant progress in applying machine learning to predict the residual strength of corroded pipelines, the performance of these models remains highly dependent on the quality of the residual strength data. Therefore, collecting relevant and comprehensive experimental data is essential not only for developing machine learning models for residual strength prediction, but also for uncovering the nonlinear relationships between physical factors—such as pipeline sizing parameters, corrosion parameters, and material parameters—and residual strength. However, conducting full-scale burst tests to obtain data on corroded pipeline samples is often impractical due to the high cost, safety risks¹⁶. In such cases, the finite element method represents a vital tool for acquiring pipeline residual strength data⁸. However, finite element models demand high-performance computing resources, and generating large-scale datasets involves substantial time and financial costs. Consequently, the availability of such data in public literature remains limited. For instance, the full-scale burst tests on corroded pipelines conducted by Benjamin include only six experimental cases, which nonetheless have served as benchmark references for many subsequent finite element analysis²¹. Similarly, the dataset constructed by Liu et al.²² and Lo et al.²³ using finite element methods for machine learning contains fewer than 100 instances. In addition, much of the industry’s field inspection data remains inaccessible due to confidentiality. For data-driven models to achieve reliable performance in residual strength prediction, a sufficiently large and diverse dataset is essential. Typically, hundreds to thousands of labeled samples are needed, depending on the dimensionality of input features and the complexity of the model. The scarcity of relevant training data, combined with the heavy reliance of predictive models on data quality, can lead to overfitting. To address this issue, deep learning-based data augmentation techniques have emerged as an effective strategy to generate new samples that enhance feature space coverage. Zhao et al. utilized tabular generative adversarial networks (TGAN) to augment limited training data and trained an XGBoost model on the augmented dataset to predict the limit states of recycled aggregate concrete (RAC)²⁴. The model exhibited excellent predictive performance. Marani et al. successfully predicted the compressive strength of ultra-high-performance concrete by training machine learning models with 6513 data points generated using TGAN²⁵. Experimental results confirm the feasibility of the data augmentation model. In the field of pipeline structural integrity assessment, Woldesellasse and Tesfamariam generated a corrosion dataset using conditional generative adversarial networks (cGAN), and demonstrated that training an artificial neural network (ANN) with the cGAN-augmented dataset significantly improved predictive accuracy²⁶. Ma et al. addressed the issues of data scarcity and distribution imbalance by utilizing CTGAN to augment pipeline corrosion data, achieving accurate prediction of corrosion depth²⁷. Soomro et al. addressed the issue of class imbalance by applying the synthetic minority over-sampling technique (SMOTE) and integrating multiple machine learning methods to improve the accuracy of burst pressure prediction for oil and gas pipelines²⁸.

Building upon previous research, this study further explores the applicability of data augmentation models for predicting the residual strength of corroded pipelines. This study employs advanced generative algorithms—TVAE, CopulaGAN, and CTGAN—to synthesize reliable residual strength data for corroded pipelines. The synthesized data is incorporated into the training process of machine learning models to enable accurate and scientific predictions of the residual strength of corroded pipelines under limited data conditions. Building on this, SHAP was applied to interpret the established machine learning model and identify key factors affecting the residual strength of corroded pipelines. Additionally, a web GUI interface for the model was developed using Streamlit technology, enhancing the practical usability of the proposed model. The proposed framework opens new avenues for future research and underscores the considerable potential of data augmentation techniques in addressing data scarcity challenges.

Results and discussion

Synthetic data analysis

Ensuring data quality while expanding the dataset presents a significant challenge in data augmentation. To assess the validity of the generated data, outliers were manually removed based on physically unreasonable conditions. Specifically, samples were excluded if: (1) the defect depth exceeded the wall thickness; (2) the ultimate tensile strength was lower than the yield strength; or (3) any physical parameter was negative. After this cleaning process, the final dataset sizes were 1748 (TVAE), 1749 (CopulaGAN), and 1747 (CTGAN), respectively. To validate the validity of the synthetic data, statistical histograms were used to compare the distributions of individual features, while Pearson correlation analysis was conducted to evaluate the consistency of linear relationships among variables. In addition, Table 1 summarizes the statistical characteristics of both synthetic and real data, comparing features such as minimum, maximum, mean, and standard deviation. Although minor differences exist, the statistical features of each parameter in the real and synthetic data are largely comparable, indicating that the constructed data generation models effectively captured the key statistical properties of the real data.

Table 1 Statistics of real and synthetic data

Full size table

Figure 1 histograms depicting the distributions of both real and synthetic data. The similarity between their frequency distributions and kernel density plots indicates that the synthetic data effectively replicates the characteristics of the real data across various features. Furthermore, the data generated by the models showed no significant anomalies or unnatural characteristics, indicating high quality without substantial errors or unrealistic values.

**Fig. 1: Histograms of real and synthetic data.**

Figure 2 presents a comparison of Pearson correlation coefficients for multiple attributes in both the synthetic and real datasets. The Pearson correlation coefficients demonstrate a strong correspondence between variable relationships in the synthetic and real data. For example, wall thickness exhibits a strong positive correlation with residual strength, with coefficients of 0.41 in the real data and 0.49 (TVAE), 0.45 (CopulaGAN), and 0.39 (CTGAN) in the synthetic data. Similarly, defect length shows a notable negative correlation with residual strength, with coefficients of −0.24 for real data and −0.22 (TVAE), −0.19 (CopulaGAN), and −0.20 (CTGAN) for synthetic data. These results indicate that all three data augmentation models effectively captured the underlying relationships among variables.

**Fig. 2: Correlation coefficient matrix of real and synthetic data.**

The above analysis indicates that the data generated by TVAE, CopulaGAN, and CTGAN closely resemble the real experimental data. Consequently, this synthetic data can be effectively used to develop subsequent models, mitigating the scarcity of residual strength data for corroded pipelines.

Model performance analysis

To verify whether the data generated by the data augmentation models enhances the predictive performance of the LightGBM model, comparative experiments were conducted. The experiments involved training the model using both the original dataset and a mixed dataset composed of real and synthetic data. Figure 3 illustrates the model’s performance under different training dataset conditions. Although the model was enhanced using synthetic data, its generalization capability was evaluated on 91 previously unseen experimental samples. As shown in Fig. 3, models trained with data generated by TVAE, CopulaGAN, and CTGAN demonstrated notable improvements over the baseline model. Specifically, the R² scores increased by 3.12%, 4.46%, and 3.60%, respectively. These results confirm that the data generated through augmentation techniques can effectively enhance the predictive performance of residual strength models for corroded pipelines.

**Fig. 3: LightGBM model performance on the training and testing datasets.**

To further assess the effectiveness of data augmentation methods, scatter plots of the four models on the test set were generated, as shown in Fig. 4. The identity line indicates the ideal scenario where predicted values equal real values. The proximity of the scatter points to this line reflects the model’s prediction accuracy—the closer the points, the more accurate the predictions. Figure 4 shows that the scatter plots of the three models enhanced by data augmentation are more tightly clustered around the identity line on the test set compared to the baseline model, indicating improved prediction accuracy. This demonstrates that the models trained with augmented data possess stronger generalization capabilities when predicting previously unseen data. This improved generalization is particularly beneficial in engineering applications, where models are expected to perform reliably under diverse and untested corrosion scenarios. To assess practical utility, it is important to consider how the proposed approach aligns with existing engineering standards. For instance, DNV-RP-F101 provides semi-empirical formulations for residual strength assessment of corroded pipelines²⁹. While effective, these equations are constrained by assumptions about defect geometry and pipeline material. In contrast, the data-driven model developed here can capture more complex defect interactions and broader parameter variations. Thus, the model can serve as a complementary tool to DNV-based assessment, offering refined predictions in scenarios that fall outside the code’s conservative boundaries. This integration can enhance the efficiency and accuracy of pipeline integrity evaluations in real-world applications.

**Fig. 4: Comparison chart of real values and predicted values.**

The experimental findings demonstrate that data augmentation models are capable of learning intricate patterns and distributions in diverse scenarios, producing high-quality synthetic data that substantially improves the robustness, generalization, and predictive performance of machine learning models. Among the models, CopulaGAN-LightGBM (LightGBM model enhanced with CopulaGAN data) achieved the best performance, with an R² of 0.9710, MSE of 1.9316, MAE of 0.8707, and MAPE of 0.0693. Accordingly, CopulaGAN-LightGBM was selected as the primary method for subsequent analysis. Figure 4e compares the real residual strength values in the test set with those predicted by the CopulaGAN-LightGBM model. The evaluation results indicate that this model exhibits strong learning and predictive capabilities, enabling accurate estimation of the residual strength of corroded pipelines.

Model interpretability analysis

Although the proposed model demonstrates strong predictive performance for the residual strength of corroded pipelines, it functions as a black box, limiting its interpretability. To address this issue and improve understanding of how various parameters influence residual strength, the SHAP method is employed to interpret the model’s predictions. SHAP is a very practical and effective method for interpreting machine learning models³⁰. It generates predictions for each sample and assigns a SHAP value to each feature, representing its contribution to the model’s output. These values quantitatively indicate whether an input feature has a positive or negative impact on the prediction, as shown in Eq. (1)³¹.

$$f({x}_{i})={f}_{base}+f({x}_{i},1)+f({x}_{i},2)+\cdots +f({x}_{i},k)$$

(1)

Where $f({x}_{i})$ is the predicted value; ${f}_{base}$ is the average predicted value of all samples; k is the k-th feature of the sample. $f({x}_{i},k)$> 0 indicates that the feature has a positive contribution to the target value, while $f({x}_{i},k)$< 0 indicates that the feature has a negative contribution to the target value. Therefore, SHAP not only provides the magnitude of feature impacts but also determines the positive or negative effects of features, making it widely used across various industries in recent years.

Figure 5 presents the SHAP value distribution for the CopulaGAN-LightGBM model. The X-axis indicates the SHAP value corresponding to each feature, while the color represents the magnitude of the feature value for each sample. For example, increasing wall thickness leads to higher SHAP values, reflecting its significant positive influence on residual strength and resulting in greater predicted strength. In contrast, increasing defect depth leads to lower SHAP values, reflecting a greater negative impact and a corresponding decrease in residual strength. Figure 6 illustrates the feature importance for all samples. The three most influential features in predicting the residual strength of corroded pipelines are wall thickness, defect depth, and pipe diameter. Additionally, the defect parameters and pipe size parameters have a significant impact on the residual strength of corroded pipelines, while the material parameters of the pipeline have a relatively smaller influence on the residual strength.

**Fig. 5: Distribution of SHAP value of features.**

**Fig. 6: Contribution analysis of features.**

In addition to offering a global interpretation of the real dataset, SHAP can also intuitively illustrate the impact of each feature on the prediction for individual samples. The parameters for the two samples named Sample1 and Sample2 in the test set are listed in Table 2, while the SHAP analysis is illustrated in Fig. 7. In Fig. 7, red features indicate a positive contribution to residual strength, whereas blue features indicate a negative contribution. The length of each band reflects the magnitude of the feature’s influence. This detailed visualization of individual samples highlights which features are changing and how they influence the residual strength of pipelines, providing critical insights for pipeline safety maintenance. As shown in Fig. 7 and Table 2, when the defect depth is 2.920 mm, its impact on residual strength is positive. However, as the defect depth increases to 10.795 mm, its effect becomes negative. This indicates that larger defect depths significantly reduce the residual strength, which is consistent with physical expectations. Similarly, a decrease in wall thickness leads to a shift in its contribution from positive to negative, further weakening the pipeline. Such individual-level SHAP analysis can be integrated with real-time monitoring data to estimate residual strength dynamically and provide interpretable visual explanations. Based on the SHAP outputs, maintenance personnel can identify and prioritize critical features responsible for the reduction in residual strength. To further illustrate the effectiveness of the proposed model, Fig. 8 presents the predicted values and absolute errors of the two samples. The relatively low prediction errors shown in Fig. 8 further confirm the reliability of the proposed model. Combined with the insights from the SHAP analysis, this suggests that the model effectively captures the influence of key features on residual strength. Therefore, the model can be confidently applied to predictive tasks in practical pipeline safety assessments.

**Fig. 7: Local SHAP analysis for single samples.**

**Fig. 8: The predicted values and errors of the samples.**

Table 2 Parameters of samples used in the experiment

Full size table

Although the internal mechanisms of the machine learning model can be interpreted using the above method, certain limitations still exist that may impede its broader application. Since most engineers lack expertise in programming machine learning models, this study developed a visualization Web GUI (https://residualstrengthpredictor.streamlit.app/) (Fig. 9) based on the proposed model to facilitate its practical application. The program requires users to input eight key characteristics of the corroded pipeline. Based on the CopulaGAN-LightGBM model proposed in this study, it then predicts the pipeline’s residual strength and generates corresponding force plots and bar plots for interpretation. In these visualizations, red features represent positive contributions to residual strength, while blue features indicate negative contributions.

**Fig. 9: Visualization web GUI for the residual strength prediction model.**

Nevertheless, this work has certain limitations. It does not investigate feature optimization based on interpretability results or analyze the effects of feature interactions on model performance. Additionally, a broader range of pipeline-related features should be considered in the machine learning modeling process. Future work could focus on these aspects to improve both the predictive accuracy and robustness of the model.

In summary, the proposed model not only achieves improved predictive performance but also exhibits strong practical applicability. A lightweight web application was developed to facilitate real-time prediction of residual strength, highlighting its potential for field deployment and integration into digital pipeline integrity management systems. Furthermore, the model demonstrates strong generalization capability, underscoring its applicability across a wide range of materials and defect conditions.

Methods

Proposed framework

This study aims to develop a novel predictive model for the residual strength of corroded pipelines using data augmentation techniques. The overall framework is illustrated in Fig. 10 and consists of four main steps: generation of residual strength datasets, statistical data analysis, machine learning modeling, and performance and interpretability analysis of the model. The detailed process is described as follows:

(1)
First, a dataset comprising 453 data points was collected from the public literature³² and subsequently split into a training set (80%) and a test set (20%). Data synthesis for the eight input features in the training set was performed using TVAE, CopulaGAN, and CTGAN, which are pipe diameter (D), wall thickness (t), yield strength (${\sigma }_{y}$), ultimate tensile strength (${\sigma }_{u}$), elastic modulus (E), defect depth (d), defect length (l), and defect width (w). To account for the influence of material variation on residual strength, material properties were explicitly included as continuous input features in the model. This design allows the model to capture the effects of different material grades represented in the dataset. The corresponding residual strength (P_b) was predicted using the Stacking-XGBoost model developed by Wang and Lu¹⁹. This stacking ensemble model uses XGBoost as the meta-learner and integrates seven base learners: k-nearest neighbors (KNN), support vector regression (SVR), random forest (RF), multilayer perceptron (MLP), extremely randomized trees (ETR), LightGBM, and XGBoost. The hyperparameters of each base learner were optimized using FLAML and Optuna frameworks. The model achieved an R² of 0.9571 on the test set, indicating excellent predictive accuracy. More detailed information about the development and structure of the Stacking-XGBoost model can be found in the work by Wang and Lu¹⁹.
(2)
Second, the synthesized data were statistically analyzed using histograms, kernel density plots, and Pearson correlation coefficients.
(3)
Subsequently, both the original data and the augmented mixed data were used to train the LightGBM model, which was optimized through Bayesian optimization.
(4)
Finally, the model’s predictive performance was evaluated using both the original data and data generated through various augmentation strategies, and the optimal model was selected for interpretability analysis.

**Fig. 10: The overview of our research framework.**

Data augmentation methods

VAE is a state-of-the-art deep generative model that is widely used to generate synthetic data from real-world datasets³³. At the core of the model are two components: an encoder and a decoder, both implemented using neural network architectures. The encoder compresses the input data from the real world and learns its latent probability distribution, while the decoder generates new data instances based on this inferred latent space distribution. Specifically, the encoder approximates the posterior distribution of the latent variables using variational inference, while the decoder generates new samples resembling the training data by sampling from the latent space and reconstructing the original inputs. This VAE architecture effectively captures the intrinsic features and structures of data, demonstrating excellent performance across various applications such as image generation, data augmentation, and anomaly detection. Although traditional variational autoencoders have achieved considerable success with unstructured data such as images and text, they face challenges when generating and analyzing tabular data. TVAE proposed by Xu³⁴ is a generative model specifically designed for handling structured tabular data. The hyperparameter values for the TVAE model used in this study are shown in Table 3.

Table 3 Hyperparameters of the TVAE model

Full size table

GAN was introduced by Goodfellow at the NIPS conference. They are trained using two adversarial neural networks—the generator and the discriminator—to produce high-quality samples³⁵. The generator aims to produce realistic samples that can deceive the discriminator, while the discriminator’s task is to distinguish between real and generated samples³⁶. This adversarial process establishes a game-theoretic mechanism that enables the generator to progressively improve the quality of the generated samples through continuous optimization. Copula theory, introduced by Sklar in 1959, is a mathematical tool for modeling the dependency relationships among multivariate random variables. It independently models the marginal distributions of each variable and utilizes specific copula functions to capture the dependencies among variables. CopulaGAN is an innovative model that integrates GAN with copula theory. It adopts a separable approach that decouples marginal and joint distributions, allowing the model to learn the marginal properties of each variable independently while using copula functions to model the complex dependencies among them³⁷. The hyperparameter values for the CopulaGAN model used in this study are shown in Table 4.

Table 4 Hyperparameters of the CopulaGAN model

Full size table

CTGAN is an improved variant of GAN specifically designed for generating structured data, demonstrating superior performance in handling tabular datasets³⁷. CTGAN addresses non-Gaussian and multimodal distributions by incorporating conditional variables and mode-specific normalization, enabling the generated samples to better capture the underlying data distribution while being adjustable based on specific conditions³⁸. This design allows CTGAN to generate high-quality synthetic data while effectively enhancing its diversity and preserving its structure and feature relationships. The hyperparameter values for the CTGAN model used in this study are shown in Table 5.

Table 5 Hyperparameters of the CTGAN model

Full size table

Machine learning methods

Hyperparameters play a critical role in the performance of machine learning models. In this study, Bayesian optimization combined with five-fold cross-validation was utilized to determine the optimal hyperparameters of the machine learning model. Bayesian optimization is a powerful global optimization technique. Compared to traditional methods, it effectively identifies the global optimum with a limited number of evaluations by constructing a probabilistic model of the objective function³⁹. The core concept involves employing a Gaussian process as a surrogate model to estimate the objective function’s distribution across the input space and iteratively updating this model based on newly acquired observations⁴⁰. Bayesian optimization leverages the performance of previously tested hyperparameters to infer promising new candidates, effectively leveraging historical information to enhance search efficiency. In this study, Bayesian optimization is implemented using the Skopt package in Python.

LightGBM is a highly efficient gradient boosting framework that employs histogram-based decision tree learning⁴¹, allowing it to handle large-scale feature sets and datasets with high computational efficiency. The gradient-based one-sided sampling (GOSS) and exclusive feature bundling (EFB) introduced by LightGBM significantly enhance the training speed, computational efficiency, predictive accuracy, and handling capability of large datasets for the LightGBM model⁴². The LightGBM predictive model used in this study is implemented using the open-source LightGBM package in Python. LightGBM provides a flexible hyperparameter tuning interface, and to improve the predictive performance of the model, the hyperparameters of LightGBM need to be optimized. In this study, Bayesian optimization is used to determine these hyperparameters. To ensure fair and comparable experimental results, an identical hyperparameter search space was applied across all trials to determine the optimal LightGBM configurations. The Bayesian optimization hyperparameter search space is “Number of trees” (low = 50, high = 500), “Number of leaves” (low = 20, high = 150), “Minimum child samples” (low = 1, high = 50), “Learning rate” (low = 0.01, high = 0.3), “Feature sampling ratio” (low = 0.1, high = 1.0), “L1 regularization” (low = 0, high = 10), and “L2 regularization” (low = 0, high = 10). The hyperparameter values set in the lightGBM are given in Table 6.

Table 6 Hyperparameters of the LightGBM model

Full size table

Error metrics

This study utilizes statistical analysis methods to assess the performance of the data augmentation models. The coefficient of determination (R²), mean squared error (MSE), mean absolute error (MAE), and mean absolute percentage error (MAPE) are used to assess the predictive performance of the LightGBM model, with the correlation coefficient defined as follows¹⁹:

$${R}^{2}=1-\frac{\displaystyle {\sum }_{i=1}^{n}{\left({y}_{i}-{p}_{i}\right)}^{2}}{\displaystyle {\sum }_{i=1}^{n}{\left({y}_{i}-\bar{y}\right)}^{2}}$$

(2)

$${\rm{MSE}}=\frac{1}{n}\sum {({y}_{i}-{p}_{i})}^{2}$$

(3)

$${\rm{MAE}}=\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}\left|{y}_{i}-{p}_{i}\right|$$

(4)

$${\rm{MAPE}}=\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}\left|\frac{{p}_{i}-{y}_{i}}{{y}_{i}}\right|\times 100 \%$$

(5)

where $i$ denotes each sample; $n$ denotes the total number of samples; ${y}_{i}$ is the true value, ${p}_{i}$ is the predicted value and $\bar{y}$ is the mean value of the samples.

Implementation details

This study utilizes the synthetic data vault (SDV) created by the Massachusetts Institute of Technology’s artificial intelligence laboratory⁴³ to implement the creation of TVAE, CopulaGAN, and CTGAN. The model’s hyperparameters were determined through an extensive trial-and-error process. SDV incorporates automatic data preprocessing, thereby eliminating the need for manual normalization during data model generation. Before developing the LightGBM model, the relevant data was normalized using the StandardScaler from Scikit-learn. All experiments are conducted on a Jupyter notebook on a laptop equipped with a 2.30 GHz 12th Gen Intel Core i7-12700H processor, 16 GB of RAM, and running the Windows 11 operating system.

Data availability

Some or all data, models, or code that support the findings of this study are available from the corresponding author upon reasonable request.

References

Li, X., Jia, R., Zhang, R., Yang, S. & Chen, G. A KPCA-BRANN based data-driven approach to model corrosion degradation of subsea oil pipelines. Reliab. Eng. Syst. Saf. 219, 108231 (2022).
Lu, H. & Cheng, Y. F. Detecting urban gas pipeline leaks using a vehicle–canine collaboration strategy. Nat. Cities 2, 281–282 (2025).
Article Google Scholar
Lu, H., Xi, D., Xiang, Y., Su, Z. & Cheng, Y. F. Vehicle-canine collaboration for urban pipeline methane leak detection. Nat. Cities 2, 336–343 (2025).
Article Google Scholar
Shaik, N. B., Jongkittinarukorn, K., Benjapolakul, W. & Bingi, K. A novel neural network-based framework to estimate oil and gas pipelines life with missing input parameters. Sci. Rep. 14, 4511 (2024).
Article CAS PubMed PubMed Central Google Scholar
Shaik, N. B. et al. Recurrent neural network-based model for estimating the life condition of a dry gas pipeline. Process Saf. Environ. Prot. 164, 639–650 (2022).
Article CAS Google Scholar
Xu, L. et al. The research progress and prospect of data mining methods on corrosion prediction of oil and gas pipelines. Eng. Fail. Anal. 144, 106951 (2023).
Article Google Scholar
Kumari, P., Halim, S. Z., Kwon, J. S.-I. & Quddus, N. An integrated risk prediction model for corrosion-induced pipeline incidents using artificial neural network and Bayesian analysis. Process Saf. Environ. Prot. 167, 34–44 (2022).
Article CAS Google Scholar
Soomro, A. A. et al. Analysis of machine learning models and data sources to forecast burst pressure of petroleum corroded pipelines: a comprehensive review. Eng. Fail. Anal. 155, 107747 (2024).
Lu, H., Xu, Z.-D., Iseley, T. & Matthews, J. C. Novel data-driven framework for predicting residual strength of corroded pipelines. J. Pipeline Syst. Eng. Pract. 12, 04021045 (2021).
Article Google Scholar
Wu, T., Miao, X. & Song, F. Residual strength prediction of corroded pipelines based on physics-informed machine learning and domain generalization. Npj Mater. Degrad. 9, 12 (2025).
Article Google Scholar
Shaik, N. B., Pedapati, S. R. & Dzubir, F. A. B. A. Remaining useful life prediction of a piping system using artificial neural networks: a case study. Ain Shams Eng. J. 13, 101535 (2022).
Article Google Scholar
Lu, H., Iseley, T., Matthews, J., Liao, W. & Azimi, M. An ensemble model based on relevance vector machine and multi-objective salp swarm algorithm for predicting burst pressure of corroded pipelines. J. Pet. Sci. Eng. 203, 108585 (2021).
Article CAS Google Scholar
Su, Y., Li, J., Yu, B., Zhao, Y. & Yao, J. Fast and accurate prediction of failure pressure of oil and gas defective pipelines using the deep learning model. Reliab. Eng. Syst. Saf. 216, 108016 (2021).
Article Google Scholar
Li, S., Zhang, Z., Qian, H., Wang, H. & Fan, F. Research on remaining bearing capacity evaluation method for corroded pipelines with complex shaped defects. Ocean Eng. 296, 116805 (2024).
Article Google Scholar
Shuai, Y. et al. A novel framework for predicting the burst pressure of energy pipelines with clustered corrosion defects. Thin-Walled Struct. 205, 112413 (2024).
Article Google Scholar
Wang, Q. & Lu, H. Machine learning methods for predicting residual strength in corroded oil and gas steel pipes. Npj Mater. Degrad. 9, 30 (2025).
Article Google Scholar
Xiao, R., Zayed, T., Meguid, M. A. & Sushama, L. Predicting failure pressure of corroded gas pipelines: A data-driven approach using machine learning. Process Saf. Environ. Prot. 184, 1424–1441 (2024).
Article CAS Google Scholar
Ma, H. et al. A new hybrid approach model for predicting burst pressure of corroded pipelines of gas and oil. Eng. Fail. Anal. 149, 107248 (2023).
Article Google Scholar
Wang, Q. & Lu, H. A novel stacking ensemble learner for predicting residual strength of corroded pipelines. Npj Mater. Degrad. 8, 87 (2024).
Article Google Scholar
Miao X., Zhao H. Novel method for residual strength prediction of defective pipelines based on HTLBO-DELM model. Reliabil. Eng. Syst. Saf. 237, 109369 (2023).
Benjamin, A. C., Freire, J. L. F., Vieira, R. D., Diniz, J. L. C. & de Andrade, E. Q. Burst tests on pipeline containing interacting corrosion defects. In: 24th International Conference on Offshore Mechanics and Arctic Engineering (ASME, 2005).
Liu, X. et al. An ANN-based failure pressure prediction method for buried high-strength pipes with stray current corrosion defect. Energy Sci. Eng. 8, 248–259 (2020).
Article Google Scholar
Lo, M., Karuppanan, S. & Ovinis, M. ANN- and FEA-based assessment equation for a corroded pipeline with a single corrosion defect. J. Mar. Sci. Eng. 10, 476 (2022).
Article Google Scholar
Zhao, X.-Y., Chen, J.-X., Chen, G.-M., Xu, J.-J. & Zhang, L.-W. Prediction of ultimate condition of FRP-confined recycled aggregate concrete using a hybrid boosting model enriched with tabular generative adversarial networks. Thin-Walled Struct. 182, 110318 (2023).
Article Google Scholar
Marani, A., Jamali, A. & Nehdi, M. L. Predicting ultra-high-performance concrete compressive strength using tabular generative adversarial networks. Materials 13, 4757 (2020).
Woldesellasse, H. & Tesfamariam, S. Data augmentation using conditional generative adversarial network (cGAN): application for prediction of corrosion pit depth and testing using neural network. J. Pipeline Sci. Eng. 3, 100091 (2023).
Article Google Scholar
Ma, H. et al. Data augmentation of a corrosion dataset for defect growth prediction of pipelines using conditional tabular generative adversarial networks. Materials 17, 1142 (2024).
Article CAS PubMed PubMed Central Google Scholar
Soomro, A. A. et al. Data augmentation using SMOTE technique: application for prediction of burst pressure of hydrocarbons pipeline using supervised machine learning models. Results Eng. 24, 103233 (2024).
Article Google Scholar
Qin, G. & Cheng, Y. F. A review on defect assessment of pipelines: principles, numerical solutions, and applications. Int. J. Press. Vessels Pip. 191, 104329 (2021).
Article Google Scholar
Qin, G., Zhang, C., Wang, B., Ni, P. & Wang, Y. An interpretable machine learning model for failure pressure prediction of blended hydrogen natural gas pipelines containing a crack-in-dent defect. Energy 320, 135401 (2025).
Article CAS Google Scholar
Ben Seghier, M. E. A., Mohamed, O. A. & Ouaer, H. Machine learning-based Shapley additive explanations approach for corroded pipeline failure mode identification. Structures 65, 106653 (2024).
Article Google Scholar
Amaya-Gómez, R., Munoz Giraldo, F., Schoefs, F., Bastidas-Arteaga, E. & Sanchez-Silva, M. Recollected burst tests of experimental and FEM corroded pipelines (Mendeley Data, 2019).
Inan, M. S. K., Hossain, S. & Uddin, M. N. Data augmentation guided breast cancer diagnosis and prognosis using an integrated deep-generative framework based on breast tumor’s morphological information. Inform. Med. Unlocked 37, 101171 (2023).
Article Google Scholar
Xu, L., Skoularidou, M., Cuesta-Infante, A. & Veeramachaneni, K. Modeling tabular data using conditional GAN. In: 33rd Conference on Neural Information Processing Systems (NeurIPS, 2019).
Goodfellow, I. et al. Generative adversarial networks. Commun. Acm 63, 139–144 (2020).
Article Google Scholar
He, G., Zhao, Y. & Yan, C. Application of tabular data synthesis using generative adversarial networks on machine learning-based multiaxial fatigue life prediction. Int. J. Press. Vessels Pip. 199, 104779 (2022).
Article Google Scholar
Chia, M. Y., Koo, C. H., Huang, Y. F., Di Chan, W. & Pang, J. Y. Artificial intelligence generated synthetic datasets as the remedy for data scarcity in water quality index estimation. Water Resour. Manag. 37, 6183–6198 (2023).
Article Google Scholar
Zeng, S. et al. Prediction of compressive strength of FRP-confined concrete using machine learning: a novel synthetic data driven framework. J. Build. Eng. 94, 109918 (2024).
Article Google Scholar
Marani, A. & Nehdi, M. L. Predicting shear strength of FRP-reinforced concrete beams using novel synthetic data driven deep learning. Eng. Struct. 257, 114083 (2022).
Article Google Scholar
Chen, J. et al. An error-corrected deep Autoformer model via Bayesian optimization algorithm and secondary decomposition for photovoltaic power prediction. Appl. Energy 377, 124738 (2025).
Ke, G. et al. LightGBM: a highly efficient gradient boosting decision tree. In: 31st Annual Conference on Neural Information Processing Systems (NIPS, 2017).
Janizadeh, S. et al. Advancing the LightGBM approach with three novel nature-inspired optimizers for predicting wildfire susceptibility in Kaua’i and Moloka’i Islands, Hawaii. Expert Syst. Appl. 258, 124963 (2024).
Article Google Scholar
Patki, N., Wedge, R., Veeramachaneni, K. & IEEE. The synthetic data vault. In: 3rd IEEE/ACM International Conference on Data Science and Advanced Analytics (DSAA, 2016).

Download references

Acknowledgements

This study was financially supported by the National Natural Science Foundation of China (grant no. 52402421 and W2531036), the Natural Science Foundation of Jiangsu Province (grant no. BK20220848), and the Start Grant for Talent Attraction of the Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences.

Author information

Authors and Affiliations

State Key Laboratory of Advanced Marine Materials, Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo, China
Qiankun Wang, Hongfang Lu & Y. Frank Cheng
Department of Mathematics and Computer Science, Lawrence Technological University, Southfield, MI, USA
Fan Li

Authors

Qiankun Wang
View author publications
Search author on:PubMed Google Scholar
Hongfang Lu
View author publications
Search author on:PubMed Google Scholar
Fan Li
View author publications
Search author on:PubMed Google Scholar
Y. Frank Cheng
View author publications
Search author on:PubMed Google Scholar

Contributions

Qiankun Wang: conceptualization, methodology, data curation, and writing—original draft. Hongfang Lu: conceptualization and writing—reviewing and editing. Fan Li: visualization and investigation. Y. Frank Cheng: writing—reviewing and editing.

Corresponding author

Correspondence to Hongfang Lu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, Q., Lu, H., Li, F. et al. Advancing LightGBM with data augmentation for predicting the residual strength of corroded pipelines. npj Mater Degrad 9, 128 (2025). https://doi.org/10.1038/s41529-025-00673-9

Download citation

Received: 08 June 2025
Accepted: 05 September 2025
Published: 22 October 2025
Version of record: 22 October 2025
DOI: https://doi.org/10.1038/s41529-025-00673-9

Subjects

Abstract

Introduction

Results and discussion

Synthetic data analysis

Model performance analysis

Model interpretability analysis

Methods

Proposed framework

Data augmentation methods

Machine learning methods

Error metrics

Implementation details

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links