Abstract
Carbon dioxide (CO2) is the main greenhouse gas that drives global warming, climate change, and other environmental issues. CO2 absorption using amine solvents stands out as one of the most well-known industrial technologies of CO2 capture. However, accurate prediction of CO2 absorption in aqueous amine solutions under different operating conditions is crucial for designing an efficient amine scrubbing system in power plants. In this work, CO2 solubility in aqueous piperazine (PZ) solutions was modeled using 517 experimental data points covering a temperature range of 298 to 373 K, PZ concentration of 0.1 to 6.2 mol/L (M), and CO2 partial pressure of 0.03 to 7399 kPa. To this end, four robust machine learning algorithms, including gradient boosting with categorical features support (CatBoost), light gradient boosting machine (LightGBM), extreme gradient boosting (XGBoost), and adaptive boosting decision trees (AdaBoost-DT) were utilized. Among the developed models, the CatBoost model presented the highest accuracy with an overall determination coefficient (R2) of 0.9953 and an average absolute relative error of 2.36%. Sensitivity analysis revealed that CO2 partial pressure had the greatest influence on CO2 absorption in aqueous PZ solutions, followed by PZ concentration and temperature. Moreover, CO2 partial pressure positively influenced CO2 absorption in aqueous PZ solutions, while PZ concentration and temperature exhibited negative effects. Finally, the leverage technique indicated that both the experimental data bank used for modeling and the model’s estimates were statistically acceptable and valid showing only 8 points (∼1.5% of total data) as possible suspected data.
Similar content being viewed by others
Introduction
The deterioration of climate change and global warming problems is not hidden from anyone and one of the major concerns in this regard is the anthropogenic carbon dioxide (CO2) emissions worldwide1,2. CO2 emissions have progressively augmented in recent years3, mainly due to fossil fuels and industry. Global warming is to be mitigated to prevent agricultural output reduction and extreme weather patterns4, while industries must meet the energy demand of additional 900 million people by 2035 5. Various technologies have been developed so far for capturing CO2 from flue gas such as cryogenic distillation, membrane separation, chemical/physical absorption, adsorption, and bioremediation1,2,6,7,8, among which the use of chemical absorption is the most attractive option for post-combustion CO2 capture at room pressure and temperature owing to its low cost and easy implementation2. Among various organic and inorganic solvents proposed for CO2 chemical absorption, aqueous solutions of amines containing reactive nitrogen atoms, which can absorb CO2 in a reversible and selective process, are the most appealing options. Amine structures significantly impact the CO2 capture process9. Also, they are cheap and have low steam pressures2. Amine-based aqueous solutions can be potentially applied to extract CO2 in power plants10,11,12. Specifically, piperazine (PZ) has shown a great potential to absorb CO2 with respect to its high absorption capacity (almost twice as monoethanolamine (MEA))2. Moreover, the product of its reaction with CO2 includes PZ carbamate at low loadings and PZ bicarbamate within the concentrated PZ process operational spectrum, thereby enhancing the overall rate of CO2 absorption under varying operational parameters13,14.
The CO2 equilibrium absorption capacity or CO2 solubility in the amine-based solution is the most significant parameter that directly affects the performance of the solvent in the CO2 absorption process15. This crucial parameter was traditionally obtained through various experimental tests or several thermodynamic models, which are developed on the basis of vapor-liquid equilibrium (VLE) theory16,17,18,19. Although experimental procedures are reliable methods for evaluating CO2 loading in amine solutions, they are costly and time-consuming. In contrast, thermodynamic-based methodologies are not capable of assessing CO2 solubility in broad ranges of operational circumstances20. By contrast, recent advancements in computer science have led to the development of powerful and robust machine learning approaches that have been extensively employed in energy and environmental sciences21,22,23,24,25,26,27,28. They have also proven their competency in modeling the CO2 capture process using alkanolamine solvents29. Salooki et al.30 attempted to predict the output variables of a stripper operating in one of the Iranian gas refineries using artificial neural networks (ANN). The output temperature and flow rate of this stripper were also modeled by the support vector machine (SVM) framework31. The process of steady-state CO2 capture in monoethanolamine (MEA) aqueous solution was also modeled using ANN32 and optimized through statistical methods33. More analogous research works related to the CO2 capture can be found in the literature34,35,36,37,38.
The CO2 loading in amine-based aqueous solutions was another point of interest among researchers. Ghiasi and Mohammadi39 developed a least-squares support vector machine (LSSVM) to estimate CO2 solubility in a variety of amine solutions with respect to the concentration of amine, temperature, and CO2 partial pressure. A similar study was then conducted by using an adaptive neuro-fuzzy inference system (ANFIS)40. Daneshvar et al.41 implemented an ANN algorithm to estimate CO2 loading in triisopropanolamine (TIPA), TIPA/PZ, and TIPA/MEA solvents. In another study, the absorption capacity of CO2 in diethanolamine (DEA) and methyl-diethanolamine (MDEA) was estimated using radial basis function and multilayer perceptron networks42. More recently, Dashti et al.20 implemented four intelligent approaches to forecast CO2 solubility in twelve amine-based solvents. They concluded that the LSSVM model optimized by coupled simulated annealing (CSA) optimization technique could provide the most reliable results in comparison to the other models.
Given the potential of PZ aqueous solutions in capturing CO2, many researchers have focused on proposing reliable approaches for accurate estimation of the CO2 absorption capacity of PZ solvents. Tatar et al.43 proposed two intelligent approaches, namely CSA-LSSVM and ANFIS coupled with Conjugate Hybrid-Particle Swarm Optimization (CHPSO-ANFIS) to predict CO2 solubility in PZ solutions and reported the superiority of CHPSO-ANFIS model. A similar study was conducted by Yarveicy et al.44 using four intelligent approaches including LSSVM, ANFIS, ANN, and adaptive boosting-classification and regression tree. Dashti et al.45 developed genetic programming (GP) and GA-ANFIS models to predict CO2 solubility in aqueous solutions of PZ using CO2 partial pressure, PZ concentration, and temperature as input variables. Their models were developed using a databank gathered from the literature consisting of 390 data points. They reported average absolute relative deviations (AARDs) of 9.7% and 5.3% for the developed GA-ANFIS and GP models, respectively. To the best of our knowledge, this database represents the most extensive collection utilized for developing predictive models of CO2 loading in aqueous PZ solutions. Furthermore, a thorough literature review indicates that existing models for CO2 solubility in PZ solvents employ outdated algorithms, highlighting the necessity to enhance their applicability across broader operational conditions and to develop novel intelligent approaches using cutting-edge algorithms for estimating CO2 loading in PZ aqueous solutions based on an expanded database.
In this work, an extended databank comprising 517 data points gathered from open-source literature is utilized to develop several novel intelligent approaches for estimating CO2 loading in PZ aqueous solutions. To achieve this goal, four robust machine learning algorithms including, gradient boosting with categorical features support (CatBoost), light gradient boosting machine (LightGBM), extreme gradient boosting (XGBoost), and adaptive boosting decision trees (AdaBoost-DT) are utilized. Then, the performance of the models is evaluated by employing a variety of statistical and graphical assessments. Furthermore, additional trend analyses are conducted to assess the validity of the best-developed model. Also, sensitivity analysis is performed to examine the relationships between inputs and the outcomes of the model. Finally, the Leverage technique is employed to evaluate the credibility and application range of the best-predictive model.
Data gathering
In this work, 517 experimental findings related to the absorption of CO2 into aqueous PZ solutions were gathered from the literature46,47,48,49,50,51,52,53. This data bank has more than 120 data points more than what was used in the studies of Dashti et al.20,45. Three independent variables, namely temperature (K), PZ concentration (M), and CO2 partial pressure (kPa) were considered as inputs to the models, while CO2 loading (mol CO2 / mol PZ) is the output. Table 1 reports the statistical description of the data bank used for modeling in this work. As is evident, the solubility of CO2 in aqueous PZ solutions was modeled using a wide range of influencing parameters including PZ molarities up to 6.2 M, temperatures between 298 and 373 K, and pressures up to about 7400 kPa. A snapshot of the CO2 solubility changes with the three input parameters was displayed in the 2D contour plots of Fig. 1. A quick glance at the contour plots shows that the higher CO2 solubility in PZ solutions corresponds to the more elevated CO2 partial pressures, lower temperatures, and lower PZ concentrations.
Figure 2 shows the correlation matrix between all variables in the gathered data bank in this work. The correlation coefficients shown in the matrix can specify the relationship between two variables, where an absolute value close to 1 is deemed a strong relationship and 0 is neutral. Also, positive and negative values demonstrate direct and inverse relationships between the two variables, respectively54. The following formula was used to compute the linear correlation coefficient between two variables55:
where, xi and yi show the values of the x-variable and y-variable in two sets of data, respectively. Also, xa and ya stand for the average of the x-variable and the average of y-variable in the mentioned data sets, respectively.
Based on Fig. 2, CO2 partial pressure has a direct relationship with CO2 loading, and on the other hand, PZ concentration and temperature have an inverse one with it. It is important to remember that correlation coefficients measure the strength and direction of a linear relationship but do not imply causation. For example, a correlation coefficient of 0.33 between PZ concentration and temperature suggests a weak positive relationship between these two variables. However, the presence of a correlation does not mean that changes in PZ concentration cause changes in temperature or vice versa. The correlation coefficient simply indicates the degree to which the two variables move together in a linear fashion. This analysis only provides an overview of the correlation coefficient matrix for the data collected in this research, focusing on the linear relationships between inputs (temperature, PZ concentration, and CO2 partial pressure) and the target variable (CO2 loading). No general conclusions about causation or trends are drawn at this stage. Further analysis, including trend analysis and other statistical methods, will be presented in the continuation of the manuscript to provide more comprehensive insights and conclusions.
Model development
In this study, four powerful tree-based machine learning algorithms are implemented to predict the CO2 solubility in PZ aqueous solutions accurately considering CO2 partial pressure, temperature, and PZ concentration using a databank comprised of 517 data points. The theoretical concepts behind these intelligent models are described in what follows.
Extreme gradient boosting (XGBoost)
This algorithm is proposed as a supervised machine learning approach on the basis of the tree-boosting method and is capable of solving regression tasks as well as ranking and classification problems56,57. XGBoost operates based on the Newton-Raphson method. Analogous to the structure of a decision tree (DT), XGBoost consists different types of node58. In the initial step of model training, the entire databank is divided into k datasets and then they form two distinct internal nodes followed by leaf nodes after the last classification59,60. When the model structure completes, the model outputs will be calculated as follows:
and
where h(x) is determined by mapping example X and denotes binary leaf index, f represents the regression tree’s space, T stands for the leaves of the tree, fk exhibits the kth tree and \(\omega\) means the weight of the tree59. Afterwards, the objective function (L) is to be iteratively minimized for each leaf59:
where the regularization and loss functions are respectively denoted by Ω and l, λ signifies the regulation coefficient, and γ shows the minimum loss. The model uses parameters γ and λ to control its variance and avoid overfitting. Figure 3 represents a representation of the XGBoost algorithm.
Light gradient boosting machine (LightGBM)
Alike XGBoost, LightGBM is applicable in a variety of machine learning tasks as another tree-based learning model59,61. LightGBM applies a histogram by splitting eigenvalues into ‘P’ distinct bins so as to reduce memory consumption and speed up the model’s development steps59. This algorithm reduces memory consumption even more by keeping values in an eight-bit integer62. LightGBM is trained through a leaf-wise process which is more effective than the traditional level-wise method63,64. It is also possible to minimize the error by recognizing the leaves with the maximum branching gain. However, this process makes a deeper and more complex model that is more prone to overfitting, which should be prevented by defining an upper limit on the depth of the leaf top59,65. A schematic of LightGBM is depicted in Fig. 4.
Categorical boosting (CatBoost)
As another variation of gradient boosting techniques, CatBoost applies categorical columns to take advantage of target-based statistics and one_hot_max_size (OHMS) features66,67,68. The algorithm employs a greedy method to split a tree and find the exponential evolution of the feature combination66. If a feature possesses more category compared to OHMS, the following steps are applied in the algorithm:
-
1.
Making random subsets from the available records.
-
2.
Converting labels into integers.
-
3.
Using the equation below to transform categorical features to numeric ones66:
$$avgTarget=\frac{CountInClass+prior}{totalCount+1}$$where CountInClass and totalCount denote the number of targets and the number of preceding objects, respectively, while prior is specified by the starting parameters to count objects66. CatBoost prevents overfitting through ordered boosting, regularization, and early stopping, ensuring effective handling of categorical features and robust model performance. This algorithm is schematically illustrated in Fig. 5.
Adaptive boosting decision tree (AdaBoost-DT)
AdaBoost was first introduced by Freund and Schapire69 as a powerful tool that is capable of learning the mistakes of weak learners and executing a strong classifier/regressor. In this algorithm, an initial group of learners is developed based on weighted datasets, and different weights are assigned to each learner with respect to its accuracy70. The less accurate learners get higher weights so that new learners will affect them the most. The algorithm typically follows the steps below71:
-
1.
Allocating initial weights: wj = 1/n, j = 1, 2, …, n.
-
2.
Developing weak learners based on training data and obtaining weighted errors of each learner.
-
3.
Assigning weights to each learner.
-
4.
Updating the weight of the training samples.
-
5.
Testing the learners with testing data.
In this study, decision trees (DTs) were employed as weak learners. A schematic of the AdaBoost-DT algorithm is shown in Fig. 6.
Assessment of models
Using five statistical indicators, namely determination coefficient (R2), average absolute percent relative error (AAPRE), average percent relative error (APRE), standard deviation (SD), and root mean square error (RMSE), the accuracy of the proposed models was assessed. These statistical criteria are listed below72:
In these formulas, Yi, exp, Yi, pred, and N show the experimental CO2 solubility data, the predicted CO2 solubility data by the proposed models, and the number of data, respectively.
In tandem with statistical analysis, this work incorporates graphical evaluation of model outcomes, outlined succinctly as follows:
Cross-plot: This analysis allows the cross-plotting of two sets of data (experimental and modeling data). The more data concentrated around the unit-slope line, the better the estimates of the model.
Trend plot: In this analysis, the validity of the model is assessed by plotting both experimental and modeling data according to the inputs.
Error distribution graph: In this analysis, the error distribution around the zero error line is assessed to specify the possible error trend of the model. In this graphical analysis, the percent relative error (Ei) values are used, which can be calculated according to the following formula:
Cumulative frequency plot: This analysis allows checking the accuracy of models by plotting the absolute relative error (Ea), as calculated using the following formula, versus the proportion of the data.
Results and discussion
Developed models
In this work, CO2 absorption in aqueous PZ solutions was modeled using robust boosting machine learning algorithms. In this regard, 517 experimental findings were used in the modeling process considering temperature, PZ concentration, and CO2 partial pressure as input parameters. Here, 80% of the data was utilized for model training, while the remaining 20% served as the test subset. To evaluate model performance and ensure unbiased predictions, a widely used approach in machine learning is the 10-fold cross-validation method used in this work. This technique involves partitioning the dataset into ten equal segments, referred to as “folds.” In each iteration, one fold is set aside for validation, while the other nine are used for training the model73. This process repeats ten times, with each fold serving as the validation set once. Consequently, the model undergoes nine training phases before each validation, cycling through all folds to provide a comprehensive assessment. A grid search was employed for optimizing the hyperparameters of each model throughout the modeling process. Grid search is a method for optimizing hyperparameters by exhaustively evaluating all possible combinations within a defined range, using cross-validation to assess model performance. It systematically trains and evaluates the model for each combination to identify the best-performing parameters. The search range and tuned values of the principal hyperparameters obtained in the modeling process were reported in Table 2. Hyperparameter tuning plays a crucial role in minimizing prediction errors in machine learning models74. In addition to grid search, metaheuristic optimization methods like genetic algorithms, particle swarm optimization, and grey wolf optimization can effectively navigate vast hyperparameter spaces to quickly discover optimal solutions, as demonstrated in the literature73.
Statistical and graphical evaluation of models
Considering evaluating the accuracy of the proposed models, Table 3 summarizes the values of R2, RMSE, APRE, AAPRE, and SD. According to statistical principles, the closer the R2 of a model is to 1 and the lower the values of RMSE, AAPRE, APRE, and SD in the modeling process, the more accurate and valid that model is. As shown in Table 3, the CatBoost model represents AAPRE values of 2.36%, 2.06%, and 3.54% for the total, train, and test collections, which are the lowest error values among the four models developed in this work. Furthermore, this model shows the highest overall R2 value of 0.9953 along with the lowest values of APRE, RMSE, and SD compared to the remaining three models. Hence, the CatBoost model can be considered the most accurate model developed in this study for predicting CO2 absorption in aqueous PZ solutions. Summing up the statistical analyses, CatBoost, LightGBM, XGBoost, and AdaBoost-DT models are classified from the best performance to the weakest, respectively.
Moreover, the performance of the suggested models was compared using graphical error analyses. First, Fig. 7 illustrates cross-plots of the predicted data by the developed models versus the experimental data. As is evident, all the boosting models show good performance having most of the data points around the unit slope line, however, the CatBoost model delivers the closest cloud of data to this line suggesting that the estimations of this model match the experimental values better than the rest.
In the subsequent stage, the distributions of the CO2 solubility estimation errors applying developed models against the experimental data were plotted in Fig. 8. As can be seen, the predictions of the models developed in this work show relative errors close to zero, which confirms their accuracy and reliability. However, again, the CatBoost model delivers relatively lower errors than others, and the formed cloud of errors is more concentrated near the zero error line.
Next, Fig. 9 depicts the cumulative frequency of the absolute relative error for different models when applied to the whole data bank. If the yellow horizontal dashed line that defines 70% of the data in the figure is considered, it can be seen that the CatBoost, LightGBM, XGBoost, and AdaBoost-DT models show absolute relative errors of 2.7%, 3.2%, 3.9%, and 4.9%, respectively, which means that the error of the models for predicting 70% of the data is less than the mentioned values. Similarly, about 90% of the estimated values by the CatBoost model had an absolute relative error of less than 5%, while the error values of other models are more than this. These observations along with other statistical and graphical analyses prove that the CatBoost model is highly accurate for predicting CO2 absorption into aqueous PZ solutions.
Trend analysis
At this stage, it is time to check how the CatBoost model predicts the physical trend of CO2 absorption in aqueous PZ solutions based on influencing variables. First, the prediction of the CatBoost model related to the solubility of CO2 in 0.2 M PZ solution, as studied experimentally in the literature48, was investigated with respect to temperature and partial pressure of CO2. As illustrated in Fig. 10, CO2 absorption values increased with increasing CO2 partial pressure. This behavior is due to the more driving force for absorption at higher CO2 partial pressure. Experimental studies showed that when CO2 or a sour gas is added to an aqueous PZ solution, since the gas is mainly dissolved in non-volatile and ionic form, the total pressure initially rises very slightly with a raising extent of gas in the liquid. For higher gas loadings, the total pressure and of course CO2 partial pressure increase steeply when PZ has been spent in the liquid phase by chemical reactions. This means that more sour gas can no longer be absorbed chemically but must be dissolved physically75,76. This is while the temperature has a destructive effect on the CO2 solubility in PZ aqueous solution, and with the increase in temperature, the amount of CO2 loading has decreased significantly. Actually, CO2 absorption in aqueous PZ (amine) solutions decreases at higher temperatures due to the nature of the exothermic mass transfer process of chemisorption. Moreover, lower temperatures raise the viscosity of the liquid phase; thus lower CO2 diffusion coefficients and consequently decrease CO2 solubility50,77. Considering the modeling results illustrated in the figure, the proposed CatBoost model accurately recognized the absorption trend of gas and forecasts the CO2 loading in aqueous PZ solution at various temperatures and pressures.
The effect of temperature on CO2 solubility in 0.2 M PZ solution; experimental data48 and CatBoost model predictions.
Next, the impact of PZ concentration on CO2 solubility in PZ solutions was investigated at a fixed temperature of 303.15 K with respect to the partial pressure of CO2, as experimentally studied in the literature53, and compared with CatBoost model predictions in Fig. 11. As shown in Fig. 11, increasing the PZ concentration decreases the CO2 loading at constant temperature and pressure. The free amine concentration, being a component of the mass transfer coefficient, has the potential to influence CO2 mass transfer. With the increase of PZ concentration, the viscosity of the liquid phase increases, and therefore the CO2 diffusion coefficient decreases slightly, leading to a decrease in the solubility of CO2 at constant temperature and pressure50. Moreover, at higher CO2 partial pressures for a more dilute solution, more physical absorption of the gas can be observed, which can ultimately lead to greater solubility of the gas compared to lower CO2 partial pressures. Actually, a stronger PZ solution cannot be loaded to a high extent since the physically absorbed CO2 is negligible in comparison to the chemically absorbed CO253. Again, the modeling results shown in the figure exhibit that the CatBoost model has an outstanding prediction for CO2 solubility in PZ solutions with different concentrations and at different pressures.
The effect of PZ concentration on CO2 solubility in aqueous PZ solutions; experimental data53 and CatBoost model predictions.
Sensitivity analysis
In this survey, the Pearson and Spearman correlation coefficients were calculated to check the impact of three inputs, namely temperature, CO2 partial pressure, and, PZ concentration on the output of the CatBoost model (i.e. CO2 solubility in aqueous PZ solutions). For the Pearson correlation coefficient, the formula used to compute the linear effect of the input parameters is given below55,78:
here, \({z}_{i,j}\) and \({z}_{a,i}\) stand for the j-th and average values of i-th input parameter, respectively. Moreover, \(i\) could be temperature, CO2 partial pressure, and, PZ concentration. In addition, \({y}_{a}\)and \({y}_{j}\)show the average and the j-th values of estimated CO2 solubility in aqueous PZ solutions.
The Spearman correlation coefficient measures the association between the rankings of two variables using a monotonic function, enabling detection of non-linear relationships. It is robust against sample data distribution, unlike parametric methods, and uses a specific formula given below for rank correlation analysis79:
here, n denotes the count of data, where ρ is the Spearman rank correlation coefficient. Also, R(z) stands for the rank of variable z, while Ra(z) represents its average rank. Moreover, R(y) shows the rank of variable y, and Ra(y) is its average rank.
The correlation coefficients range from − 1 to 1, while the higher the absolute value of a parameter, the greater its impact on the output of the model80. Positive or negative values of correlation coefficients for a parameter indicate the increasing or decreasing effect of that parameter on the model’s output, respectively81,82. The Pearson and Spearman correlation coefficients for all inputs calculated using the results of the CatBoost model, as the best paradigm developed in this work, are shown in Fig. 12. Among the input parameters, CO2 partial pressure had the greatest influence on CO2 absorption in aqueous PZ solutions. After that, PZ concentration and temperature respectively have shown the greatest effect with a slight difference. Both temperature and PZ concentration exhibit inverse relationships with CO2 solubility, as indicated by negative Pearson coefficients (-0.299 and − 0.355, respectively) and even stronger negative Spearman coefficients (-0.361 and − 0.383, respectively), suggesting the presence of non-linear elements in these relationships. Conversely, CO2 partial pressure shows a strong positive correlation with CO2 solubility, with a Pearson coefficient of 0.621 and an even higher Spearman coefficient of 0.862, highlighting significant non-linear dynamics. In summary, while temperature and PZ concentration negatively influence CO2 solubility with some non-linear effects, CO2 partial pressure positively affects solubility, predominantly through non-linear effects.
Leverage approach
To appraise the validity region of the proposed CatBoost model and to discern any dubious data, the leverage technique83,84,85 was utilized in this survey. In this approach, the differences between the model’s estimates and experimental data are dubbed standardized residuals (SR). Taking Hi as the ith Leverage value, ei as the error value, and MSE as the mean square of error, SR values are represented below86,87:
Standardized residuals are incorporated in a Hat matrix. Also, hat indexes are elements on the main diagonal of the Hat matrix. Considering T as the transpose matrix of X as a (k × l) matrix incorporating k rows (data points), l columns (input parameters), the Hat indexes are determined according to the Hat matrix presented as follows86:
In addition, critical leverage (H*) is a fixed value for a given data bank and can be computed as follows85,88:
Williams’s plot is usually drawn for a visual representation of the applicability scope of a model and doubtful data existing in the data bank, as shown in Fig. 13 for the CatBoost model developed in this survey. Here, bad high leverage points are those having SR values of more than 3 and less than − 3 regardless of their Hat values. As shown in Fig. 13, only 8 data points (∼1.5% of data) were identified as suspected data, which means that these data were laboratory suspects. Moreover, the data points having SR values between − 3 and 3 with a Hat value higher than H* (0.0232) are named good high leverage. As Williams’s plot shows, 20 data points were identified as probable outliers, which means that despite the accurate estimation, these data were beyond the applicability scope of the model and are different from most of the data. In conclusion, both the experimental data bank utilized for modeling and the model’s estimates were statistically acceptable and valid. Table 4 provides a list of the suspected data along with outliers identified for the proposed CatBoost model using the leverage technique.
In reviewing the literature43,44,45, various models such as GP, GA-ANFIS, LSSVM, ANFIS, AdaBoost-CART, CHPSO-ANFIS, and CSA-LSSVM have been effectively utilized to address similar problems. These models have shown considerable success in their respective applications. The present work introduces the application of tree-based boosting algorithms to this domain, which have proven to be highly effective in regression problems but have not been previously applied to this specific subject of study. Through meticulous hyperparameter tuning using grid search and cross-validation, significant improvements in prediction accuracy were achieved, underscoring the potential of these algorithms in this context. For future work, incorporating new datasets to further validate and enhance the model’s robustness is proposed. Additionally, exploring advanced metaheuristic optimization techniques and developing novel algorithms could offer further performance gains, ensuring the models remain at the forefront of predictive accuracy and reliability.
Conclusions
In this study, CO2 solubility in aqueous PZ solutions was modeled using 517 experimental data points and four robust machine learning algorithms, namely CatBoost, LightGBM, XGBoost, and AdaBoost-DT. The CatBoost model represented the lowest error values among the four models developed in this work having AAPRE values of 2.36%, 2.06%, and 3.54% for the total, train, and test collections. Moreover, LightGBM, XGBoost, and AdaBoost-DT models were classified from the best performance to the weakest after the CatBoost model, respectively. Among the input parameters, CO2 partial pressure had the greatest influence on CO2 absorption in aqueous PZ solutions based on sensitivity analysis. After that, PZ concentration and temperature respectively demonstrated the greatest effect with a slight difference. Furthermore, both temperature and PZ concentration exhibited inverse relationships with CO2 solubility, as indicated by negative Pearson and even stronger negative Spearman coefficients, suggesting the presence of non-linear elements. Conversely, CO2 partial pressure showed a strong positive correlation with CO2 solubility, with higher Spearman coefficient highlighting significant non-linear dynamics. Eventually, data assessment using the Leverage approach exhibited that 20 data points were probable outliers, which means that despite the accurate estimation, these data were beyond the applicability scope of the model and were statistically different from most of the data. Moreover, both the experimental data bank used for modeling and the model’s estimates were statistically acceptable and valid showing only 8 points (∼1.5% of total data) as possible suspected data.
Data availability
The databank utilized during this research is available from the corresponding author on reasonable request.
References
Aghel, B., Behaein, S., Wongwises, S. & Shadloo, M. S. A review of recent progress in biogas upgrading: With emphasis on carbon capture. Biomass Bioenergy160, 106422 (2022).
Aghel, B., Janati, S., Wongwises, S. & Shadloo, M. S. Review on CO2 capture by blended amine solutions. Int. J. Greenh. Gas Control119, 103715 (2022).
Friedlingstein, P. et al. Global carbon budget 2022. In Earth System Science Data Discussions 1–159 (2022).
Gelles, T., Lawson, S., Rownaghi, A. A. & Rezaei, F. Recent advances in development of amine functionalized adsorbents for CO2 capture. Adsorption26, 5–50 (2020).
Zhang, F., Zhao, P., Niu, M. & Maddy, J. The survey of key technologies in hydrogen energy storage. Int. J. Hydrog. Energy41, 14535–14552 (2016).
Chen, P. C., Cho, H. H., Jhuang, J. H. & Ku, C. H. Selection of mixed amines in the CO2 capture process. Carbon. 7, 25 (2021).
Wu, S. Y., Liu, Y. F., Chu, C. Y., Li, Y. C. & Liu, C. M. Optimal absorbent evaluation for the CO2 separating process by absorption loading, desorption efficiency, cost, and environmental tolerance. Int. J. Green Energy12, 1025–1030 (2015).
Olabi, A. et al. Membrane-based carbon capture: Recent progress, challenges, and their role in achieving the sustainable development goals. Chemosphere320, 137996 (2023).
Dai, N. & Mitch, W. A. Influence of amine structural characteristics on N-nitrosamine formation potential relevant to postcombustion CO2 capture systems. Environ. Sci. Technol.47, 13175–13183 (2013).
Bui, M. et al. Carbon capture and storage (CCS): The way forward. Energy Environ. Sci.11, 1062–1176 (2018).
Liang, Z. H. et al. Recent progress and new developments in post-combustion carbon-capture technology with amine based solvents. Int. J. Greenh. Gas Control40, 26–54 (2015).
Wang, Z., Zhang, Z. & Mitch, W. A. Role of absorber and desorber units and operational conditions for N-nitrosamine formation during amine-based carbon capture. Water Res.170, 115299 (2020).
Aghel, B., Sahraie, S., Heidaryan, E. & Varmira, K. Experimental study of carbon dioxide absorption by mixed aqueous solutions of methyl diethanolamine (MDEA) and piperazine (PZ) in a microreactor. Process Saf. Environ. Prot.131, 152–159 (2019).
Kim, Y. E., Choi, J. H., Nam, S. C. & Yoon, Y. I. CO2 absorption characteristics in aqueous K2CO3/piperazine solution by NMR spectroscopy. Ind. Eng. Chem. Res.50, 9306–9313 (2011).
Rochelle, G. T. Amine scrubbing for CO2 capture. Science325, 1652–1654 (2009).
Aronu, U. E. et al. Solubility of CO2 in 15, 30, 45 and 60 mass% MEA from 40 to 120 °C and model representation using the extended UNIQUAC framework. Chem. Eng. Sci.66, 6393–6406 (2011).
Chen, C. C. & Evans, L. B. A local composition model for the excess Gibbs energy of aqueous electrolyte systems. AlChE J.32, 444–454 (1986).
Fouad, W. A. & Berrouk, A. S. Prediction of H2S and CO2 solubilities in aqueous triethanolamine solutions using a simple model of Kent–Eisenberg type. Ind. Eng. Chem. Res.51, 6591–6597 (2012).
Haghtalab, A. & Dehghani Tafti, M. Electrolyte UNIQUAC – NRF model to study the solubility of acid gases in alkanolamines. Ind. Eng. Chem. Res.46, 6053–6060 (2007).
Dashti, A., Raji, M., Alivand, M. S. & Mohammadi, A. H. Estimation of CO2 equilibrium absorption in aqueous solutions of commonly used amines using different computational schemes. Fuel264, 116616 (2020).
Song, Z., Shi, H., Zhang, X. & Zhou, T. Prediction of CO2 solubility in ionic liquids using machine learning methods. Chem. Eng. Sci.223, 115752 (2020).
Amar, M. N., Larestani, A., Lv, Q., Zhou, T. & Hemmati-Sarapardeh, A. Modeling of methane adsorption capacity in shale gas formations using white-box supervised machine learning techniques. J. Pet. Sci. Eng. 109226 (2021).
Naghizadeh, A., Larestani, A., Amar, M. N. & Hemmati-Sarapardeh, A. Predicting viscosity of CO2–N2 gaseous mixtures using advanced intelligent schemes. J. Pet. Sci. Eng. 109359 (2021).
Hashemizadeh, A., Maaref, A., Shateri, M., Larestani, A. & Hemmati-Sarapardeh, A. Experimental measurement and modeling of water-based drilling mud density using adaptive boosting decision tree, support vector machine, and K-nearest neighbors: A case study from the South pars gas field. J. Pet. Sci. Eng. 109132 (2021).
Larestani, A., Hemmati-Sarapardeh, A. & Naseri, A. Experimental measurement and compositional modeling of bubble point pressure in crude oil systems: Soft computing approaches, correlations, and equations of state. J. Pet. Sci. Eng. 110271 (2022).
Lv, Q. et al. Modelling minimum miscibility pressure of CO2-crude oil systems using deep learning, tree-based, and thermodynamic models: Application to CO2 sequestration and enhanced oil recovery. Sep. Purif. Technol. 123086 (2023).
Tian, Y., Wang, X., Liu, Y. & Hu, W. Prediction of nitrogen solubility in ionic liquids by machine learning methods based on COSMO-derived descriptors. Chem. Eng. Sci.284, 119482 (2024).
Wang, C. et al. Integrating experimental study and intelligent modeling of pore evolution in the Bakken during simulated thermal progression for CO2 storage goals. Appl. Energy359, 122693 (2024).
Saghafi, H. & Arabloo, M. Modeling of CO2 solubility in MEA, DEA, TEA, and MDEA aqueous solutions using AdaBoost-Decision Tree and Artificial neural network. Int. J. Greenh. Gas Control58, 256–265 (2017).
Salooki, M. K., Abedini, R., Adib, H. & Koolivand, H. Design of neural network for manipulating gas refinery sweetening regenerator column outputs. Sep. Purif. Technol.82, 1–9 (2011).
Adib, H., Sharifi, F., Mehranbod, N., Kazerooni, N. M. & Koolivand, M. Support vector machine based modeling of an industrial natural gas sweetening plant. J. Nat. Gas Sci. Eng.14, 121–131 (2013).
Sipöcz, N., Tobiesen, F. A. & Assadi, M. The use of artificial neural network models for CO2 capture plants. Appl. Energy88, 2368–2376 (2011).
Sahraie, S., Rashidi, H. & Valeh-e-Sheyda, P. An optimization framework to investigate the CO2 capture performance by MEA: Experimental and statistical studies using Box-Behnken design. Process. Saf. Environ. Prot.122, 161–168 (2019).
Wu, Y. & Chan, C. W. Analysis of data for the carbon dioxide capture domain. Eng. Appl. Artif. Intell.24, 154–163 (2011).
Zhou, Q., Chan, C. W., Tontiwachwuthikul, P., Idem, R. & Gelowitz, D. Application of neuro-fuzzy modeling technique for operational problem solving in a CO2 capture process system. Int. J. Greenh. Gas Control15, 32–41 (2013).
Zhou, Q., Wu, Y., Chan, C. W. & Tontiwachwuthikul, P. Modeling of the carbon dioxide capture process system using machine intelligence approaches. Eng. Appl. Artif. Intell.24, 673–685 (2011).
Hsiao, Y. D. & Chang, C. T. Expandable neural networks for efficient modeling of various amine scrubbing configurations for CO2 capture. Chem. Eng. Sci.281, 119191 (2023).
Wang, X., Chan, C. W. & Li, T. High accuracy prediction of the Post-combustion Carbon capture process parameters using the decision Forest Approach. Chem. Eng. Sci. 119878 (2024).
Ghiasi, M. M. & Mohammadi, A. H. Rigorous modeling of CO2 equilibrium absorption in MEA, DEA, and TEA aqueous solutions. J. Nat. Gas Sci. Eng.18, 39–46 (2014).
Ghiasi, M. M., Arabloo, M., Mohammadi, A. H. & Barghi, T. Application of ANFIS soft computing technique in modeling the CO2 capture with MEA, DEA, and TEA aqueous solutions. Int. J. Greenh. Gas Control49, 47–54 (2016).
Daneshvar, N., Moattar, M. Z., Abdi, M. A. & Aber, S. Carbon dioxide equilibrium absorption in the multi-component systems of CO2 + TIPA + MEA + H2O, CO2 + TIPA + pz + H2O and CO2 + TIPA + H2O at low CO2 partial pressures: Experimental solubility data, corrosion study and modeling with artificial neural network. Sep. Purif. Technol.37, 135–147 (2004).
Shahsavand, A., Fard, F. D. & Sotoudeh, F. Application of artificial neural networks for simulation of experimental CO2 absorption data in a packed column. J. Nat. Gas Sci. Eng.3, 518–529 (2011).
Tatar, A. et al. Comparison of two soft computing approaches for predicting CO2 solubility in aqueous solution of piperazine. Int. J. Greenh. Gas Control53, 85–97 (2016).
Yarveicy, H., Ghiasi, M. M. & Mohammadi, A. H. Performance evaluation of the machine learning approaches in modeling of CO2 equilibrium absorption in Piperazine aqueous solution. J. Mol. Liq.255, 375–383 (2018).
Dashti, A. et al. Efficient hybrid modeling of CO2 absorption in aqueous solution of piperazine: Applications to energy and environment. Chem. Eng. Res. Des.144, 405–417 (2019).
Bishnoi, S. & Rochelle, G. T. Absorption of carbon dioxide into aqueous piperazine: Reaction kinetics, mass transfer and solubility. Chem. Eng. Sci.55, 5531–5543 (2000).
Dash, S. K., Samanta, A., Samanta, A. N. & Bandyopadhyay, S. S. Vapour liquid equilibria of carbon dioxide in dilute and concentrated aqueous solutions of piperazine at low to high pressure. Fluid Phase Equilibria300, 145–154 (2011).
Derks, P., Dijkstra, H., Hogendoorn, J. & Versteeg, G. Solubility of carbon dioxide in aqueous piperazine solutions. AIChE J.51, 2311–2327 (2005).
Dugas, R. & Rochelle, G. Absorption and desorption rates of carbon dioxide with monoethanolamine and piperazine. Energy Procedia1, 1163–1169 (2009).
Dugas, R. E. Carbon Dioxide Absorption, Desorption, and Diffusion in Aqueous Piperazine and Monoethanolamine (The University of Texas at Austin, 2009).
Haghtalab, A., Eghbali, H. & Shojaeian, A. Experiment and modeling solubility of CO2 in aqueous solutions of diisopropanolamine + 2-amino-2-methyl-1-propanol + piperazine at high pressures. J. Chem. Thermodyn.71, 71–83 (2014).
Kadiwala, S., Rayer, A. V. & Henni, A. High pressure solubility of carbon dioxide (CO2) in aqueous piperazine solutions. Fluid. Phase. Equilibria292, 20–28 (2010).
Aroua, M. K. & Mohd Salleh, R. Solubility of CO2 in aqueous piperazine and its modeling using the Kent-Eisenberg approach. Chem. Eng. Technol. Ind. Chem. Plant Equip. Process. Eng. Biotechnol.27, 65–70 (2004).
Hadavimoghaddam, F. et al. Modeling crude oil pyrolysis process using advanced white-box and black-box machine learning techniques. Sci. Rep.13, 22649 (2023).
Chen, G. et al. The genetic algorithm based back propagation neural network for MMP prediction in CO2-EOR process. Fuel126, 202–212 (2014).
Chen, T. & Guestrin, C. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining 785–794.
Chen, T. et al. Xgboost: Extreme gradient boosting. R package version 0.4-2 1, 1–4 (2015).
Nakhaei-Kohani, R. et al. Machine learning assisted structure-based models for predicting electrical conductivity of Ionic liquids. J. Mol. Liq. 119509 (2022).
Abdi, J., Hadavimoghaddam, F., Hadipoor, M. & Hemmati-Sarapardeh, A. Modeling of CO2 adsorption capacity by porous metal organic frameworks using advanced decision tree-based models. Sci. Rep.11, 1–14 (2021).
Mohammadi, M. R. et al. Modeling hydrogen solubility in hydrocarbons using extreme gradient boosting and equations of state. Sci. Rep.11, 17911 (2021).
Sun, X., Liu, M. & Sima, Z. A novel cryptocurrency price trend forecasting model based on LightGBM. Finance Res. Lett.32, 101084 (2020).
Yang, X., Dindoruk, B. & Lu, L. A comparative analysis of bubble point pressure prediction using advanced machine learning algorithms and classical correlations. J. Petrol. Sci. Eng.185, 106598 (2020).
Gu, Y. et al. Data-driven estimation for permeability of simplex pore-throat reservoirs via an improved light gradient boosting machine: A demonstration of sand-mud profile, Ordos Basin, northern China. J. Petrol. Sci. Eng. 110909 (2022).
Mahmoudzadeh, A. et al. Modeling CO2 solubility in water using gradient boosting and light gradient boosting machine. Sci. Rep.14, 13511 (2024).
Qi, M. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Neural Inform. Process. Syst. Curran Associates Inc (2017).
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V. & Gulin, A. CatBoost: Unbiased boosting with categorical features. Adv. Neural. Inf. Process. Syst. 31 (2018).
Morozov, A. D. et al. Data-driven model for hydraulic fracturing design optimization: Focus on building digital database and production forecast. J. Petrol. Sci. Eng.194, 107504 (2020).
Duplyakov, V. et al. Data-driven model for hydraulic fracturing design optimization. Part II: Inverse problem. J. Petrol. Sci. Eng.208, 109303 (2022).
Freund, Y. & Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci.55, 119–139 (1997).
Dargahi-Zarandi, A., Hemmati-Sarapardeh, A., Shateri, M., Menad, N. A. & Ahmadi, M. Modeling minimum miscibility pressure of pure/impure CO2-crude oil systems using adaptive boosting support vector regression: Application to gas injection processes. J. Petrol. Sci. Eng.184, 106499 (2020).
Zerrouki, N., Harrou, F., Sun, Y. & Houacine, A. Vision-based human action classification using adaptive boosting algorithm. IEEE Sens. J.18, 5115–5121 (2018).
Mohammadi, M. R. et al. Modeling the solubility of light hydrocarbon gases and their mixture in brine with machine learning and equations of state. Sci. Rep.12, 14943 (2022).
Nair, P. et al. AI-driven digital twin model for reliable lithium‐ion battery discharge capacity predictions. Int. J. Intell. Syst. 8185044 (2024).
Shawki, N., Nunez, R. R., Obeid, I. & Picone, J. In 2021 IEEE Signal Processing in Medicine and Biology Symposium (SPMB) 1–7 (IEEE).
Kamps, Á. P. S., Xia, J. & Maurer, G. Solubility of CO2 in (H2O + piperazine) and in (H2O + MDEA + piperazine). AIChE J.49, 2662–2670 (2003).
Ermatchkov, V., Pérez-Salado Kamps, Á., Speyer, D. & Maurer, G. Solubility of carbon dioxide in aqueous solutions of piperazine in the low gas loading region. J. Chem. Eng. Data51, 1788–1796 (2006).
Jahangiri, A. & Nabipoor Hassankiadeh, M. Effects of piperazine concentration and operating conditions on the solubility of CO2 in AMP solution at low CO2 partial pressure. Sep. Sci. Technol.54, 1067–1078 (2019).
Mohammadi, M. R., Hemmati-Sarapardeh, A., Schaffie, M., Husein, M. M. & Ranjbar, M. Application of cascade forward neural network and group method of data handling to modeling crude oil pyrolysis during thermal enhanced oil recovery. J. Petrol. Sci. Eng.205, 108836 (2021).
Xu, M., Wong, T. C. & Chin, K. S. Modeling daily patient arrivals at Emergency Department and quantifying the relative importance of contributing variables using artificial neural network. Decis. Support Syst.54, 1488–1498 (2013).
Ansari, S. et al. Experimental measurement and modeling of asphaltene adsorption onto iron oxide and lime nanoparticles in the presence and absence of water. Sci. Rep.13, 122 (2023).
Mohammadi, M. R. et al. On the evaluation of crude oil oxidation during thermogravimetry by generalised regression neural network and gene expression programming: Application to thermal enhanced oil recovery. Combust. Theor. Model.25, 1268–1295 (2021).
Salehi, E. et al. Modeling interfacial tension of N2/CO2 mixture + n-alkanes with machine learning methods: Application to eor in conventional and unconventional reservoirs by flue gas injection. Minerals12, 252 (2022).
Leroy, A. M. & Rousseeuw, P. J. Robust Regression and Outlier Detection. rrod (1987).
Goodall, C. R. 13 Computation Using the QR Decomposition (1993).
Gramatica, P. Principles of QSAR models validation: Internal and external. QSAR Comb. Sci.26, 694–701 (2007).
Rousseeuw, P. J. & Leroy, A. M. Robust Regression and Outlier Detection (Wiley, 2005).
Hadavimoghaddam, F. et al. Data-driven modeling of H2 solubility in hydrocarbons using white-box approaches. Int. J. Hydrog. Energy47, 33224–33238 (2022).
Ansari, S. et al. Prediction of hydrogen solubility in aqueous solutions: Comparison of equations of state and advanced machine learning-metaheuristic approaches. Int. J. Hydrog. Energy47, 37724–37741 (2022).
Author information
Authors and Affiliations
Contributions
M-R.M: Investigation, data curation, modeling, visualization, writing-original draft; A.L: Investigation, conceptualization, validation, modeling, writing-original draft; M.S: Writing-Review and Editing, Methodology, Validation, Supervision; A.H-S: Methodology, validation, supervision, writing-review and editing; M.R: Writing-Review and Editing, Methodology, Validation, Supervision.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Mohammadi, MR., Larestani, A., Schaffie, M. et al. Predictive modeling of CO2 solubility in piperazine aqueous solutions using boosting algorithms for carbon capture goals. Sci Rep 14, 22112 (2024). https://doi.org/10.1038/s41598-024-73070-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-024-73070-y
Keywords
This article is cited by
-
Modeling residue formation from crude oil oxidation using tree-based machine learning approaches
Scientific Reports (2025)
-
Leveraging advanced ensemble learning techniques for methane uptake prediction in metal organic frameworks
Scientific Reports (2025)
-
Compositional modeling of solution gas–oil ratio (Rs): a comparative study of tree-based models, neural networks, and equations of state
Scientific Reports (2025)
-
Fluorescence-based spectrometric and imaging methods and machine learning analyses for microbiota analysis
Microchimica Acta (2025)