Abstract
Accurate prediction of fuel deposition during crude oil pyrolysis is pivotal for sustaining the combustion front and ensuring the effectiveness of in-situ combustion enhanced oil recovery (ISC EOR). Employing 2071 experimental TGA datasets from 13 diverse crude oil samples extracted from the literature, this study sought to precisely model crude oil pyrolysis. A suite of robust machine learning techniques, encompassing three black-box approaches (Categorical Gradient Boosting—CatBoost, Gaussian Process Regression—GPR, Extreme Gradient Boosting—XGBoost), and a white-box approach (Genetic Programming—GP), was employed to estimate crude oil residue at varying temperature intervals during TGA runs. Notably, the XGBoost model emerged as the most accurate, boasting a mean absolute percentage error (MAPE) of 0.7796% and a determination coefficient (R2) of 0.9999. Subsequently, the GPR, CatBoost, and GP models demonstrated commendable performance. The GP model, while displaying slightly higher error in comparison to the black-box models, yielded acceptable results and proved suitable for swift estimation of crude oil residue during pyrolysis. Furthermore, a sensitivity analysis was conducted to reveal the varying influence of input parameters on residual crude oil during pyrolysis. Among the inputs, temperature and asphaltenes were identified as the most influential factors in the crude oil pyrolysis process. Higher temperatures and oil °API gravity were associated with a negative impact, leading to a decrease in fuel deposition. On the other hand, increased values of asphaltenes, resins, and heating rates showed a positive impact, resulting in an increase in fuel deposition. These findings underscore the importance of precise modeling for fuel deposition during crude oil pyrolysis, offering insights that can significantly benefit ISC EOR practices.
Similar content being viewed by others
Introduction
In-situ combustion (ISC) is a challenging thermal enhanced oil recovery (EOR) technique, defined as the process of oil recovery by burning the heavy oil in reservoir1. In this technique, pure oxygen or oxygen-enriched gas is injected into the reservoir to combust a portion of the crudes. In other words, a portion of the oil-in-place is oxidized and utilized as fuel to generate heat2. The initiation of combustion front involves either artificial or spontaneous ignition of the oil. Artificial methods, such as gas/air burners, steam/hot fluid injection, or electric ignition, can be employed to ignite the oil deliberately. Alternatively, spontaneous ignition occurs at or near the injection well, often facilitated by downhole igniters3. As oxygen-enriched gas is continuously injected, the combustion front will propagate toward the production well. This causes a lot of heat to be released within the reservoir, reducing oil viscosity and achieving oil recovery. One of the prerequisites for ISC is fuel availability in the reservoir for the sustainability of the combustion front. The fuel served during ISC consists of carbonaceous residues (mainly coke) deposited around the combustion front as a result of thermal cracking, pyrolysis, and distillation of crude oil. Eventually, the recovery of unburned oil is enhanced due to displacement agents made by gases and heat released from combustion, along with changes in the physical and chemical properties of reservoir oil1,4,5. During ISC, oxidation and pyrolysis of hydrocarbons take place, which strongly affect the quantity and quality of the formed fuel required for the sustainability of the combustion front. Pyrolysis is a chemical reaction involving crude oil exposure to heat in lack of an oxidizing medium5,6,7. Pyrolysis, cracking, vapourization, condensation, and dehydrogenation may occur during ISC, which affects the physical and chemical properties of the carbonaceous residue and are important for oil production5,7.
Over time, thermogravimetry analysis (TGA) and differential thermogravimetry (DTG) techniques have been employed as investigative instruments for studying ISC processes5,8. Specifically, TGA can monitor weight changes during the combustion of fuels or residues, yet it is important to acknowledge that it does not claim complete simulation capability for such intricate phenomena during ISC. Ciajolo and Barbella6 investigated the oxidation and pyrolysis of several heavy oils and their fractions using DTG profiles. Low-temperature (< 400 °C) and high-temperature phases were found in the thermal behavior of fuel, which includes the volatilization of paraffinic and aromatic fractions in the first phase, and the pyrolysis of polar and asphaltene fractions causing a particulate carbon residue in the latter. Ranjbar and Pusch9 experimentally showed that heat transfer and transferability characteristics of the pyrolysis medium as well as the colloidal composition of oil (such as asphaltenes and resins) have a noticeable impact on the fuel formation and composition. In another study, Ranjbar10 showed that clay minerals existing in the matrix raise fuel deposition during the pyrolysis process and catalyze the oxidation of fuel. Kok11 studied differential scanning calorimeter (DCS) and TGA of two heavy crudes and showed that the heavier oil deposited larger quantities of residue/fuel after distillation was complete. Karacan and Kok12 analyzed the pyrolysis of crude oils and their fractions using TGA and DSC and showed that asphaltenes and resins respectively have the most contribution to coke formation. In another laboratory study, Kok and Karacan13 showed that as crude oils' °API decreases cracking activation energy increases. They also indicated two main mechanisms along with their temperature ranges for mass loss, which included thermal cracking and vis-breaking (400–600 °C) along with distillation (20–400 °C). Ambalae et al.14 experimentally indicated that asphaltenes have the largest role in the formation of coke (fuel) among other fractions of crude oil. Kok15 showed that the heating rate influenced the reaction region peak, intervals, and burn-out temperatures in TG-DTG experiments of crude oil combustion. Li et al.16 showed that pyrolyzed and oxidized cokes are the main types of coke in the ISC process, releasing more heat than crude oil under similar conditions. A lot of research has been done on the catalytic impact of different compounds on oxidation and pyrolysis of various crudes and cokes17,18,19,20,21,22. The kinetics of combustion and pyrolysis of crudes and their fractions and the deposited coke have been investigated in some studies13,22,23,24,25,26,27,28.
In the realm of crude oil pyrolysis and oxidation, despite extensive laboratory research, recent attention has shifted to modeling approaches, particularly through machine learning regression. This artificial intelligence technique proves valuable in understanding the complex relationships within crude oil pyrolysis and oxidation processes, given the multitude of influencing parameters in ISC. Rasouli et al.29 investigated the pyrolysis of six crudes and represented a multilayer perceptron model for predicting the crude oil residue on the basis of TGA with a 3.5% error. Norouzpour et al.30 modeled crude oil pyrolysis employing a radial basis function neural network based on TGA of six crudes with a 5.8% error. Mohammadi et al.31 collected TGA experimental data from nine crude oils’ oxidation and presented a model using a generalized regression neural network with an error of 2.3%. In another study, Mohammadi et al.32 modeled the pyrolysis of 11 crude oils based on TGA data by applying a cascade forward neural network with an error of 1.04%. Despite the existence of several models for predicting crude oil pyrolysis, gathering more experimental data and applying cutting-edge and robust black-box and white-box machine learning techniques have the potential to engender streamlined mathematical correlations, thereby yielding more precise intelligent models.
In this study, 2071 experimental TGA findings for 13 different crude oils are gathered from the literature in order to precisely represent crude oil pyrolysis, which is a crucial reaction in the ISC EOR process. Four robust machine learning techniques, including three black-box approaches (Categorical Gradient Boosting—CatBoost, Gaussian Process Regression—GPR, Extreme Gradient Boosting—XGBoost), and a white-box approach (Genetic Programming—GP), are used to model the residual mass of crude oils at various temperatures obtained from TGA. High-precision statistical and graphical error analyses are utilized to validate the developed models and mathematical correlation. Eventually, sensitivity analysis is carried out to reckon the relative effect of inputs on crude oil residue obtained during pyrolysis.
Data gathering and preparation
In this study, 2071 experimental TGA findings related to 13 distinct crude oils were gathered from the literature12,13,24,29,30,33,34,35 in order to precisely represent crude oil pyrolysis, which is a crucial reaction in the ISC EOR process. The database used in this work is more comprehensive than the one used in Mohammadi et al.’s study31 (i.e. 2015 TGA data for 11 distinct crude oils). Since the kind of crude oil affects how it is pyrolyzed, a variety of crude oils with the characteristics mentioned in Table 1 were chosen to serve as input data for our models.
For model training, factors identified in the literature as being important during the pyrolysis of crude oil9,13,33,36,37 were taken into consideration. In this research, the models' input parameters included the temperature, heating rate, weight percentage of asphaltenes and resins, and oil °API gravity. Since these values are often accessible, there is a large enough database for training the models. The model's result was the residual mass of crude oil at various temperatures. Table 1 lists the characterizations for crudes and heating rates utilized in this study's simulation. Additionally, Table 2 lists the output parameter and statistical descriptions of every model input variable, and Fig. 1 visually depicts the distribution of all the arguments.
Asymmetrical distribution, in contrast to symmetrical distribution, deviates from a regular and balanced pattern typically illustrated by a bell curve. Skewness may quantify the asymmetry of the distribution in this situation. Skewness value is positive when the probability function's left side contains the majority of the data, and vice versa. Conversely, kurtosis identifies the distribution shape in relation to the normal distribution. For instance, if the kurtosis is positive, it means that the normal distribution has a greater peak than the usual distribution38. According to the data in Table 2 and Fig. 1, the distribution and variation range of the input variables are broad enough to provide a generic model for forecasting the pyrolysis of crude oil. It should be mentioned that oil °API gravity, heating rate and especially asphaltene have a number of outliers which, in turn, definitely influence the precision of models. However, the vast majority of observations, as it is seen, are located within the box borders, making the impact of an error term insignificant. Despite the presence of outliers in the data, a thorough examination confirms their validity, indicating that they statistically differ from the majority of the data. As evident from the modeling results, these outliers do not significantly skew errors during the modeling process. As it is further observed, residual crude oil and temperature data are provided in a continuous form, and the distance between observations is insignificant, while other parameters including oil °API gravity, heating rate, asphaltene, and resin are represented with considerable gaps. Moreover, the median of the temperature and the residual crude oil used to develop models in this research are 385.71 and 40.22, respectively. The median of other parameters such as resin, asphaltene, heating rate, and oil °API gravity are 14, 9.66, 8, and 20.26, accordingly.
Figure 2 shows the correlation matrix of input data. As it is demonstrated, the temperature factor accounted for the greatest influence on mass estimate defining around 92% of its behavior. It should be stated that the correlation is negative, thus it means the bigger the temperature the less the mass value, and vice versa. Other parameters have a much smaller effect than the temperature on the target factor, but these parameters are essential for differentiation in crude oil characteristics and modeling of crude pyrolysis.
The dataset was partitioned into a training set comprising 80% of the total data and a test set with 20% randomly. Training involved using the training subset, while the test subset assessed the model's predictive performance. Here, K-fold cross-validation, specifically K-fold 6, was implemented to ensure each observation had an equal chance in training and validation. This involved randomly splitting the training data into 6 folds, fitting the model using 5 folds, and validating it with the remaining fold, a practice tailored to our dataset size.
Model development
In this study, four different machine learning approaches were used for the purpose of the calculation of residual crude oil during pyrolysis. Among these techniques, one utilized was of white-box nature, and the others were of black-box origin. The flowchart represented below in Fig. 3 depicts the general schematic of the research showing the main steps of each stage employed.
Gaussian process regression (GPR)
A common nonparametric modeling approach called GPR employs the Gaussian process before doing regression analysis39. It includes a prior Gaussian process solved using Bayesian inference as well as the regression residual.
A distribution over functions could be described by a Gaussian process, which is a group of random variables. Since a mean function \(m(x)\) and a covariance function \(k(x, x{\prime})\) may fully explain an actual process \(f(x)\), it might be expressed as40:
The objective of GPR is to determine the mapping correlation between the input vector x and the observable y for a specific training dataset \(D={\left\{{x}_{i}, {y}_{i} \right\}}_{i=1}^{n}\), where \({x}_{i}\) is the input vector of the ith sample and \({y}_{i}\) is the observation value of the ith sample41:
where \(\zeta \) is the additional disturbance that matches a Gaussian distribution with zero mean and variance \({\sigma }_{n}^{2}\). Calculating the covariance of noisy measurements y is as follows: \(K\left(x, x\right)+ {\sigma }_{n}^{2}+{I}_{n}\), where \(y={{[y}_{1 }, {y}_{2}, \dots ,{y}_{n} ]}^{T}\), \(X={{[x}_{1 }, {x}_{2}, \dots ,{x}_{n} ]}^{T}\), K represents the covariance matrix, \({I}_{n}\) shows the n-dimensional identity matrix. Therefore, the joint distribution of the testing sample data \({x}_{*}\) under the prior may be computed as41:
According to Eq. (3), the mean of \(f\left({x}_{*}\right)\) and covariance of \(f\left({x}_{*}\right)\) may be written as41:
In the traditional GPR (CGPR), the entire training dataset \(D={\left\{{x}_{i}, {y}_{i} \right\}}_{i=1}^{n}\) is utilized to develop the nonparametric model and to compute the prediction findings for a specific test sample. The dimension of the covariance matrix K in CGPR is \(n\times n\).
Extreme gradient boosting (XGBoost)
Boosting applies to a family of learning techniques that increase the fit of ultimate models by mixing base models with basic functions42. The composite of basic models with fairly low precision43 creates a scalable solution that could identify deep interactions and is less susceptible to anomalies44. The gradient boosting approach, which consists of an effective linear model solver and a tree learning algorithm, is utilized to develop the model. Several objective functions, including regression, classification, and ranking, are supported by the boosting method. XGBoost, a free software package, delivers cutting-edge solutions to a variety of challenges, notably climate projections45,46. XGBoost with a scalable tree-boosting method performs more than 10 times quicker than current popular solutions on a single computer44. XGBoost includes several parameters, making it a complicated model. In addition, hyperparameters are required to limit the danger of over-fitting and forecast variability47. The number of iterations (n estimators) and the learning rate are the two key hyper-parameters that avoid overfitting in XGBoost. In this technique, n estimators relate to the complexity of the model; raising this parameter may result in a more robust model, but it might still overfit to a certain extent. The amount of iterations governs the degree of fit and so influences the optimal learning rate value, and conversely. Generalization effectiveness is often enhanced by minimizing the learning rate. Decreased learning rate may significantly enhance predictive accuracy48. The regularization term, proposed by Friedman43, assists users in avoiding overfitting and manages the model's complexity. Throughout the tuning procedure, model regularization factors such as lambda and alpha should be adjusted to the required regularization weight in order to improve the quality of the model.
Categorical gradient boosting (CatBoost)
CatBoost is a machine learning technique founded on gradient boosting decision tree (GBDT) that was developed by Yandex researchers in 201749,50. Through ranking promotion, it enhances GBDT, assures that all datasets may be utilized for training and learning, and decreases the over-fitting of training51. Due to its strong effectiveness, CatBoost has been employed in various sectors, notably driving style identification52 and diabetes diagnosis53. The traditional GBDT method substitutes the category feature with the average label value related to that category. In a decision tree, the mean label value is used as the segmentation criteria for nodes. This technique is referred to as greedy target-based statistics (greedy TBS) and is described as follows49:
In general, though, features include more data than lab. While the mean label value is employed to forcefully represent characteristics, conditional transfer takes place. The claim is that the supplied collection of findings \({\text{D}} = \{\text{X}_{\text{i}}, {\text{Y}}_{\text{i}},\}, \text{i} = 1,\ldots, \text{n}, \sigma=(\sigma_{1},\ldots,\sigma_{n} )\) is a permutation, and \({x}_{{p}_{A}k}\) may be replaced with49:
here, P is the a priori, and a is its weight (a > 0). The addition of a priori reduces the noise produced by the low-frequency category.
Genetic programming (GP)
GP is a frequently used evolutionary method in evolutionary-based computing54,55. GP may be used to locate global optimum solutions in a wrapped search space. It may additionally generate optimization algorithms motivated by Darwin's theory of evolution56. GP employs an evolutionary path including selection, crossover, mutation, and cloning procedures to seek syntactic expressions that offer more connection between a set of independent (input) and dependent (output) elements57. GP is capable of optimizing model structure on its own, and its results are symbolic in nature. Moreover, its depiction is adaptable. These significant qualities make GP an excellent method for symbolic regression. GP-evolved solutions, in contrast, provide robust interpretability in terms of how features are learned or retrieved from the signals and how they influence categorization58.
Model optimization and tuning
Optimal hyperparameter selection is crucial for algorithm performance. Tuning these parameters fine-tunes the model, significantly impacting accuracy and ensuring the algorithm is well-suited to the specific characteristics of the data, ultimately enhancing predictive capabilities59. In constructing each model and addressing overfitting, grid search was utilized to optimize the hyperparameters. The hyperparameters selected for each model differed, with their importance grounded in both theoretical principles and practical considerations. Table 3 provides a comprehensive overview of the selected hyperparameters for the algorithms implemented in this work.
Evaluation of models
Utilizing seven statistical indicators, the accuracy of the suggested models was evaluated. The following metrics have been employed in the research: MAPE, SD, RMSE, R2, MAE, MBE, and NSE. The selection of such indicators is based on the fact that they are commonly considered to be the most representative and effective ones in the fields of statistics and machine learning. These are the descriptions for the measures listed below60:
Mean absolute percentage error (MAPE, %):
Standard deviation (SD):
Root mean square error (RMSE):
Determination coefficient (R2):
where N shows the count of data, yexp refers to the experimental data, and ypred stands for predicted data by presented models.
Mean absolute error (MAE):
This estimate is a risk measure equivalent to the anticipated value of the absolute error loss or \(l1\)-norm loss. If \(\hat{y}_{i}\) is the anticipated value of the ith sample and \({y}_{i}\) is the matching real value, then the calculated MAE over \({n}_{\text{samples}}\) is given by:
Mean bias error (MBE):
This parameter quantifies the average mistake in a forecast and is computed as:
The Nash–Sutcliffe efficiency (NSE):
It is a normalized measurement that compares the residual variation (or "noise") to the variation of the observed data.
Here, \({\overline{y} }_{o}\) represents the mean of observed data, while \({y}_{m}\) signifies the simulated data. Additionally, \({y}_{o}^{t}\) denotes the data being released at time instant t.
In combination with the statistical method, graphical analysis was utilized to verify the accuracy of the models. The following is a brief summary of what these graphical analyses imply61:
The plot of the error distribution is the percent relative error (Ei), which is generated using the given equations and plotted against the experimental findings or variable. This graph illustrates the error pattern and the distribution of approximated Ei values along the axis of zero error.
The number of data units along the Y = X axis impacts the correctness of the model; the fewer points there are, the more effective the model.
A graph of cumulative frequency vs absolute relative error (Ea) displays the accuracy of the model in anticipating any percentage of data. Ea is computed using the following equation:
Results and discussion
Developed correlation
For the GP algorithm, the following correlation which can accurately predict the target parameter was developed. In order to optimize the model, a thorough grid search was done to find the optimum population size, tree depth, tree length, maximum generations, etc. As a result, the comprehensible equation consisting of 3 input parameters and 10 additional coefficients was established.
Figure 4 represents the schematic of the GP employed for estimating the residual crude oil during pyrolysis.
Statistical evaluation of models
According to the statistical analysis provided in Table 4, the XGBoost model has provided the highest accuracy and reliability in terms of all the indicators. Its R2 and NSE are almost equal to 1, and the RMSE, SD, MAPE, MBE, and MAE are extremely small for the testing and training as well as the whole datasets. The GP approach has proved itself to be the worst among the four developed techniques, yet its precision is still quite high despite being less robust than others having 0.9820 of R2 for all portions of data. The middle positions are held by GPR and CatBoost. The performance of the first and the latter is rather decent with RMSE, SD, MBE, and MAE extremely close to zero and the estimate of R2 being more than 0.99. However, the GPR is better when comparing all the parameters except for MBE. All of these models outperform previously published models and correlations in terms of the precision of the forecasts. Summing up the statistical analysis, the following list from the best performance to the weakest can be established: XGBoost, GPR, CatBoost, and GP.
Graphical evaluation of models
In this regard, the graphical assessment of the models’ results was performed first by displaying the cross-plot of algorithms outcomes vs real data points, as shown in Fig. 5. Based on these diagrams, the spread of data points forms a line with a unit slope, indicating that the predicted and objective data points in all models except for GP are in excellent conformity. Having a unit slope though, XGBoost, GPR, and CatBoost differ from each other. As seen in the pictures, GPR and CatBoost have a number of insignificant outliers, whereas the XGBoost line is the smoothest among all.
A residual graph is a diagram depicting residuals along the vertical axis and the independent variable along the horizontal axis. The residual number is the discrepancy between the reported and expected numbers. According to the visual materials represented in Fig. 6, the best performance should be attributed to XGBoost possessing the smallest y-axis range (from approximately − 1 to around 1) and the lowest amount of outliers. GP is the least accurate with the spread of residuals from 20 to − 20. GPR and CatBoost are somewhat similar both having the same range in which the majority of observations are located, yet the outliers of CatBoost make it less precise than GPR.
A histogram of error distribution is an allocation of probabilities regarding a point projection that specifies the likelihood of each inaccuracy. Based on Fig. 7, the distributions are highly centered having little deviations in all approaches employed. The majority of the observations are at the point of zero relative error for both training and testing. However, the XGBoost is again the leader in the assessment as its relative error spread is the lowest one being from roughly − 0.14 to 0.08 for both training and testing.
Figure 8 illustrates the relative deviation graph of the created models' results. The horizontal line of this graph represents experimental values, while the vertical axis represents the comparative deviation of model results from experimental values. This graph demonstrates that the comparative deviations of the suggested models are generally spread around the zero-deviation line, indicating that the models can predict the target data with tolerable error rates. As in all the cases, XGBoost effectiveness is the highest one comparing to other techniques utilized with the smallest relative error variation equal to around − 0.14 and 0.08. GP range is the greatest one being in the range from ~ − 0.5 to ~ 0.5 which indicated its lowest level of accuracy.
A cumulative frequency is the total values distributed across multiple absolute relative error intervals. As depicted in Fig. 9, XGBoost, GPR, and CatBoost are the most effective methods for predicting the correct value of the target parameter, as the relative error of 90% of the data does not exceed 10–15%. The best precision is in the case of XGBoost, as the relative error of around 99% of the data is roughly equal to 5–7%. GP performance is worse than black-box models, as shown by the graph. Although the lower accuracy of the developed correlation is obvious and even predictable from the beginning, the advantage of correlation is fast prediction without the need for artificial intelligence-related knowledge, which is usually required to use black-box models.
In comparing this study with previous research, it can be asserted that a dataset comprising 2071 experimental TGA findings for 13 distinct crude oil samples was harnessed in this study, ensuring a comprehensive foundation for the modeling approach. This dataset represents the most extensive compilation utilized for modeling crude oil pyrolysis to date. The application of advanced machine learning techniques led to the development of models with high accuracy. Specifically, the XGBoost model achieved an overall MAPE of 0.7796% and an R2 of 0.9999, signifying a remarkable level of precision. This result compares favorably with prior investigations. Past studies in this domain have also sought to model crude oil pyrolysis and predict fuel deposition. Notably, Rasouli et al.29 developed a multilayer perceptron model with a 3.5% error for the pyrolysis of 6 crudes, Norouzpour et al.30, employed a radial basis function neural network with a 5.8% error for the pyrolysis of 6 crudes, and Mohammadi et al.32 utilized a generalized regression neural network with a 1.04% error for the pyrolysis of 11 crudes. While these models showcased respectable performance, our current study not only extends the dataset size but also harnesses a variety of machine learning techniques, enhancing accuracy and robustness in modeling. Furthermore, this study introduces a straightforward mathematical correlation that achieves remarkable accuracy with a mere 9.73% error. Formulating a coherent correlation between input and output datasets proves challenging in opaque methodologies. The application of black-box models demands sophisticated computer systems and specialized expertise, constraining widespread accessibility. Consequently, the development of user-friendly mathematical correlations using advanced white-box algorithms can streamline the prediction of fuel formation during crude oil pyrolysis, offering rapid and precise predictions without the necessity for specialized tools.
Trend analysis
Lastly, Figs. 10, 11 and 12 illustrate how the XGBoost model predicts residual crude oil during pyrolysis as a function of temperature for various crudes and heating rates. It should be mentioned that the chosen oil samples in each graph were connected to particular research to guarantee that the TG experimental settings were the same. Table 1 provides a summary of the heating rates and characteristics of these crude oils. The TG curves of several crudes (Oil 1, 2, 3, 5, and 6) with respect to temperature are shown in Fig. 10a,b. As shown in Fig. 10, the XGboost model successfully estimates the experimental trend for various heating rates and oils. Because crude oils have diverse constituents, so do their TG curves are likewise distinct. Heavy crudes often leave more residue because they contain more asphaltenes. In this instance, the suggested XGBoost model accurately recognizes the TG curve trend and forecasts the quantity of residue for each crude oil sample at various temperatures.
Figure 11 displays the TG curves for crude sample #5 at three diverse heating rates. As shown in Fig. 11, when the heating rate decreases, the crude oil's TG curve shifts to the left as a consequence of a longer exposure period to heat. The XGBoost model successfully predicts the experimental trend and tracks the influence of the heating rate.
Figure 12 displays the TG curves for crude samples #4 and #7 at the same heating rate, which is 10 °C/min. As shown in Fig. 12, the XGBoost model successfully predicts the experimental trend.
Sensitivity analysis
To assess the comparative significance of input variables on residual crude oil, the relevance factor (r) and the XGBoost model results are utilized. The accompanying method is utilized to calculate the r values for each input parameter32,62:
where \({\sigma }_{m}\) is the average value of calculated residual crude oil and \({\sigma }_{j}\) is the jth value of assessed crude oil residue; and \(in{p}_{i,j}\) and \(in{p}_{m,i}\) are the jth and average value of the ith input parameter, correspondingly, where \(in{p}_{i,j}\) are oil oAPI gravity, heating rate, resins, asphaltenes, and temperature. Figure 13 depicts the relative effect and relevance of input parameters on residual crude oil. As it is seen, the most impact in the XGBoost model is attributed to the temperature with approximately − 0.92 significance. All other parameters such as resin, asphaltene, heating rate, and oil oAPI gravity are not as influential as temperature having less than 0.16 of importance. Overall, among the mentioned inputs, temperature and asphaltenes owe the highest influence on the crude oil pyrolysis process. In addition, temperature and oil oAPI gravity had negative impacts on fuel deposition, while asphaltenes, resins, and heating rates had a positive impact on fuel deposition during crude oil pyrolysis. This means that the higher the amount of asphaltene and resin of crude, the higher the amount of fuel (coke) formation.
The high negative impact of temperature on fuel deposition is attributed to the fundamental principles of pyrolysis. Elevated temperatures promote the thermal cracking and vaporization of hydrocarbons in crude oil, leading to a high reduction in the mass of residual crude oil. This behavior is consistent with the well-established pyrolysis process. Asphaltenes and resins are complex, high molecular weight components in crude oil. They tend to break down and contribute to coke formation during pyrolysis. Their positive impact on fuel deposition can be attributed to their transformation into solid carbonaceous residues, which enhance the overall fuel availability for sustaining the combustion front in ISC. Overall, an increase in asphaltene and resin content results in a reduction in mass loss during the pyrolysis of crude oil, consequently leading to increased fuel deposition. While heating rate is essential in governing the speed of temperature increase, its impact is relatively low in this model. With an escalation in heating rate, the TG curve for crude oil shifts to the right, signifying an increase in the mass of residual crude oil. The observed result is linked to the reduced exposure time of the crude oil to heat. Finally, Oil oAPI gravity, with its lower significance, implies that its effect on fuel deposition is less pronounced. Typically, heavier crude oils characterized by lower oAPI gravity tend to leave more residue, primarily due to a higher concentration of asphaltene. In summary, the technical reasons for these sensitivity analysis outcomes are rooted in the complex chemistry of crude oil pyrolysis. Understanding the behavior of these parameters can aid in optimizing ISC processes and improving the recovery of unburned oil.
Conclusions
Crude oil pyrolysis analysis through TGA runs offers insights into fuel deposition during ISC EOR. This study aimed to precisely model crude oil pyrolysis by leveraging 2071 experimental TGA datasets obtained from literature sources. A suite of robust machine learning techniques, encompassing three black-box approaches (CatBoost, GPR, and XGBoost), and a white-box approach (GP), was employed to estimate crude oil residue at varying temperature intervals during TGA runs. Among the developed models and mathematical correlation, the XGBoost model exhibited exceptional precision, achieving an overall MAPE of 0.7796% and an R2 of 0.9999. Following the XGBoost model, GPR, CatBoost, and GP models provided the next best results, respectively. Notably, the GP model, despite displaying a slightly higher error compared to the black-box models, provided satisfactory results, making it a viable option for rapid estimation of crude oil residue during pyrolysis. Moreover, a sensitivity analysis was conducted to explore the relative impact and significance of inputs on residual crude oil during pyrolysis. Among these inputs, temperature and asphaltenes were identified as the most influential factors in the crude oil pyrolysis process. Higher temperatures and oil oAPI gravity were associated with a negative impact, leading to a decrease in fuel deposition. On the other hand, increased values of asphaltenes, resins, and heating rates showed a positive impact, resulting in an increase in fuel deposition.
Data availability
The datasets used during the current study available from the corresponding author on reasonable request.
References
Green, D. W. & Willhite, G. P. Enhanced oil recovery. Vol. 6 (Henry L. Doherty Memorial Fund of AIME, Society of Petroleum Engineers, 1998).
Tarek, A. & Nathan, M. Advanced reservoir management and engineering (Gulf Professional Pub, 2012).
Fazlyeva, R. et al. In situ combustion. Thermal Methods, 155–215 (2023).
Sarathi, P. S. In-situ combustion handbook--principles and practices (National Petroleum Technology Office, Tulsa, OK (US), 1999).
Mahinpey, N., Ambalae, A. & Asghari, K. In situ combustion in enhanced oil recovery (EOR): A review. Chem. Eng. Commun. 194, 995–1021 (2007).
Ciajolo, A. & Barbella, R. Pyrolysis and oxidation of heavy fuel oils and their fractions in a thermogravimetric apparatus. Fuel 63, 657–661 (1984).
Ramey, H. (Gulf Publishing Company, Texas, 1985).
Vossoughi, S. TGA/DSC techniques as research tools for the study of the in-situ combustion process. Thermochim. Acta 106, 63–69 (1986).
Ranjbar, M. & Pusch, G. Pyrolysis and combustion kinetics of crude oils, asphaltenes and resins in relation to thermal recovery processes. J. Anal. Appl. Pyrolysis 20, 185–196 (1991).
Ranjbar, M. Influence of reservoir rock composition on crude oil pyrolysis and combustion. J. Anal. Appl. Pyrolysis 27, 87–95 (1993).
Kok, M. V. Use of thermal equipment to evaluate crude oils. Thermochim. Acta 214, 315–324 (1993).
Karacan, O. & Kok, M. V. Pyrolysis analysis of crude oils and their fractions. Energy Fuels 11, 385–391 (1997).
Kök, M. & Karacan, O. Pyrolysis analysis and kinetics of crude oils. J. Thermal Anal. Calorimetry 52, 781–788 (1998).
Ambalae, A., Mahinpey, N. & Freitag, N. Thermogravimetric studies on pyrolysis and combustion behavior of a heavy oil and its asphaltenes. Energy Fuels 20, 560–565 (2006).
Kok, M. V. Clay concentration and heating rate effect on crude oil combustion by thermogravimetry. Fuel Process. Technol. 96, 134–139 (2012).
Li, Y.-B. et al. Characteristics and properties of coke formed by low-temperature oxidation and thermal pyrolysis during in situ combustion. Ind. Eng. Chem. Res. 59, 2171–2180 (2020).
Kök, M. & Iscan, A. Catalytic effects of metallic additives on the combustion properties of crude oils by thermal analysis techniques. J. Thermal Anal. Calorimetry 64, 1311–1318 (2001).
Rezaei, M., Schaffie, M. & Ranjbar, M. Thermocatalytic in situ combustion: Influence of nanoparticles on crude oil pyrolysis and oxidation. Fuel 113, 516–521 (2013).
Zhang, X., Liu, Q. & Fan, Z. Enhanced in situ combustion of heavy crude oil by nickel oxide nanoparticles. Int. J. Energy Res. 43, 3399–3412 (2019).
Li, Y.-B. et al. Study of the catalytic effect of copper oxide on the low-temperature oxidation of Tahe ultra-heavy oil. J. Thermal Anal. Calorimetry 135, 3353–3362 (2019).
Abaas, M., Yuan, C., Emelianov, D. A., Varfolomeev, M. A. & Ariskina, K. A. Effect of calcite on crude oil combustion characterized by high-pressure differential scanning calorimetry (HP-DSC). Pet. Sci. Technol. 37, 1216–1221 (2019).
Li, Y.-B. et al. A comprehensive investigation of the influence of clay minerals on oxidized and pyrolyzed cokes in in situ combustion for heavy oil reservoirs. Fuel 302, 121168 (2021).
Ren, Y., Freitag, N. & Mahinpey, N. A simple kinetic model for coke combustion during an in-situ combustion (ISC) process. J. Can. Pet. Technol. 46 (2007).
Murugan, P., Mahinpey, N., Mani, T. & Freitag, N. Pyrolysis and combustion kinetics of Fosterton oil using thermogravimetric analysis. Fuel 88, 1708–1713 (2009).
Gundogar, A. S. & Kok, M. V. Thermal characterization, combustion and kinetics of different origin crude oils. Fuel 123, 59–65 (2014).
Karimian, M., Schaffie, M. & Fazaelipoor, M. H. A kinetic investigation into the in situ combustion reactions of Iranian heavy oil from Kuh-E-Mond reservoir. Iran. J. Oil Gas Sci. Technol. 6, 18–33 (2017).
Zhao, S., Pu, W., Sun, B., Gu, F. & Wang, L. Comparative evaluation on the thermal behaviors and kinetics of combustion of heavy crude oil and its SARA fractions. Fuel 239, 117–125 (2019).
Wang, J.-X., Wang, L.-L., Wang, T.-F. & Peng, X.-Q. Effects of SARA fractions on pyrolysis behavior and kinetics of heavy crude oil. Pet. Sci. Technol. 38, 945–954 (2020).
Rasouli, A., Dabiri, A. & Nezamabadi-pour, H. A multi-layer perceptron-based approach for prediction of the crude oil pyrolysis process. Energy Sour. Part A Recov. Util. Environ. Effects 37, 1464–1472 (2015).
Norouzpour, M., Rasouli, A. R., Dabiri, A., Azdarpour, A. & Karaei, M. A. Prediction of crude oil pyrolysis process using radial basis function networks. Revista QUID, 567–576 (2017).
Mohammadi, M.-R. et al. On the evaluation of crude oil oxidation during thermogravimetry by generalised regression neural network and gene expression programming: Application to thermal enhanced oil recovery. Combust. Theory Model. 25, 1268–1295 (2021).
Mohammadi, M.-R., Hemmati-Sarapardeh, A., Schaffie, M., Husein, M. M. & Ranjbar, M. Application of cascade forward neural network and group method of data handling to modeling crude oil pyrolysis during thermal enhanced oil recovery. J. Pet. Sci. Eng. 205, 108836 (2021).
Alvarez, E. et al. Pyrolysis kinetics of atmospheric residue and its SARA fractions. Fuel 90, 3602–3607 (2011).
Coriolano, A. C., Oliveira, A. A., Bandeira, R. A., Fernandes, V. J. & Araujo, A. S. Kinetic study of thermal and catalytic pyrolysis of Brazilian heavy crude oil over mesoporous Al-MCM-41 materials. J. Thermal Anal. Calorimetry 119, 2151–2157 (2015).
Wang, Y. et al. New insights into the oxidation behaviors of crude oils and their exothermic characteristics: Experimental study via simultaneous TGA/DSC. Fuel 219, 141–150 (2018).
Coriolano, A. C., Oliveira, A. A., Bandeira, R. A., Fernandes, V. J. & Araujo, A. S. Kinetic study of thermal and catalytic pyrolysis of Brazilian heavy crude oil over mesoporous Al-MCM-41 materials. J. Therm. Anal. Calorimetry 119, 2151–2157 (2015).
Bae, J. Characterization of crude oil for fireflooding using thermal analysis methods. Soc. Pet. Eng. J. 17, 211–218 (1977).
Hemmati-Sarapardeh, A., Varamesh, A., Husein, M. M. & Karan, K. On the evaluation of the viscosity of nanofluid systems: Modeling and data assessment. Renew. Sustain. Energy Rev. 81, 313–329 (2018).
Rasmussen, C. E. & Williams, C. K. Gaussian processes in machine learning. Lect. Notes Comput. Sci. 3176, 63–71 (2004).
Rasmussen, C. E. & Williams, C. K. Gaussian processes for machine learning. Vol. 1 (Springer, 2006).
Ouyang, Z.-L., Chen, G. & Zou, Z.-J. Identification modeling of ship maneuvering motion based on local Gaussian process regression. Ocean Eng. 267, 113251 (2023).
Schapire, R. E. & Freund, Y. Boosting: Foundations and algorithms. Kybernetes 42, 164–166 (2013).
Friedman, J. H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 1189–1232 (2001).
Chen, T. & Guestrin, C. in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785–794.
Zheng, H. & Wu, Y. A xgboost model with weather similarity analysis and feature engineering for short-term wind power forecasting. Appl. Sci. 9, 3019 (2019).
Ma, X., Fang, C. & Ji, J. in IOP Conference Series: Earth and Environmental Science. 012013 (IOP Publishing).
Madani, S. A. et al. Modeling of nitrogen solubility in normal alkanes using machine learning methods compared with cubic and PC-SAFT equations of state. Sci. Rep. 11, 24403 (2021).
Shi, Y., Li, J. & Li, Z. Gradient boosting with piece-wise linear regression trees. arXiv preprint arXiv:1802.05640 (2018).
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V. & Gulin, A. CatBoost: Unbiased boosting with categorical features. Adv. Neural Inf. Process. Syst. 31 (2018).
Dorogush, A. V., Ershov, V. & Gulin, A. CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363 (2018).
Pham, T. D. et al. Comparison of machine learning methods for estimating mangrove above-ground biomass using multiple source remote sensing data in the red river delta biosphere reserve Vietnam. Remote Sens. 12, 1334 (2020).
Liu, W. et al. A semi-supervised tri-catboost method for driving style recognition. Symmetry 12, 336 (2020).
Fengshun, M., Yan, L., Cen, G., Meiji, W. & Dongmei, L. Diabetes prediction method based on CatBoost algorithm [J]. Comput. Syst. Appl. 28, 215–218 (2019).
Al-Sahaf, H. et al. A survey on evolutionary machine learning. J. R. Soc. N. Zeal. 49, 205–228 (2019).
Poli, R., Langdon, W. B., McPhee, N. F. & Koza, J. R. A Field guide to genetic programming. lulu. com. With contributions by JR Koza (2008).
Koza, J. R. Genetic programming: On the programming of computers by means of natural selection (complex adaptive systems). A Bradford Book 1, 18 (1993).
Emigdio, Z. et al. Modeling the adsorption of phenols and nitrophenols by activated carbon using genetic programming. J. Clean. Prod. 161, 860–870 (2017).
Bi, Y., Xue, B. & Zhang, M. Genetic programming for image classification: An automated approach to feature learning. Vol. 24 (Springer Nature, 2021).
Mohammadi, M.-R. et al. Modeling hydrogen solubility in hydrocarbons using extreme gradient boosting and equations of state. Sci. Rep. 11, 17911 (2021).
Liu, B. et al. Pore structure characterization of solvent extracted shale containing kerogen type III during artificial maturation: Experiments and tree-based machine learning modeling. Energy 283, 128885 (2023).
Rashidi-Khaniabadi, A., Rashidi-Khaniabadi, E., Amiri-Ramsheh, B., Mohammadi, M.-R. & Hemmati-Sarapardeh, A. Modeling interfacial tension of surfactant–hydrocarbon systems using robust tree-based machine learning algorithms. Sci. Rep. 13, 10836 (2023).
Ansari, S. et al. Experimental measurement and modeling of asphaltene adsorption onto iron oxide and lime nanoparticles in the presence and absence of water. Sci. Rep. 13, 122 (2023).
Author information
Authors and Affiliations
Contributions
F.H.: writing—original draft, methodology, visualization, A.R.: writing—original draft, validation, methodology, M.-R.M.: data curation, writing—original draft, M.M.G.: conceptualization, methodology, visualization, P.P.: validation, methodology, reviewing and editing, A.H.-S.: supervision, conceptualization, methodology, reviewing and editing.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Hadavimoghaddam, F., Rozhenko, A., Mohammadi, MR. et al. Modeling crude oil pyrolysis process using advanced white-box and black-box machine learning techniques. Sci Rep 13, 22649 (2023). https://doi.org/10.1038/s41598-023-49349-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-023-49349-x