Introduction

Despite the fact that the use of nonrenewable energy might be declining, this sector holds significant potential. Thus, promoting and expanding oil recovery methods is a pivotal step toward a more sustainable future in this inevitable industry. As a matter of fact, utilizing experiments and authentic outcomes of studies conducted with miraculous technological advances, some crucial behaviors of hydrocarbons will be predictable. Hence, the advancement of methodologies for oil recovery, especially through in-situ combustion (ISC), presents a promising thermal recovery technique. This process involves injecting air into a reservoir, which generates energy through oxidation reactions with crude oil. ISC is considered an effective approach for extracting heavy oil due to its strong displacement capabilities and associated cost efficiencies1,2,3. ICS process has proven effective as a follow-up to steam injection. This process involves injecting hot air to oxidize hydrocarbons or coke in oil reservoirs, creating a combustion front that generates substantial heat, leading to increased oil temperature and enhances mobility, maximizing recovery efficiency and making ISC an appealing option for oil extraction4,5. Before designing an ISC process, it is inevitable to understand the oxidation behavior of heavy oil with oxygen. In other words, improved ISC development requires modeling the fundamental reaction dynamic6,7. Although ISC has potential benefits, its industrial applications have faced criticism due to unclear oxidation mechanisms and combustion processes, leading to project failures8,9. This scenario holds the key to understanding the dynamic nature of coke deposition and combustion. To put it simply, examining the structural changes and combustion behavior of coke during the low-temperature oxidation (LTO) process provides insights that can improve ISC process for addressing inferior heavy oil reservoirs. Therefore, the characteristics and quantity of coke produced are crucial for ensuring the sustainability of the ISC process10,11,12. From this viewpoint, it is indispensable to comprehend how LTO affects the coking mechanism and the formation of coke, as this dramatically impacts the combustion process13. Another reaction region that might be happening in ISC is the negative temperature gradient region, which is considered a transitional zone between LTO and high-temperature oxidation (HTO) where the oxygen absorption rate reduces with rising temperature14,15. More importantly, heavy components are converted into lighter components to produce fuel for combustion16,17. HTO reactions are the most critical aspect of the ISC process because they release a considerable amount of heat. Typically, coke produced from LTO or thermal cracking is thought to serve as the primary fuel source for HTO reactions18,19.

Numerous studies have been conducted from both laboratory and system modeling perspectives to achieve a deeper understanding of the phenomenon of residue formation from crude oil during ISC. Belgrave et al.20 built a reaction model including thermal cracking reactions, LTO, and HTO, in which the bitumen was divided into two components, asphaltene and maltene. Yamamoto et al.21,22 built the Lattice Boltzmann model for propane combustion and subsequently simulated combustion, although they did not consider the effects of thermal expansion. Xu et al.23 studied coke deposition at varying oxidation temperatures for conventional heavy oil from Xinjiang. They found that the highest coke deposition occurred under LTO conditions and during coke combustion at HTO conditions, specifically within the LTO temperature range. Zhao et al.24,25 used practical outcomes and theoretical results to summarize LTO reaction pathways, significantly enhancing the understanding of LTO reactions. Askarova et al.26 conducted medium-pressure coking tests on crushed and consolidated core samples to examine ISC feasibility in a heavy-oil carbonate reservoir. Experiments analyzed phase behavior under high pressures and temperatures, highlighting that consolidated cores require increased air flux due to lower porosity.

In recent years, significant advancements have been made in the modeling and simulation of underground reservoirs, covering a wide range of topics, including enhanced oil recovery (EOR) processes, complex pore-scale fluid-rock interactions, thermal behavior of insulating oils, shales and coals, as well as ignition and combustion sciences27,28,29,30,31,32,33,34,35. Surprisingly, the advent of artificial intelligence has greatly changed various aspects of thermal EOR. Rasouli et al.36 applied thermogravimetric analysis (TGA) to investigate the pyrolysis of six crude oils in a nitrogen atmosphere. A simple neural network model was developed to predict the residual crude oil based on temperature during pyrolysis, achieving an acceptable average relative error of about 3.5%. Then, Norouzpour et al.37 created radial basis function network models to anticipate the remaining mass of crudes during pyrolysis, using input parameters such as API density, viscosity, resin and asphaltene content, temperature, and heating rate. Their results indicated that the developed model can precisely anticipate remaining weight of crude oil during pyrolysis. In investigating this issue, Mohammadi et al.38 created general models for anticipating residual mass during the pyrolysis of crude oil. They utilized various neural network architectures. Their sensitivity analysis revealed that temperature and asphaltene content substantially influence mass loss. Furthermore, Mohammadi et al.39 modeled oxidation reactions during the ISC with advanced machine-learning techniques and 2289 experimental TGA data. Hadavimoghaddam et al.40 conducted a study to model crude oil pyrolysis during TGA using various machine learning techniques. Their results showed that higher temperatures and lower oil gravity decreased fuel deposition. Fuel formation from crude oil pyrolysis and oxidation is essential in ISC, as it generates the coke needed to sustain the combustion front. Proper fuel generation ensures continuous heat release, supporting effective oil recovery. The literature review reveals limited studies attempting to model these processes, with a focus primarily on pyrolysis over oxidation and a reliance on minimal crude oil TGA data. Hence, incorporating diverse crude oil data, emphasizing oxidation profiles, and utilizing advanced machine learning algorithms could aid in developing a robust model to predict residue formation during ISC.

In this study, the thermo-oxidative profiles and residue formation of crude oils during TGA are modeled using 3075 experimental data points from 18 crude oils with API gravities between 5 and 42. To develop accurate predictive models, four advanced tree-based machine learning algorithms were utilized: gradient boosting with categorical features support (CatBoost); light gradient boosting machine (LightGBM); random forest (RF); and extreme gradient boosting (XGBoost). To assess model performance, statistical metrics and graphical error analysis are utilized. Moreover, sensitivity analysis is conducted to investigate the relationships between input variables and the predictions of the top-performing model. Eventually, the leverage method is applied to assess the model’s validity and its applicable scope.

Data gathering and preprocessing

In the current work, a collection of 3075 TGA experimental data points, sourced from credible literature41,42,43,44,45,46,47,48,49,50,51,52,53, were gathered for the purpose of modeling crude oil oxidation. The models take into account ˚API gravity of the oil, asphaltenes (wt%), resins (wt%), the heating rate, and temperature as input variables. Meanwhile, the output of the models is residual crude oil (wt%) measured at varying temperatures. The selection of model input parameters is crucial for accurately capturing the key factors influencing the thermal decomposition of crude oil in TGA. The °API gravity, asphaltenes, and resins are essential for characterizing the crude oil, as these properties significantly impact its oxidation behavior. The inclusion of the heating rate and temperature as input variables reflects the operational conditions of TGA, providing a realistic representation of the thermal environment. This combination of parameters ensures a comprehensive model that effectively captures both the intrinsic characteristics of the crude oil and the external thermal conditions, aligning with the approaches demonstrated in prior studies39,52,54. Table 1 represents a summary of the properties of crude oils analyzed through TGA along with heating rates and temperature ranges of experiments. Moreover, Table 2 presents the statistical analysis of both the target and input variables for the gathered database in this work. The ranges of the input variables shown in Table 2 are as follows: °API gravity (5–42.03), asphaltene content (0.53–49.12 wt%), resin content (1.09–51.30 wt%), heating rate (2–30 °C/min), and temperature (25.25–821.62 °C). The target variable, crude oil residue, ranges from 0.03 to 100.00 wt.%. The data utilized for developing the machine learning models are included in the Supplementary Information accompanying this manuscript. Also, Fig. 1 depicts box plots for all variables in the dataset, showcasing key statistical indicators such as the quartiles, mean, median, minimum and maximum. A box plot, or whisker plot, is a graphical representation applied in data preprocessing to reveal the distribution of numerical data through their quartiles. This type of plot demonstrates a graphical summary of the minimum, first quartile (Q1), third quartile (Q3), median, and maximum values, making it easy to identify outliers and understand the data’s spread. The “box” indicates the interquartile range (IQR), containing 50% of the data between Q1 and Q3. Also, whiskers are limits extending from the box to the minimum and maximum values within 1.5 times the IQR from Q1 and Q3, respectively. These plots offer a clear visual summary, making them an effective tool for analyzing the dataset’s statistical properties. As it is evident, the broad ranges for both input and target parameters are sufficient for developing a reliable model to anticipate crude oil oxidation behavior in TGA experiments.

Table 1 The properties of crude oils analyzed through TGA considered for modeling.
Table 2 Statistical summary of the database applied for modeling in this work.
Fig. 1
figure 1

Box plots of the variables used for modeling in this work.

Model development

CatBoost

CatBoost is a gradient boosting decision tree technique that is particularly adept at managing categorical features, which can have a substantial impact on predictive accuracy55. Introduced by Prokhorenkova et al.56, the CatBoost algorithm effectively transforms categorical variables into numerical representations utilizing greedy target-based statistics (Greedy TBS), necessitating limited hyperparameter adjustments57,58. Within its gradient-boosting architecture, it employs both non-symmetric and symmetric tree construction methods, initiating the process with a root node that encompasses all data. This method enhances the feature space by generating combinations of categorical features based on their interrelationships, employing greedy target-based statistics59. Additionally, CatBoost incorporates “ordered boosting” to mitigate the target leakage issue commonly associated with gradient boosting (GB) methods and demonstrates strong performance with small datasets by utilizing stochastic permutations of the training data, thereby improving its robustness.

Given a training subset with N samples and M features, where each sample is represented as \(({x}_{i}, {y}_{i})\), as \({x}_{i}\) is a vector of M features and \({y}_{i}\) is the corresponding target variable, CatBoost aims to learn a function \(F(x)\) that estimates the target variable \(y\)60.

$$F\left(x\right)={F}_{0}\left(x\right)+\sum_{m=1}^{M}\sum_{i=1}^{N}{f}_{m}\left({x}_{i}\right)$$
(1)

where \(F\left(x\right)\) represents the overall prediction function, \({F}_{0}\left(x\right)\) is the initial guess or the baseline prediction, and \({f}_{m}({x}_{i})\) represents the prediction of the mth tree for the ith training sample.

This framework employs both symmetric and non-symmetric methods for tree construction61,62. This algorithm implements a non-symmetric splitting technique for numerical attributes, while utilizing a symmetric splitting approach for categorical attributes, evaluating all potential split points and generating a new branch for each category63. This methodology improves the GB algorithm and effectively develops a robust model by iteratively reducing training errors64. Figure 2 demonstrates a schematic representation of the CatBoost algorithm.

Fig. 2
figure 2

Schematic image of a CatBoost algorithm.

XGBoost

XGBoost is a powerful supervised machine learning algorithm65, combining classification and regression trees (CARTs) to improve learning dataset alignment66. XGBoost is popular due to its strong generalization abilities and differences from other boosting methods, making it effective for regression and classification problems67,68. It combines several weak classifiers/regressors into a strong classifiers/regressors using the decision tree (DT) algorithm69. XGBoost is fundamentally based on an optimized distributed GB framework. While traditional GB constructs trees in a sequential manner, XGBoost enhances this process by employing a parallelized approach to tree construction. This parallel computation occurs at a more granular level compared to bagging methods, specifically during the tree-building phase at each iteration. There are three categories of nodes in this algorithm70. As Fig. 3 presented, in each CART tuning procedure, the root node plays a crucial role, followed by the interior nodes and, finally, the leaf nodes.

Fig. 3
figure 3

Schematic illustration of an XGBoost model.

During XGBoost developing process, a series of CARTs are built sequentially and each CART estimator is associated with a weight to be tuned in the training process to construct a powerful and robust result. After aggregating the tree results, the model’s initial productivity prediction is calculated as follows71:

$$\widehat{{y}_{i}}=\sum_{k=1}^{K}{f}_{k}\left({x}_{i}\right), {f}_{k}\in F$$
(2)
$$F=\left\{{f}_{x}={\omega }_{q\left(x\right)}\right\}, \left(q:{R}^{m}\to T,\omega \in {R}^{T}\right)$$
(3)

in which, \(\widehat{{y}_{i}}\) denotes the predicted value, \({f}_{k}\) shows the regression tree’s output, \(T\) is a number of leaf nodes, \(q\) illustrates tree’s structure, and \(\omega\) is the weight vector.

In the context of regression tasks, XGBoost incrementally adds new regression trees, which are designed to fit the residuals produced by the preceding model through the newly constructed CART tree. The ultimate prediction is derived by aggregating the outputs of all trained DTs across each training iteration72. This paradigm proposed a simple GB method by integrating a regularization component into the objective function, which assists in reducing the risk of overfitting70,73:

$${Obj}^{(r)}=\sum_{i=1}^{n}L\left({y}_{i},{\widehat{y}}_{i}^{(r)}\right)+\sum_{i=1}^{r}\Omega ({g}_{r})$$
(4)

where, \({y}_{i}\) and \({\widehat{y}}_{i}^{(r)}\) shows the experimental and the estimated value of the r-th step, respectively, \(L\left({y}_{i},{\widehat{y}}_{i}^{(r)}\right)\) denotes the loss function, \(n\) stands for the count of training samples, \({g}_{r}\) is the symbol for showing tree’s structure, and \(\Omega ({g}_{r})\) shows the regularization term calculated as below70:

$$\Omega \left({g}_{r}\right)=\gamma T+\frac{1}{2}\lambda \sum_{j=1}^{T}{\omega }_{j}^{2}$$
(5)

where, ω is the weight of the leaves, T denotes the total count of leaves, while γ and λ are constant coefficients, assigned default values of 0 and 1, respectively.

RF

RF is an ensemble machine learning technique introduced in 200174, recognized for its interpretability, ease of use, and rapid computational efficiency. This robust algorithm is adept at addressing various tasks, including unsupervised learning, regression, and classification75,76. The RF methodology is built upon a collection of DTs, where each tree functions as a relatively straightforward model characterized by root nodes, split nodes, and leaf nodes. In this context, each DT is treated as an individual model output, which is then synthesized to produce a comprehensive new model77. The inherent randomness in the selection of nodes is fundamental to the RF approach. In other words, this technique integrates multiple individual learners to create a cohesive model78,79. This paradigm employs a distinctive sampling technique known as bootstrap sampling to enhance diversity of selected samples. This method generates two types of data: out-of-bag (OOB) data and in-bag data80. OOB data consists of one-third of the original sample that is excluded from the bag, while in-bag data pertains to the portion of the sample that remains within the bag (see Fig. 4)81. It operates not on a singular DT but rather aggregates predictions from multiple individual trees, determining the final response based on the majority vote among them.

Fig. 4
figure 4

Flowchart of a RF technique.

When Dt is the training data for tree ht, the training dataset of the model is expressed as \(D = \left\{ {\left( {x_{1} ,y_{1} } \right),\left( {x_{2} ,y_{2} } \right),...,\left( {x_{n} ,y_{n} } \right)} \right\}\). Moreover, the following equation calculates the prediction of the OOB outcome for data × 74:

$${H}^{OOB}(x)=argmax \sum_{t=1}^{T}I\left(h\left(x\right)\right)=y$$
(6)

also, the error of the OOB dataset is measured as follows:

$${\varepsilon }^{OOB}(x)=\frac{1}{\left|D\right|}{\sum }_{x,y\epsilon D}I({H}^{OOB}(x)\ne y)$$
(7)

The randomness operation of the RF model is controlled by the parameter q as q = log2 d. The feature significance of the variable Xi is calculated as below74:

$$I\left({X}_{i}\right)=\frac{1}{B}\sum_{t}^{B}\widetilde{OOB}{err}_{{t}^{i}}-OOB{err}_{t}$$
(8)

in which, B shows the number of trees, \({X}_{i}\) stands for the ith factor of the vector X, \(\widetilde{OOB}{err}_{{t}^{i}}\) denotes the estimation error of the OOB samples of the permuted \({X}_{i}\) sample in tree t, and \(OOB{err}_{t}\) shows initial OOB samples, that contains the permuted variables.

LightGBM

LightGBM represents an innovative GB framework that employs tree-based algorithms characterized by vertical or leaf-wise growth, in contrast to traditional algorithms that typically exhibit horizontal or level-wise growth82,83. This framework prioritizes the expansion of leaves associated with significant loss, achieving a greater reduction in loss than traditional level-wise algorithms84. Figure 5 illustrates the distinctions between level-wise and leaf-wise tree growth more effectively.

Fig. 5
figure 5

Schematic illustration of a LightGBM.

LightGBM implements three key strategies to ensure rapid, efficient, and accurate model training59. First, it adopts a leaf-wise growth strategy for constructing DTs. To enhance training efficiency and minimize the risk of overfitting, LightGBM imposes constraints on the tree’s depth and the minimum count of data required for each leaf node. The use of histogram-based methods contributes to reducing loss, accelerating training, and minimizing memory consumption85. Second, LightGBM utilizes the gradient-based one-side sampling (GOSS) technique for splitting internal nodes, which is informed by variance gain. This method reduces the count of instances with low gradients prior to calculating informative data, thereby enabling the sampling of more informative data86. It is noteworthy that the histogram-based approach is computationally more intensive than GOSS. Lastly, LightGBM incorporates exclusive feature bundling to decrease the dimensionality of input features, thereby expediting the training stage while maintaining precision87.

The following formula shows the training subset of the LightGBM paradigm88:

$$X={\left\{\left({x}_{j},{y}_{j}\right)\right\}}_{j=1}^{N}$$
(9)

Then, \({\widehat{f}}_{(x)}\) will forecast by minimizing the loss function \(L\):

$$L\left(y,f\left(x\right)\right): \widehat{f}\left(x\right)=argmin {E}_{y,x}.L\left(y,f\left(x\right)\right)$$
(10)

Finally, the training process of each tree can be presented as below88:

$${W}_{q(x)} , q \in \left\{\text{1,2},3,\dots ,N\right\}$$
(11)

In the equation above, W stands for the weight term of each leaf node, q shows the decision rules used in a single tree, and N indicates the leaf number in a tree. Applying Newton’s law for minimizing objective function, the training final result of each step is computed as:

$${G}_{t}\cong \sum_{i=1}^{N}L[{y}_{i}, {F}_{t-1}\left({x}_{i}\right)+{f}_{t}({x}_{i})]$$
(12)

Results and discussion

Developed models

For training the intelligent algorithms, various train/test ratios such as 70/30, 75/25, 80/20, 85/15, and 90/10 are evaluated to obtain the optimum proportion. As a consequence, the dataset utilized in this work was randomly partitioned into training and testing subsets, adhering to an 80/20 proportional split as the best ratio. Moreover, to achieve robust model validation, K-fold cross-validation analysis is applied to the training data to prevent overfitting and to make reliable predictions. After testing various K values, the optimum value of 10 is earned. Hence, the dataset is divided into ten parts, with each part used as a validation set once, ensuring comprehensive model evaluation89. Additionally, grid search was applied to fine-tune the hyperparameters, selecting values grounded in both theoretical insights and practical relevance. The optimal hyperparameter values for all models are presented in Table 3.

Table 3 The tuned hyperparameters obtained for all models.

Evaluation of developed models

The performance of the developed models was assessed applying several statistical indicators, such as the standard deviation (SD), mean absolute relative error (MARE, %), mean relative error (MRE, %), root mean square error (RMSE), and the coefficient of determination (R2). The formulas for the mentioned statistical metrics are defined as follows90:

$$RMSE = \sqrt {\frac{1}{N}\sum\limits_{i = 1}^{N} {\left( {Y_{i,\exp } - Y_{i,pred} } \right)}^{2} }$$
(13)
$$SD = \sqrt {\frac{1}{N - 1}\sum\limits_{i = 1}^{N} {\left( {\frac{{Y_{i,\exp } - Y_{i,pred} }}{{Y_{i,\exp } }}} \right)}^{2} }$$
(14)
$$R^{2} = 1 - \frac{{\sum\nolimits_{i = 1}^{N} {(Y_{i,\exp } - Y_{i,pred} )^{2} } }}{{\sum\nolimits_{i = 1}^{N} {(Y_{i,\exp } - \overline{{Y_{\exp } }} )^{2} } }}$$
(15)
$$MRE = \frac{100}{N}\sum\limits_{i = 1}^{N} {\left( {\frac{{Y_{i,\exp } - Y_{i,pred} }}{{Y_{i,\exp } }}} \right)}$$
(16)
$$MARE = \frac{100}{N}\sum\limits_{i = 1}^{N} {\left| {\frac{{Y_{i,\exp } - Y_{i,pred} }}{{Y_{i,\exp } }}} \right|}$$
(17)

here, Yi,exp stands for the experimental crude oil residue in the oxidation process, and Yi,pred​ shows the predicted crude oil residue by the models, and N is the count of data.

Table 4 presents the computed values for each statistical parameter during the modeling processes. Analysis of the statistical criteria in Table 4 demonstrates an accuracy ranking of the models, with CatBoost, XGBoost, RF, and LightGBM in descending order based on MARE values. Notably, the CatBoost model achieved the most accurate predictions for crude oil residue during oxidation, with MARE values of 4.95% across the entire database, 5.92% for testing collection, and 4.71% for training collection. Moreover, an R2 value of 0.9993 highlights the model’s strong predictive capability, supported by its minimal SD, RMSE, and MRE values compared to other models, underscoring its robustness in prediction.

Table 4 Statistical criteria calculated for all models.

Next, Table 5 shows the comparison of the proposed model in this work with existing models in the literature for crude oil TGA estimation. Most studies in the literature have primarily focused on modeling the pyrolysis behavior of crude oils, with comparatively fewer works addressing oxidation. Additionally, the number of data points and crude oil samples considered in these studies has generally been limited. However, this work significantly expands the data size and number of crude oil samples, achieving high prediction accuracy despite the increased dataset size.

Table 5 Assessment of developed and literature models for predicting TGA behavior of crude oils.

To conduct a deeper evaluation of the suggested models’ accuracy, a graphical analysis was conducted by plotting predicted crude oil residue values versus the corresponding laboratory values, as illustrated in Fig. 6. This figure reveals a dense clustering of points near the Y = X line for all models. Nonetheless, the CatBoost model displays superior performance, exhibiting greater alignment between experimental and predicted values, which underscores its reliability for forecasting crude oil oxidation.

Fig. 6
figure 6

Cross plots of the proposed models.

Subsequently, the error distribution of the models is evaluated visually, based on the premise that a model demonstrating lower error dispersion around the zero-error line in the plots indicates higher reliability. Figure 7 illustrates that the CatBoost model exhibits less error spread near the zero-error line in comparison with other models. Additionally, the absence of any discernible error trends in both the training and testing datasets highlights the accuracy and robustness of the proposed models.

Fig. 7
figure 7

Error distribution plots of the suggested models.

To complement previous analyses, a cumulative frequency plot for the entire dataset was created for the suggested models, as shown in Fig. 8. The absolute relative error (ARE, %) was calculated by applying the formula provided in below:

$$ARE = \left| {\frac{{Y_{i,e} - Y_{i,p} }}{{Y_{i,e} }}} \right| \times 100\quad i \, = 1,2,3,...,n$$
(18)
Fig. 8
figure 8

Cumulative frequency plot of absolute relative errors for all models.

Figure 8 reveals that over 70% and 90% of the crude oil residue estimates made by the CatBoost model had errors below 2.5% and 10.5%, respectively. In comparison, the RF, XGBoost, and LightGBM models showed errors of 12.43%, 16.75%, and 18.13% for 90% of the data, respectively, underscoring high reliability of the CatBoost model in predicting crude oil oxidation.

Trend analysis

Finally, CatBoost, identified as the best-performing model in this work, is evaluated by comparing its predictions of the thermo-oxidative thermal behavior trends of crude oils with laboratory data to assess its accuracy in capturing these trends. First, the impact of the heating rate on crude oil #9, with an API gravity of 16.4, is assessed and illustrated in Fig. 9. It is evident that increasing the heating rate causes a shift of reaction regions toward higher temperatures due to thermal lag, resulting in reduced mass loss at elevated heating rates49,53. Literature findings confirm that with higher heating rates, the peak, burnout, and ignition temperatures move to higher values51. In this context, the CatBoost model accurately captures the rate of mass loss of crude oil across various heating rates, demonstrating its robustness in predicting the thermo-oxidative behavior trends of crude oil.

Fig. 9
figure 9

Comparison of experimental data45 with CatBoost model predictions for TG curves of a heavy oil at various heating rates.

Figure 10 presents the TG curves for two crude oils: light oil (#17) and heavy oil (#18), plotted against temperature at a constant heating rate of 10 °C/min. Due to their distinct compositions and properties, the TG profiles of these oils vary significantly. Generally, heavier crude oils, which contain higher asphaltene and resin contents, tend to leave more residual mass54,91. The TG analysis reveals distinct mass loss patterns for light and heavy crude oils, as indicated in the literature53. Light oil undergoes a considerable mass loss of 85% in the LTO stage, almost double that of heavy oil (48.9%), while heavy oil experiences greater mass loss during the fuel deposition and HTO stages, driven by its higher asphaltene content and complex intermolecular interactions53. Again, the CatBoost model effectively captures the mass loss rates of both crude oils, demonstrating its strong predictive accuracy for oxidation behavior of crude oils.

Fig. 10
figure 10

Comparison of experimental data53 with CatBoost model predictions for TG curves of heavy and light oils at a heating rate of 10 ˚C/min.

Sensitivity analysis

In this study, sensitivity analysis based on the correlation coefficients92,93 is utilized to quantify the impact of input variables on the CatBoost model’s predictions. A higher correlation coefficient calculated between any input parameter and the output indicates a more substantial impact of that parameter on crude oil residue during oxidation. The Pearson correlation coefficient measures the strength and direction of the linear relationship between two continuous variables. Here, the Pearson correlation coefficients are determined by applying the following equation38,94:

$$r(inp_{i} ,Y) = \frac{{\sum\limits_{j = 1}^{n} {\left( {inp_{i,j} - inp_{a,i} } \right)\left( {Y_{j} - Y_{a} } \right)} }}{{\left( {\sum\limits_{j = 1}^{n} {\left( {inp_{i,j} - inp_{a,i} } \right)^{2} \sum\limits_{j = 1}^{n} {\left( {Y_{j} - Y_{a} } \right)^{2} } } } \right)^{0.5} }}$$
(19)

here, inpi,j shows the jth value of the ith input variable and inpa,i is the average of the ith input variable. Also, i can be any of the inputs considered in the modeling. Finally, the average value and the jth value of predicted crude oil residue are shown by Ya and Yj, respectively.

Moreover, the Spearman correlation coefficient is a rank-based measure that captures the strength and direction of monotonic relationships between two variables. It is resilient to outliers and can reveal nonlinear associations often missed by traditional linear methods. It is calculated using the following formula93:

$$\rho \left(x,y\right)=\frac{\frac{1}{n}\sum_{i=1}^{n}(R({x}_{i})-{R}_{m}(x))(R({y}_{i})-{R}_{m}(y)) }{{\left((\frac{1}{n}\sum_{i=1}^{n}{\left(R({x}_{i})-{R}_{m}(x)\right)}^{2})(\frac{1}{n}\sum_{i=1}^{n}{\left(R({y}_{i})-{R}_{m}(y)\right)}^{2})\right)}^{0.5}}$$
(20)

here, ρ represents the Spearman rank correlation coefficient, while n is the number of observations. The terms R(x) and Rm(x) denote the rank and mean rank of the x variable, respectively, with similar definitions for R(y) and Rm(y).

The sensitivity analysis presented in Fig. 11 highlights the influence of various input parameters on residue formation in crude oils, using both Pearson and Spearman correlation coefficients. Temperature stands out as the most impactful factor, showing a strong negative correlation with residue formation, with Pearson and Spearman coefficients of -0.76 and -0.93, respectively, indicating a substantial reduction in residual content as temperature increases. The difference between Pearson and Spearman coefficients indicates that temperature’s impact on residue formation involves both linear and nonlinear effects, with the latter being more pronounced. API gravity also exhibits a negative effect, with Pearson and Spearman values of -0.19 and -0.18, respectively, suggesting that lighter oils produce less residue. In contrast, asphaltene and resin have positive correlations, with Pearson coefficients of 0.16 and 0.13, and slightly higher Spearman coefficients of 0.17 and 0.16, respectively. These positive values imply that higher levels of asphaltene and resin contribute to increased residue formation. The slightly higher Spearman values indicate the potential presence of nonlinear relationships, which this rank-based method can better capture. The heating rate has a relatively minor positive impact on residue formation, with Pearson and Spearman correlation coefficients of 0.032 and 0.026, respectively, indicating a weak influence. Overall, the temperature has the most substantial impact, followed by API gravity, asphaltene, resin, and heating rate in descending order of influence, emphasizing the critical role of crude oil composition and operational conditions in residue formation. Crude oil type, as well as its asphaltene and resin content, plays a critical role in influencing coke deposition, which is evident in the results of sensitivity analysis. For the ISC process to remain sustainable, managing fuel consumption and residue formation is essential. Thus, precisely calculating the amount of air injected to ignite the coke or residue within the porous media is vital to enhance the heat generated during high-temperature oxidation.

Fig. 11
figure 11

The influence of inputs on residue formation of crude oils in TGA.

Leverage technique

The leverage technique95,96,97 is useful for identifying outliers and potentially anomalous data points that may deviate from the main dataset trends. Additionally, this technique plays an important role in evaluating the reliability and accuracy of the database used for modeling purposes. Standardized residuals (SR) are calculated as the differences between the model’s estimates and real laboratory measurements. In this calculation, ei is the error for each data point, MSE is the mean square error, and Hi denotes the leverage value for the ith observation98:

$${SR}_{i}=\frac{{e}_{i}}{{[MSE\left(1-{H}_{i}\right)]}^{0.5}}$$
(21)

Furthermore, in this approach, leverage values, represented by the diagonal elements of the hat matrix, are calculated based on the structure of the hat matrix, as outlined in the following99:

$$H=X ( {X}^{T} X {)}^{-1} {X}^{T}$$
(22)

in this context, X represents an n × i matrix that includes n data points and i input parameters, while T stands for the transpose of the matrix. Furthermore, the critical leverage (H*) is computed as follows97:

$${H}^{*}= \frac{3\times (i+1)}{n}$$
(23)

Here, data with SR within the interval [− 3, + 3] and H ≤ H* are regarded as “valid”. The model is statistically sound if the majority of data satisfy the mentioned conditions. Also, points with H  >  H* and − 3 ≤ R ≤ 3 are identified as “out-of-leverage” points, meaning they exist outside the applicability domain yet remain accurately predicted. In contrast, points with SR values beyond [− 3, + 3] are labeled as “suspected” points, signifying a potential for experimental uncertainty, thereby placing them outside the model’s applicability domain97,98. As illustrated in the Williams plot in Fig. 12, only 2.14% of the empirical data (i.e., 66 data points) were identified as suspected, and no out-of-leverage points were detected. This represents a minimal fraction of the dataset, indicating that the vast majority of data points were classified as valid. These findings prove that CatBoost model suggested in this work demonstrates a high level of reliability.

Fig. 12
figure 12

The William’s plot of the total data predicted by the CatBoost model.

Conclusions

In this study, thermo-oxidative profiles and residue formation of crude oils during TGA are modeled using 3075 experimental data points collected from 18 crude oils’ oxidation, spanning an API gravity range of 5–42. Four advanced tree-based machine learning algorithms including CatBoost, XGBoost, LightGBM, and RF were utilized for modeling. The findings of this work allow for the following conclusions to be drawn:

  • The models were ranked in terms of accuracy, with CatBoost, XGBoost, Random Forest, and LightGBM. The CatBoost model demonstrated the highest predictive accuracy for crude oil residue during oxidation, achieving MARE values of 4.95% across the entire database, 5.92% for the testing collection, and 4.71% for the training collection. In addition, an R2 value of 0.9993 underscores the model’s exceptional predictive capability.

  • Temperature strongly influences residual crude oil content, with a significant negative correlation, while API gravity also negatively impacts it. In contrast, asphaltene, resin, and heating rate positively correlate with crude oil residual content, with temperature being the most influential factor, followed by API gravity, asphaltene, resin, and heating rate; crude oil type and composition notably affect coke deposition as shown in sensitivity analysis.

  • The leverage method identified only 2.14% of the data as suspected, with no out-of-leverage points detected, highlighting the high reliability of the CatBoost model suggested in this work.

To ensure the sustainability of the ISC process, effective management of fuel consumption and residue formation is crucial, a task in which the developed CatBoost model demonstrated exceptional capability.