Introduction

With the rapid growth of energy demand and the sharp decrease of conventional petroleum resources, the unconventional resources, e.g., tight oil and gas, shale oil and gas, heavy oil, are gradually coming into focus1,2,3. Among those, shale oil is one of the most important unconventional resources and occur widely in sedimentary basins4,5,6. Shale is characterized by the abundant organic matter and acts as the source rock for conventional petroleum accumulations7,8,9. Shale plays have been found around the globe, including the Latin America, the Middle East, Russia and China3,10. TOC content is the assessment of the amount of total hydrocarbon occurring in the studied strata and is generally used to survey the quality of a shale play. TOC content depends on the hydrocarbon saturation and porosity of shale. Therefore, TOC are considered to be one of the significant parameters for shale oil assessment. Generally, TOC data can be available by laboratory measurement for rock samples11, which are burned to convert to carbon dioxide. However, TOC datasets are limited due to the time and experimental cost and core samples amounts, which cannot meet fine evaluation of shale oil.

To address this problem, the relationships between TOC and well logging parameter have been established by numerous authors due to the vertically successive well logging data12,13,14. For example, Schmokers and Hester (1983) showed that TOC content of sediments heads a positive linear correlation with the corresponding density13. Passey et al.12 showed that the ΔlogR got by the porosity and resistivity well logging (baseline) can be used to calculate the TOC content. Prior to using this approach, it is key to determine baselines for the sonic and resistivity logs defined as the interval transit time and resistivity of non-hydrocarbon-bearing beds (sandstones), which likely vary in different strata and geological setting. Besides these concerns, this method should be calibrated when it is used in unconventional plays. Moreover, in a complex geological setting, this method is unavailable due to the rapid change of lithology. A 3D surface fitting method has been developed to applied for TOC content estimation15. The above methods employ limited logging data leading to the inaccuracy results. With the development of artificial intelligence and big data technology, many machine learning models have been applied for source rock organic richness evaluation16,17,18,19,20,21, thermal maturity prediction22,23, organic matter kerogen type definition24,25,26, reservoirs pore-pressure estimation27,28, reservoir oil saturation estimation29, oil and gas distribution prediction27,30,31, lithology identification32,33,34,35, reservoir porosity and permeability estimation36,37. Various machine learning methods have been used to predict the TOC content by the combine of laboratory measurement data and well logging data38,39,40,41,42. For instance, Neural networks, e.g., artificial neural network and BP neural network, have been widely employed to estimate TOC values due to their powerful flexibility to handling the complex and nonlinear problems43,44. The support vector regression is a critical learning model for solving the small sample size problems, which have been applied for TOC content prediction under small sample conditions45,46. The gaussian process regression model, an ensemble learning model, have been successfully used to estimate TOC content attributed to its ability for dealing with complex problems (e.g., high dimensions, small samples, and non-linearity)47,48,49. Xgboost method were employed to predicte TOC of source rock from the Espírito Santo Basin, SE Brazil50. However, those methods have been attempted in conventional clastic rock and carbonate reservoirs deposited under stable deposition conditions. Less work has been done on the shale reservoirs composed of a mixture of siliceous and carbonate rocks deposited under a tectonic complex setting.

In this work, a sample set from Hashan area, Junggar Basin provides an excellent opportunity to focus on the TOC content of shale reservoirs and well logging data by machine learning methods. Subsequently, this understanding can be expanded to shale reservoirs deposited in the similar structural and sedimentary setting.

Geological settings

The Junggar Basin is one of the most petroliferous basins in the northwestern China51. It can be divided into Southern Piedmont Thrust Zone, Central Depression, Western Uplift, Eastern Uplift, Luliang Uplift and Wulungu Depression from south to north52. The Hashan area, located in the northwestern margin of Junggar Basin, is adjacent to Hala’alate Mountains (Fig. 1a). This area has undergone multistage tectonic movements and is mainly influenced by compressional tectonic regimes53,54, which results in the formation of many large-scale thrust nappe structures55,56 (Fig. 1b).

Fig. 1
figure 1

(a) Location of the study area in the Junggar Basin, (b) structural overview of the Hashan area, and (c) transverse section in the study area showing the strata distribution and lithology (location of section shown in b).

The stratigraphic framework is comprised of a series of depositions from Carboniferous to Quaternary57. Overlying the Carboniferous basement are the Permian depositions, including the Jiamuhe Formation, Fengcheng Formation, Xiazijie Formation, Lower Wuerhe Formation and Upper Wuerhe Formation. The Triassic strata contain the Baikouquan, Kelamayi, Baijiantan Formations (Fig. 2). The P1f. consists of organic-rich shale, mudstone and dolomitic mudstone and is an entirely hydrocarbon-abundant formation58,59, which is an important target for shale oil exploration60,61. A diversity of lithologies, including mudstone, dolomitic mudstone, silty fine sandstone, muddy siltstone, dolomitic siltstone and siltstone as well as silty fine sandstone, were identified based on careful core observations62,63,64,65. Moreover, very frequent lithological variations with meter–centimeter scale also present in the well profiles62,65.

Fig. 2
figure 2

Stratigraphic chart for the Hashan area in the Junggar Basin.

Samples and methods

Samples

A total of 82 core samples of the Fengcheng Formation were collected from five wells, i.e., HQ101, HQ6, HS1, HS2 and HSX1, in the Hashan area. The locations of those wells have been shown in Fig. 1a. The basic information of the core samples, indicating the sampling depth and main lithology, has been listed in Table S1 (in Supplementary material).

TOC measure

The selected rock samples were ground into power with particle diameters of less than 0.2 mm. And then, 100 mg of power samples were washed by diluted HCl and deionized water, respectively to remove the inorganic carbon and other contaminants in cores. All samples were dried in an oven at 80 °C for one day. The samples were heated to 900 °C using a LECO CS-230 analyzer so that organic carbon was fully combusted and converted into carbon dioxide.

Model data

In this study, the collected TOC datasets of core samples together with corresponding log data were randomly subdivided into training subset and testing subset, where the former was made from 70% of the data and the latter was consisted of left 30%. Prior to data input, all of the datasets were normalized in 0–1 using a min–max normalization pre-processing method in order to remove the effect of the different log with different units.

The relationships between measured TOC and well log data, including GR (Gamma-ray), RT (Resistivity), AC (Acoustic logging), DEN (Density) and CNL (Compensated neutron logging), have been shown in previous reports13,18,19,20,21,66. Therefore, the aforementioned data were selected as input for TOC values estimation in this study. GR log measures the radioactivity of different formations, where organic-rich strata are characterized by relatively extreme radioactivity with high GR values. DEN tool measures the bulk formation density affected by matrix and fluids compositions. Organic matter in rocks showing a low density decreases the density of source beds. Neutron log allows for the response of concentration of hydrogen atoms in rocks. The contents of hydrogen atoms and porosity in sedimentary rocks are related to the contents of organic matter. Consequently, organic-rich intervals have a high neutron porosity. RT log aims to measure the fluid conductivity in strata, whereby mature source rocks contain an enhanced resistivity due to the abundant generated hydrocarbons. AC log similar with CNL log also measures the contents of hydrogen atoms and porosity.

Several key steps are as follows (Fig. 3): Firstly, emending the depth of core sample by core observation and well log analysis; Next, establishing the relation between TOC values and corresponding well log data by statistical and visualization method to remove the abnormal data point; Then, subdividing the dataset into training and test subset; After that, debasing dimension of well log data from five dimensional to two-dimensional data with PCA method; Finally training, saving and applying the model. Table 1 lists the main parameters about the Xgboost model. Among those, the parameter for “min_child_weight” is equivalent to “minimum leaf size”.

Fig. 3
figure 3

A flowchart showing the general work process in this study.

Table 1 Main parameters about the Xgboost method to establish the model.

Xgboost algorithm

Xgboost algorithm, one of important ensemble learning algorithms, was firstly developed by Chen and Guestrin67 and has been applied in various fields due to its great ability to deal with non-linearity problems. The basic structure and main idea of the algorithm is to assemble multiple simple tree models to iteratively generate a new tree, which can significantly enhance the accuracy of the model. It can be thought to be an optimization problem. The objective function is as follow:

$$L\left( \Phi \right) = \Sigma_{i } l\left( {\hat{y}_{i} , y_{i} } \right) + \Sigma_{k} \Omega \left( {f_{k} } \right)$$
(1)

where \(l\left({\widehat{y}}_{i}, {y}_{i}\right)\) is a loss function to determine the difference between the observed values \({y}_{i}\) and predicted values \({\widehat{y}}_{i}\) and \(\Omega \left({f}_{k}\right)\) is the regularization term to use to simplify the model and avoids overfitting problems. The function expression is as follows:

$$\Omega \left( f \right) = \gamma T + \frac{1}{2}\lambda \left\| \omega \right\|^{2}$$
(2)

where T is the number of trees, and ω is number of Nodes.

The objective function is derived by expanding loss function using a Taylor Series:

$$L^{\left( t \right)} = {\Sigma }_{i} l\left( {y_{i} ,\hat{y}_{i}^{\left( t \right)} } \right) + {\Sigma }_{i = 1}^{t} \Omega \left( {f_{i} } \right)$$
(3)
$$\begin{array}{*{20}c} { = {\Sigma }_{i} l\left( {y_{i} ,\hat{y}_{i}^{{\left( {t - 1} \right)}} + f_{t} \left( {x_{i} } \right)} \right) + \Omega \left( {f_{t} } \right)} \\ \end{array}$$
(4)
$$\begin{array}{*{20}c} { = {\Sigma }_{i} \left[ {g_{i} f_{t} \left( {x_{i} } \right) + \frac{1}{2}h_{i} f_{i}^{2} \left( {x_{i} } \right)} \right] + \Omega \left( {f_{t} } \right)} \\ \end{array}$$
(5)

where gi is the first order derivative and hi is the second derivative.

Define \({I}_{j}\) as the instance of node j. The function can be rewrite as follow:

$$L^{\left( t \right)} = {\Sigma }_{j = 1}^{T} \left[ {{\text{G}}_{j} {\upomega }_{j} + \frac{1}{2} \left( {{\text{H}}_{j} + \lambda } \right){\upomega }^{2}_{j} } \right] + \gamma T$$
(6)

where \({\text{G}}_{j}= {\Sigma }_{i\epsilon {I}_{j}}{g}_{i}\) and \({\text{H}}_{j}= {\Sigma }_{i\epsilon {I}_{j}}{h}_{i}\)

For a fixed structure q(x), the optimal wright \({{\upomega }^{*}}_{j}\) of node j can be expressed by:

$$\begin{array}{*{20}c} {{\upomega }^{*}_{j} = - \frac{{{\text{G}}_{j} }}{{{\text{H}}_{j} + \lambda }}} \\ \end{array}$$
(7)

and the corresponding optimal value can be calculated by:

$$\begin{array}{*{20}c} {\tilde{L}^{\left( t \right)} = - \frac{1}{2} {\Sigma }_{j = 1}^{T} \left( {\frac{{{\text{G}}^{2}_{j} }}{{{\text{H}}_{j} + \lambda }}} \right) + \gamma T} \\ \end{array}$$
(8)

Results and discussions

Petrological features

The thin section observations show the lithology of the Fengcheng Formation in the study area. The results show that Fengcheng Formation are characterized by carbonate, siltstone and mudstone and generally comprise either interbedded carbonatite and mudstone or interbedded siltstone and mudstone68 (Fig. 4). Figure 5 illustrates the relative content of calcareous minerals, felsic minerals and clay minerals. The results indicate that Fengcheng Formation shales in the study area are dominated by brittle minerals, i.e., quartz, feldspar, calcite and dolomite. The relative content of clay minerals ranges from 5.10 to 27.10%, with a mean of 15.30%. In general, based on the ternary diagram of mineral compositions, Fengcheng Formation shales belong to mixed shales69,70.

Fig. 4
figure 4

The representative microscope photographs showing the lithology characteristics of the Fengcheng Formation in the study area. (a) well HS1, 2100.30 m, silty mudstone interbedded with calcite; (b) well HS1, 2103.00 m, mudstone interbedded with siltstone; (c) well HQ6, 2702.20 m, micrite dolomite; (d) well HQ6, 2698.50 m, dolomite; (e) well HSX1, 3683.00 m, mudstone interbedded with organic matter; (f) well HSX1, 3302.10 m, dolomite.

Fig. 5
figure 5

Ternary diagram of mineral compositions of shale of Fengcheng Formation in Hashan area (modified from Zeng et al.69).

Correlation analysis

Figure 6 shows the relationships between the well logging and TOC values. The well logging data and correlated TOC values of selected samples are all conformed to a Gaussian distribution. The scatter plots show no significant correlation between the TOC values and the well log parameters. The TOC content was the only predicted attributes used in the Xgboost model. A conventional normalization approach was used to remove the influences of different units of different well log parameters prior to the Xgboost prediction model.

Fig. 6
figure 6

Cross plots of TOC and well log parameters.

TOC prediction

As shown in Fig. 7, the R2 of the measured TOC and estimated TOC values is 0.54 using Xgboost model. A significant relationship occurs between some well log parameters leads to the bad performance of the model in the test set. For example, the AC have a clearly negative linear relationship with DEN and a positive linear relationship with CNL, which may lead to an inaccurate prediction model. Therefore, a principal component analysis method was applied for the removal of overfitting16,29. The results show that the correlation of determination increase between measured TOC and estimated TOC values by a Xgboost model with principal component analysis. Other evaluation indicator including mean squared error (MSE), root mean squared error (RMSE) and mean absolute error (MAE) for the model are calculated to comprehensively assess the model. The MSE, RMSE and MAE in the model with PCA values are 0.05, 0.39 and 0.28 respectively, which are clearly lower than that in the model without PCA with values of 0.14, 0.63 and 0.39 respectively. Moreover, the R2 of the measured TOC and estimated TOC values for test data subset involved PCA method (0.57) is much higher than that without PCA method (0.12). Therefore, it indicates that PCA analysis can decrease the influence of overfitting on the model to a certain extent and then increase generalization ability.

Fig. 7
figure 7

Cross plots of measured TOC and estimated TOC values by combination (a) with PCA analysis and (b) without PCA analysis.

Implication for shale oil exploration

Profile distribution features

A visual comparison of the final output of estimated TOC values compared to the measured results can be seen in Figs. 8 and 9. The TOC prediction curves can predict the organic richness of each well where there were no samples taken. The sediments from the second member of Fengcheng Formation and top of the third member of Fengcheng Formation have high TOC content. In the well HSX1, the sediments are mainly mudstone at the depth of 3325–3375, which is characterized by high TOC values. The second member of Fengcheng Formation is dominated by calcareous dolomite. The volcanic ash has abundant magnesium and iron-rich materials, which can fertilise the algae and bacteria. Therefore, the second member of Fengcheng Formation sediments contain rich organic matter and have high TOC values. In the well HS1 from the northern study area, the second and third members of Fengcheng Formation were eroded due to massive thrust (Fig. 1b). The organic matter of the first member of Fengcheng Formation sediments were well preserved in the thrust nappe, leading to the high TOC content (Fig. 9). Therefore, the first member of Fengcheng Formation in the thrust nappe, second member of Fengcheng Formation and top of third member of Fengcheng Formation and may be shale oil potential layers in the Fengcheng Formation in the study area, which are characterized by higher contents of TOC.

Fig. 8
figure 8

Well-logging curves and TOC distribution for the well HSX1. GR: Gamma-Ray logging; SP: Spontaneous Potential logging; CNL: Compensated Neutron logging; DEN: Density logging; AC: Acoustic logging; RT: Resistivity logging.

Fig. 9
figure 9

Profile of wells HSX1-HQ101-HQ6-HS1 showing the TOC distribution. GR: Gamma-Ray logging; D: Depth; Litho: Lithology; RT: Resistivity logging; Md TOC: Measured TOC; Ed TOC: Estimated TOC.

Plane distribution features

The planar distribution characteristics of dark mudstone in the Fengcheng Formation show that the dark mudstone is mainly distributed in the HS11 well block of the western area and the HQ23 well block of central area (Fig. 10a). However, based on the data from CNPC, the occurrence of Fengcheng Formation shale in the Hashan area is very limited, which is only distributed in the margin of the Mahu Sag (Fig. 10b)71,72. In areas where no assay data were available, on basis of the above established prediction model, the geophysical logs were used to estimate the TOC content. The TOC contours of Fengcheng Formation in the study area have been illustrated in Fig. 10c. The TOC values are more than 3.00% around the well HS11 and more than 2.00% in the HQ23 well block, which is consisted with the distribution characteristics of dark mudstone. The Fengcheng shale with TOC > 5.00%, covering about 300 m2, is more than 200 m thick closed to wells HS11 and HQ23 (Fig. 10d). It also reflects that the central of paleolake during Early Permian Fengcheng Formation depositional stage may be in the well HQ23 zone in Hashan area and Fengnan area in Mahu Sag, where the confined water body were separated from the seawater regressing toward eastsouth. It leads to those sediments formed under a saline and reducing environment favorite for the prevailing of aquatic organisms, which can contribute amount of organic matter for source rock. Therefore, the well HQ23 and HS1 blocks can be the shale oil target for future exploration in this area. However, strong fault movements resulted in the shale reservoirs heterogeneity, which brings about the engineering problem and a relatively lower oil production compared with the Mahu Sag.

Fig. 10
figure 10

Isopleth maps of (a) thickness of dark mudstone of Fengcheng Formation (m) in this work; (b) thickness of dark mudstone of Fengcheng Formation (m) drawn by CNPC; (c) total organic carbon content of Fengcheng Formation (%) and (d) thickness of dark mudstone of Fengcheng Formation with TOC > 1.5% (m).

Comprehensive assessment

Table S1 also shows the lithology for the selected samples in this study. By compared the siliceous with carbonate rocks data, it is clearly observed that a similar performance of the prediction model on those samples. Several authors have shown the relationship between lithology and well log data, and lithology predictions model using well log data have been proved to be effective. Actually, the effects of lithology relying on just that relation have been considered. Therefore, the influence of lithology on the TOC values estimation is minor due to the input of multi-petrophysical data and considerable relationship between those data and lithology.

As shown in Fig. 6b, it can be observed that the predicted TOC values have a linearly positive correlation with the measured TOC values with a coefficient of association of 0.68. The model performance is not very excellent due to the limited dataset and the most of core samples with a relatively lower TOC values. Even so, the results of model behavior in the samples with high TOC values are clear, indicating the model still can be used for TOC values estimation.

To enhance the reliability for TOC estimation from the well log data, the quality of original materials should be sufficient, which was determined by compared with the difference between the observed well data and inspection log. The tolerable errors of GR, RT, AC, DEN and CNL log data are 5%, 10%, 5%, 5% and 10%, respectively. The depth error between the original observation well and the inspection well is less than ± 20 cm/200 m. The input data with a skewed distribution would be decrease the generalization of model and results in misleading MSE values of the trained data, where the values appear to be very low but the true values may be high. Therefore, standardization of the data should be performed.

As reviewed in the introduction part, some machine learning methods including Xgboost have been widely used to predict TOC contents and other physicochemical properties of rocks in the subsurface. However, the selected samples in previous studies mainly were pure shales and/or mudstones with high TOC contents deposited under a stable water body such as deep-lacustrine and marine (e.g., Yu et al.41, Mahmoud et al.44, Rui et al.45), which appear to provide a conceptual study frame. Whereas, in this study, the samples are mixed lithology with shales and carbonates deposited under a sharply changed environment and distributed in a tectonic complex geological setting, which would pose challenges for the depth emendation of the core sample and data processing. Meanwhile, a small set of data were obtained due to the limited exploration in the study area. In addition, the PCA method were employed to decrease the effect of overfitting issue likely leading by the narrow dataset, which were rarely reported. Therefore, this work presents a case study for the TOC contents estimation in the similar geological background. The distributions of the Fengcheng Formation shales in this area are updated based on the predicted data in this study and the constraint of geological model, which provides more detailed information for further shale oil exploration.

Conclusions

In this paper, the validity of using a xgboost machine learning algorithm was tested to predict TOC values using wireline logs. The results showed that xgboost algorithm has a good performance to deal with a nonlinear relationship between TOC values and well log parameters, which performed very well on TOC estimation with acceptable correlation. The correlation coefficient between measured TOC and predicted TOC values is 0.54, which is not very excellent likely relevant to a small core samples dataset. For this issue, the principal component analysis (PCA) method was used to transform the dimension (D) of data from 5 to 2D. By combination with Xgboost algorithm and PCA approach, the correlation coefficient between the predicted and measured TOC increases from 0.54 to 0.68, indicating PCA method can be applied for the decrease of overfitting. The model in this work provides reliable data for shale oil evaluation in the study area and a good example under similar geological setting. Based on the data and model, the second member of Fengcheng Formation contains tremendous potential in exploration of shale oil in the Hashan area. The present method in this study can help to save study time and avoid the difficulty to identify the “base line” corresponding the non-source rock segment due to a complex lithology mixed with carbonate and siliceous rocks. In the future, more sample data should be considered and a combinatorial method is involved to solve the overfitting problem.