Introduction

Precise lithofacies predictions are critical for an adequate reservoir characterization and management in the petroleum industry. Classifying lithologies is challenging because different rocks have different porosity, permeability, and fluid saturation. Traditional methods for identifying lithology using wireline logs have long been recognized as labor-intensive, often hampered by subjective human interpretation, and reliant on core samples collected from drilled wells1,2. Moreover, nonlinear relationships observed amongst wireline logs and the diverse lithologies and petrophysical properties within geological formations add further complexity to manual interpretation techniques3. Nevertheless, recent advancements in computational methods, particularly machine learning algorithms, offer promising avenues for expediting and enhancing the lithology identification process.

In lithology prediction, researchers have traditionally relied on well-log data and empirical relationships to identify subsurface rock types. Wireline data such as gamma-ray, sonic, resistivity, neutron and density logs, are crucial for providing continuous downhole measurements that indicate the chemical composition and physical characteristics of rock layers. Historically, manual interpretation of well logs was the primary method for petrophysical characterization of reservoirs, with expert geologists analyzing log signatures based on established cutoffs and patterns; for example, gamma-ray logs distinguish between sandstones (low values) and shales (high values), as outlined by Asquith and Gibson4, with empirical cutoffs refined in subsequent studies. Advancements were made with rock physics models introduced by Avseth et al.5, linking log responses to the mechanical properties of rocks and enabling lithology prediction through the analysis of rock stiffness (e.g., Young’s modulus) and bulk properties. For instance, plotting Young’s modulus against Poisson’s ratio helps differentiate sand, shale, and intermediate lithologies like sandy shale and shaly sand. In the 1980s and 1990s, the rise of multivariate statistical methods allowed for more sophisticated lithology prediction techniques, including discriminant analysis and principal component analysis (PCA), which group rock types based on multiple log responses. Rider6 pioneered statistical techniques for automating lithology prediction, reducing the subjectivity of manual interpretations and providing a more data-driven approach to classification. Overall, various methods have been developed and refined over the past few decades to improve the accuracy and reliability of lithology prediction.

As industry embraces digital transformation, machine learning remains at the forefront of creative approaches to improving decision-making and operational efficiency in hydrocarbon exploration. These machine learning algorithms have been developed as useful tools for evaluating complex exploration and development data, allowing geoscientists to spot patterns and links that traditional techniques would miss. By processing vast volumes of seismic, well log, and core sample data, machine learning can accurately estimate lithofacies. This predictive capability deepens our understanding of subsurface formations and optimizes exploration and production strategies, reducing risks and costs associated with drilling and reservoir development. Leveraging machine learning facilitates the automated categorization and identification of lithologies, making it feasible to discern complex geological formations and their variations7,8,9,10,11. Notably, non-parametric approaches such as artificial neural networks (ANN), K-nearest neighbors (KNN), decision trees (DT), support vector machines (SVM), logistic regression (LR), and fuzzy logic have gained traction for their utility in lithology identification within petroleum reservoirs12,13,14. These increasingly explored methods represent a departure from traditional manual interpretation, offering potential solutions to the challenges posed by laborious and subjective lithology identification processes in the petroleum industry15,16.

While research in reservoir characterization predominantly focuses on deep-learning techniques, such as random forest (RF) and extreme gradient boosting, for forecasting porosity based on seismic and well-log characteristics, these methods have also proven to be equally effective for lithology prediction17,18,19,20. Notably, machine learning algorithms have become proficient in interpreting lithologies, mainly when applied to large well-log datasets21. This highlights the versatility and utility of machine learning algorithms in addressing various challenges within reservoir characterization, including lithology identification and prediction. By utilizing advanced computational techniques, researchers and industry professionals can achieve enhanced accuracy and efficiency in characterizing reservoirs and optimizing hydrocarbon recovery strategies.

Handling and processing extensive well-log datasets pose significant challenges that must be overcome before utilizing them for lithology prediction22. The application of various machine learning techniques generates a substantial quantity of data, further augmenting the volume available for lithology interpretation. To address the complexities of dataset management, Sircar et al. (2019)23 and Saporetti et al.24 explored the use of artificial neural networks (ANN), fuzzy logic, and genetic algorithms (GA). These methodologies offer promising avenues for efficiently handling and extracting valuable insights from large datasets, facilitating more accurate and robust lithology prediction processes.

In the context of lithology prediction in petroleum reservoir characterization, integrating geological descriptions and rock physics models is crucial in enhancing our understanding of reservoir quality25,26,27. By incorporating well-log data and geological characteristics, these models offer accurate insights into lithology distribution within a reservoir. Notably, lithological models are essential for capturing the complex relationships between lithologies and other properties of petroleum reservoirs (e.g., porosity), thus enabling effective lithology prediction28. Moreover, establishing a physical foundation for the relationships between porosity, lithology, and rock physics, including elastic properties, further improves lithology prediction accuracy29. By leveraging these models, both researchers and industry practitioners can deepen their understanding of reservoir characteristics and enhance the accuracy of lithology predictions, thereby facilitating improved reservoir management and more successful hydrocarbon exploration endeavours.

However, it is important to note that studies using seismic attributes operate at a significantly coarser resolution and often aim to characterize broader-scale heterogeneities, whereas well-log-based approaches, such as the present study, deal with high-resolution vertical data at the wellbore scale. As such, direct comparisons between the two must consider differences in data resolution, sampling density, and domain focus. In this context, the present study restricts its evaluation to well-log-based lithofacies classification, with references to seismic-based studies serving only to highlight broader methodological trends in machine learning applications across reservoir characterization.

This study applied six ensemble techniques, namely SVM, DT, RF, ANN, KNN, and LR, to forecast lithologies within the Basal Sand of the Lower Goru Formation in the Lower Indus Basin, Pakistan. While machine learning for lithofacies classification is not new, this study introduces a comparison of multiple algorithms (RF, SVM, ANN, etc.) on the Lower Goru Formation’s well logs, integrating traditional geological methods to enhance prediction accuracy. Additionally, the study explores insights into lithology distribution within the Basal Sand and how these insights can inform reservoir management strategies. By addressing these objectives, the research contributes to advancing machine learning applications in reservoir characterization and provides a roadmap for integrating computational techniques into petroleum exploration workflows.

Theoretical background

This section provides a theoretical overview of the machine learning algorithms utilized in this study, including Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), Artificial Neural Network (ANN), K-Nearest Neighbor (KNN), and Logistic Regression (LR). Each algorithm offers unique approaches to handling classification tasks like lithology prediction, and their underlying principles are briefly outlined here.

  1. 1.

    Support Vector Machine (SVM)

SVM is a supervised learning algorithm primarily used for tasks30. The main objective of SVM is to find an optimal hyperplane that separates data points of different classes with the maximum possible margin. The algorithm works well in high-dimensional spaces and is effective in cases where the relationship between the input data and the class labels is nonlinear. To handle non-linearity, SVM uses kernel functions (such as radial basis function or polynomial kernel) to map the input data into a higher-dimensional feature space, where a linear separation is possible31. SVM is especially suitable for binary classification problems but can be extended to multi-class tasks through methods like one-vs-one or one-vs-all approaches.

  1. 2.

    Decision Tree (DT)

A Decision Tree is a non-parametric, tree-based learning model used for both classification and regression32. It splits the dataset into subsets based on feature values, recursively partitioning the data into nodes until a specific stopping criterion is met (e.g., all leaves contain homogeneous classes). DT aims to create a model that predicts the target variable by learning simple decision rules inferred from the data features. Each internal node represents a feature, each branch a decision rule, and each leaf a class label. Gini impurity or entropy are commonly used as measures of node splitting33,34. One key advantage of DTs is their interpretability, making them highly useful for geological data like well logs.

  1. 3.

    Random Forest (RF)

Random Forest is an ensemble learning method built on the idea of combining multiple Decision Trees to improve accuracy and prevent overfitting35,36. It works by generating multiple DTs during training and averaging their predictions for classification tasks (or taking the majority vote). Each tree in the forest is trained on a random subset of the data (using bootstrapping), and a random subset of features is considered at each split, ensuring diversity among the trees. This bagging technique reduces the variance of the model, making Random Forest robust against overfitting and noise in the data​34. RF is particularly effective in handling high-dimensional data and complex relationships, which are common in lithology prediction using well logs.

  1. 4.

    Artificial Neural Network (ANN)

An Artificial Neural Network (ANN) is inspired by the structure and function of biological neural networks, designed by Marvin Minsky in 195137. ANNs consist of layers of interconnected nodes (neurons), where each node represents a feature in the dataset. ANNs are particularly powerful for capturing nonlinear relationships in data. The most common type of ANN is the feed-forward neural network, which consists of input, hidden, and output layers. Each neuron applies an activation function (e.g., ReLU, sigmoid) to the weighted sum of its inputs. The model learns by adjusting the weights during the training process using algorithms like backpropagation and gradient descent38. While ANNs are highly flexible and capable of modeling complex patterns, they often require large datasets and careful tuning of hyperparameters, which can make them prone to overfitting, especially with smaller datasets like those often used in lithology prediction.

  1. 5.

    K-Nearest Neighbor (KNN)

K-Nearest Neighbor (KNN) is a simple, instance-based learning algorithm that classifies a data point based on the majority class of its k-nearest neighbors in the feature space39,40. KNN does not build an explicit model; instead, it stores the entire training dataset and makes predictions by calculating the distance (typically Euclidean distance) between the new data point and all training samples. The algorithm works well when there is a clear local structure in the data, making it useful for lithology prediction in cases where facies exhibit clear, local variations. However, KNN can be computationally expensive with large datasets since it requires calculating distances for every prediction​38.

  1. 6.

    Logistic Regression (LR)

Logistic regression is a widely used classification algorithm designed to predict categorical outcomes, particularly binary classification tasks41. Unlike linear regression, which predicts a continuous output, logistic regression models the probability of a data point belonging to a particular class using the logistic function (sigmoid). The model assumes a linear relationship between the input features and the log-odds of the target class, which makes it suitable for simple, linearly separable data42. In cases where lithology prediction involves more complex, nonlinear relationships, LR may underperform compared to nonlinear models like Decision Trees or SVM.

In addition to traditional supervised learning methods, recent studies have increasingly adopted deep learning architectures, such as convolutional neural networks (CNNs) for capturing spatial features in log images and recurrent neural networks (RNNs) for modeling sequential dependencies in well-log data (e.g., Mishra et al.11; Prajapati et al.38. These models, though data-intensive, have demonstrated improved performance in lithology classification and petrophysical property prediction tasks. Furthermore, there is growing interest in uncertainty-aware modeling and explainable AI (XAI) techniques, including SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations), which enhance model transparency and support geologically consistent interpretations.

Geological setting

The study area is located in the central part of the Lower Indus Basin (LIB) between the eastern boundary of the Indian Shield and the western boundary formed by the marginal zone of the Indian plate. To the south, it extends into the Sanghar field, supported by observations from the Offshore Murray Ridge (OMR) and Oven Fracture Plate (OFP)43. The Sanghar field is located within the Sinjhoro concession, covering an area of 180 square kilometers, as shown in Fig. 1a. Notably, significant discoveries of oil, gas, and condensate have been made from the Lower Goru Formation. The field produces 900 barrels per day of oil, with ongoing expansion efforts. This research focuses on the Lower Goru Formation, widespread in the Lower Indus Basin (LIB), Pakistan. These formations exhibit considerable lithological diversity, primarily due to fluctuations in sediment supply and environmental conditions44.

Fig. 1
figure 1

Sources: Esri, Maxar, Earthstar Geographics, and the GIS User Community. Map imagery © Google.; (b) Locations of the studied wells.

(a) Map of the study area in Sindh, Pakistan. The base image was extracted using Google Earth Pro (Version 7.3), Google LLC. (2025). (URL: https://www.google.com/earth/versions/#earth-pro). The background basemap is the World Imagery layer sourced from ArcGIS Online (URL: https://www.arcgis.com/apps/mapviewer/index.html? layers=10df2279f9684e4a9f6a7f08febac2a9).

The stratigraphy of the Lower Indus Basin (LIB) is characterized by a thick sedimentary package ranging in age from the Triassic Wulagi Formation to Quaternary alluvium. The Lower Cretaceous Sembar Formation, mainly consisting of shale, is the primary source rock, whereas the late Early Cretaceous to Late Cretaceous Upper Shale acts as a seal, and the Lower Goru Formation is the reservoir rock45. This Sembar-Goru petroleum system in the LIB is notable for its efficiency and comprises several geological units.

The Early Cretaceous in the LIB is characterized by warmer latitude sedimentation of the Sembar and Goru over large erosional surfaces45. The Lower Goru Formation is a deltaic and shallow-marine sandstone characterized by different types of stratified sand representing a regressive stratum top sets prograding sequence. These intervals were deposited during rapid sea-level fluctuations, transitioning from high to low levels. This sequence exhibits aggradational patterns (Fig. 2). The uppermost portion of the Lower Goru Formation (LGF) comprises clastic sediments, primarily sandstone, which possess favorable reservoir characteristics and may indicate the presence of commercial hydrocarbons (oil and gas) in various parts of the LIB. The LGF was mainly deposited in deep marine settings, although some minor shallow benthic-rich fauna have been reported by researchers46. In the Lower Goru Formation, as in many other marine and deltaic basins, mixed lithologies are common, and the informal terms, such as, sandy shale and shaly sand help describe the gradational nature of the deposits. The use of such terminology allows for a more accurate and practical classification of lithofacies in this context, aligning with previous research in the study area (e.g., Khan et al.47; Hussain et al.43).

On a regional scale, sedimentary strata dip from east to west, with these structures serving as significant conduits for hydrocarbon accumulation. The Lower Indus Basin harbors numerous sandstone reservoirs spanning from the Cretaceous to the Paleogene units, including the Goru Formation, Pab Sandstone, and Ranikot Group, exhibiting substantial hydrocarbon potential. In the study area, the Lower Goru Formation of the middle Cretaceous serves as the primary confirmed reservoir. Shales of the Upper Goru Formation act as seal rocks for both the Goru and Sembar petroleum systems48. The reservoir quality generally reduces as Goru thickness increases westward45.

Fig. 2
figure 2

Generalized stratigraphy of the study area, redrawn based on data from Krois et al.49 and Azeem et al.50.

Methods

The workflow utilized in this study, as shown in Fig. 3, comprises several crucial steps tailored to predict lithology effectively.

Fig. 3
figure 3

Workflow for evaluating machine learning algorithms.

Data collection

The first step in this workflow was to gather well log data from six specific wells: Chak 7, Hakeem, Resham, Chak 63, Chak 66, and Chak 5, as shown in Fig. 1b, targeting the Basal Sand interval of the Lower Goru Formation (LGF). The logs available from these wells included gamma-ray (GR), bulk density (RHOB or ZDEN), sonic (compressional sonic travel time = DTP or DT), laterolog deep (RD or LLD), and neutron porosity (PHIN), with some logs (i.e., PHIN) missing for specific wells (Table 1). These available logs provide complementary data that are widely recognized for distinguishing lithologies in deltaic and shallow marine environments. The goal was to create a dataset that would support lithology prediction and reservoir characterization. The data for Chak 7 and Hakeem were reserved for model testing, while the rest were used for model training.

Data exploration and analysis

After data collection, exploratory data analysis (EDA) was performed to understand the characteristics of the dataset. EDA helped identify patterns and relationships between well logs and lithologies, which informed further data processing. Data visualization techniques, such as correlation matrices (Fig. 4), were employed to assess the relationships between key logs (GR, sonic, bulk density, resistivity) and lithologies. The scatter plots showed substantial overlap, complicating lithology interpretation, which emphasized the need for advanced data processing and machine learning models to better delineate lithological boundaries. While formal grain-size classifications (e.g., sand, silt, clay) are important for scientific accuracy, they sometimes fail to capture the real-world complexity of sedimentary facies, especially in transitional zones where both sand and shale co-exist. In these cases, informal terms like “sandy shale” and “shaly sand” provide a more accurate representation of the physical properties observed in the field and well logs.

Table 1 Availability of logging data for machine learning in studied wells.
Fig. 4
figure 4

Correlation matrix between gamma-ray (GR), sonic (DT), bulk density (RHOB or ZDEN) and deep-resistivity (RD) logs, illustrated for two test wells, Hakeem and Chak 7.

Data preprocessing

Preprocessing is a crucial step in machine learning workflows to ensure the data is clean and consistent. This involved the following tasks:

  • Data Cleaning: Missing values were addressed using domain-specific statistical methods, such as correlation-based imputation. Outliers were handled by applying standard deviation thresholds, and erroneous entries were corrected based on the physical limits of the logs.

  • Data Transformation: Categorical variables were encoded, numerical data was scaled, and skewness in the data was corrected. This transformation ensured the data was in a format suitable for machine learning models, allowing for consistent input across all algorithms.

After cleaning and transforming the data, the class distribution for each lithofacies type was assessed to ensure that the models’ performance metrics were contextualized with respect to class imbalance. Table 2 provides the distribution of samples for each lithofacies class used in this study. The codes used have been introduced as supplementary file, and Figs. S1 and S2.

Table 2 Class distribution table shows the number of samples for each lithofacies class in the dataset.

Feature engineering

Feature engineering was applied to derive new informative variables from the available data. For instance, Young’s modulus was calculated using sonic logs and combined with quartz content to improve lithology classification. An additional sonic log, such as the shear sonic log (shear wave travel time, DTS), was available in the testing and training wells and was used along with DTP to calculate Young’s modulus. Detailed calculations can be found in Sohail et al.51. Cross-plots of these parameters helped identify distinct clusters, particularly for differentiating between sandstone, shale, and mixed lithologies like sandy shale and shaly sand (see Fig. 5 for the rock physics model of Chak 7 well). Sandy shale typically refers to rocks where shale is the dominant matrix with a significant sand fraction, while shaley sand involves a higher proportion of sand with some interspersed shaly components. To differentiate between shaly sand and sandy shale, the model considers not only gamma-ray and density logs but also advanced rock-physics modeling that incorporates Young’s modulus and other log-derived features. Both can be challenging to distinguish based solely on well-log data, and hence, advanced methods like machine learning are employed to model their characteristics.

Fig. 5
figure 5

Rock physics model for the classification of lithologies using well log data of Chak 7.

Data partitioning

Once the data was preprocessed and engineered, it was partitioned into training and testing sets. Chak 5, Resham, Chak 63, and Chak 66 wells were used for training, while Chak 7 and Hakeem were set aside for testing. This partitioning ensured the models were trained on a representative dataset and validated on unseen data for unbiased performance evaluation.

Data scaling

To ensure consistent input across all machine learning models, the numerical features (well logs) were scaled. Standardization was applied to ensure that all features had the same scale, which is particularly important for algorithms like Support Vector Machine (SVM) and K-Nearest Neighbor (KNN) that are sensitive to feature scaling.

Model training

Multiple machine learning algorithms were trained on the well log data, including:

  1. 1.

    Logistic Regression (LR): Used for binary classification between lithologies.

  2. 2.

    K-Nearest Neighbor (KNN): Employed for nonlinear relationships in the data.

  3. 3.

    Random Forest (RF): A robust algorithm that handles non-linearity and missing values well.

  4. 4.

    Decision Tree (DT): Used for interpretable model outputs.

  5. 5.

    Artificial Neural Network (ANN): Applied to capture complex patterns and relationships between logs and lithologies.

  6. 6.

    Support Vector Machine (SVM): Used with a radial basis function (RBF) kernel for nonlinear classification.

The models were trained using stratified k-fold cross-validation to avoid overfitting and ensure generalizability52,53,54. This technique was specifically chosen to address potential concerns related to the spatial dependency of facies variability within the Lower Goru Formation. Stratified k-fold cross-validation ensures that each well is represented in both the training and test sets, providing a more comprehensive evaluation of the model’s performance across diverse geological conditions.

Hyperparameter tuning

Hyperparameters for each model were optimized using grid search (e.g., GridSearchCV) to determine the best configuration for lithofacies prediction, as shown in Table 3, and their impact on model performance was evaluated using cross-validation techniques. This tuning process involved adjusting parameters such as the number of neighbors for KNN, the number of trees for the Random Forest, the learning rate for ANN, and the kernel type for SVM. The aim was to identify the configuration that would yield the best performance for each algorithm.

Table 3 Optimum hyperparameters used in the studied models.

Model testing

The trained models were evaluated on the reserved test set (Chak 7 and Hakeem) to assess their performance on unseen data. The testing phase ensured that the models were not overfitted and generalized well beyond the training data.

Model evaluation and selection

Model performance was evaluated using several metrics:

  • Accuracy: To measure the proportion of correctly classified lithologies.

  • Precision: To assess the ability of the model to predict sand and shale.

  • Recall: To measure the model’s capability in identifying true positives (sand or shale).

  • F1 Score: To balance precision and recall in cases where data is imbalanced between lithologies.

Statistical tests, such as paired t-tests, were conducted to compare the performance of different models. The best-performing model was selected for lithology prediction based on the evaluation metrics.

Results

Manual interpretation of lithologies

The manual interpretation of lithologies involves a systematic examination of core samples and well logs to discern and classify various lithological units of the Basal Sands of the Lower Goru Formation, as illustrated in Fig. 6. This analysis relies on mineral composition, texture, and depositional environment specific to each lithology. Four main lithologies within the Basal Sands of the Lower Goru Formation—sand, shale, sandy shale, and shaly sand—have been identified based on a combination of geological knowledge of the Lower Goru Formation and analysis using the elbow method in the K-means clustering algorithm. The details of this manual interpretation fall beyond the scope of the present study and are therefore not delineated here. The lithologies interpreted within the Basal Sand intervals (2800–2845 m in Chak 7 well and 2955–2985 m in Hakeem well), as depicted in Fig. 6, serve as a benchmark and necessary input for the machine learning algorithms. These interpreted lithologies constrain the algorithms, ensuring accurate predictions by aligning them with the reference outputs obtained through this interpretation.

Fig. 6
figure 6

Shows well logs and manually interpreted lithologies of the Basal Sand interval of Lower Goru Formation encountered in Chak 7 (a) and Hakeem (b) wells.

The lithological correlation between Chak 63, Chak 66, and Resham wells provides critical insights into the subsurface geology of the Basal Sand within the Lower Goru Formation in the Lower Indus Basin, as illustrated in Fig. 7. In Chak 63 and Chak 66, the consistent sequence of sand overlying shale suggests relatively stable depositional conditions, with minor variations in sand thickness likely influenced by localized changes in sediment supply or energy regimes, post-depositional compaction, or slight variations in subsidence rates. In contrast, the more complex depth distribution of sand and shale in the Resham well indicates that geological factors such as faulting, folding, or variations in paleotopography may have played a role in controlling sediment deposition and preservation. This implies that Resham may be situated near a structural feature like an anticline or a fault zone, which could have influenced sedimentation patterns. Alternatively, these variations might reflect lateral shifts in the depositional environment, likely due to changes in shoreline position or sediment supply, which are common in fluvio-deltaic to shallow marine settings. In such settings, transgressive-regressive cycles play a key role in shaping sedimentary facies distributions55, as reflected in the distinct lithofacies observed across these wells.

The lithological profile of the Chak 5 well, as depicted in Fig. 8, is distinguished by a higher proportion of sandstone, thinner sandy shale units, and an overall dominance of sandstone compared to other wells in the area. Gamma-ray values above 80 API (e.g., 100 or 125 API) are attributed to glauconitic sands in the Lower Goru Formation, as noted by Khan et al.47, which explains the higher-than-expected readings for typical sandstones. Although Khan et al.47 also examine the Lower Goru Formation, their lithological interpretation differs due to regional variations in depositional environments, as well as differing methods of gamma-ray log calibration. This distinctive profile suggests a more energetic depositional environment, possibly driven by fluctuations in sea level or variations in sediment supply, which could reflect shifts in sedimentary facies or the influence of structural features, such as faulting, that have affected the distribution and thickness of the sedimentary layers. The clear predominance of sandstone in the upper and lower sections, along with the reduced thickness of the intervening sandy shale and shale units, may indicate proximity to a sediment source, such as a deltaic channel or shoreface system. Additionally, diagenetic processes like compaction and cementation may have influenced the porosity and permeability of the sandstone and shale units, further differentiating them from other wells in the region. Further analysis, such as biostratigraphic correlation, seismic interpretation, or advanced petrophysical studies, would help clarify the underlying factors driving this lithological variation, but these analyses are beyond the scope of the current study.

Fig. 7
figure 7

Shows well logs and manually interpreted lithologies of the Basal Sand interval of Lower Goru Formation encountered in (a) Chak 63, (b) Chak 66 and (c) Resham wells.

Fig. 8
figure 8

Shows well logs and manually interpreted lithologies of the Basal Sand interval of Lower Goru Formation encountered in Chak 5 well.

Lithology prediction using machine learning

The study employed a range of machine learning algorithms, including Support Vector Machine (SVM), Random Forest (RF), Conventional/Artificial Neural Network (ANN), K-Nearest Neighbor (KNN), Decision Tree (DT), and Logistic Regression (LR), to model and predict lithofacies within wells Chak 7 and Hakeem. The results obtained from these algorithms were compared both visually and mathematically with manually predicted lithologies, as illustrated in Figs. 9 and 10. The Chak 7 well exhibited two thick shale patches, contrasting with the thinner shale patches observed in the Hakeem well. This disparity suggests a gradual thinning of shale towards the southwest of the study area, a phenomenon supported by the depositional style of the Lower Goru Formation in this region. Moreover, the lithologies within the Basal Sands revealed several complexities, leading to the observation of sand-shale intercalations within the Hakeem well. Despite these complexities, all examined machine learning algorithms effectively captured the intricate lithological patterns in the study area. Figures 9 and 10 depict how these algorithms accurately delineate lithological boundaries and capture the variations and complexities inherent in the Basal Sands of the Lower Goru Formation. This highlights the robustness of machine learning techniques in handling and interpreting geological data, particularly in contexts marked by significant lithological heterogeneity.

Fig. 9
figure 9

Represents manually interpreted lithologies in the first column, followed by predicted lithologies by various machine learning algorithms for Chak 7 well, where (a) represents SVM, (b) for RF, (c) for NNC, (d) for KNN, (e) for DT and (f) for LR.

Fig. 10
figure 10

Represents manually interpreted lithologies in the first column, followed by predicted lithologies by various machine learning algorithms for Hakeem well, where (a) represents SVM, (b) for RF, (c) for NNC, (d) for KNN, (e) for DT and (f) for LR.

Performance of machine learning algorithms

The performance of the various machine learning algorithms was evaluated using key metrics, including accuracy, precision, recall, and F1 score. To provide a more granular comparison of model performance, Fig. 11 (a-f) present the F1 scores for individual lithofacies classes (Sand, Shale, Sandy Shale, and Shaly Sand) across six machine learning algorithms: Support Vector Machine (SVM), Decision Tree (DT), K-Nearest Neighbor (KNN), Artificial Neural Network (ANN), Logistic Regression (LR), and Random Forest (RF).

Fig. 11
figure 11

F1 Scores for Individual Lithofacies Classes (Sand, Shale, Sandy Shale, and Shaly Sand) across Different Machine Learning Algorithms (SVM, Decision Tree, KNN, ANN, Logistic Regression, and Random Forest). Each bar represents the F1 score for a specific lithofacies class, illustrating the comparative performance of the algorithms in predicting lithology types.

Table 4 shows high accuracy, certain lithologies, like sandy shale and shaly sand, were misclassified due to their overlapping physical characteristics. This is a common challenge in lithofacies classification and is particularly evident in models that struggle with distinguishing these two facies, as shown in Fig. 9. SVM showed challenges in correctly classifying the green cluster, which is likely due to its sensitivity to outliers and misclassifications in the overlapping data points. Further tuning of the SVM model or testing with different kernels could improve performance. Notably, among all algorithms, the random forest model achieved the highest scores for Chak 7, while the decision tree model performed exceptionally well for Hakeem. These results underscore the inherent strength of Decision Tree (DT) and Random Forest (RF) algorithms in extracting robust insights from well-log data. Random Forest demonstrated superior performance due to its adeptness at handling large datasets, resilience against overfitting, and feature importance ranking. Moreover, the dataset’s specific characteristics, such as lower lithological heterogeneity compared to the Hakeem well, likely contributed to RF’s success. Conversely, the DT exhibited effectiveness in scenarios with higher lithological heterogeneity, showcasing its simplicity and non-parametric nature. Despite other algorithms yielding satisfactory results in predicting lithological facies, their performance metrics were comparatively lower. Nevertheless, their contributions remain noteworthy as they offered valuable predictive outcomes, albeit with some limitations in accuracy and precision, indicating some limitations in handling overlapping lithologies like sandy shale and shaly sand.

Table 4 Machine learning algorithms and their predicted accuracy for Chak 7 and hakeem.

Figure 12 presents the performance metrics with confidence intervals for six machine learning models (SVM, Decision Tree, KNN, ANN, Logistic Regression, and Random Forest) across four metrics: Accuracy, Precision, Recall, and F1 Score. The Accuracy metric shows that Random Forest achieves the highest accuracy at 96%, closely followed by Decision Tree at 93%. On the lower end, SVM and Logistic Regression have accuracy values of 89% and 88%, respectively. In terms of Precision, Decision Tree and Random Forest perform equally well, both achieving 95% and 94%, while SVM has the lowest precision at 86%. For Recall, Random Forest and Decision Tree again outperform the other models, both scoring 96%, while Logistic Regression has the lowest recall at 88%. The F1 Score follows a similar trend, with Random Forest leading at 95%, followed by Decision Tree at 93%. SVM and Logistic Regression show the lowest F1 Scores, at 86% and 84%, respectively. The confidence intervals (represented by the error bars) provide an estimate of uncertainty in these performance values. For example, SVM has an accuracy confidence interval of [87%, 91%], while Random Forest has a narrower interval of [94%, 98%]. The analysis includes micro and macro averaging techniques, which ensure that class imbalance is properly accounted for in the multi-class classification task, offering a more robust evaluation of model performance across different lithofacies classes.

Fig. 12
figure 12

Performance metric with confidence intervals for machine learning models.

The confusion matrices for both Random Forest (RF) and Decision Tree (DT) algorithms, as shown in Fig. 13, indicate that shale is the easiest lithology to predict, with a high number of correct classifications in all cases (e.g., 479 and 1031 correct predictions for shale in RF). This suggests that the models can quickly identify distinct lithologies like shale based on well log data, such as gamma-ray or density logs, which are often clear indicators. However, both algorithms struggle with more ambiguous lithologies such as sandy shale and shaly sand, where there are frequent misclassifications. This difficulty may be due to overlapping physical characteristics between these lithologies, making it hard for the models to distinguish them based on wireline log inputs. Although Random Forest outperforms Decision Tree—particularly in larger datasets (e.g., RF correctly predicts 1031 instances of shale vs. DT’s 1014)—both models demonstrate robustness in handling nonlinear data. Nevertheless, Random Forest’s ensemble learning framework provides an edge in dealing with noise and complex patterns, which explains its overall better performance in terms of accuracy and precision. These results emphasize the importance of algorithm selection and parameter tuning when predicting lithologies with well logs, especially when distinguishing between closely related rock types.

Fig. 13
figure 13

Confusion matrices demonstrating the classification performance of random forest and decision tree algorithms for lithology prediction, highlighting accuracy in predicting shale and challenges in distinguishing between Sandy Shale and Shaly Sand.

Feature importance for lithofacies prediction

Figure 14 compares the feature importance of different well logs in predicting lithofacies using two machine learning models: Random Forest and Decision Tree. Both models show that Gamma-Ray and Density are the most significant features, with Gamma-Ray having the highest importance in the Random Forest model (0.35) and both Gamma-Ray and Density being equally important in the Decision Tree model (0.30 each). Sonic log also contributes significantly, particularly to the Decision Tree model (0.20), but has a slightly lower importance in the Random Forest model (0.15). Neutron Porosity and Resistivity have the least impact in both models, with an importance score of 0.10 in Random Forest and Decision Tree. This indicates that Gamma-Ray and Density are the most critical logs for lithofacies classification, although their relative importance varies slightly between the two models. This figure enhances the transparency and interpretability of the Random Forest model by clearly showing which well logs have the greatest impact on lithofacies classification.

Fig. 14
figure 14

Comparison of feature importance for lithofacies prediction using random forest and decision tree models.

Receiver operating characteristic (ROC) curve

In addition to confusion matrix analysis, the accuracy of the predictive lithology type was also assessed through receiver operating characteristic (ROC) plot, as shown in Fig. 15. The ROC plots are obtained from the plots of true positive rate against false positive rate of the ML models for each lithology, wherein the area under the receiver operating characteristic curve (AUC) is one of the user-defined parameters to measure accuracy in the present analysis. AUC values obtained for lithologies from DT and RF vary from 0.84 to 0.93 and 0.97–0.99, respectively.

Additionally, the ROC and AUC values reinforce that, during the training phase, all ML models demonstrated over 80% accuracy, making them well-prepared for lithology prediction in testing. These findings highlight the importance of selecting algorithms with robust classification abilities and underscore the ROC curve’s role as a supplementary tool in evaluating and confirming model performance beyond conventional accuracy metrics. The ROC Curves for ML Algorithms are provided as Appendix in figures A1, A2, A3, and A4.

Fig. 15
figure 15

Receiver Operating Characteristic (ROC) curves of the (a) DT and (b) RF classifier for four lithology types, Sand, Shale, Sandy Shale and Shaly Sand.

Factors affecting prediction accuracy

Data quality and computational resources posed significant challenges, with noise and inconsistencies in well log measurements impacting model precision. While effective, models like RF require considerable computational power, especially in large datasets, underscoring the need for efficient optimization methods. Furthermore, missing or erroneous data necessitated robust imputation strategies to maintain prediction integrity. Incorporating these considerations into model selection processes remains crucial for optimizing lithology prediction in operational settings.

Comparison with literature

Table 5 compares RF and DT performance metrics across studies, showing close alignment with Banerjee et al.56 and Merembayev et al.57. Differences in F1 score and recall with Xie et al.7 highlight the influence of dataset quality and log sensitivity on model performance, reinforcing the importance of data quality and parameter tuning for lithology prediction. These findings align with the broader trend in the literature that supports ML models’ efficacy in managing complex, high-dimensional geological data, making them valuable tools for lithofacies classification in varied geological settings.

Table 5 A comparison of RF and DT with studies from the literature.

Moreover, this study aligns with broader trends in the literature, highlighting a growing preference for machine-learning models over traditional empirical methods for lithology prediction. These modern approaches excel at managing complex, high-dimensional datasets effectively, as supported by various researchers15,58,59. The evidence thus reinforces the need for continued exploration of machine learning applications in geosciences, as these techniques are proving to be instrumental in refining lithofacies classification.

In light of these challenges, this study underscores the transformative potential of machine learning algorithms in revolutionizing operational efficiency and decision-making within the oil and gas sector. By automating the lithology interpretation process, companies can significantly streamline operations, thereby curtailing the time and resources conventionally expended on manual log analysis and leading to substantial cost savings. Furthermore, the ability to swiftly and accurately predict lithologies across multiple wells empowers geoscientists and reservoir engineers to gain timely insights into subsurface geological properties, facilitating more informed decision-making throughout exploration and production activities. Through automated lithology prediction, companies can optimize well planning, refine reservoir characterization efforts, and enhance hydrocarbon recovery strategies, ultimately bolstering overall operational performance and maximizing asset value. Moreover, the integration of machine learning-based lithology prediction into existing workflows equips industry professionals with the tools to harness advanced analytics and data-driven insights, fostering a culture of continuous improvement and innovation within the oil and gas industry.

Suggestions for further study

Based on this study’s findings, several recommendations can guide future research and industry practices in lithology prediction within the oil and gas sector. Firstly, prioritize decision trees (DT) and random forests (RF) in modeling efforts due to their demonstrated effectiveness, robust performance, and interpretability for subsurface characterization. Further refinement of predictive models is crucial, involving ongoing research into optimization techniques to enhance accuracy and generalization capabilities. While Random Forest and Decision Tree showed the best results in this study, future work could explore hybrid models combining the strengths of various algorithms, such as ensemble methods or deep learning, to improve lithofacies classification. Additionally, integrating machine learning-based lithology prediction with other geophysical data sources, such as seismic surveys, can provide a more comprehensive understanding of reservoir properties and improve decision-making in exploration and production activities. Companies should invest in developing automated workflows and decision support systems based on machine learning algorithms to streamline lithology interpretation processes, reduce manual effort, and accelerate decision-making, thereby increasing productivity and cost-effectiveness for oil and gas companies. Lastly, thorough validation and benchmarking studies should be conducted across diverse datasets and geological settings to ensure the reliability and robustness of predictive models in real-world applications, facilitating their adoption by industry stakeholders. Implementing these recommendations can harness the full potential of machine learning in lithology prediction, driving innovation and efficiency in subsurface exploration and hydrocarbon recovery efforts within the oil and gas industry.

Conclusion

This study has used multiple machine learning techniques, including Support Vector Machine (SVM), Random Forest (RF), Decision Tree (DT), Artificial Neural Network (ANN), K-Nearest Neighbor (KNN), and Logistic Regression (LR), to predict lithofacies using wireline logs in the deltaic depositional system in the Lower Goru Formation, Lower Indus Basin, Pakistan. Each machine learning algorithm applied in this study offers distinct advantages and limitations depending on the characteristics of the data. While Support Vector Machines (SVM) excel at nonlinear classification, Decision Trees and Random Forests provide interpretability and robustness against noise. Artificial Neural Networks (ANN) are ideal for capturing complex patterns, whereas K-Nearest Neighbors (KNN) are more suited for localized variations. Logistic Regression serves as a useful baseline for simpler problems. In this study, Random Forest and Decision Tree models demonstrated superior performance, highlighting their strength in handling the complex, noisy, and multivariate nature of well log data for lithology prediction.

Among the evaluated models, Decision Tree (DT) and Random Forest (RF) demonstrated the highest classification accuracies, particularly in lithologically complex intervals. Specifically, DT achieved up to 98% accuracy, followed closely by RF at 97%, KNN and ANN at 96%, SVM at 96%, and LR at 94%, as applied to the Hakeem well. These results confirm the effectiveness of tree-based ensemble methods in handling the multivariate, nonlinear nature of well-log data. This performance ranking supports the recommendation of RF and DT as preferred models for lithofacies prediction in similar depositional settings.