Application of machine learning in forensic geochemistry using presalt oil samples from the Santos basin

Silva, Gil Marcio Avelino; de Miranda, Fernando Pellon; Guzzo, Jarbas Vicente Poley; Bastos, Wagner Leonel; Rocha, Ygor; de Souza, Igor Viegas Alves Fernandes; Matias, Italo Oliveira; Torres, Sarah Barron; de Araujo Ponte, Francisco Fabio

doi:10.1038/s41598-025-00084-5

Download PDF

Article
Open access
Published: 02 May 2025

Application of machine learning in forensic geochemistry using presalt oil samples from the Santos basin

Gil Marcio Avelino Silva¹,
Fernando Pellon de Miranda¹,
Jarbas Vicente Poley Guzzo¹,
Wagner Leonel Bastos¹,
Ygor Rocha¹,
Igor Viegas Alves Fernandes de Souza¹,
Italo Oliveira Matias²,
Sarah Barron Torres² &
…
Francisco Fabio de Araujo Ponte²

Scientific Reports volume 15, Article number: 15424 (2025) Cite this article

1876 Accesses
1 Citations
Metrics details

Subjects

Abstract

Identifying oil spills in offshore production areas presents a critical challenge, requiring reliable and efficient methodologies to minimize environmental and economic impacts. Traditional approaches are often time-consuming, subjective, and limited in their ability to provide accurate predictions. This study introduces a novel methodology that integrates geochemical data analysis with machine learning techniques to enhance the identification of oil spill origins. A dataset comprising 2200 presalt oil samples and 75 attributes from the Santos Basin underwent preprocessing and exploratory analysis, resulting in 2137 samples and 62 predictive attributes. Seven machine learning algorithms were evaluated, with the random forest model achieving the highest classification accuracy of 91%. The methodology was validated using three independent oil samples (spill events and one natural seep), demonstrating its robustness in accurately predicting field origins with high confidence. The integration of machine learning techniques and geochemical analysis reduced the subjectivity of human interpretation, significantly accelerated diagnostic workflows, and provided reliable results in minutes. This approach represents a scalable and innovative solution for both exploratory and forensic geochemistry, particularly in complex production areas along the Brazilian coast. The proposed methodology has the potential to enhance decision-making processes in environmental monitoring and oil exploration.

Predictive modeling of oil rate for wells under gas lift using machine learning

Article Open access 30 July 2025

Accurate modeling of crude oil and brine interfacial tension via robust machine learning approaches

Article Open access 20 November 2024

Compositional modeling of solution gas–oil ratio (Rs): a comparative study of tree-based models, neural networks, and equations of state

Article Open access 11 March 2025

Introduction

Oil slicks of anthropogenic origin are classified as oil spills. Identifying and characterizing oil spills present critical challenges in offshore production areas, leading to significant environmental and economic implications. Geochemistry, the study of the distribution and migration of chemical elements in the geological spectrum, provides essential tools for addressing these challenges. Within this field, organic geochemistry focuses on the distribution of carbon in Earth’s systems, whereas petroleum geochemistry applies chemical principles to investigate the origin, generation, migration, accumulation, and alteration of petroleum. These fields are pivotal in both exploratory and forensic applications.

Oil is a complex mixture of hydrocarbons in solid, liquid, and gaseous states, along with trace amounts of oxygen, sulfur, nitrogen and metals. It is categorized into three primary classes of compounds: saturated, aromatic, and polar compounds. Among the organic species present in oil, biomarkers—molecular fossils derived from the decomposition of living organisms—are particularly valuable. These compounds retain their carbon skeletons with minimal alteration during the transformation of organic matter into petroleum, making them critical indicators of oil origin, quality, and maturity¹.

Geochemical analysis of oil typically relies on biomarker ratios rather than absolute concentrations, as the former are less susceptible to external variations and analytical inconsistencies. This approach underpins oil‒oil and oil‒rock correlations, contributing to exploratory investigations aimed at determining source rock characteristics, depositional paleoenvironments, thermal maturity, and the extent of biodegradation². In Brazil, marginal basins, such as the Santos Basin, contain oils derived from diverse paleoenvironments, including lacustrine, deltaic, and marine systems, formed during the separation of South American and African plates³. The application of petroleum geochemistry in these basins has been invaluable in identifying and evaluating oil accumulations with high accuracy and cost-effectiveness.

In forensic contexts, geochemical methods are used to characterize spilled oil and identify its origin, supporting investigations into environmental incidents, such as leaks from pipelines, operational failures on platforms, and tanker accidents⁴. Biomarker analysis, particularly of terpanes and steranes, is a cornerstone of these investigations, as it provides reliable data for distinguishing between potential sources. The ability to rapidly and accurately determine the origin of oil spills is critical for mitigating environmental damage and informing legal and regulatory actions.

Although traditional geochemical analysis is effective, it is often time-consuming and subject to interpretative biases. Recent advances in machine learning offer promising solutions to these limitations. By integrating numerical methods with expert-driven approaches, machine learning algorithms can increase predictive accuracy, reduce subjectivity, and accelerate analysis. This study introduces a novel methodology that combines geochemical data, adopts a numerical compositional approach and supervised classification of oils, uses multivariate statistics to analyze redundant variables and reduce dimensionality, and employs machine learning techniques to build models to classify the origins of oil spills and validate these classifications with independent samples.

This study aims to improve the reliability and efficiency of geochemical workflows, particularly in exploratory and forensic applications. By reducing the time required for analysis and increasing the objectivity of interpretations, the proposed methodology has the potential to transform oil spill diagnostics and decision-making processes in high-production areas such as the Santos Basin. This integration of artificial intelligence and geochemistry methods represents a significant advancement in the field, offering new opportunities for environmental monitoring and resource management.

Literature review

In recent years, the integration of machine learning techniques with geochemical data analysis has significantly advanced the fields of oil spill identification and source attribution. Several studies have demonstrated the efficacy of these approaches in enhancing the accuracy and efficiency of oil spill detection.

Miranda et al.⁵ developed predictive models to distinguish natural seepage slicks from anthropogenic oil spills on sea surfaces by employing synthetic aperture radar (SAR) sensors and artificial intelligence methods. Their approach utilized geometric pattern recognition under a transfer learning framework, achieving notable success in classifying oil slick sources.

Similarly, Genovez et al.⁶ applied machine learning methods to differentiate between natural and anthropogenic oil slicks in the Gulf of Mexico. By analyzing a robust dataset of validated oil samples detected via SAR, they evaluated multiple algorithms, including random forest and artificial neural networks, and achieved a maximum accuracy of 73.15% with the random forest classifier.

In another study, Liu et al.⁷ proposed a semiautomatic oil spill detection method that combines texture analysis, machine learning, and adaptive thresholding to process X-band marine radar images. Their methodology effectively identified oil spills, demonstrating the potential of integrating various computational techniques for environmental monitoring.

These studies underscore the transformative potential of combining machine learning algorithms with geochemical and remote sensing data to enhance oil spill detection and classification. The methodologies developed by researchers such as Miranda, Genovez, and Liu have laid a strong foundation for future advancements in this critical area of environmental science.

Materials and methods

This section outlines the methodology used to classify oil spill samples, including data collection, preprocessing, exploratory analysis, and machine learning model development. The workflow integrates geochemical principles with data science techniques to increase the reliability and efficiency of oil source identification.

Dataset description

The dataset comprises 2200 oil presalt samples collected from the Santos Basin, Brazil (Fig. 1). Each sample includes 75 attributes, with 72 derived from diagnostic ratios of saturated geochemical biomarkers (Table 1) and 3 categorical parameters: ‘Sample’, ‘Well’, and ‘Field’.

Table 1 List of attributes and diagnostic ratios used with detailed calculations.

Full size table

Biomarkers, including terpenes, steranes, and other hydrocarbon fractions, were selected on the basis of their diagnostic relevance in oil spill studies.

To maintain anonymity, samples and wells were assigned coded field names. Each name begins with the letter “C,” followed by a three-digit sequence (e.g., C001). The database contains 52 fields, of which 45 were used: C001, C003–8, C010–11, C014–18, C020–29, C031–37, and C039–52.

Overview of the workflow

The methodology consists of 5 data-driven operations:

1.
Data acquisition Geochemical data were collected from the oil samples.
2.
Data preprocessing: The dataset was cleaned, normalized, and optimized.
3.
Exploratory data analysis (EDA) The most relevant features are identified and selected.
4.
Machine learning Predictive models are develop via machine learning methods.
5.
Application The models are applied to validate new datasets and assess robustness.

Operations guided by expert knowledge include analysis and interpretation (result validation) as well as operational use (information sharing and decision-making).

A flowchart illustrating the workflow is shown in Fig. 2.

Data acquisition (analytical procedures)

The saturated biomarker profiles of the samples were analyzed in a geochemical laboratory.

Gas chromatography coupled with mass spectrometry (GC–MS) has been applied to study petroleum biomarker distributions, representing the most widely used method for evaluating petroleum and petroleum products⁸, with standardized and comparable laboratory methodologies and results⁹.

These biomarkers provide insights into the source rock, depositional paleoenvironment, and thermal maturation of organic matter.

The analysis focused on terpanes and steranes, as these compounds are well-established indicators for correlating oil samples and identifying their origins. The ion fractions monitored (m/z 177, 191, 217, 218, and 259) are listed in Table 2.

Table 2 Some of the compounds monitored in the saturated fraction.

Full size table

Data preprocessing

The preprocessing phase aims to clean and optimize the dataset by addressing inconsistencies, duplicates, and outliers:

1.
Outliers Anomalous compositional data that may result from contamination or misregistration. The isolation forest algorithm was employed for detection.
2.
Missing values Rows and columns containing absent data that require replacement or removal.
3.
Duplicates Samples recorded more than once and need to be removed to avoid redundancy.
4.
Data transformation Normalization is applied to avoid mismatches in scale, measurement units, etc. The normal score function (mean = 0, standard deviation = 1) was used to ensure consistency across attributes. This process is important for organizing information to ensure the integrity of the results, transforming variables into the same order of magnitude, and placing them within a predefined range.

Exploratory data analysis (EDA)

EDA was performed to examine the dataset’s statistical properties and reduce dimensionality:

Correlation matrix A table that displays the collinearity between variables, with values ranging from − 1 to 1. The color of each cell at the intersection of two variables visually indicates the degree of collinearity between their measurements.
Multidimensional scaling (MDS) This method measures the degree of similarity or dissimilarity between multivariate structures. The correlation between the variables is used as the basis for calculating the distance matrix; the greater the distance is, the greater the dissimilarity. Grouped variables are highly correlated, exhibit high similarity and have shorter distances between them. MDS facilitates the visualization of the closest/grouped attributes in two and three dimensions.
Principal component analysis (PCA) PCA is applied to transform multivariate distributions into ‘principal components’ (PCs), which are orthogonal to each other in a system of reduced dimensionality¹⁰. These components are uncorrelated (r = 0) and reveal the relative contribution of each transformed variable to the multivariate system. PCA is a linear spectral decomposition technique that is particularly attractive for situations involving many variables.
K-means clustering This method optimizes the dataset by grouping similar samples into K distinct clusters on the basis of their features. The center of each cluster (centroid) is the arithmetic mean of all the points belonging to it. The number of clusters is defined beforehand, and each data point is assigned to the centroid closest to it. Iterations begin and end when the variables no longer change their cluster centers. The centroids move their positions until the convergence criteria are met.

This process ensures that only the most informative and nonredundant attributes are retained for machine learning.

Machine learning

The machine learning component of this study was implemented via Python, which leverages several specialized libraries to facilitate data analysis and model development:

Scikit-learn Utilized for implementing various machine learning algorithms and evaluation metrics.
Pandas Employed for data manipulation and preprocessing tasks.
NumPy Used for numerical computations and array operations.
Matplotlib Applied for data visualization and plotting.

Seven machine learning algorithms were evaluated for their ability to classify oil spill samples:

1.
Decision tree (DT)
2.
Random forest (RF)
3.
Gaussian naive Bayes (GNB)
4.
K-nearest neighbors (KNN)
5.
Artificial neural network (ANN)
6.
Support vector machine (SVM)
7.
Linear discriminant analysis (LDA)

The dataset was partitioned into 80% for training and 20% for testing to ensure robust model evaluation. Each algorithm was trained to predict the field of origin on the basis of geochemical attributes. To assess the performance of these models comprehensively, several metrics were calculated:

Accuracy: The proportion of correct predictions over the total predictions made.
Precision: The ratio of true positive predictions to total positive predictions, reflecting the model’s exactness.
Recall (Sensitivity): The ratio of true positive predictions to all actual positive cases, reflecting the model’s ability to identify positive instances.
F1-Score: The harmonic mean of precision and recall, which serves as a single metric that balances both measures.

For each machine learning algorithm, a classification model was developed by optimizing the combination of parameters and selecting the most predictive attributes. Feature importance analysis was conducted, particularly for ensemble methods such as random forest, to identify which attributes significantly influenced the model’s predictions. Visual representations, such as feature importance plots, were generated to illustrate the impact of each feature, where longer bars denote greater importance.

Additionally, confusion matrices were constructed to visualize the performance of each model, displaying the distributions of true positives, true negatives, false positives, and false negatives. These matrices provide insights into the model’s ability to correctly classify samples across different categories.

Application of new samples for model validation

To validate the methodology, three independent oil samples unrelated to the training dataset were analyzed:

Sample A A natural oil seep sample collected from the sea floor as a result of leakage from subsurface geologic features.
Sample B An oil spill sample.
Sample C An oil spill sample.

These samples were processed using the same attributes and normalization techniques as the training data, allowing the machine learning model with the best performance to predict their origins with high accuracy.

Results

This section presents the findings of the study, covering data preprocessing, exploratory analysis, machine learning model evaluation, and validation of the proposed methodology via independent samples. Each step is detailed to illustrate the robustness of the workflow and its applicability in identifying oil spill sources. The results demonstrate the success of integrating geochemical data with machine learning techniques, highlighting the accuracy of predictive models and their ability to generalize beyond the training dataset.

Data preprocessing

This stage included handling missing values, eliminating duplicate records and removing outliers, in addition to normalization.

Outliers detected via the isolation forest algorithm (Fig. 3A) were associated with anomalous compositional data, likely caused by contamination or sample misclassification.

Notably, two samples from Well X were initially flagged as outliers. Upon expert review, these samples revealed evidence of an independent accumulation linked to a secondary petroleum system in field C004. This finding underscores the importance of combining automated techniques with expert validation. Figure 3B shows, on the left side, that Well X was initially thought to contain a single oil accumulation, with two samples of differing compositions considered outliers. On the right side, however, the final interpretation of Well X indicates the presence of two distinct oil accumulations separated by a water interval.

Missing data were imputed using the median value. In one column, ‘25NOR_Hopane’, more than 80% of the values were missing, so this feature was removed. In addition, 25 rows were classified as outliers and eliminated from the dataset, as were 38 entries of duplicate samples.

Figure 3C presents the normalization with the normal score function of one of the attributes as an example, displaying the frequency histograms before and after the transformation, as well as the diagram probability–probability (p–p) plot. Before normalization, the dataset contained an isolated frequency peak, which could indicate inconsistencies in the data or lead to misinterpretation by the algorithm.

After preprocessing, of the initial 2,200 oil samples and 75 attributes, the dataset was reduced to 2137 samples with 74 attributes.

Exploratory data analysis (EDA)

EDA was conducted to reduce dimensionality and improve the predictive ability of the dataset. The key methods used included the following:

Correlation matrix analysis Seven highly correlated variables (r > 95.45%), such as ‘27_29BBRS218’ and ‘TRIC_HOP’, were identified and removed (Fig. 4A).
Multidimensional scaling (MDS) A distance less than or equal to 0.05 on the map was set as the threshold for characterizing strong similarity (Fig. 4B), leading to the removal of two redundant parameters, ‘TS_TS + TM’ and ‘DIAH_H30’.
Principal component analysis (PCA) This method reduces the dimensionality of the dataset while preserving variance.
K-means clustering: Data are separated into clusters to optimize feature selection.

These techniques reduced the number of attributes from 74 to 65. For the subsequent classification stage, the three categorical parameters ‘Sample’, ‘Well’, and ‘Field’ were excluded. Therefore, 62 attributes were selected from the 2,137 oil spill samples, retaining only the most relevant features for machine learning.

The optimum subset was as follows: ‘P27BBS218’, ‘P28BBS218’, ‘P29BBS218’, ‘27_29BBS218’, ‘28_29BBS218’, ‘28_29BBRS218’, ‘P27AAAR’, ‘P28AAAR’, ‘P29AAAR’, ‘DIA_C27AA’, ‘GAM_H30’, ‘H35_H34’, ‘H29_H30’, ‘H28_H29’, ‘TET24_H30’, ‘TET24_26TRI’, ‘23_24TRI’, ‘19_23TRI’, ‘STER_HOP’, ‘HOP_STER’, ‘TRIC_STER’, ‘TS_TM’, ‘NORNEO_H29’, ‘21 + 22_STER’, ‘S_S + R’, ‘C29BBS_C29R’, ‘21_23TRI’, ‘24_25TRI’, ‘26_25TRI’, ‘26_28TRI’, ‘DITERP_H30’, ‘29_30H’, ‘H_HM2930’, ‘TR23_H30’, ‘NOR25H_H29’, ‘NOR25H_H30’, ‘H28_TR23’, ‘GAM_TR23’, ‘H29_C29TS’, ‘H31S_H31’, ‘H32S_H32’, ‘H33S_H33’, ‘PH31’, ‘PH32’, ‘PH33’, ‘PH34’, ‘PH35’, ‘C29BBR_C29R’, ‘C29BB_C29’, ‘P28aBBReS218’, ‘P29aBBReS218’, ‘P27ST’, ‘P28ST’, ‘P29ST’, ‘H30_C27AA’, ‘DIA30_C27AA’, ‘TPP’, ‘TNH_TNH + 25NOR’, ‘BNH_BNH + 25NOR’, ‘BNH_BNH + C29HOP’, ‘NORC29Ts_NORC29Ts + TsC29’, and ‘25NOR_25NOR + C29HOP’.

This refinement improved model performance by minimizing redundancy and maximizing interpretability.

Machine learning classification

In this study, we evaluated seven machine learning algorithms to classify oil spill samples: DT, RF, GNB, KNN, ANN, SVM, and LDA. The dataset was randomly divided into 80% for training (1,709 samples) and 20% for testing (428 samples) via the train_test_split function from Python’s scikit-learn library, ensuring an unbiased selection process.

Accuracy was chosen as the primary performance metric to provide an initial overview of the models’ effectiveness. To ensure a comprehensive understanding of the classifiers’ capabilities, we also calculate additional metrics such as precision, recall, and the F1 score. These complementary metrics are particularly useful for monitoring performance across different classes and identifying potential model biases.

The RF model achieved the highest overall accuracy of 91%, outperforming other methods such as ANN (88%), LDA (87%), KNN (86%), GNB (84%), and DT (82%). The SVM exhibited the lowest accuracy at 79%. Figure 5A presents the relative importance of features in the RF model, while Fig. 5B shows the confusion matrix of the RF model, validating its reliability for geochemical classification tasks.

The RF algorithm correctly classified 100% of the samples in 31 of 45 fields. In the other three fields, more than 90% of the samples were accurately classified.

Application of new samples for model validation

To evaluate the generalizability of the proposed methodology, three independent oil samples (two spills and one natural seep) were analyzed via the trained RF model (Fig. 6):

Sample A (Natural Oil Seep) Predicted as originating from field C051 with a probability of 55%, followed by C008 (24%).
Sample B (Oil Spill) Correctly classified as belonging to field C045 with 86% probability.
Sample C (Oil Spill) Also associated with field C045, with a high confidence of 91%.

These results demonstrate the robustness of the RF model in accurately predicting oil origins, even for samples not included in the training dataset.

Discussion

The findings of this study demonstrate the potential of integrating geochemical analysis with machine learning techniques to improve oil spill identification. This discussion evaluates the implications of the results, compares the methodology to existing approaches, and highlights the study’s novelty and limitations.

Performance of the proposed methodology

The RF model outperformed all other machine learning algorithms, achieving an overall accuracy of 91%. This high performance reflects the effectiveness of the data preprocessing and feature selection strategies, which reduce redundancy and improve predictive power. In comparison, traditional statistical methods often fail to handle complex, high-dimensional datasets as effectively, particularly when collinearity between variables is present.

The validation of the RF model using independent samples further underscores its robustness. The accurate classification of Samples B and C (oil spills) as originating from field C045 and the plausible prediction for Sample A (natural seep) demonstrate the model’s ability to generalize beyond the training data (Fig. 6). These results confirm the reliability of the proposed methodology in real-world scenarios.

Comparative advantage of the approach

Traditional geochemical approaches rely heavily on manual interpretation, which is time-consuming and prone to subjectivity. By integrating machine learning, this study reduces human bias and accelerates the analysis process, transforming tasks that once required hours or days into minutes. The automation of diagnostic workflows provides a scalable solution for large-scale geochemical datasets, making this methodology particularly advantageous for high-production areas such as the Santos Basin.

Furthermore, the use of advanced dimensionality reduction techniques, such as PCA and MDS, allows the methodology to focus on the most relevant geochemical attributes. This method contrasts with traditional methods, which often include redundant or irrelevant variables, potentially skewing results.

Geological implications

The findings of this study align with the known geological characteristics of the Santos Basin. The ability to differentiate between oil accumulations and secondary petroleum systems, as seen in the case of outliers in Well X, highlights the utility of combining machine learning methods with expert-guided validation. The accurate classification of oil spills also contributes to environmental monitoring efforts, providing actionable insights for mitigating the impact of offshore incidents.

The results also highlight the compositional similarity of oils from the Santos Basin, reflecting the geological simplicity of its reservoirs. This contrasts with the neighboring Campos Basin, where more complex geological conditions create greater compositional diversity. These insights emphasize the importance of tailoring geochemical and machine learning approaches to the specific characteristics of different basins.

Study novelty and contributions

This study introduces a novel workflow that combines expert knowledge with machine learning techniques to increase the reliability and efficiency of geochemical analysis. By validating the methodology with independent samples, the study bridges the gap between theoretical modeling and practical application. The proposed approach represents a significant advancement over traditional methods by offering the following:

Greater accuracy in oil source identification.
A scalable framework for geochemical analysis.
Reduced subjectivity through automation and standardization.

These contributions highlight the transformative potential of artificial intelligence in geochemistry, particularly for exploratory and forensic applications.

Limitations and future directions

While the methodology demonstrated high accuracy and robustness, several limitations must be addressed. This study focused on a single dataset from the Santos Basin, limiting the generalizability of the findings to other regions or basins with different geological characteristics. Expanding the dataset to include more diverse samples would improve the model’s adaptability.

Additionally, while machine learning reduces subjectivity, the success of the methodology still relies on the quality of the input data and expert validation. Future studies could explore the integration of unsupervised learning techniques to further automate the discovery of patterns in geochemical datasets.

Conclusion

This study successfully integrates geochemical analysis with machine learning techniques to address the critical challenge of identifying oil spill origins in offshore production areas. By leveraging advanced statistical techniques and machine learning algorithms, the proposed methodology offers a reliable and efficient approach for oil source identification, significantly reducing the subjectivity and time constraints associated with traditional methods.

From an initial dataset of 2200 samples and 75 attributes, rigorous preprocessing and exploratory analysis reduced the data to 2137 samples and 62 predictive attributes. Among the seven machine learning models tested, RF demonstrated the highest accuracy (91%), validating its suitability for geochemical classification tasks. The methodology was validated with three independent samples—two spills and one natural seep—demonstrating its robustness and practical applicability in real-world scenarios. Notably, the predictions aligned with geological contexts, reinforcing the model’s reliability.

The integration of machine learning methods into geochemical workflows represents a transformative advancement, allowing for the rapid, objective, and accurate diagnosis of oil origins. By reducing the analysis time from hours or days to mere minutes, the methodology provides a scalable solution for both exploratory and forensic applications. Its implementation is particularly relevant for high-production areas such as the Santos Basin, where accurate and efficient spill diagnostics are critical for environmental monitoring and resource management.

This study also highlights the importance of expert-guided validation in complementing automated approaches, ensuring the accuracy and interpretability of results. The successful identification of secondary petroleum systems, as evidenced in Well X, underscores the utility of combining machine learning with domain expertise to derive actionable geological insights.

While the methodology has demonstrated high accuracy and reliability, further research is needed to generalize the approach to other basins with varying geological characteristics. Expanding the dataset to include more diverse samples and incorporating unsupervised learning techniques could further enhance the model’s adaptability and robustness.

In conclusion, this study presents a novel, scalable, and efficient framework for oil spill diagnostics that combines the strengths of geochemical analysis and machine learning. It offers significant potential for transforming geochemical workflows, providing value for both environmental and exploratory geochemistry.

Data availability

The datasets generated and analyzed during the current study are not publicly available due to industrial confidentiality considerations. The data contain confidential information related to commercial and proprietary processes, which are protected by nondisclosure agreements. Access to these datasets is restricted exclusively to technical staff within the company’s exploration department, under agreed-upon conditions and with the necessary authorization. The datasets are available from the corresponding author upon reasonable request.

References

Brocks, J. J. & Summons, R. E. Sedimentary hydrocarbons: Biomarkers for early life. In Treatise on Geochemistry 2nd edn, Vol. 8 (eds Holland, H. & Turekian, K.) 65–103 (Elsevier, 2003).
Google Scholar
Peters, K. E., Walters, C .C. & Moldowan, J. M. The Biomarker Guide. Volume 1: Biomarkers and Isotopes in the Environment and Human History. Volume 2: Biomarkers and Isotopes in Petroleum Exploration and Earth History, pp. 1132, 2nd ed. (Cambridge: Cambridge University Press, 2005).
Mello, M. R., Gaglianone, P. C., Brassel, S. C. & Maxwell, J. R. Geochemical and biological marker assessment of depositional environments using Brazilian offshore oils. Mar. Pet. Geol. 5(3), 205–223 (1988).
Article CAS Google Scholar
Wang, Z. et al. Petroleum biomarker fingerprinting for oil spill characterization and source identification. In Standard Handbook of Oil Spill Environmental Forensics: Fingerprinting and Source Identification, Chapter 4 (eds Stout, S. & Wang, Z.) 131–254 (Elsevier, 2016).
Chapter Google Scholar
de Miranda, F. P. et al. Improved classification models to distinguish natural from anthropic oil slicks in the gulf of Mexico using machine learning. Rem. Sens. 13(22), 4568 (2021).
Article ADS Google Scholar
Genovez, P. C. et al. Development and application of predictive models to distinguish seepage slicks from oil spills on sea surfaces employing SAR sensors and artificial intelligence: Geometric patterns recognition under a transfer learning approach. Rem. Sens. 15(6), 1496 (2023).
Article ADS Google Scholar
Liu, P., Li, Y., Liu, B., Chen, P. & Xu, J. Semi-automatic oil spill detection on x-band marine radar images using texture analysis, machine learning, and adaptive thresholding. Rem. Sens. 11(7), 756 (2019).
Article ADS Google Scholar
Roman-Hubers, A. T., Cordova, A. C., Barrow, M. P. & Rusyn, I. Analytical chemistry solutions to hazard evaluation of petroleum refining products. Regul. Toxicol. Pharmacol. 137, 105310 (2023).
Article CAS PubMed Google Scholar
Weiss, H. M., Wilhelms, A., Mills, N., Scotchmer, J., Hall, P. B., Lind, K. & Brekke, T. 2000: NIGOGA. The Norwegian Industry Guide to Organic Geochemical Analyses [online]. Edition 4.0 Published by Norsk Hydro, Statoil, Geolab Nor, SINTEF Petroleum Research and the Norwegian Petroleum Directorate. pp. 102
Bishop, C. M. Pattern Recognition and Machine Learning (Springer, 2006). https://doi.org/10.1007/978-0-387-45528-0.
Book MATH Google Scholar

Download references

Acknowledgements

We express our gratitude to Petroleo Brasileiro S.A. (Petrobras) for the opportunity to contribute to this research project and for supporting this study within the framework of the Cooperation Agreement with the Pontifical Catholic University of Rio de Janeiro (PUC-Rio). We also extend our thanks to the PUC-Rio team for their collaboration and valuable contributions to this work.

Author information

Authors and Affiliations

Petroleo Brasileiro S.A., Rio de Janeiro, Brazil
Gil Marcio Avelino Silva, Fernando Pellon de Miranda, Jarbas Vicente Poley Guzzo, Wagner Leonel Bastos, Ygor Rocha & Igor Viegas Alves Fernandes de Souza
PUC-Rio, Rio de Janeiro, Brazil
Italo Oliveira Matias, Sarah Barron Torres & Francisco Fabio de Araujo Ponte

Authors

Gil Marcio Avelino Silva
View author publications
Search author on:PubMed Google Scholar
Fernando Pellon de Miranda
View author publications
Search author on:PubMed Google Scholar
Jarbas Vicente Poley Guzzo
View author publications
Search author on:PubMed Google Scholar
Wagner Leonel Bastos
View author publications
Search author on:PubMed Google Scholar
Ygor Rocha
View author publications
Search author on:PubMed Google Scholar
Igor Viegas Alves Fernandes de Souza
View author publications
Search author on:PubMed Google Scholar
Italo Oliveira Matias
View author publications
Search author on:PubMed Google Scholar
Sarah Barron Torres
View author publications
Search author on:PubMed Google Scholar
Francisco Fabio de Araujo Ponte
View author publications
Search author on:PubMed Google Scholar

Contributions

All authors contributed to writing the main manuscript, including the figures and tables. Additionally, all authors reviewed and approved the final version of the manuscript.

Corresponding author

Correspondence to Gil Marcio Avelino Silva.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Silva, G.M.A., de Miranda, F.P., Guzzo, J.V.P. et al. Application of machine learning in forensic geochemistry using presalt oil samples from the Santos basin. Sci Rep 15, 15424 (2025). https://doi.org/10.1038/s41598-025-00084-5

Download citation

Received: 18 November 2024
Accepted: 24 April 2025
Published: 02 May 2025
DOI: https://doi.org/10.1038/s41598-025-00084-5

Subjects

Abstract

Similar content being viewed by others

Predictive modeling of oil rate for wells under gas lift using machine learning

Accurate modeling of crude oil and brine interfacial tension via robust machine learning approaches

Compositional modeling of solution gas–oil ratio (Rs): a comparative study of tree-based models, neural networks, and equations of state

Introduction

Literature review

Materials and methods

Dataset description

Overview of the workflow

Data acquisition (analytical procedures)

Data preprocessing

Exploratory data analysis (EDA)

Machine learning

Application of new samples for model validation

Results

Data preprocessing

Exploratory data analysis (EDA)

Machine learning classification

Application of new samples for model validation

Discussion

Performance of the proposed methodology

Comparative advantage of the approach

Geological implications

Study novelty and contributions

Limitations and future directions

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links