Introduction

Oil slicks of anthropogenic origin are classified as oil spills. Identifying and characterizing oil spills present critical challenges in offshore production areas, leading to significant environmental and economic implications. Geochemistry, the study of the distribution and migration of chemical elements in the geological spectrum, provides essential tools for addressing these challenges. Within this field, organic geochemistry focuses on the distribution of carbon in Earth’s systems, whereas petroleum geochemistry applies chemical principles to investigate the origin, generation, migration, accumulation, and alteration of petroleum. These fields are pivotal in both exploratory and forensic applications.

Oil is a complex mixture of hydrocarbons in solid, liquid, and gaseous states, along with trace amounts of oxygen, sulfur, nitrogen and metals. It is categorized into three primary classes of compounds: saturated, aromatic, and polar compounds. Among the organic species present in oil, biomarkers—molecular fossils derived from the decomposition of living organisms—are particularly valuable. These compounds retain their carbon skeletons with minimal alteration during the transformation of organic matter into petroleum, making them critical indicators of oil origin, quality, and maturity1.

Geochemical analysis of oil typically relies on biomarker ratios rather than absolute concentrations, as the former are less susceptible to external variations and analytical inconsistencies. This approach underpins oil‒oil and oil‒rock correlations, contributing to exploratory investigations aimed at determining source rock characteristics, depositional paleoenvironments, thermal maturity, and the extent of biodegradation2. In Brazil, marginal basins, such as the Santos Basin, contain oils derived from diverse paleoenvironments, including lacustrine, deltaic, and marine systems, formed during the separation of South American and African plates3. The application of petroleum geochemistry in these basins has been invaluable in identifying and evaluating oil accumulations with high accuracy and cost-effectiveness.

In forensic contexts, geochemical methods are used to characterize spilled oil and identify its origin, supporting investigations into environmental incidents, such as leaks from pipelines, operational failures on platforms, and tanker accidents4. Biomarker analysis, particularly of terpanes and steranes, is a cornerstone of these investigations, as it provides reliable data for distinguishing between potential sources. The ability to rapidly and accurately determine the origin of oil spills is critical for mitigating environmental damage and informing legal and regulatory actions.

Although traditional geochemical analysis is effective, it is often time-consuming and subject to interpretative biases. Recent advances in machine learning offer promising solutions to these limitations. By integrating numerical methods with expert-driven approaches, machine learning algorithms can increase predictive accuracy, reduce subjectivity, and accelerate analysis. This study introduces a novel methodology that combines geochemical data, adopts a numerical compositional approach and supervised classification of oils, uses multivariate statistics to analyze redundant variables and reduce dimensionality, and employs machine learning techniques to build models to classify the origins of oil spills and validate these classifications with independent samples.

This study aims to improve the reliability and efficiency of geochemical workflows, particularly in exploratory and forensic applications. By reducing the time required for analysis and increasing the objectivity of interpretations, the proposed methodology has the potential to transform oil spill diagnostics and decision-making processes in high-production areas such as the Santos Basin. This integration of artificial intelligence and geochemistry methods represents a significant advancement in the field, offering new opportunities for environmental monitoring and resource management.

Literature review

In recent years, the integration of machine learning techniques with geochemical data analysis has significantly advanced the fields of oil spill identification and source attribution. Several studies have demonstrated the efficacy of these approaches in enhancing the accuracy and efficiency of oil spill detection.

Miranda et al.5 developed predictive models to distinguish natural seepage slicks from anthropogenic oil spills on sea surfaces by employing synthetic aperture radar (SAR) sensors and artificial intelligence methods. Their approach utilized geometric pattern recognition under a transfer learning framework, achieving notable success in classifying oil slick sources.

Similarly, Genovez et al.6 applied machine learning methods to differentiate between natural and anthropogenic oil slicks in the Gulf of Mexico. By analyzing a robust dataset of validated oil samples detected via SAR, they evaluated multiple algorithms, including random forest and artificial neural networks, and achieved a maximum accuracy of 73.15% with the random forest classifier.

In another study, Liu et al.7 proposed a semiautomatic oil spill detection method that combines texture analysis, machine learning, and adaptive thresholding to process X-band marine radar images. Their methodology effectively identified oil spills, demonstrating the potential of integrating various computational techniques for environmental monitoring.

These studies underscore the transformative potential of combining machine learning algorithms with geochemical and remote sensing data to enhance oil spill detection and classification. The methodologies developed by researchers such as Miranda, Genovez, and Liu have laid a strong foundation for future advancements in this critical area of environmental science.

Materials and methods

This section outlines the methodology used to classify oil spill samples, including data collection, preprocessing, exploratory analysis, and machine learning model development. The workflow integrates geochemical principles with data science techniques to increase the reliability and efficiency of oil source identification.

Dataset description

The dataset comprises 2200 oil presalt samples collected from the Santos Basin, Brazil (Fig. 1). Each sample includes 75 attributes, with 72 derived from diagnostic ratios of saturated geochemical biomarkers (Table 1) and 3 categorical parameters: ‘Sample’, ‘Well’, and ‘Field’.

Fig. 1
figure 1

Location map of the presalt reservoirs in the Santos Basin, Brazil. The map was created using ArcGIS Pro 3.1.2 (available at https://www.esri.com/en-us/arcgis/products/arcgis-pro/), with public data sources including bathymetry from the Geological Survey of Brazil (CPRM) and city locations from the Brazilian Institute of Geography and Statistics (IBGE). The satellite imagery background is provided by ArcGIS (source indicated in the figure).

Table 1 List of attributes and diagnostic ratios used with detailed calculations.

Biomarkers, including terpenes, steranes, and other hydrocarbon fractions, were selected on the basis of their diagnostic relevance in oil spill studies.

To maintain anonymity, samples and wells were assigned coded field names. Each name begins with the letter “C,” followed by a three-digit sequence (e.g., C001). The database contains 52 fields, of which 45 were used: C001, C003–8, C010–11, C014–18, C020–29, C031–37, and C039–52.

Overview of the workflow

The methodology consists of 5 data-driven operations:

  1. 1.

    Data acquisition Geochemical data were collected from the oil samples.

  2. 2.

    Data preprocessing: The dataset was cleaned, normalized, and optimized.

  3. 3.

    Exploratory data analysis (EDA) The most relevant features are identified and selected.

  4. 4.

    Machine learning Predictive models are develop via machine learning methods.

  5. 5.

    Application The models are applied to validate new datasets and assess robustness.

Operations guided by expert knowledge include analysis and interpretation (result validation) as well as operational use (information sharing and decision-making).

A flowchart illustrating the workflow is shown in Fig. 2.

Fig. 2
figure 2

Flowchart of the methodology used in data-based operations.

Data acquisition (analytical procedures)

The saturated biomarker profiles of the samples were analyzed in a geochemical laboratory.

Gas chromatography coupled with mass spectrometry (GC–MS) has been applied to study petroleum biomarker distributions, representing the most widely used method for evaluating petroleum and petroleum products8, with standardized and comparable laboratory methodologies and results9.

These biomarkers provide insights into the source rock, depositional paleoenvironment, and thermal maturation of organic matter.

The analysis focused on terpanes and steranes, as these compounds are well-established indicators for correlating oil samples and identifying their origins. The ion fractions monitored (m/z 177, 191, 217, 218, and 259) are listed in Table 2.

Table 2 Some of the compounds monitored in the saturated fraction.

Data preprocessing

The preprocessing phase aims to clean and optimize the dataset by addressing inconsistencies, duplicates, and outliers:

  1. 1.

    Outliers Anomalous compositional data that may result from contamination or misregistration. The isolation forest algorithm was employed for detection.

  2. 2.

    Missing values Rows and columns containing absent data that require replacement or removal.

  3. 3.

    Duplicates Samples recorded more than once and need to be removed to avoid redundancy.

  4. 4.

    Data transformation Normalization is applied to avoid mismatches in scale, measurement units, etc. The normal score function (mean = 0, standard deviation = 1) was used to ensure consistency across attributes. This process is important for organizing information to ensure the integrity of the results, transforming variables into the same order of magnitude, and placing them within a predefined range.

Exploratory data analysis (EDA)

EDA was performed to examine the dataset’s statistical properties and reduce dimensionality:

  • Correlation matrix A table that displays the collinearity between variables, with values ranging from − 1 to 1. The color of each cell at the intersection of two variables visually indicates the degree of collinearity between their measurements.

  • Multidimensional scaling (MDS) This method measures the degree of similarity or dissimilarity between multivariate structures. The correlation between the variables is used as the basis for calculating the distance matrix; the greater the distance is, the greater the dissimilarity. Grouped variables are highly correlated, exhibit high similarity and have shorter distances between them. MDS facilitates the visualization of the closest/grouped attributes in two and three dimensions.

  • Principal component analysis (PCA) PCA is applied to transform multivariate distributions into ‘principal components’ (PCs), which are orthogonal to each other in a system of reduced dimensionality10. These components are uncorrelated (r = 0) and reveal the relative contribution of each transformed variable to the multivariate system. PCA is a linear spectral decomposition technique that is particularly attractive for situations involving many variables.

  • K-means clustering This method optimizes the dataset by grouping similar samples into K distinct clusters on the basis of their features. The center of each cluster (centroid) is the arithmetic mean of all the points belonging to it. The number of clusters is defined beforehand, and each data point is assigned to the centroid closest to it. Iterations begin and end when the variables no longer change their cluster centers. The centroids move their positions until the convergence criteria are met.

This process ensures that only the most informative and nonredundant attributes are retained for machine learning.

Machine learning

The machine learning component of this study was implemented via Python, which leverages several specialized libraries to facilitate data analysis and model development:

  • Scikit-learn Utilized for implementing various machine learning algorithms and evaluation metrics.

  • Pandas Employed for data manipulation and preprocessing tasks.

  • NumPy Used for numerical computations and array operations.

  • Matplotlib Applied for data visualization and plotting.

Seven machine learning algorithms were evaluated for their ability to classify oil spill samples:

  1. 1.

    Decision tree (DT)

  2. 2.

    Random forest (RF)

  3. 3.

    Gaussian naive Bayes (GNB)

  4. 4.

    K-nearest neighbors (KNN)

  5. 5.

    Artificial neural network (ANN)

  6. 6.

    Support vector machine (SVM)

  7. 7.

    Linear discriminant analysis (LDA)

The dataset was partitioned into 80% for training and 20% for testing to ensure robust model evaluation. Each algorithm was trained to predict the field of origin on the basis of geochemical attributes. To assess the performance of these models comprehensively, several metrics were calculated:

  • Accuracy: The proportion of correct predictions over the total predictions made.

  • Precision: The ratio of true positive predictions to total positive predictions, reflecting the model’s exactness.

  • Recall (Sensitivity): The ratio of true positive predictions to all actual positive cases, reflecting the model’s ability to identify positive instances.

  • F1-Score: The harmonic mean of precision and recall, which serves as a single metric that balances both measures.

For each machine learning algorithm, a classification model was developed by optimizing the combination of parameters and selecting the most predictive attributes. Feature importance analysis was conducted, particularly for ensemble methods such as random forest, to identify which attributes significantly influenced the model’s predictions. Visual representations, such as feature importance plots, were generated to illustrate the impact of each feature, where longer bars denote greater importance.

Additionally, confusion matrices were constructed to visualize the performance of each model, displaying the distributions of true positives, true negatives, false positives, and false negatives. These matrices provide insights into the model’s ability to correctly classify samples across different categories.

Application of new samples for model validation

To validate the methodology, three independent oil samples unrelated to the training dataset were analyzed:

  • Sample A A natural oil seep sample collected from the sea floor as a result of leakage from subsurface geologic features.

  • Sample B An oil spill sample.

  • Sample C An oil spill sample.

These samples were processed using the same attributes and normalization techniques as the training data, allowing the machine learning model with the best performance to predict their origins with high accuracy.

Results

This section presents the findings of the study, covering data preprocessing, exploratory analysis, machine learning model evaluation, and validation of the proposed methodology via independent samples. Each step is detailed to illustrate the robustness of the workflow and its applicability in identifying oil spill sources. The results demonstrate the success of integrating geochemical data with machine learning techniques, highlighting the accuracy of predictive models and their ability to generalize beyond the training dataset.

Data preprocessing

This stage included handling missing values, eliminating duplicate records and removing outliers, in addition to normalization.

Outliers detected via the isolation forest algorithm (Fig. 3A) were associated with anomalous compositional data, likely caused by contamination or sample misclassification.

Fig. 3
figure 3

Outlier analysis: (A) isolation forest, (B) sample analysis, and normalization below: (C) frequency histograms before and after normalization and diagram p–p plot.

Notably, two samples from Well X were initially flagged as outliers. Upon expert review, these samples revealed evidence of an independent accumulation linked to a secondary petroleum system in field C004. This finding underscores the importance of combining automated techniques with expert validation. Figure 3B shows, on the left side, that Well X was initially thought to contain a single oil accumulation, with two samples of differing compositions considered outliers. On the right side, however, the final interpretation of Well X indicates the presence of two distinct oil accumulations separated by a water interval.

Missing data were imputed using the median value. In one column, ‘25NOR_Hopane’, more than 80% of the values were missing, so this feature was removed. In addition, 25 rows were classified as outliers and eliminated from the dataset, as were 38 entries of duplicate samples.

Figure 3C presents the normalization with the normal score function of one of the attributes as an example, displaying the frequency histograms before and after the transformation, as well as the diagram probability–probability (p–p) plot. Before normalization, the dataset contained an isolated frequency peak, which could indicate inconsistencies in the data or lead to misinterpretation by the algorithm.

After preprocessing, of the initial 2,200 oil samples and 75 attributes, the dataset was reduced to 2137 samples with 74 attributes.

Exploratory data analysis (EDA)

EDA was conducted to reduce dimensionality and improve the predictive ability of the dataset. The key methods used included the following:

  • Correlation matrix analysis Seven highly correlated variables (r > 95.45%), such as ‘27_29BBRS218’ and ‘TRIC_HOP’, were identified and removed (Fig. 4A).

  • Multidimensional scaling (MDS) A distance less than or equal to 0.05 on the map was set as the threshold for characterizing strong similarity (Fig. 4B), leading to the removal of two redundant parameters, ‘TS_TS + TM’ and ‘DIAH_H30’.

  • Principal component analysis (PCA) This method reduces the dimensionality of the dataset while preserving variance.

  • K-means clustering: Data are separated into clusters to optimize feature selection.

Fig. 4
figure 4

(A) Correlation matrix and (B) multidimensional scaling.

These techniques reduced the number of attributes from 74 to 65. For the subsequent classification stage, the three categorical parameters ‘Sample’, ‘Well’, and ‘Field’ were excluded. Therefore, 62 attributes were selected from the 2,137 oil spill samples, retaining only the most relevant features for machine learning.

The optimum subset was as follows: ‘P27BBS218’, ‘P28BBS218’, ‘P29BBS218’, ‘27_29BBS218’, ‘28_29BBS218’, ‘28_29BBRS218’, ‘P27AAAR’, ‘P28AAAR’, ‘P29AAAR’, ‘DIA_C27AA’, ‘GAM_H30’, ‘H35_H34’, ‘H29_H30’, ‘H28_H29’, ‘TET24_H30’, ‘TET24_26TRI’, ‘23_24TRI’, ‘19_23TRI’, ‘STER_HOP’, ‘HOP_STER’, ‘TRIC_STER’, ‘TS_TM’, ‘NORNEO_H29’, ‘21 + 22_STER’, ‘S_S + R’, ‘C29BBS_C29R’, ‘21_23TRI’, ‘24_25TRI’, ‘26_25TRI’, ‘26_28TRI’, ‘DITERP_H30’, ‘29_30H’, ‘H_HM2930’, ‘TR23_H30’, ‘NOR25H_H29’, ‘NOR25H_H30’, ‘H28_TR23’, ‘GAM_TR23’, ‘H29_C29TS’, ‘H31S_H31’, ‘H32S_H32’, ‘H33S_H33’, ‘PH31’, ‘PH32’, ‘PH33’, ‘PH34’, ‘PH35’, ‘C29BBR_C29R’, ‘C29BB_C29’, ‘P28aBBReS218’, ‘P29aBBReS218’, ‘P27ST’, ‘P28ST’, ‘P29ST’, ‘H30_C27AA’, ‘DIA30_C27AA’, ‘TPP’, ‘TNH_TNH + 25NOR’, ‘BNH_BNH + 25NOR’, ‘BNH_BNH + C29HOP’, ‘NORC29Ts_NORC29Ts + TsC29’, and ‘25NOR_25NOR + C29HOP’.

This refinement improved model performance by minimizing redundancy and maximizing interpretability.

Machine learning classification

In this study, we evaluated seven machine learning algorithms to classify oil spill samples: DT, RF, GNB, KNN, ANN, SVM, and LDA. The dataset was randomly divided into 80% for training (1,709 samples) and 20% for testing (428 samples) via the train_test_split function from Python’s scikit-learn library, ensuring an unbiased selection process.

Accuracy was chosen as the primary performance metric to provide an initial overview of the models’ effectiveness. To ensure a comprehensive understanding of the classifiers’ capabilities, we also calculate additional metrics such as precision, recall, and the F1 score. These complementary metrics are particularly useful for monitoring performance across different classes and identifying potential model biases.

The RF model achieved the highest overall accuracy of 91%, outperforming other methods such as ANN (88%), LDA (87%), KNN (86%), GNB (84%), and DT (82%). The SVM exhibited the lowest accuracy at 79%. Figure 5A presents the relative importance of features in the RF model, while Fig. 5B shows the confusion matrix of the RF model, validating its reliability for geochemical classification tasks.

Fig. 5
figure 5

Random forest classification: (A) feature importance and (B) confusion matrix.

The RF algorithm correctly classified 100% of the samples in 31 of 45 fields. In the other three fields, more than 90% of the samples were accurately classified.

Application of new samples for model validation

To evaluate the generalizability of the proposed methodology, three independent oil samples (two spills and one natural seep) were analyzed via the trained RF model (Fig. 6):

  • Sample A (Natural Oil Seep) Predicted as originating from field C051 with a probability of 55%, followed by C008 (24%).

  • Sample B (Oil Spill) Correctly classified as belonging to field C045 with 86% probability.

  • Sample C (Oil Spill) Also associated with field C045, with a high confidence of 91%.

Fig. 6
figure 6

Prediction probabilities of samples (A) (natural oil seep), (B) (oil spill) and (C) (oil spill).

These results demonstrate the robustness of the RF model in accurately predicting oil origins, even for samples not included in the training dataset.

Discussion

The findings of this study demonstrate the potential of integrating geochemical analysis with machine learning techniques to improve oil spill identification. This discussion evaluates the implications of the results, compares the methodology to existing approaches, and highlights the study’s novelty and limitations.

Performance of the proposed methodology

The RF model outperformed all other machine learning algorithms, achieving an overall accuracy of 91%. This high performance reflects the effectiveness of the data preprocessing and feature selection strategies, which reduce redundancy and improve predictive power. In comparison, traditional statistical methods often fail to handle complex, high-dimensional datasets as effectively, particularly when collinearity between variables is present.

The validation of the RF model using independent samples further underscores its robustness. The accurate classification of Samples B and C (oil spills) as originating from field C045 and the plausible prediction for Sample A (natural seep) demonstrate the model’s ability to generalize beyond the training data (Fig. 6). These results confirm the reliability of the proposed methodology in real-world scenarios.

Comparative advantage of the approach

Traditional geochemical approaches rely heavily on manual interpretation, which is time-consuming and prone to subjectivity. By integrating machine learning, this study reduces human bias and accelerates the analysis process, transforming tasks that once required hours or days into minutes. The automation of diagnostic workflows provides a scalable solution for large-scale geochemical datasets, making this methodology particularly advantageous for high-production areas such as the Santos Basin.

Furthermore, the use of advanced dimensionality reduction techniques, such as PCA and MDS, allows the methodology to focus on the most relevant geochemical attributes. This method contrasts with traditional methods, which often include redundant or irrelevant variables, potentially skewing results.

Geological implications

The findings of this study align with the known geological characteristics of the Santos Basin. The ability to differentiate between oil accumulations and secondary petroleum systems, as seen in the case of outliers in Well X, highlights the utility of combining machine learning methods with expert-guided validation. The accurate classification of oil spills also contributes to environmental monitoring efforts, providing actionable insights for mitigating the impact of offshore incidents.

The results also highlight the compositional similarity of oils from the Santos Basin, reflecting the geological simplicity of its reservoirs. This contrasts with the neighboring Campos Basin, where more complex geological conditions create greater compositional diversity. These insights emphasize the importance of tailoring geochemical and machine learning approaches to the specific characteristics of different basins.

Study novelty and contributions

This study introduces a novel workflow that combines expert knowledge with machine learning techniques to increase the reliability and efficiency of geochemical analysis. By validating the methodology with independent samples, the study bridges the gap between theoretical modeling and practical application. The proposed approach represents a significant advancement over traditional methods by offering the following:

  • Greater accuracy in oil source identification.

  • A scalable framework for geochemical analysis.

  • Reduced subjectivity through automation and standardization.

These contributions highlight the transformative potential of artificial intelligence in geochemistry, particularly for exploratory and forensic applications.

Limitations and future directions

While the methodology demonstrated high accuracy and robustness, several limitations must be addressed. This study focused on a single dataset from the Santos Basin, limiting the generalizability of the findings to other regions or basins with different geological characteristics. Expanding the dataset to include more diverse samples would improve the model’s adaptability.

Additionally, while machine learning reduces subjectivity, the success of the methodology still relies on the quality of the input data and expert validation. Future studies could explore the integration of unsupervised learning techniques to further automate the discovery of patterns in geochemical datasets.

Conclusion

This study successfully integrates geochemical analysis with machine learning techniques to address the critical challenge of identifying oil spill origins in offshore production areas. By leveraging advanced statistical techniques and machine learning algorithms, the proposed methodology offers a reliable and efficient approach for oil source identification, significantly reducing the subjectivity and time constraints associated with traditional methods.

From an initial dataset of 2200 samples and 75 attributes, rigorous preprocessing and exploratory analysis reduced the data to 2137 samples and 62 predictive attributes. Among the seven machine learning models tested, RF demonstrated the highest accuracy (91%), validating its suitability for geochemical classification tasks. The methodology was validated with three independent samples—two spills and one natural seep—demonstrating its robustness and practical applicability in real-world scenarios. Notably, the predictions aligned with geological contexts, reinforcing the model’s reliability.

The integration of machine learning methods into geochemical workflows represents a transformative advancement, allowing for the rapid, objective, and accurate diagnosis of oil origins. By reducing the analysis time from hours or days to mere minutes, the methodology provides a scalable solution for both exploratory and forensic applications. Its implementation is particularly relevant for high-production areas such as the Santos Basin, where accurate and efficient spill diagnostics are critical for environmental monitoring and resource management.

This study also highlights the importance of expert-guided validation in complementing automated approaches, ensuring the accuracy and interpretability of results. The successful identification of secondary petroleum systems, as evidenced in Well X, underscores the utility of combining machine learning with domain expertise to derive actionable geological insights.

While the methodology has demonstrated high accuracy and reliability, further research is needed to generalize the approach to other basins with varying geological characteristics. Expanding the dataset to include more diverse samples and incorporating unsupervised learning techniques could further enhance the model’s adaptability and robustness.

In conclusion, this study presents a novel, scalable, and efficient framework for oil spill diagnostics that combines the strengths of geochemical analysis and machine learning. It offers significant potential for transforming geochemical workflows, providing value for both environmental and exploratory geochemistry.