Background & Summary

Since the Industrial Revolution, human activities such as large-scale deforestation, land-use and land-cover changes, cement production, and burning fossil fuels have significantly increased atmospheric carbon dioxide (CO2) levels. Over half of these human-caused CO2 emissions are offset by natural sinks that absorb CO2 on land and in oceans. According to the Global Carbon Budget (2024), the ocean alone absorbed about 26% of the total CO2 emissions during the decade from 2014 to 20231.

The Antarctic region plays a critical role in the global climate system and the carbon cycle. Its cold, nutrient-rich waters are essential areas for the absorption and storage of atmospheric CO2. The dynamics of its coastal zones, influenced by changes in sea ice, glacial melt, and complex ocean-atmosphere interactions, have significant effects on ocean biogeochemistry2,3. Additionally, the Southern Ocean surrounding Antarctica is one of the least sampled and fastest-changing marine environments on Earth4,5.

Beyond its role in the carbon cycle, the Southern Ocean is also the most significant heat sink in the world’s oceans. It absorbs approximately 60–90% of the excess heat stored in the global ocean since the mid-20th century, and around 83 ± 33% during the historical period of the climate models6. This remarkable capacity results from its unique circulation patterns, driven by the Antarctic Circumpolar Current and strong westerly winds, which bring cold, deep waters to the surface. These waters efficiently transfer heat to the atmosphere before sinking into the ocean’s interior. By redistributing heat northward and ventilating the deep ocean, the Southern Ocean plays a key role in regulating Earth’s energy balance and mitigating the full impacts of surface warming7.

Better observational data from this region are vital for understanding local processes and their broader climate feedbacks, especially in the face of accelerating global change8,9.

Deception Island, situated in the South Shetland Islands north of the Antarctic Peninsula (Fig. 1A), is located in the maritime Antarctic between the South American and Antarctic continents, making it one of the most climate-sensitive regions10. Its unique setting and dynamic environment have turned it into an exceptional natural laboratory for studying climate change, seismicity, geology, flora, and fauna, as well as for testing experimental instrumentation11,12,13,14,15. Over recent decades, these activities have helped build a comprehensive scientific database, thereby positioning the island among the most extensively studied areas in Antarctica, particularly in volcanology, seismology, and terrestrial biodiversity16.

Fig. 1
figure 1

(A) Location of Deception Island in the South Shetlands Archipelago, Antarctica, and (B) detail of the location of the AEMET meteorological station at the Gabriel de Castilla Spanish Base (GdC; yellow dot), the seismic stations (grey triangles), and the ship track along the day of the year 2025 (color bar) in Deception Island.

Geologically, Deception Island is the exposed rim of a submerged stratovolcano that ranks among the most active volcanoes in the region17. More than half of its surface (57%) is covered with ice, and various geological and geothermal processes co-occur, including hydrothermal activity, fumarolic emissions, ground deformation cycles and highly variable seismic activity18. Historical records document several eruptions over the past 200 years, with the most recent in 1970, which reshaped and expanded parts of the island19.

At the island’s center lies Port Foster (PF), a semi-enclosed bay formed by the flooding of a volcanic caldera (Fig. 1B). The basin covers about 38 km², has a perimeter of nearly 30 km, and connects to the Bransfield Strait via Neptune’s Bellows, a narrow, shallow channel averaging 22 meters in depth. The deepest parts of PF exceed 160 meters. Two permanent Antarctic research stations are located on the island: the Spanish Antarctic Base Gabriel de Castilla and the Argentine Antarctic Base Decepción.

Tides within PF are mixed, with predominantly semidiurnal components, and follow a mesotidal regime, characterized by ranges exceeding 2 m, due to the influence of astronomical forces. Average wind speeds are around 5 m s¹, but gusts can reach 25 m s¹, with both prevailing and extreme winds primarily coming from the southwest. Water circulation results from the combined effects of tides, weather variability, and thermal forcing, creating patterns similar to those seen in fjords, which is due to the contrast between the deep central basin and the shallow sill at Neptune’s Bellows20.

Several shallow coastal sites in PF, such as Pendulum Cove, Fumarole Bay, Whalers Bay, and Murature Beach, show intense geothermal activity. Fumaroles in these areas release gases from waters heated by convective flows from a subsurface magma chamber12. Emissions mainly consist of CO2 (75–90%) with smaller amounts of CH4, and interstitial water temperatures can reach nearly 100 °C. This geothermal activity creates strong, localized biogeochemical anomalies, including elevated levels of trace metals (Fe, Ni, Co and V) and increased greenhouse gas fluxes14.

This dataset compiles high-resolution measurements taken in February 2025 from PF. It combines surface seawater pCO2 levels with oceanographic variables (temperature, salinity), meteorological variables (wind speed, solar radiation, air temperature), and geophysical variables (seismic events, tidal elevation). Spatio-temporal metadata, such as geographic coordinates and date and time, are also included.

To complement direct observations, a modeling framework based on deep learning was applied to estimate seawater pCO2 using a Bidirectional Long Short-Term Memory (Bi-LSTM) neural network.

The resulting dataset combines direct observations and model-improved estimates of surface pCO2, offering a unique and robust resource for a wide range of scientific applications. It supports biogeochemical assessments, air-sea CO2 flux calculations, and regional climate analyses in polar coastal systems. Furthermore, the modelling approach, based on explainable deep learning, can serve as a valuable tool for improving the representation of CO2 dynamics in coupled ocean-atmosphere models, thereby contributing to more accurate climate projections. This dataset also enables the validation of satellite-based observations, the benchmarking of machine learning methodologies, and the extension of similar modeling frameworks to other data-scarce polar or volcanic marine environments.

Methods

Data acquisition

Environmental data were collected from February 11th to 18th, 2025, from coordinated marine and land sources to document high-resolution geophysical, chemical, and biological dynamics in the PF area of Deception Island (Fig. 1B).

Seawater was continuously sampled through a flow-through system onboard the Spanish R/V “Sarmiento de Gamboa”, with the ship track shown along the day of the year 2025 in Fig. 1B. Surface pCO2 concentrations in µatm (Fig. 2A) were measured using a ship-mounted CO2-Pro™ CV pCO2 Sensor (Pro-Oceanus Systems Inc.) that employed non-dispersive infrared detection. The sensor has an accuracy of ±0.5%.

Fig. 2
figure 2

(A) Seawater CO₂ concentration (μatm), (B) Seawater temperature (°C), (C) Salinity and (D) Solar radiation (W m-2) over the day of the year of the ship track (color bar); (E) Tidal elevation (m); (F) Air temperature (°C); (G) Wind speed (m s1); and (H) Number of detected seismic events.

Sea surface temperature, salinity, and solar radiation data were also collected onboard R/V “Sarmiento de Gamboa” by the Marine Technology Unit (UTM-CSIC). Seawater temperature and salinity (Fig. 2B,C) were measured with an SBE21 thermosalinograph sensor (Sea-Bird Scientific Inc.). The accuracy and precision are ±0.01 °C, ±0.001 S m−1, and ±0.01 °C and 0.0001 S m−1 for temperature and conductivity, respectively. Solar radiation in W m−2 (Fig. 2D) was recorded using the onboard multispectral radiometer PRR-800 sensor (Biospherical). All shipboard measurements were taken at a 1-second resolution, georeferenced, and synchronized via GPS.

Tidal elevation (Fig. 2E) was simulated using the Delft3D-FLOW model, with water level boundary conditions defined through astronomical forcing based on the main tidal constituents in the area.

Seismic data were collected using a network of broadband seismic stations located within the bay or the coast of the inner part of the island (Fig. 1B) from the Spanish National Geographic Institute (IGN) seismic catalog, which continuously monitors local volcanic activity with high sensitivity. In this study, we focused on two types of seismic signals: Long Period (LP) events and spasmodic tremors (TR). LP events are the dominant type of seismicity at Deception Island volcano. They present a feature characterized by emergent arrivals and a spindle-shaped envelope, with dominant frequencies below 5 Hz, and durations of only a few seconds18. Spasmodic tremors share certain traits with hybrid events, such as the presence of identifiable P- and occasionally S-wave arrivals, followed by a sustained low-frequency signal that contains energy at frequencies up to 8 Hz21. LP events and TR share indicating similarities in their source regions and processes22. Thus, TR and LP events are assumed to be different manifestations of the same source process21. These TR differ from isolated LP events in their longer duration and more complex spectral content, making them essential indicators of subsurface volcanic activity. Both signals were combined for inclusion in the model (Fig. 2H).

Air temperature and wind speed data (Fig. 2F,G) were collected from the meteorological station WMO-89070 operated by the Agencia Estatal de Meteorología (AEMET), Spain, at the Gabriel de Castilla Spanish Antarctic Base on Deception Island (GdC in Fig. 1B). Wind speed and direction were measured every 10 minutes at the 10 m above ground reference height.

All time series were quality-checked using instrument-specific thresholds and were post-processed to ensure consistency. The dataset includes a total of 30,404 data points, combining features and targets. The data were synchronized to UTC and uniformly resampled at a 1-minute interval.

Data Records

The full dataset is publicly available23 on the figshare repository (https://figshare.com/) and can be accessed via the following URL: https://doi.org/10.6084/m9.figshare.30112708.

Data were collected during the DICHOSO Project (https://doi.org/10.20351/29HE20240312), an interdisciplinary field campaign conducted in and around Deception Island (Antarctica) in February 2025. The dataset integrates georeferenced time-series measurements from oceanographic, meteorological, and geophysical sources, and is organized as a single CSV file in tabular format. A brief overview of the dataset structure and associated sub-datasets is provided in Table 1. Detailed descriptions of the variables included in the main file, including units, data types, and interpretation notes, are presented in Table 2.

Table 1 Overview of the data files and sub-datasets available in the figshare repository repository.
Table 2 Description of variables included in the primary dataset Deception_2025_CO2_ocean_meteo_seismic.csv.

The primary data file contains high-frequency (1-minute resolution) observations recorded during the vessel’s navigation through PF and from the inland seismic and meteorological stations. Each row corresponds to a unique timestamp and geographic position derived from the vessel’s GPS track. However, some variables, such as wind speed, air temperature, and seismic activity, originate from fixed stations on Deception Island. The geographic location of these stations is not included as a separate field in the dataset, but their measurements are temporally aligned and integrated into the unified data structure. Most variables were initially collected at 1-minute intervals; however, wind speed and air temperature were measured every 10 minutes and then replicated across each 1-minute time step within the corresponding 10-minute interval to ensure temporal consistency across all variables. The dataset includes measurements such as seawater temperature, salinity, surface pCO2, tidal elevation, solar radiation, air temperature, wind speed, and binary indicators of seismic activity. Additionally, a derived variable captures the number of seismic events within a 10-minute moving window, calculated as the sum of binary event occurrences (see Table 2 for details). Missing values are denoted by blank fields.

Additionally, seismic event metadata can be accessed through the Spanish Polar Data Center seismic catalog (https://doi.org/10.7419/162.03.2022; last accessed Nov 2025), and the SB21 sensor and meteorological raw data collected onboard the R/V Sarmiento de Gamboa and the meteorological data from the AEMET GdC station are archived in the CSIC Marine Technology Unit repository accessed through the Centro Nacional de datos Polares website (https://cndp.utm.csic.es/geonetwork/srv/eng/catalog.search#/home).

Technical Validation

To assess the potential of the dataset for predicting surface seawater pCO2 concentrations, we employed a deep learning approach using a Bi-LSTM architecture. This model was trained and validated with the complete set of multivariate features, including oceanographic data (such as seawater temperature, salinity, and tidal elevation), meteorological data (like wind speed, solar radiation, and air temperature), seismic events (including the number of LP and T), and location descriptors (latitude and longitude). Seismic signals were preprocessed to produce 10-minute rolling sums of event occurrence, enabling the model to incorporate information on previous subsurface activity. Although the data are arranged chronologically, this is not a typical time series task; instead, the model processes 3D windows of data (samples × time × variables), where time reflects sampling order but not continuous temporal flow.

To prevent data leakage and ensure a reliable evaluation, the dataset was divided using a 5-fold GroupKFold cross-validation strategy, with groups defined by scaled latitude values. This method ensures that geographically similar samples remain within the same fold and that model generalization is evaluated across different regions.

Input features (Fig. 2A–H) were normalized within each fold using StandardScaler, which applies z-score normalization. It transforms the data so that each feature has a mean of 0 and a standard deviation of 1. Scalers were fitted exclusively on the training data and applied separately to both training and validation sets. This fold-specific normalization prevents any information from the validation set from leaking into the training process.

The Bi-LSTM model used is implemented in TensorFlow/Keras24 with two stacked Bidirectional LSTM layers featuring dropout regularization and L2 kernel regularization to prevent overfitting, followed by dense layers to produce the final pCO2 estimate. All input sequences were generated using a sliding window of 10 time steps, corresponding to 10 units of environmental data. A fixed random seed (42) was applied across all experiments to ensure reproducibility of data partitioning and weight initialization.

The model was trained with the RMSprop optimizer and early stopping based on validation loss. For each fold, we recorded the Root Mean Squared Error (RMSE) and R² values between the observed and predicted pCO2, and saved the training and validation loss trajectories (Fig. 3A–C). The best fold yielded an RMSE of 8.655 and an R2 of 0.968 (Fig. 3B), and the resulting dataset is described in Table 3.

Fig. 3
figure 3

(A) Observed and predicted surface seawater pCO2 (μatm) across the training data sequence. (B) Relationship between observed and predicted surface seawater pCO2 (μatm) for the best-performing configuration. (C) Validation and training loss (mean squared error) for the best fold in the Bi-LSTM model, with the shaded area indicating the standard deviation among the five folds.

Table 3 Description of variables included in the secondary dataset Deception_2025_CO2_predictions.csv.

To further assess feature relevance, we calculated permutation importance across folds (Fig. 4). For each input variable, values were randomly permuted within the validation set (considering each time window and feature index), and the increase in RMSE was measured relative to baseline predictions. This process was repeated for each fold, and the average ± standard deviation across folds was reported for all features. The results highlighted the dominant influence of dial variability as represented by solar radiation and seawater temperature. Although seismic signals ranked lowest in global permutation importance, further analysis showed that their inclusion improved the model sensitivity to abrupt changes in pCO2 levels that could not be explained solely by environmental and physical variables. This analysis confirms the model ability to identify meaningful drivers of pCO2 variability in PF central area surface waters (Fig. 1B), supporting the robustness of the model interpretation.

Fig. 4
figure 4

Barplot summary of the permutation importance across folds showing the influence of each input variable on model RMSE increase. Black lines represent the confidence interval (95%) for each fold.

All preprocessing steps, model training, and evaluation procedures were implemented in Python and are available in the accompanying code repository and supplementary files to support reproducibility and future benchmarking.

Usage Notes

Scripts and reproducibility

All scripts used to train, validate, and interpret the Bi-LSTM model are available in a public GitHub repository (https://github.com/susafle/CO2_prediction_BiLSTM). They are organized to facilitate reproducibility and adaptation to related datasets or modeling tasks. While detailed validation and methodology are provided in the Technical Validation section, the repository serves as a practical entry point for users wishing to replicate or expand upon this work.

Reuse potential

This dataset and modeling workflow have broad reuse potential in the fields of ocean biogeochemistry, climate modeling, and machine learning. The combination of physical, chemical, and seismic variables offers a valuable resource for studying the drivers of pCO2 variability in coastal and volcanic ocean regions. The structured time series format and model-ready input files enable other researchers to test different deep learning architectures (e.g., GRU, Transformer) or compare results with traditional statistical models (e.g., GAM, Cubist).

The dataset is particularly useful for benchmarking explainable AI techniques in environmental science because it includes permutation importance. Users can also modify the pipeline for other target variables (e.g., pH, carbonate saturation, dissolved inorganic carbon) or apply it to similar observational datasets from different regions. Including seismological signals provides a new opportunity to explore geophysical-biogeochemical interactions, particularly in tectonically or volcanically active coastal areas, such as Deception Island.

Researchers working on model interpretability, predictive performance evaluation, or transfer learning in environmental fields may find the modular structure of the scripts helpful, as it facilitates easier adaptation to new spatiotemporal settings or additional input features.

Limitations

The dataset concentrates on a specific region and time frame, which restricts the relevance of the findings to other oceanographic settings. Additionally, although several predictive variables were included, some potentially relevant parameters (e.g., dissolved oxygen, fluorescence) were not available and thus excluded from this version.

Ultimately, the significance of seismic variables may differ depending on the specific volcanic or tectonic setting, and caution should be taken when applying this approach to non-volcanic systems. Additionally, as with any data-driven model, the results rely on the quality of the data and the preprocessing decisions, which are clearly documented in the GitHub repository for transparency.