A regional ocean database for the Coastal China Sea

Wang, Cece; Su, Bei; Sun, Jun; Hu, Xiaoke; Liu, Jihua

doi:10.1038/s41597-025-05840-w

Download PDF

Data Descriptor
Open access
Published: 23 September 2025

A regional ocean database for the Coastal China Sea

Scientific Data volume 12, Article number: 1550 (2025) Cite this article

1715 Accesses
10 Altmetric
Metrics details

Subjects

Abstract

Access to high-quality marine geophysical and biogeochemical in-situ data poses a challenge for model evaluation and parameter calibration of the Coastal China Sea (CCS). We describe a new regional ocean database for CCS (RODCCS) with original data from six repositories. The database covers the region of 116–135°E in longitude and 20–42°N in latitude, which embraces the Bohai Sea, the Yellow Sea, the East China Sea and a part of the Sea of Japan. About 3.9 million data points are collected and sorted according to variable types, including temperature, salinity, dissolved oxygen, silicate, nitrate, nitrite, ammonium, phosphate, Chlorophyll a, dissolved inorganic carbon, dissolved organic carbon, and particulate organic carbon. These data are quality-controlled (QCed) with six QC checks and stored in a Network Common Data Format (NetCDF) file. RODCCS includes twelve NetCDF files, each with a unified structure. The database is easily accessed and of high quality after QC checks, making it suitable for a wide range of marine modelling as well as field research for the CCS.

CODC-S: A quality-controlled global ocean salinity profiles dataset

Article Open access 30 May 2025

Investigating the closures of sea level budgets in China’s adjacent seas

Article Open access 02 July 2025

The seas around China in a warming climate

Article 18 July 2023

Background & Summary

The coastal region is closely tied to human life, featuring intricate ecological and economic impacts. Anthropogenic activities have led to a variety of environmental issues in coastal waters¹. Hypoxia (oxygen concentration lower than 2 mg L⁻¹) has frequently been reported in the Coastal China Sea (CCS) over recent decades^2,3,4, driven by both anthropogenic activities and climate change^5,6,7. The Yangtze River estuary is the largest estuary in the CCS. In its adjacent coastal ocean, shallow hypoxia has been observed in summer and autumn in recent years^8,9,10. It has a significantly negative impact on environmental health and subsequently on ecological community compositions and fisheries^11,12. Therefore, a sufficient understanding of their underlying mechanisms and exploring solutions to alleviating hypoxia are very urgent^4,13. Continental shelves absorb atmospheric CO₂ at a rate of about 0.2 Pg C yr⁻¹, accounting for approximately 13%–15% of the current global oceanic CO₂ uptake^14,15,16. The CCS, one of the largest continental shelves on Earth¹⁷, is considered a region with significant carbon sink potential^18,19,20 and deserves great effort in quantifying its carbon fluxes and predicting its response to future climate change.

To grasp the physicochemical characteristics of coastal waters, researchers use marine physical coupled biogeochemical models, which serve as crucial tools in testing hypotheses and quantifying fluxes of elemental cycles. Model evaluation and parameter calibration require a large volume of observational data. However, accessing these data poses a challenge due to their dispersed distribution. Furthermore, a strict quality control (QC) for different types of in-situ observational data is necessary.

Considering the recently recognised importance of CCS in the carbon cycle and the increased number of reported hypoxia events, a regional ocean database for CCS (RODCCS) is compiled in this study to offer comprehensive and reliable observational data for both modelling and field research²¹. RODCCS includes data from six repositories, one of which comprises unpublished data from authors (Table 1). Our database covers the region of 116° to 135° E in longitude and 20° to 42° N in latitude. It encompasses the Bohai Sea, the Yellow Sea, and the East China Sea, as well as a part of the Sea of Japan and its sampling depths span from the surface to 6984 meters (Fig. 1a,b). The sampling years of the data span from 1985 to 2021 (Fig. 1b). The database includes twelve variables, i.e., temperature (4,348,536 data points), salinity (4,325,295 data points), and dissolved oxygen (DO, 4,235,725 data points), silicate (726,086 data points), nitrate (745,908 data points), nitrite (246,347 data points), ammonium (29,526 data points), phosphate (729,507 data points), Chlorophyll a (Chl a, 73,715 data points), dissolved inorganic carbon (DIC, 128,744 data points), dissolved organic carbon (DOC, 196,919 data points), particulate organic carbon (POC, 25,190 data points) concentrations.

Table 1 Summary of in-situ datasets and value ranges of variables collected from the literatures.

Full size table

We apply six strict QC checks to each variable of the collected in-situ data. The QC includes location check, depth check, constant value check, value range check, vertical gradient check and time reversal check (Table 2). After QC, irrelevant, failed and acceptable data are marked with flags of 1, 2 and 3, respectively (Table 3). The numbers of failed data by the six QC checks for the twelve variables are presented in Table 4. We store the data in twelve NetCDF format files with one variable in each file. There is a uniform structure in each file where longitude, latitude, depth, sampling time, data source ID, QC flag and variable values are included. It provides the spatial and temporal attributes, original repository information, QC check result, and value of each data point. RODCCS provides quality-controlled observational data for model evaluation as well as inter-comparisons of different databases, making it suitable for a wide range of marine modelling and field research.

Table 2 Details of each QC check for RODCCS.

Full size table

Table 3 Flags of QC results and their interpretations in the RODCCS.

Full size table

Table 4 Original data point number of 12 variables and failed data number identified by the 6 QC checks.

Full size table

Methods

The original data of RODCCS are from six repositories, encompassing observational data in various formats, including Comma-Separated Values (.csv), NetCDF, Excel Open XML Spreadsheet (.xlsx), Copy Number Variation (.cnv), and Tab-Separated Values (.tab). The procedures of RODCCS compilation are summarised in Fig. 2.

Array for real-time geostrophic oceanography (Argo) is an international program that measures water properties across the world’s ocean using a fleet of robotic instruments that drift with the ocean currents and move up and down between the surface and a mid-water level²². On the top of every Argo float is a conductivity, temperature, pressure sensor which measures temperature within an accuracy of 0.001 °C and pressure within 0.1 dbar, and calculates salinity using conductivity, temperature, and pressure within 0.001 psu (practical salinity units). Biogeochemical-Argo (BGC-Argo) is the extension of the Argo array of profiling floats to include floats that are equipped with biogeochemical sensors for pH, oxygen, Chl a, nitrate concentrations, suspended particles, and downwelling irradiance. On the Euro-Argo European Research Infrastructure Consortium (ERIC) website (https://fleetmonitoring.euro-argo.eu)²³, we select data from the China Sea Institute of Oceanology (CSIO), Korea Meteorological Administration (KMA), and Korea Ocean Research & Development Institute (KORDI) data centre for download. They all fall under the category of BGC-Argo data. All data from CSIO, KMA, and KORDI are exclusively adjusted data, which are raw sensor outputs and remain institutionally archived. These Argo data have undergone algorithmic processing and environmental compensation procedures²⁴. CSIO, KMA and KORDI provide depth data rather than sensor-measured pressure data, which are depth values derived directly from pressure sensors and synthetically reconstructed layers from multi-sensor fusion²⁵. CSIO delivers both real-time (automatically quality-controlled) and delayed-mode (expert-validated) adjusted data. KMA and KORDI provide real-time adjusted data without delayed-mode products due to shortened float deployments, which are transmitted via satellite in near real-time. Delayed-mode data undergo rigorous quality control protocols, including sensor calibration, salinity bias adjustment, and outlier removal^25,26. For BGC-Argo, delayed-mode processing further integrates laboratory analytical validation²⁵. Real-time QC of Argo detects physically implausible values and ensures vertical profile consistency, while delayed-mode QC combines expert manual verification with regional climatological datasets to identify biases²⁴. We obtain original data in NetCDF files that include sampling sites, sampling times, DO concentration, salinity and temperature. We then use MATLAB to extract these data. The depth of each data point is converted from pressure using the seawater toolbox in MATLAB.

Climate and Ocean Variability, Predictability and Change (CLIVAR) and Carbon Hydrographic Data Office (CCHDO) support oceanographic research by providing access to high quality, global, vessel-based Conductivity, Temperature and Depth (CTD) and hydrographic data from Global Ocean Ship (GO-SHIP), World Ocean Circulation Experiment (WOCE), CLIVAR and other repeat hydrography programs^27,28. The electrochemical Sea-Bird SBE43 sensor is utilised to measure DO concentration in CCHDO²⁹. Data are retrieved from the CCHDO database through its advanced search platform (https://cchdo.ucsd.edu/search/advanced) in.xlsx format. Ten variables, i.e., Chl a, DIC, DO, DOC, nitrate, nitrite, phosphate, and silicate concentrations, salinity, and temperature, are obtained from the CCHDO database. The depth of each data point is also converted from pressure using the seawater toolbox in MATLAB.

We retrieve four datasets from the National Earth System Science Data Center (NESSDC, http://www.geodata.cn/) and integrate them into our database. We conduct a targeted search for cruise expeditions along the coastal regions of China, with a specific emphasis on Chl a concentration. Following the submission of a formal request and subsequent grant of access by the website administrators, we obtain the Yellow Sea and East China Sea Chl a concentration data for 2011–2013, the Bohai Sea Chl a concentration data for 2015 and 2017, and the China Coastal Chl a concentration data for 2009–2012 (offshore CTD Chl a concentration measurements in CCS), in four.xlsx format files respectively. The in-situ Chl a concentration data from NESSDC are unpublished in the scientific literatures. The data authors are authors of this study, and permission to use the data has been granted. The Chl a concentrations in these datasets are all determined using the Trilogy fluorometer technique. We extract Chl a concentration data along with their corresponding sampling location and time information.

The Coast Dissolved Organic Matter (CoastDOM) database includes comprehensive coastal DOM concentration data in a single repository, making it openly and freely available to different research communities³⁰. In CoastDOM, the concentrations of 81% samples are determined using a High-Temperature Catalytic Oxidation (HTCO) analyser for DOC concentration, with the remaining 19% determined by a combination of wet chemical oxidation (WCO) and/or UV digestion. Data from the CoastDOM are downloaded from the website (https://doi.pangaea.de/10.1594/PANGAEA.964012) in.tab format. Five variables, i.e., Chl a, DIC, DOC, POC, and ammonium concentrations, together with their corresponding sampling location and time information, are extracted with MATLAB.

Global Ocean Data Analysis Project Version 2 (GLODAPv2) is a synthesis activity for ocean surface to bottom biogeochemical data collected through chemical analysis of water samples^31,32,33. GLODAP deals only with bottle data and CTD data at bottle trip depths. The consistency of its data product is estimated to be better than 0.005 for salinity, 1% for oxygen, 2% for nutrients, 4 μmol/kg for DIC concentration and total alkalinity, and 0.01–0.02 for pH, indicating a high level of precision and reliability across these measurements. We download data from the Pacific Ocean part of GLODAPv2.2023 (released in 2023) via the GLODAPv2 portal (https://glodap.info/index.php/merged-and-adjusted-data-product-v2-2023/) in a.csv format file³⁴. We extract nine variables, i.e., Chl a, DO, DOC, nitrate, nitrite, phosphate, silicate concentrations, salinity and temperature, together with their corresponding sampling location and time information.

The Rolling Deck to Repository (R2R), with their global capability and diverse array of sensors and research vessels, is an essential mobile observing platform for ocean science^35,36. Temperature and salinity are measured with a CTD profiler, and DO concentration is measured by an oxygen sensor. R2R provides essential documentation and standard products for each expedition, as well as tools to document shipboard data acquisition activities while underway. Data collected on every expedition are of high value, given the high cost and increasingly limited resources for ocean exploration. We download the cruise data from the Pacific Ocean on the R2R website (https://www.rvdata.us/search?keyword=ctd&zoom=1&x=0&y=2646652.0332176173&projection=M). Twenty-three cruise files in.cnv format from this repository within the region of RODCCS are selected for further analysis. Ultimately, we extract the values for DO concentration, salinity and temperature, and the location and temporal information of each data point, and then include them in RODCCS.

In order to control the quality of the in-situ data, we apply six types of checks for each variable. The QC includes location check, depth check, constant value check, value range check, vertical gradient check and time reversal check. The six QC checks are listed in Table 2 and explained below, and data points are flagged with 1, 2, or 3 when they are irrelevant, have failed or passed the specific check (Table 3). Flags of each data point are saved in the NetCDF format files for all variables.

Location check

The location check ensures the accuracy of the sampling locations within the CCS region defined in this study^37,38. Due to the diversity of data sources and the varying sampling locations across different cruises, the collected dataset includes data outside the CCS region. Data points outside the study area fail this inspection and are not subject to the subsequent five inspections, while only data points that passed undergo the subsequent inspections.

Depth check

The depth check assesses whether the sampling depth is shallower than the corresponding seabed depth estimated with the General Bathymetric Chart of the Oceans (GEBCO) data^37,38,39. Data points either above sea level or deeper than the seabed fail this check and are not applied in the subsequent QC checks (Fig. 3).

Constant value check

This check identifies consecutive identical values in the vertical profile, which suggests instrument malfunction or data corruption³⁸, resulting in sampling values that do not change with variations in depth or time. We analyse the data distributions of each profile and find that anomalous values typically occur three or more times consecutively. As a result, data points with occurrences of more than three consecutive identical values in a vertical profile are identified as failed data and not applied in the subsequent checks.

Value range check

For variables that conform to or approximate a normal distribution (log-transformed DOC, Chl a, ammonium concentrations, and salinity), we employ the Chauvenet’s criterion to identify outliers. For a dataset comprising N measurements, any value that deviates from the mean with a probability of less than 1/(2 N) is classified as a suspicious outlier^40,41. We determine the critical value using MATLAB’s norminv function, which requires the mean (mu) and standard deviation (sigma) of the data (Eq. 1). Given that this method is a two-sided test and only data at the tail with high values will be identified, the threshold is calculated with 1-1/(4 N)^42,43 (Eq. 1). Any measurements exceeding the critical value are taken as outliers, with the remaining measurements as pass values (Fig. 4).

$${Critical\; value}={norminv}(1\,-\,1/(4{\rm{N}}),{mu},{sigma})$$

(1)

Where mu and sigma are the mean and standard deviation, respectively.

Log-transformed DOC, Chl a and ammonium concentrations, and salinity undergo the value range check with the Chauvenet’s criterion method. Since Chl a, ammonium, and DOC concentrations don’t have negative values and may have very low values, the threshold for these variables is set between 0 and the Critical value defined by Eq. 1. We also compare these thresholds with the ranges collected from the literatures (Table 1), to validate the rationality of the thresholds applied in this check.

For variables that do not follow a normal or approximately normal distribution (temperature, and DO, silicate, nitrate, nitrite, phosphate, DIC, POC concentrations), we utilise the Interquartile Range (IQR) method for outlier identification, as shown in Eqs. 2–4⁴⁴. Iqr, defined as the difference between the 75% quantile and 25% quantile of a variable, together with published value ranges^45,46,47, is applied in determining the upper (Upper Bound) and lower (Lower Bound) bounds for outlier identification (Eqs. 3, 4). Through comparative analysis of literature-reported ranges (Table 1) and bounds derived from varying Iqr coefficients, a coefficient of 2 is ultimately determined in Eqs. 3, 4, which produces better consistency between both ranges. Data points outside the calculated ranges are marked as outliers (Fig. 4).

$${\rm{Iqr}}={\rm{Q}}2-{\rm{Q}}1$$

(2)

$${\rm{Lower\; Bound}}={\rm{Q}}1-2{\rm{Iqr}}$$

(3)

$${\rm{Upper\; Bound}}={\rm{Q}}2+2{\rm{Iqr}}$$

(4)

Where Q1 and Q2 are the 25% quartile and 75% quartile of a variable, respectively.

Vertical gradient check

This check is conducted to identify excessive decreases or increases in variable values over a depth range. A gradient is defined as:

$${\rm{gradient}}=\left|\frac{{{\rm{\nu }}}_{{\rm{i}}}-{{\rm{\nu }}}_{{\rm{i}}-1}}{{{\rm{z}}}_{{\rm{i}}}-{{\rm{z}}}_{{\rm{i}}-1}}\right|\left({\rm{i}}=2,3,4,5,\ldots \ldots ,{\rm{k}}\right)$$

(5)

Where ${\nu }_{{\rm{i}}}$ and ${\nu }_{{\rm{i}}-1}$ are the values of a variable at the current depth level and the previous depth level (shallower). z_i and z_i-1 are the depths (in meter) of the current depth level and the previous depth level (shallower), respectively. k is the number of data points in a vertical profile.

We firstly sort the data points with identical latitude and longitude coordinates into identical vertical profiles. We analyse the number of data points per vertical profile and find that most sampling events contain over ten data points. If the number of samples in a particular vertical profile exceed or is equivalent to ten, a vertical gradient check is conducted. Otherwise, all data points in a vertical profile are marked as irrelevant. Since data distributions of nitrite, ammonium, Chl a, DIC, DOC, and POC concentrations are highly dispersive and often do not meet this criterion, vertical gradient check is only performed on the six variables of temperature, salinity, DO, silicate, nitrate, and phosphate concentrations.

For each vertical profile, a surface-to-bottom check sequence is adopted. The value of the shallowest sampling point is validated against the World Ocean Atlas 2013 (WOA13) annual climatological data⁴⁸. If the value of a data point is out of the value estimated from WOA13 at its corresponding location by ± n%, it is flagged as an outlier. Then the data point of the next deeper level is estimated until a value within the acceptable range is identified, which is then adopted as the first sampling point. The spatial distributions of salinity and DO concentration are significantly different from the other four variables; therefore, we set different n values to ensure that our selection of the first sampling point is appropriate. We determine the values of n by comprehensively analysing the results obtained from different initial sampling points, which are generated by varying n. The n values resulting in accurate outlier identification are selected. Finally, the n values used in this check for salinity, DO concentration, and the other four variables are 20, 40, and 100, respectively. The next sampling point (ν_i) at depth z_i and the starting point (ν_i−1) at depth z_i−1 are selected. If the depth interval Δz (|z_i - z_i-1|) is greater than 10 meters, a vertical gradient check is performed. Analysis of gradients in vertical profiles indicates that smaller Δz values (Δz < 10 m) reduce the value of the denominator in Eq. 5, resulting in unreasonably large gradients. Numerous valid data points are misidentified as outliers. Conversely, if Δz is less than 10 meters, the search moves upward to a shallower point that has passed the vertical gradient check as the i−1 point. Data points within 10 meters vertically from the starting point are irrelevant for this check. To better represent the differences in gradient ranges between surface and deep waters (e.g. due to physical or biogeochemical influences), every data point has been categorised into the shallow water group (depth ≤ 400 m) or the deep water group (depth > 400 m). Data point with gradients exceeding the maximum gradient value (MGV) fails this check and are flagged (Fig. 5). We compare the results with different MGV values and verify their corresponding locations and values of outliers. The MGV values with which obvious outliers are identified are applied in this check (Table 5).

Table 5 Maximum gradient values (MGV) for vertical gradient check.

Full size table

Time reversal check

This check identifies instances where data points are recorded out of temporal sequence, leading to misinterpretations of temporal trends. Within the same sampling event, data points that do not conform to an increasing chronological order are flagged as failed data. Data points contain sampling time information of only year and month, but without day, hour, and minute, are marked as irrelevant, as they do not provide sufficient temporal resolutions required for this check³⁷.

Evaluation of QCs for RODCCS

After QC, we employ the dichotomous metrics of True Positive Rate (TPR), False Positive Rate (FPR), and True Negative Rate (TNR) to evaluate QC performance (Fig. 2). TPR reflects the QC’s ability to correctly retain valid data points. FPR indicates the proportion of normal data erroneously flagged as anomalies. TNR measures the specificity in preserving true negative instances^38,49. These metrics quantify the trade-off between detection efficacy and error control, thereby providing a comprehensive evaluation of the discriminative capacity of the QC system. Optimal performance of QC is achieved when both TPR and TNR are maximised, and FPR is minimised³⁸. These dichotomous metrics are initially proposed by Yerushalmy and defined as follows^38,49:

$${\rm{TPR}}=100 \% \times \frac{{N}_{{TP}}}{{N}_{{TP}}+{N}_{{FN}}}$$

(6)

$${\rm{FPR}}=100 \% \times \frac{{N}_{{FP}}}{{N}_{{TN}}+{N}_{{FP}}}$$

(7)

$${\rm{TNR}}=100 \% \times \frac{{N}_{{TN}}}{{N}_{{TN}}+{N}_{{FP}}}$$

(8)

In these equations, N_TP, N_FN, N_FP and N_TN represent the numbers of true positives, false negatives, false positives, and true negatives, respectively. For measurement, the benchmark dataset provides the true passed or rejected flags, which are then compared against the QC results (pass or fail). We use GLODAPv2 as a benchmark dataset to evaluate the performance of our QC of RODCCS due to its comprehensive quality control procedures for multiple variables. Given that the GLODAPv2 dataset contains quality check indicators only for salinity, DO, silicate, nitrate, and phosphate concentrations, the evaluation of QC checks for RODCCS is conducted exclusively on these five variables.

Data Records

RODCCS is stored in twelve NetCDF format files, and each file encompasses ten variables, with nine common foundational information variables and one unique variable. The variable descriptors in the RODCCS_temperature.nc file are listed below, and the other files in RODCCS follow the same format to organise variables in each record:

Variables:

Longitude

Size: 4348536×1

Dimensions: data_number

Datatype: double

Attributes:

standard_name = ‘longitude’

units = ‘degrees_east’

FillValue = NaN

valid_min = −180

valid_max = 179.9984

variable properties = ‘common foundational information variable’

Latitude

Size: 4348536 × 1

Dimensions: data_number

Datatype: double

Attributes:

standard_name = ‘latitude’

units = ‘degrees_north’

FillValue = NaN

valid_min = −78.643

valid_max = 89.9909

variable properties = ‘common foundational information variable’

Depth

Size: 4348536 × 1

Dimensions: data_number

Datatype: double

Attributes:

standard_name = ‘depth’

units = ‘m’

FillValue = NaN

valid_min = −4.639

valid_max = 61228.213

variable properties = ‘common foundational information variable’

Year

Size: 4348536 × 1

Dimensions: data_number

Datatype: double

Attributes:

standard_name = ‘year’

units = ‘years’

FillValue = NaN

valid_min = 1978

valid_max = 2022

variable properties = ‘common foundational information variable’

Month

Size: 4348536 × 1

Dimensions: data_number

Datatype: double

Attributes:

standard_name = ‘month’

units = ‘months’

FillValue = NaN

valid_min = 1

valid_max = 12

variable properties = ‘common foundational information variable’

Day

Size: 4348536 × 1

Dimensions: data_number

Datatype: double

Attributes:

standard_name = ‘day’

units = ‘days’

FillValue = NaN

valid_min = 1

valid_max = 31

variable properties = ‘common foundational information variable’

Time

Size: 4348536 × 1

Dimensions: data_number

Datatype: double

Attributes:

standard_name = ‘sampling time’

units = ‘minute’

FillValue = NaN

valid_min = 0

valid_max = 2400

variable properties = ‘common foundational information variable’

QC flag

Size: 4348536 × 1

Dimensions: data_number

Datatype: double

Attributes:

standard_name = ‘quality control of sampling data point’

units = ‘xxxxxx, x equals 1 or 2 or 3’

FillValue = NaN

valid_min = 211111

valid_max = 333333

variable properties = ‘common foundational information variable’

Data Source ID

Size: 4348536 × 1

Dimensions: data_number

Datatype: double

Attributes:

standard_name = ‘Source of data point’

units = ‘constant’

FillValue = NaN

valid_min = 1

valid_max = 6

variable properties = ‘common foundational information variable’

Temperature

Size: 4348536 × 1

Dimensions: data_number

Datatype: double

Attributes:

standard_name = ‘Temperature of seawater’

units = ‘°C’

FillValue = NaN

valid_min = −57.5261

valid_max = 99

variable properties = ‘unique variable’

In the attribute description of each NetCDF file, a comprehensive summary is provided to introduce the sources of in-situ data, where each point is distinctly recognised by a unique source identifier (Data Source ID). The Data Source ID values are consecutive integers from 1 to 6, denoting the six data repositories of Argo, CCHDO, NESSDC, CoastDOM, GLODAPv2 and R2R, respectively (Table 1). The QC flag is a six-digit integer, with each digit (x) representing the outcome of a QC check (Table 3). Longitude, Latitude and Depth serve as the location descriptors of the data point. Year, Month, Day and Time serve as the time descriptors of each data point. The twelve files of RODCCS in NetCDF format can be accessed on Figshare²¹ using the link (https://doi.org/10.6084/m9.figshare.28532210). ‘NaN’ denotes missing data.