Background & Summary

The coastal region is closely tied to human life, featuring intricate ecological and economic impacts. Anthropogenic activities have led to a variety of environmental issues in coastal waters1. Hypoxia (oxygen concentration lower than 2 mg L−1) has frequently been reported in the Coastal China Sea (CCS) over recent decades2,3,4, driven by both anthropogenic activities and climate change5,6,7. The Yangtze River estuary is the largest estuary in the CCS. In its adjacent coastal ocean, shallow hypoxia has been observed in summer and autumn in recent years8,9,10. It has a significantly negative impact on environmental health and subsequently on ecological community compositions and fisheries11,12. Therefore, a sufficient understanding of their underlying mechanisms and exploring solutions to alleviating hypoxia are very urgent4,13. Continental shelves absorb atmospheric CO2 at a rate of about 0.2 Pg C yr−1, accounting for approximately 13%–15% of the current global oceanic CO2 uptake14,15,16. The CCS, one of the largest continental shelves on Earth17, is considered a region with significant carbon sink potential18,19,20 and deserves great effort in quantifying its carbon fluxes and predicting its response to future climate change.

To grasp the physicochemical characteristics of coastal waters, researchers use marine physical coupled biogeochemical models, which serve as crucial tools in testing hypotheses and quantifying fluxes of elemental cycles. Model evaluation and parameter calibration require a large volume of observational data. However, accessing these data poses a challenge due to their dispersed distribution. Furthermore, a strict quality control (QC) for different types of in-situ observational data is necessary.

Considering the recently recognised importance of CCS in the carbon cycle and the increased number of reported hypoxia events, a regional ocean database for CCS (RODCCS) is compiled in this study to offer comprehensive and reliable observational data for both modelling and field research21. RODCCS includes data from six repositories, one of which comprises unpublished data from authors (Table 1). Our database covers the region of 116° to 135° E in longitude and 20° to 42° N in latitude. It encompasses the Bohai Sea, the Yellow Sea, and the East China Sea, as well as a part of the Sea of Japan and its sampling depths span from the surface to 6984 meters (Fig. 1a,b). The sampling years of the data span from 1985 to 2021 (Fig. 1b). The database includes twelve variables, i.e., temperature (4,348,536 data points), salinity (4,325,295 data points), and dissolved oxygen (DO, 4,235,725 data points), silicate (726,086 data points), nitrate (745,908 data points), nitrite (246,347 data points), ammonium (29,526 data points), phosphate (729,507 data points), Chlorophyll a (Chl a, 73,715 data points), dissolved inorganic carbon (DIC, 128,744 data points), dissolved organic carbon (DOC, 196,919 data points), particulate organic carbon (POC, 25,190 data points) concentrations.

Table 1 Summary of in-situ datasets and value ranges of variables collected from the literatures.
Fig. 1
figure 1

Spatio-temporal distribution of RODCCS. (a) Blue, black, green, purple, pink and yellow points represent data from Argo, NESSDC, GLODAPv2, CCHDO, CoastDOM and R2R, respectively. (b) Hovmoller diagram of spatiotemporal distribution of RODCCS. The shadings indicate the log10-transformed numbers of data points. The white shading indicates the absence of data. The orange line indicates the log10-transformed numbers of data points for each year during the period from 1985 to 2021.

We apply six strict QC checks to each variable of the collected in-situ data. The QC includes location check, depth check, constant value check, value range check, vertical gradient check and time reversal check (Table 2). After QC, irrelevant, failed and acceptable data are marked with flags of 1, 2 and 3, respectively (Table 3). The numbers of failed data by the six QC checks for the twelve variables are presented in Table 4. We store the data in twelve NetCDF format files with one variable in each file. There is a uniform structure in each file where longitude, latitude, depth, sampling time, data source ID, QC flag and variable values are included. It provides the spatial and temporal attributes, original repository information, QC check result, and value of each data point. RODCCS provides quality-controlled observational data for model evaluation as well as inter-comparisons of different databases, making it suitable for a wide range of marine modelling and field research.

Table 2 Details of each QC check for RODCCS.
Table 3 Flags of QC results and their interpretations in the RODCCS.
Table 4 Original data point number of 12 variables and failed data number identified by the 6 QC checks.

Methods

The original data of RODCCS are from six repositories, encompassing observational data in various formats, including Comma-Separated Values (.csv), NetCDF, Excel Open XML Spreadsheet (.xlsx), Copy Number Variation (.cnv), and Tab-Separated Values (.tab). The procedures of RODCCS compilation are summarised in Fig. 2.

Fig. 2
figure 2

Work flow of RODCCS compilation.

Array for real-time geostrophic oceanography (Argo) is an international program that measures water properties across the world’s ocean using a fleet of robotic instruments that drift with the ocean currents and move up and down between the surface and a mid-water level22. On the top of every Argo float is a conductivity, temperature, pressure sensor which measures temperature within an accuracy of 0.001 °C and pressure within 0.1 dbar, and calculates salinity using conductivity, temperature, and pressure within 0.001 psu (practical salinity units). Biogeochemical-Argo (BGC-Argo) is the extension of the Argo array of profiling floats to include floats that are equipped with biogeochemical sensors for pH, oxygen, Chl a, nitrate concentrations, suspended particles, and downwelling irradiance. On the Euro-Argo European Research Infrastructure Consortium (ERIC) website (https://fleetmonitoring.euro-argo.eu)23, we select data from the China Sea Institute of Oceanology (CSIO), Korea Meteorological Administration (KMA), and Korea Ocean Research & Development Institute (KORDI) data centre for download. They all fall under the category of BGC-Argo data. All data from CSIO, KMA, and KORDI are exclusively adjusted data, which are raw sensor outputs and remain institutionally archived. These Argo data have undergone algorithmic processing and environmental compensation procedures24. CSIO, KMA and KORDI provide depth data rather than sensor-measured pressure data, which are depth values derived directly from pressure sensors and synthetically reconstructed layers from multi-sensor fusion25. CSIO delivers both real-time (automatically quality-controlled) and delayed-mode (expert-validated) adjusted data. KMA and KORDI provide real-time adjusted data without delayed-mode products due to shortened float deployments, which are transmitted via satellite in near real-time. Delayed-mode data undergo rigorous quality control protocols, including sensor calibration, salinity bias adjustment, and outlier removal25,26. For BGC-Argo, delayed-mode processing further integrates laboratory analytical validation25. Real-time QC of Argo detects physically implausible values and ensures vertical profile consistency, while delayed-mode QC combines expert manual verification with regional climatological datasets to identify biases24. We obtain original data in NetCDF files that include sampling sites, sampling times, DO concentration, salinity and temperature. We then use MATLAB to extract these data. The depth of each data point is converted from pressure using the seawater toolbox in MATLAB.

Climate and Ocean Variability, Predictability and Change (CLIVAR) and Carbon Hydrographic Data Office (CCHDO) support oceanographic research by providing access to high quality, global, vessel-based Conductivity, Temperature and Depth (CTD) and hydrographic data from Global Ocean Ship (GO-SHIP), World Ocean Circulation Experiment (WOCE), CLIVAR and other repeat hydrography programs27,28. The electrochemical Sea-Bird SBE43 sensor is utilised to measure DO concentration in CCHDO29. Data are retrieved from the CCHDO database through its advanced search platform (https://cchdo.ucsd.edu/search/advanced) in.xlsx format. Ten variables, i.e., Chl a, DIC, DO, DOC, nitrate, nitrite, phosphate, and silicate concentrations, salinity, and temperature, are obtained from the CCHDO database. The depth of each data point is also converted from pressure using the seawater toolbox in MATLAB.

We retrieve four datasets from the National Earth System Science Data Center (NESSDC, http://www.geodata.cn/) and integrate them into our database. We conduct a targeted search for cruise expeditions along the coastal regions of China, with a specific emphasis on Chl a concentration. Following the submission of a formal request and subsequent grant of access by the website administrators, we obtain the Yellow Sea and East China Sea Chl a concentration data for 2011–2013, the Bohai Sea Chl a concentration data for 2015 and 2017, and the China Coastal Chl a concentration data for 2009–2012 (offshore CTD Chl a concentration measurements in CCS), in four.xlsx format files respectively. The in-situ Chl a concentration data from NESSDC are unpublished in the scientific literatures. The data authors are authors of this study, and permission to use the data has been granted. The Chl a concentrations in these datasets are all determined using the Trilogy fluorometer technique. We extract Chl a concentration data along with their corresponding sampling location and time information.

The Coast Dissolved Organic Matter (CoastDOM) database includes comprehensive coastal DOM concentration data in a single repository, making it openly and freely available to different research communities30. In CoastDOM, the concentrations of 81% samples are determined using a High-Temperature Catalytic Oxidation (HTCO) analyser for DOC concentration, with the remaining 19% determined by a combination of wet chemical oxidation (WCO) and/or UV digestion. Data from the CoastDOM are downloaded from the website (https://doi.pangaea.de/10.1594/PANGAEA.964012) in.tab format. Five variables, i.e., Chl a, DIC, DOC, POC, and ammonium concentrations, together with their corresponding sampling location and time information, are extracted with MATLAB.

Global Ocean Data Analysis Project Version 2 (GLODAPv2) is a synthesis activity for ocean surface to bottom biogeochemical data collected through chemical analysis of water samples31,32,33. GLODAP deals only with bottle data and CTD data at bottle trip depths. The consistency of its data product is estimated to be better than 0.005 for salinity, 1% for oxygen, 2% for nutrients, 4 μmol/kg for DIC concentration and total alkalinity, and 0.01–0.02 for pH, indicating a high level of precision and reliability across these measurements. We download data from the Pacific Ocean part of GLODAPv2.2023 (released in 2023) via the GLODAPv2 portal (https://glodap.info/index.php/merged-and-adjusted-data-product-v2-2023/) in a.csv format file34. We extract nine variables, i.e., Chl a, DO, DOC, nitrate, nitrite, phosphate, silicate concentrations, salinity and temperature, together with their corresponding sampling location and time information.

The Rolling Deck to Repository (R2R), with their global capability and diverse array of sensors and research vessels, is an essential mobile observing platform for ocean science35,36. Temperature and salinity are measured with a CTD profiler, and DO concentration is measured by an oxygen sensor. R2R provides essential documentation and standard products for each expedition, as well as tools to document shipboard data acquisition activities while underway. Data collected on every expedition are of high value, given the high cost and increasingly limited resources for ocean exploration. We download the cruise data from the Pacific Ocean on the R2R website (https://www.rvdata.us/search?keyword=ctd&zoom=1&x=0&y=2646652.0332176173&projection=M). Twenty-three cruise files in.cnv format from this repository within the region of RODCCS are selected for further analysis. Ultimately, we extract the values for DO concentration, salinity and temperature, and the location and temporal information of each data point, and then include them in RODCCS.

In order to control the quality of the in-situ data, we apply six types of checks for each variable. The QC includes location check, depth check, constant value check, value range check, vertical gradient check and time reversal check. The six QC checks are listed in Table 2 and explained below, and data points are flagged with 1, 2, or 3 when they are irrelevant, have failed or passed the specific check (Table 3). Flags of each data point are saved in the NetCDF format files for all variables.

Location check

The location check ensures the accuracy of the sampling locations within the CCS region defined in this study37,38. Due to the diversity of data sources and the varying sampling locations across different cruises, the collected dataset includes data outside the CCS region. Data points outside the study area fail this inspection and are not subject to the subsequent five inspections, while only data points that passed undergo the subsequent inspections.

Depth check

The depth check assesses whether the sampling depth is shallower than the corresponding seabed depth estimated with the General Bathymetric Chart of the Oceans (GEBCO) data37,38,39. Data points either above sea level or deeper than the seabed fail this check and are not applied in the subsequent QC checks (Fig. 3).

Fig. 3
figure 3

Results of the depth check of temperature, salinity, DO, silicate, nitrate, nitrite, ammonium, phosphate, Chl a, DIC, DOC, and POC concentrations (al). Black lines are the seabed topography at 123°E (ai) and 23.08°N (jl) estimated from the GEBCO dataset. Black, blue and red circles are original data, data passed the QC and failed the QC (data below the seabed or above the sea level), respectively.

Constant value check

This check identifies consecutive identical values in the vertical profile, which suggests instrument malfunction or data corruption38, resulting in sampling values that do not change with variations in depth or time. We analyse the data distributions of each profile and find that anomalous values typically occur three or more times consecutively. As a result, data points with occurrences of more than three consecutive identical values in a vertical profile are identified as failed data and not applied in the subsequent checks.

Value range check

For variables that conform to or approximate a normal distribution (log-transformed DOC, Chl a, ammonium concentrations, and salinity), we employ the Chauvenet’s criterion to identify outliers. For a dataset comprising N measurements, any value that deviates from the mean with a probability of less than 1/(2 N) is classified as a suspicious outlier40,41. We determine the critical value using MATLAB’s norminv function, which requires the mean (mu) and standard deviation (sigma) of the data (Eq. 1). Given that this method is a two-sided test and only data at the tail with high values will be identified, the threshold is calculated with 1-1/(4 N)42,43 (Eq. 1). Any measurements exceeding the critical value are taken as outliers, with the remaining measurements as pass values (Fig. 4).

$${Critical\; value}={norminv}(1\,-\,1/(4{\rm{N}}),{mu},{sigma})$$
(1)

Where mu and sigma are the mean and standard deviation, respectively.

Fig. 4
figure 4

Results of the value range check of salinity, Chl a, DOC and ammonium concentrations, temperature, DO, silicate, phosphate, nitrate, nitrite, DIC, and POC concentrations (al). Black, blue and red bars represent numbers of original, passed and failed data, respectively. Black dashed lines and blue solid lines represent the fitting-lines of the data distribution before and after QC, respectively (ad). Red dashed lines indicate Chauvenet’s critrion thresholds (ad) and magenta dashed lines indicate IQR thresholds (el), respectively. Subfigures are used to better show the distributions of failed data, with the y-axis values in each subfigure representing the quantities of the respective data points (a,e,f,j).

Log-transformed DOC, Chl a and ammonium concentrations, and salinity undergo the value range check with the Chauvenet’s criterion method. Since Chl a, ammonium, and DOC concentrations don’t have negative values and may have very low values, the threshold for these variables is set between 0 and the Critical value defined by Eq. 1. We also compare these thresholds with the ranges collected from the literatures (Table 1), to validate the rationality of the thresholds applied in this check.

For variables that do not follow a normal or approximately normal distribution (temperature, and DO, silicate, nitrate, nitrite, phosphate, DIC, POC concentrations), we utilise the Interquartile Range (IQR) method for outlier identification, as shown in Eqs. 2444. Iqr, defined as the difference between the 75% quantile and 25% quantile of a variable, together with published value ranges45,46,47, is applied in determining the upper (Upper Bound) and lower (Lower Bound) bounds for outlier identification (Eqs. 3, 4). Through comparative analysis of literature-reported ranges (Table 1) and bounds derived from varying Iqr coefficients, a coefficient of 2 is ultimately determined in Eqs. 3, 4, which produces better consistency between both ranges. Data points outside the calculated ranges are marked as outliers (Fig. 4).

$${\rm{Iqr}}={\rm{Q}}2-{\rm{Q}}1$$
(2)
$${\rm{Lower\; Bound}}={\rm{Q}}1-2{\rm{Iqr}}$$
(3)
$${\rm{Upper\; Bound}}={\rm{Q}}2+2{\rm{Iqr}}$$
(4)

Where Q1 and Q2 are the 25% quartile and 75% quartile of a variable, respectively.

Vertical gradient check

This check is conducted to identify excessive decreases or increases in variable values over a depth range. A gradient is defined as:

$${\rm{gradient}}=\left|\frac{{{\rm{\nu }}}_{{\rm{i}}}-{{\rm{\nu }}}_{{\rm{i}}-1}}{{{\rm{z}}}_{{\rm{i}}}-{{\rm{z}}}_{{\rm{i}}-1}}\right|\left({\rm{i}}=2,3,4,5,\ldots \ldots ,{\rm{k}}\right)$$
(5)

Where \({\nu }_{{\rm{i}}}\) and \({\nu }_{{\rm{i}}-1}\) are the values of a variable at the current depth level and the previous depth level (shallower). zi and zi-1 are the depths (in meter) of the current depth level and the previous depth level (shallower), respectively. k is the number of data points in a vertical profile.

We firstly sort the data points with identical latitude and longitude coordinates into identical vertical profiles. We analyse the number of data points per vertical profile and find that most sampling events contain over ten data points. If the number of samples in a particular vertical profile exceed or is equivalent to ten, a vertical gradient check is conducted. Otherwise, all data points in a vertical profile are marked as irrelevant. Since data distributions of nitrite, ammonium, Chl a, DIC, DOC, and POC concentrations are highly dispersive and often do not meet this criterion, vertical gradient check is only performed on the six variables of temperature, salinity, DO, silicate, nitrate, and phosphate concentrations.

For each vertical profile, a surface-to-bottom check sequence is adopted. The value of the shallowest sampling point is validated against the World Ocean Atlas 2013 (WOA13) annual climatological data48. If the value of a data point is out of the value estimated from WOA13 at its corresponding location by ± n%, it is flagged as an outlier. Then the data point of the next deeper level is estimated until a value within the acceptable range is identified, which is then adopted as the first sampling point. The spatial distributions of salinity and DO concentration are significantly different from the other four variables; therefore, we set different n values to ensure that our selection of the first sampling point is appropriate. We determine the values of n by comprehensively analysing the results obtained from different initial sampling points, which are generated by varying n. The n values resulting in accurate outlier identification are selected. Finally, the n values used in this check for salinity, DO concentration, and the other four variables are 20, 40, and 100, respectively. The next sampling point (νi) at depth zi and the starting point (νi−1) at depth zi−1 are selected. If the depth interval Δz (|zi - zi-1|) is greater than 10 meters, a vertical gradient check is performed. Analysis of gradients in vertical profiles indicates that smaller Δz values (Δz < 10 m) reduce the value of the denominator in Eq. 5, resulting in unreasonably large gradients. Numerous valid data points are misidentified as outliers. Conversely, if Δz is less than 10 meters, the search moves upward to a shallower point that has passed the vertical gradient check as the i−1 point. Data points within 10 meters vertically from the starting point are irrelevant for this check. To better represent the differences in gradient ranges between surface and deep waters (e.g. due to physical or biogeochemical influences), every data point has been categorised into the shallow water group (depth ≤ 400 m) or the deep water group (depth > 400 m). Data point with gradients exceeding the maximum gradient value (MGV) fails this check and are flagged (Fig. 5). We compare the results with different MGV values and verify their corresponding locations and values of outliers. The MGV values with which obvious outliers are identified are applied in this check (Table 5).

Fig. 5
figure 5

Vertical gradient check results of temperature, salinity, DO, silicate, nitrate, phosphate concentrations. Black, blue, red and yellow dots represent original, passed, failed, and irrelevant data, respectively (a,b,e,f,i,j,m,n,q,r,u,v). Blue, red and yellow circles represent passed, failed, and irrelevant data points in randomly selected vertical profiles (c,g,k,o,s,w) and their corresponding vertical gradient respectively (d,h,l,p,t,x). Red dashed lines indicate MGVs as described in Table 5.

Table 5 Maximum gradient values (MGV) for vertical gradient check.

Time reversal check

This check identifies instances where data points are recorded out of temporal sequence, leading to misinterpretations of temporal trends. Within the same sampling event, data points that do not conform to an increasing chronological order are flagged as failed data. Data points contain sampling time information of only year and month, but without day, hour, and minute, are marked as irrelevant, as they do not provide sufficient temporal resolutions required for this check37.

Evaluation of QCs for RODCCS

After QC, we employ the dichotomous metrics of True Positive Rate (TPR), False Positive Rate (FPR), and True Negative Rate (TNR) to evaluate QC performance (Fig. 2). TPR reflects the QC’s ability to correctly retain valid data points. FPR indicates the proportion of normal data erroneously flagged as anomalies. TNR measures the specificity in preserving true negative instances38,49. These metrics quantify the trade-off between detection efficacy and error control, thereby providing a comprehensive evaluation of the discriminative capacity of the QC system. Optimal performance of QC is achieved when both TPR and TNR are maximised, and FPR is minimised38. These dichotomous metrics are initially proposed by Yerushalmy and defined as follows38,49:

$${\rm{TPR}}=100 \% \times \frac{{N}_{{TP}}}{{N}_{{TP}}+{N}_{{FN}}}$$
(6)
$${\rm{FPR}}=100 \% \times \frac{{N}_{{FP}}}{{N}_{{TN}}+{N}_{{FP}}}$$
(7)
$${\rm{TNR}}=100 \% \times \frac{{N}_{{TN}}}{{N}_{{TN}}+{N}_{{FP}}}$$
(8)

In these equations, NTP, NFN, NFP and NTN represent the numbers of true positives, false negatives, false positives, and true negatives, respectively. For measurement, the benchmark dataset provides the true passed or rejected flags, which are then compared against the QC results (pass or fail). We use GLODAPv2 as a benchmark dataset to evaluate the performance of our QC of RODCCS due to its comprehensive quality control procedures for multiple variables. Given that the GLODAPv2 dataset contains quality check indicators only for salinity, DO, silicate, nitrate, and phosphate concentrations, the evaluation of QC checks for RODCCS is conducted exclusively on these five variables.

Data Records

RODCCS is stored in twelve NetCDF format files, and each file encompasses ten variables, with nine common foundational information variables and one unique variable. The variable descriptors in the RODCCS_temperature.nc file are listed below, and the other files in RODCCS follow the same format to organise variables in each record:

Variables:

Longitude

Size: 4348536×1

Dimensions: data_number

Datatype: double

Attributes:

standard_name = ‘longitude’

units = ‘degrees_east’

FillValue = NaN

valid_min = −180

valid_max = 179.9984

variable properties = ‘common foundational information variable’

Latitude

Size: 4348536 × 1

Dimensions: data_number

Datatype: double

Attributes:

standard_name = ‘latitude’

units = ‘degrees_north’

FillValue = NaN

valid_min = −78.643

valid_max = 89.9909

variable properties = ‘common foundational information variable’

Depth

Size: 4348536 × 1

Dimensions: data_number

Datatype: double

Attributes:

standard_name = ‘depth’

units = ‘m’

FillValue = NaN

valid_min = −4.639

valid_max = 61228.213

variable properties = ‘common foundational information variable’

Year

Size: 4348536 × 1

Dimensions: data_number

Datatype: double

Attributes:

standard_name = ‘year’

units = ‘years’

FillValue = NaN

valid_min = 1978

valid_max = 2022

variable properties = ‘common foundational information variable’

Month

Size: 4348536 × 1

Dimensions: data_number

Datatype: double

Attributes:

standard_name = ‘month’

units = ‘months’

FillValue = NaN

valid_min = 1

valid_max = 12

variable properties = ‘common foundational information variable’

Day

Size: 4348536 × 1

Dimensions: data_number

Datatype: double

Attributes:

standard_name = ‘day’

units = ‘days’

FillValue = NaN

valid_min = 1

valid_max = 31

variable properties = ‘common foundational information variable’

Time

Size: 4348536 × 1

Dimensions: data_number

Datatype: double

Attributes:

standard_name = ‘sampling time’

units = ‘minute’

FillValue = NaN

valid_min = 0

valid_max = 2400

variable properties = ‘common foundational information variable’

QC flag

Size: 4348536 × 1

Dimensions: data_number

Datatype: double

Attributes:

standard_name = ‘quality control of sampling data point’

units = ‘xxxxxx, x equals 1 or 2 or 3’

FillValue = NaN

valid_min = 211111

valid_max = 333333

variable properties = ‘common foundational information variable’

Data Source ID

Size: 4348536 × 1

Dimensions: data_number

Datatype: double

Attributes:

standard_name = ‘Source of data point’

units = ‘constant’

FillValue = NaN

valid_min = 1

valid_max = 6

variable properties = ‘common foundational information variable’

Temperature

Size: 4348536 × 1

Dimensions: data_number

Datatype: double

Attributes:

standard_name = ‘Temperature of seawater’

units = ‘°C’

FillValue = NaN

valid_min = −57.5261

valid_max = 99

variable properties = ‘unique variable’

In the attribute description of each NetCDF file, a comprehensive summary is provided to introduce the sources of in-situ data, where each point is distinctly recognised by a unique source identifier (Data Source ID). The Data Source ID values are consecutive integers from 1 to 6, denoting the six data repositories of Argo, CCHDO, NESSDC, CoastDOM, GLODAPv2 and R2R, respectively (Table 1). The QC flag is a six-digit integer, with each digit (x) representing the outcome of a QC check (Table 3). Longitude, Latitude and Depth serve as the location descriptors of the data point. Year, Month, Day and Time serve as the time descriptors of each data point. The twelve files of RODCCS in NetCDF format can be accessed on Figshare21 using the link (https://doi.org/10.6084/m9.figshare.28532210). ‘NaN’ denotes missing data.

Technical Validation

The six QC checks have significantly improved the data quality, and the number of data failed during all checks are shown in Table 4.

Location check

The number of data points excluded by the location check are the most abundant (Table 4). Among the twelve variables, the largest number of data points that failed the check is observed in temperature, with 1,496,903 data points. Conversely, the lowest number of failed data points is recorded for POC concentration, with 24,996 data points. DIC concentration has the highest percentage of failed data points relative to its total number of data points, accounting for 99.42%. In contrast, the lowest percentage is found in DO concentration, which is 32.21%.

Depth check

During the depth check, anomalies above sea level are exclusively found in temperature, salinity, and DO concentration. The remaining outliers exceed the maximum sea depth estimated with the GEBCO data, predominantly occurring at several deepest points of each single vertical sampling event. Given the dispersive distribution of data points, to more effectively demonstrate the effectiveness of the depth check, we select the vertical sections at 123°E and 23.08°N of the domain, which have a higher number of data points. The spatial distribution of outliers in seawater temperature, salinity, and DO concentration exhibits a high degree of similarity, indicating a strong likelihood that these outliers are derived from the same sampling event (Fig. 3a–c). A similar pattern is also observed for the concentrations of nitrate, nitrite, and phosphate (Fig. 3e,f,h). No significant outliers are detected in the concentrations of silicate, DIC, DOC, and POC (Fig. 3d,g–l) in the selected transect.

Constant value check

The disparity in the number of outliers identified by constant value check across different variables is substantial. A total of 89,004 and 232,935 outliers are identified for temperature and salinity, which account for 2.05% and 5.39% of the original data, respectively. In contrast, no outliers are identified for DIC and POC concentrations.

Value range check

In the value range check, outliers for salinity are much less than for the remaining variables. Figure 4a–d present the outcomes of the Chauvenet’s criterion. Figure 4a shows both unusually low salinity values (values below 23.71 psu), while Fig. 4c shows excessively high value for DOC concentration (values outside the range of 0 to 371.53 µmol/L). Chl a and ammonium (Fig. 4b,d) concentrations do not yield any outliers in this check, indicating that the sampled values for these variables are within a reasonable range. Analysis of the original and passed fitting-line derived from the salinity data distribution reveals that the distribution conforms more closely to a normal distribution after eliminating outliers (Fig. 4a).

Figure 4e–k demonstrates the effectiveness of the IQR method. Figure 4e shows that the IQR method successfully identifies excessively high temperature values (e.g., values greater than 42.07 °C), and Fig. 4f,j,l show the unusually high DO, phosphate, POC concentrations (e.g., values greater than 338.56, 7.4, 899.75 µmol/L) are identified.

Vertical gradient check

The vertical gradient check identifies values that exhibit substantial discrepancies from values of their adjacent data points. The highest number of data points of 85,526, which failed the check, is observed in DO concentration (Table 4). And DO concentration also has the highest percentage of failed data points relative to the total number of data points, accounting for 2.02%.

Irrelevant values of temperature, salinity, and DO concentration are predominantly distributed in shallow waters (Fig. 5b,f,j). This is attributed to that sampling events for these three variables in shallow waters often have fewer sampling points (less than 10), and thus cannot be regarded as continuous vertical profiles in this check. From randomly selected temperature and salinity sampling events, it can be observed that the vertical gradient check accurately identified the anomalous values in a set of sampling data (Fig. 5c,d,g,h).

Anomalous values of DO concentration are mostly located in shallow water and are often among the first few points of a single sampling event (Fig. 5j). This can be attributed to measurement errors that occur when the sampling instrument is initially activated. For nutrient concentrations, our vertical gradient check successfully identifies anomalously low values in the deep sea (Fig. 5n,r,v) and high values in the shallow sea (Fig. 5r), thereby rendering the vertical distribution of the data more reasonable (Fig. 5m,n,q,r,u,v).

Time reversal check

This check results in the identification of a modest number of outliers, and the highest number of data points that failed the check is observed in temperature, with 27,232 data points (Table 4), while the highest percentage of failed data points relative to the total number of data points is observed for phosphate concentration, accounting for 2.98%. It is important to note that no outliers are identified by this check for ammonium, DIC, DOC, and POC concentrations (Table 4). Given that the majority of the sampling points are not time-series data, therefore, most of the data are marked as irrelevant in this check.

Evaluation of QCs for RODCCS

Table 6 presents a comprehensive assessment of QC checks for five variables. Overall, the QC checks demonstrate excellent performance, with high TPR and TNR values for the five variables. This indicates a robust capacity to accurately identify true positives and true negatives. The value range checks achieve a TNR as much as 100% for all variables, i.e., the outliers identified by our value range check method are identical with the outliers identified by the default QC check for GLODAPv2, suggesting high efficacy of the value range check employed.

Table 6 The True Positive Rate (TPR), False Positive Rate (FPR) and True Negative Rate (TNR) for different QCs of RODCCS.