Background & Summary

Water optical parameters and WST serve as crucial indicators for measuring and assessing lake ecosystems1,2. Water optical parameters offer valuable insights into the detection of organic and inorganic particles, light penetration, ecological risk assessment, and the influence of climate change on lakes3,4,5. Variations in WST can affect the growth, reproduction, and distribution of aquatic organisms and the lake’s water balance. They find extensive application in monitoring, researching, and managing lake environments, facilitating the evaluation of the impact of human activities, climate change, and pollution on these ecosystems. Consequently, the accurate and long-term monitoring of changes in lake water optical parameters holds great significance. However, water experiments often encounter limitations due to weather conditions, instrumentation constraints, and sampling time. Collecting valuable long-term water multi-parameter data necessitates substantial economic and temporal investments6. Furthermore, water quality parameters datasets are often dispersed among researchers and institutions. Therefore, consolidating such data into large-scale, long-term, and high-precision water quality datasets poses considerable challenges.

With the advancement of remote sensing satellites and the maturation of remote sensing technology, utilizing remote sensing technology for large-scale and long-term lake monitoring has emerged as an inevitable trend7,8,9. Satellite sensors for water quality monitoring can be broadly categorized into marine and land sensors. The marine sensors include the Moderate Resolution Imaging Spectroradiometer (MODIS, Terra and Aqua), the Visible Infrared Imaging Radiometer Suite (VIIRS, NPP), and the Ocean and Land Colour Instruments (OLCI, Sentinel-3). On the other hand, the land sensors encompass Operational Land Imager (OLI, Landsat 8–9), Multi-Spectral Instrument (MSI, Sentinel-2), Panchromatic and Multispectral camera (PMS, Gaofen), and Charge-Coupled Device (CCD, HJ2A)10,11,12. These sensors substantially enhance the lake’s monitoring capabilities in terms of spatial and temporal scales, thereby furnishing long-term water optical information for lakes. To extract water optical parameters from satellite images, limnologists, oceanographers, and other researchers have devised various retrieval models based on empirical, semi-empirical, analytical, and artificial intelligence models13,14,15,16. However, the applicability and generalizability of these algorithms at spatiotemporal scales require further investigation, especially considering spectral interactions in inland waters16. The influence of optically active components contributes to interference in the water mixture’s intrinsic optical properties, rendering it challenging to discern the contributions of distinct water quality components from the overall water-leaving radiance17. Although specific studies have monitored water optical parameters across spatial and temporal scales, these retrieval models rely on a large amount of consistent data support.

In recent years, several in-situ datasets have been published for aquatic ecological research, as presented in Table 1. These datasets can be categorized into ocean datasets and inland water datasets. The ocean datasets encompass SeaBASS18, CoastColour Round Robin datasets19, and a compilation of global bio-optical in-situ data for ocean color satellite applications20. These datasets effectively integrate global ocean water spectra and quality data, establishing a solid marine ecological research foundation. Similarly, various datasets of inland water parameters have also been released, such as ReaISAT21, LCC-TP v1.022, AquaSat23, GLORIA24, LakeTSim25, and an extensive spatiotemporal water quality dataset spanning four decades in China26. ReaISAT presents a new global dataset that contains the location and surface area variations of 681,137 lakes from 1984 to 2015. AquaSat contains more than 600,000 matchups, covering 1984–2019, of ground-based water quality parameters measurements paired with spectral reflectance from Landsat 5, 7, and 8. GLORIA includes 7,572 curated hyperspectral remote sensing reflectance measurements at 1 nm intervals within the 350 to 900 nm wavelength range. LakeTSim combined numerical modeling and satellite thermal data to create a regional dataset of long-term water temperatures for 401 French lakes. LCC-TP v1.0 is the first dataset of lake-catchment characteristics for 1525 lakes on the Tibetan Plateau, including six categories: lake characteristics, topography, climate, land cover/land use, soil and geology, and human activities. The 40-year water quality spatiotemporal dataset comprised over 330,000 observations encompassing daily, weekly, and monthly records of surface water quality covering the period from 1980 to 2022. Among inland water datasets, only two provide water parameter data specifically for China, whereas the GLORIA dataset contains relatively limited data in China. Over the past five years, a mere 113 in-situ data records were recorded in 2021 across four lakes in southeastern China (Hou Lake, Wuhan East Lake, Poyang Lake, and Liangzi Lake). Conversely, the 40-year water quality spatiotemporal dataset provides significant long-term series water parameter data in China. It serves as an extensive source for long-term series data in water quality monitoring of China’s offshore waters, lakes, reservoirs, and rivers. However, the majority of these datasets are concentrated on water quality monitoring in the eastern coastal areas, resulting in a limited number of monitoring points for inland lakes in China. Furthermore, the monitoring points for lakes are mainly situated along the shorelines, making it challenging to capture the entire distribution of water parameters within the lakes. This limitation may introduce specific uncertainty when using remote sensing technology to retrieve lake water optical parameters.

Table 1 Progress in the sharing of water in situ datasets.

To fulfill the requirements of satellite remote sensing applications across various domains, China initiated the “National Civilian Space Infrastructure Terrestrial Observation Satellite Common Application Support Platform Project” in 2018. This project provides common and fundamental technical services, ensuring the provision of in-situ data measurements for six parameters across the country: radiation, geometry, water parameters, land surface, atmosphere, and vegetation. The project has accumulated a substantial volume of synchronous satellite-ground in-situ measurement data, supporting quantitative remote sensing research conducted by Chinese satellites. The initial batch of selected water-related common products included four water optical parameters (Rrs, Chl-a, TSM, SDD) and WST. More water quality parameters, such as CDOM, COD, TN and TP, will be measured with support from future projects.

After four years of development, the project has established a synchronized satellite-ground dataset for water optical parameters and WST data in typical lakes across China. This dataset aims to construct a large-scale, long-term water dataset covering typical lakes in China. From 2020 to 2023, A total of 586 matched Rrs, Chl-a, TSM, SDD, Wand ST data points were collected from 18 lakes. The satellite-ground synchronous dataset serves as a validation resource for water remote sensing products. GaoFen or HJ-2A/B satellite simultaneous observations were coordinated during the in-situ data collection process. Additionally, global open satellite data, including Landsat 8-9 and Sentinel-2, were considered in the lake experiments. The synchronous observation method ensures a time difference of less than ± 1.5 hours between in-situ and satellite observation data. Moreover, this dataset simultaneously captures Rrs and water optical parameters and WST data, enabling researchers to explore their intrinsic relationship, thus establishing a more reliable and credible retrieval model for water optical parameters. With the ongoing support of the project, the collection of water quality parameters in China’s typical lakes will continue, and regular updates to this dataset will be made. This dataset is a valuable complement to lake research, providing an effective and comprehensive data resource.

Methods

Parameter description

This dataset focuses on four water optical parameters and WST, which are essential for remote sensing studies of lakes. These parameters play a crucial role in understanding the ecological dynamics of lakes, and extensive studies utilizing remote sensing retrieval modeling have been conducted on them27.

Rrs provides a wealth of water information and can be utilized to infer essential water parameters. Data from airborne or satellite remote sensing platforms enable monitoring large-scale water regions and assessing water quality.

Chl-a is a photosynthetic pigment found in aquatic plants, algae, and certain bacteria28. It plays a crucial role in capturing light energy and initiating the process of photosynthesis, which is essential for estimating the primary productivity of aquatic ecosystems.

TSM refers to the concentration of suspended or floating organic and inorganic particles in water29. TSM measurements provide insights into water turbidity, sedimentation processes, and the transport of nutrients and pollutants.

SDD represents the depth to which light can penetrate through the water. It is influenced by factors such as TSM, phytoplankton abundance, and CDOM. SDD can provide information about the pollution level and overall water quality30.

WST is often employed as an indicator for water quality monitoring. Changes in WST are closely related to factors such as dissolved oxygen content, biological activity, water circulation, and water pollution. Furthermore, WST has significant implications for the survival and reproduction of aquatic organisms31.

By monitoring these parameters, researchers can gain insights into aquatic ecosystems’ ecological conditions, primary productivity, and nutrient dynamics. These parameters are widely utilized in environmental monitoring projects, research endeavors, and management efforts to assess the impact of human activities, climate change, and pollution on lakes.

Dataset description

This dataset encompasses 18 lakes located within China, as illustrated in Fig. 1. The longitude of the lakes ranges from 86.725° to 132.336°, while the latitude ranges from 24.859° to 49.546°. The dataset is categorized based on various lake characteristics, including the lake zone, latitude, altitude, water quality type, formation, hydrology, and surface area. Table 2 demonstrates the comprehensive representation of the dataset across different regions in China. Considering China’s location in the mid-latitude region, the dataset includes 15 lakes in the mid-latitude region and three in the low-latitude region. The altitudes of these lakes range from 0 to 3,200 meters. According to the Environmental Quality Standards for Surface Water (GB3838-2002)32, the lakes are classified as follows: four lakes as Class II, seven lakes as Class III, five lakes as Class IV, and one lake as Class V. These lakes comprise 16 freshwater lakes and two saline lakes (Lake Qinghai and Lake Tuosu). Furthermore, the sampling points were strategically designed to sample multiple water optical parameters, ensuring a more accurate representation of the spatial distribution of different parameters within the lakes33. Figure 2 depicts the sampling locations for each experiment.

Fig. 1
figure 1

The geographical locations of the 18 lakes44.

Table 2 Lakes are classified by zone, latitude, altitude, type, formation, hydrology, and surface area.
Fig. 2
figure 2

The spatial distribution of sampling points and the colors represent multiple water experiments.

This dataset compiles water experimental data from 2020 to 2023, as shown in Table 3. Among the 18 lakes included, water quality data for eight lakes was collected on a single occasion, while data for the remaining lakes was collected two or more times. Lakes, with their relatively limited spatial heterogeneity compared to other land covers, allow for the direct validation of medium-high-resolution remote sensing products. As for low-resolution satellite remote sensing products, in-situ data can be transformed into satellite image pixel scale data through scale conversion, enabling their utilization in product validation34.

Table 3 The collection time, number of sampling points, and synchronization status with remote sensing satellites.

This dataset’s advantage lies in its effective synchronization of in situ data collection with Chinese satellite observations. By coordinating synchronous observations of satellites over the lakes during in-situ data collection, the dataset ensures the most accurate “truth value” for water remote sensing products. Table 3 illustrates the synchronous satellite data during the water sampling experiment, ensuring a time difference between ground sampling and satellite observations of approximately ±1.5 hours.

Chinese multispectral satellites like GF1 (8 m/16 m), GF1B/C/D (8 m), GF2 (4 m), and GF6 (8 m/16 m) have high spatial resolutions. They can also provide high-temporal and high-spatial-resolution data for retrieving water optical parameters. For instance, HJ2A/B has multispectral sensors (16 m) and infrared and thermal infrared sensors (48/96 m). The synchronicity between Sentinel-2, Landsat 8-9, and the in-situ dataset was also investigated. When the time window was one day before and after the in-situ sampling, 73% of the ground experiments were synchronized with Sentinel-2 observations. Further expanding the time window to 3 days before and after the field sampling resulted in 90% synchronicity with Sentinel-2 data. However, fewer synchronous satellite images are available because Landsat satellites have a 16-day temporal resolution. Nonetheless, twelve water experiments coincided with Landsat satellite overpasses, allowing for the verification of water remote sensing products using this dataset.

Observation dataset acquisition

Rrs measurement method

The Rrs measurements were conducted using a TRIOS RAMSES water spectrometer. The spectral resolution of the in-situ Rrs is 1 nm. This system comprises one irradiance sensor and two radiance sensors. The above-water measurement method was employed for data collection. The Rrs at each sampling station was calculated following the methodology proposed by Mobley35,36.

$${{\rm{R}}}_{{rs}}({{\rm{\theta }}}_{v},{{\rm{\varphi }}}_{v},{\rm{\lambda }})=\left(\frac{{L}_{t}^{0+}({{\rm{\theta }}}_{v},{{\rm{\varphi }}}_{v},{\rm{\lambda }})-{r}_{{sky}}({{\rm{\theta }}}_{v}^{{\prime} },{{\rm{\varphi }}}_{v},{{\rm{\theta }}}_{v},W)\times {L}_{{sky}}^{0+}({{\rm{\theta }}}_{v}^{{\prime} },{{\rm{\varphi }}}_{v},{\rm{\lambda }})}{{E}_{d}^{0+}({\rm{\lambda }})}\right)$$
(1)

where \({L}_{t}^{0+}(W[{watt}]\cdot {m}^{-2}\,\cdot \,{{sr}}^{-1}\,\cdot \,{{nm}}^{-1})\) is the above water upwelling radiance, \({L}_{{sky}}^{0+}(W[{watt}]\cdot {m}^{-2}\,\cdot \,{{sr}}^{-1}\,\cdot \,{{nm}}^{-1})\) is sky radiance, and \(({E}_{d}^{0+}(W[{watt}]\cdot {m}^{-2}\,\cdot \,{\mathrm{nm}}^{-1}))\) is the above surface downwelling irradiance measured using TrioOS RAMSES radiometers, within 350–900 nm wavelengths. The sensors were attached to a frame adhering to the viewing angles of Mobley. \({L}_{t}^{0+}\) has a relative azimuth angle (φv) within 90°–135° from the sun and a zenith angle (θv) of 45° from the nadir, and \({L}_{{sky}}^{0+}\) has a zenith angle (\({{\rm{\theta }}}_{v}^{{\prime} }\)) of θv + 90° from nadir.

The instrument setup and measurements are conducted as follows: (i) Before measurement, the TriOS RAMSES water spectrometer is mounted on a specialized device that fixes the angles of the three sensors. This device is depicted in Fig. 3(a). (ii) The device is placed horizontally at the edge of the boat deck, ensuring the target water body is not influenced by surrounding structures’ shadows, sun glints, water foam, or white caps. (iii) The observation geometry θv, φv is set to 40° and 135°, respectively. The zenith angles of the two radiance sensors are adjusted to 40° and −40°, with one sensor pointing toward the water surface and the other sensor pointing toward the sky. (iv) Each observation strictly follows the field observation geometry. Simultaneous spectral data from two radiance sensors and one irradiance sensor are collected, with multiple spectra acquired for each sensor. (v) Filtering is applied to the collected spectral data from each sensor to eliminate residual sun glint, sky reflection components, and the influence of clouds and surrounding structures. (vi) Remote sensing reflectance (Rrs) is computed based on the processed data.

Fig. 3
figure 3

Water optical parameter collection equipment. (a) Water spectrometer, (b) water quality monitoring platform, (c) SDD measurement equipment, (di) TSM measurement equipment.

Chl-a and WST measurement method

Chl-a and WST were measured using a water quality monitoring platform manufactured by YSI37. We specifically used the EXO1/EXO2 instrument model. This platform includes a multi-parameter device designed to monitor water quality parameters data. The measurement of Chl-a concentration relies on chlorophyll fluorescence measurement. When exposed to a specific wavelength, chlorophyll emits light at a higher wavelength. The excitation light source for the instrument is a light-emitting diode with a peak wavelength of approximately 470 nm. Upon stimulation by this light source, chlorophyll within intact cells emits fluorescence within the 650–700 nm spectrum. The instrument quantifies the fluorescence signal to determine the chlorophyll concentration in the water. We employ a method based on a thermistor or a thermal resistance sensor for WST measurements. This approach ensures a high level of accuracy and stability in the measurements. During the measurement, the instrument was placed 10–30 cm underwater. The measurement time was 1 minute, and the interval between each measurement was 2–3 seconds. The average value of multiple measurements was recorded as the Chl-a concentration and WST at the sampling point.

TSM measurement method

TSM measurements are conducted using experimental determination methods, which involve the following steps38: (i) Water Sampling: Standard glass containers (300–500 ml) were used to collect the samples at 30–50 cm below the water surface and label each sample. (ii) Preparation of filters: The filter membranes were pre-rinsed with distilled water and then placed into weighing bottles. The weighing bottles containing the filter membranes were dried in an oven at 103–105 °C for 60 minutes. After the weighing bottles cooled to room temperature, their weights (B) were measured and recorded using a balance. (iii) Sample processing: A 100 ml portion of a well-mixed water sample was filtered using filter membranes and a vacuum pump. The weighing bottle with the filtered sample on the filter membrane was then placed into an oven set at 103–105 °C for 60 minutes. After the sample cooled to room temperature, its weight (A) was measured and recorded using a balance. (iv) TSM (mg/L) calculation: During the water experiment, water samples are collected using containers and returned to the laboratory for measurement. The measurement principle is shown in Eq. (2):

$${\rm{C}}=\frac{(A-B)\times {10}^{6}}{V}$$
(2)

where C is TSM concentration (mg/L); A (g) is the weight of TSM, filter membrane, and weighing bottle; B (g) is the weight of the filter membrane and weighing bottle; V (ml) represents the sample volume.

SDD measurement method

SDD measurements are conducted using the Secchi disk method39. The Secchi disk, typically a white or black-and-white disk with a diameter of approximately 30 cm, is employed to assess the transparency of lake waters. The SDD refers to the depth at which the disk is no longer visible to an observer when lowered into the water. This depth serves as a quantitative indicator of the water body’s transparency or vertical visibility. It is recommended to record the average of the two measurements taken at the bow and stern during the assessment.

Sensor corrections

TriOS calibration

The TriOS radiance and irradiance sensors were calibrated at wavelengths of 460 nm, 550 nm, and 640 nm for four integration times. The radiance source used for the integrating sphere was the USLR-V12F-NMNN model manufactured by Labsphere Inc. The irradiance sensor utilized the FEL-C Irradiance Standard (F-1703) manufactured by Optronic Laboratories. The results demonstrate excellent fitting outcomes for the radiance and irradiance sensors at the three specified wavelengths (Fig. 4). After one year of TriOS radiance and irradiance sensor operation, the calibration coefficient changes were examined, and no significant deviations were found.

Fig. 4
figure 4

Calibration of TriOS radiance (a) and irradiance (c) sensor. Example of calibration coefficient changes for the radiance (b) and irradiance (d) sensor after one year running.

YSI EXO1/EXO2 calibration

Chl-a and WST were measured using the YSI water quality parameter monitoring platform. The sensor calibration was performed before each measurement. We used a dual-channel fluorescence sensor for the Chl-a measurement and expressed the measurements in relative fluorescence units (RFU). The Chl-a values were derived from the correlation between Chl-a concentration and RFU. The RFU channel was calibrated using a two-point calibration method (Fig. 5)40. The instrument’s conductivity and temperature were directly calibrated for WST using standard conductivity solutions.

Fig. 5
figure 5

Two-point calibration of Chl-a (a) and conductivity (b). x is the measured value before sensor calibration, and y is the value after sensor calibration. The linear relationship before and after the sensor calibration calibrates the sensor’s accuracy. For example, the Chl-a concentration measured by the instrument before calibration is 50 ug/L, and the Chl-a concentration measured by the instrument after calibration is 42.128 ug/L.

Chlorophyll calibration was performed using distilled water (0 μg/L) and water with a known chlorophyll concentration. A standard solution containing 625 μg/L of Rhodamine WT dye can be utilized as the reference solution with a known chlorophyll concentration. For the conductivity/temperature sensor, calibration is primarily required for conductivity, while temperature does not need calibration. Conductivity calibration is performed using standard conductivity solutions. The two-point calibration procedure is as follows: (i) First, immerse the sensor in distilled water and input 0 as the value for the first standard solution in the instrument. (ii) Next, place the sensor in the Rhodamine WT standard solution and observe the data readings. Accept this calibration point once the data becomes stable (or shows no significant changes for approximately 40 seconds). (iii) Complete the calibration process.

Data quality control

Specific quality control measures must be implemented to ensure the generation of a high-quality dataset for water optical parameters. These measures encompass data matching, completeness checks, and quality control, all of which collectively contribute to producing a high-quality dataset.

  1. 1)

    Before quality control, all in situ data must be reviewed to ensure the collected data’s accuracy and proper matching. This review examines naming conventions, such as replacing abbreviations like “Lon” for longitude with “Longitude”. Additionally, units are standardized to ensure consistency and conformity to the appropriate unit of measurement. For example, for SDD measurements, “meters (m)” may be changed to “centimeters (cm)”.

  2. 2)

    Completeness checks are essential to ensure the usability of the in-situ dataset. These checks encompass metadata completeness, including OID, sampling time, sampling point coordinates, and other relevant information. Furthermore, the identification of missing or invalid values is crucial. Due to organizational differences, completeness checks are conducted separately for Rrs and water quality data. Initially, the completeness of metadata is assessed, and data with incomplete metadata is excluded. The presence of invalid or missing values is determined. The corresponding data point is flagged if a sampling point has multiple missing values for water optical parameters. Moreover, if a sampling point exhibits substantial missing data within the spectral range of 350–940 nm, the data point is also flagged.

  3. 3)

    Data matching ensures a one-to-one correspondence between Rrs and water optical parameter data. Spatial matching is conducted to exclude data that do not align spatially. This step ensures the consistency and reliability of the dataset.

  4. 4)

    Quality control: The reflectance spectra undergo a thorough quality control process to identify outliers and unrealistic spectral shapes. A set of quality control indicators (refer to Table 4) was utilized, derived from the GLORIA in-situ dataset24. These indicators addressed high-frequency variability, baseline shifts, and negative slopes, particularly in the ultraviolet-to-blue region of the spectrum. The QWIP indicator was employed to identify spectral shapes that deviated from the typical central tendencies observed in aquatic reflectance data. Reasonable threshold ranges for maximum and minimum values were established through a comprehensive literature review and extensive sampling experience. Hydrology and remote sensing specialists expertly evaluated spectra flagged as ‘Suspect’.

    Table 4 Quality control flag names for Rrs and water optical parameters.

The quality control identified 551 high-quality in situ datasets, all of which remain unflagged. These datasets encompass essential information, including spatial details such as longitude and latitude, temporal details such as sampling time, and water optical parameter information such as Chl-a, TSM, SDD, and WST. Table 5 presents the preliminary statistics for these data.

Table 5 Statistical Information Table for in-situ data.

Data Records

The main tasks have been completed, including lake selection, sampling design, data collection, sensor calibration, and data archiving. All data have undergone consistency checks and quality control to flag suspicious or erroneous data.

This dataset is hosted in the Zenodo database, an open-access scientific data storage and sharing platform41. The data is contained in several comma-separated values (CSV) files, and a Microsoft Excel file provides keys to column names (see Table 6).

Table 6 Files of the dataset and their content.

Technical Validation

Validation of water remote sensing reflectance

To verify the usability of the Rrs data, this experiment compares the Rrs data with satellite images. The Rrs data is equated to the image value of a synchronous satellite42. First, water reflectance (Rw) is calculated as:

$${R}_{w}=\pi \cdot {R}_{{rs}}$$
(3)

In-situ Rw spectra were weighted by spectral response functions SRF(λ) of sensor bands, thus deriving multi-spectral data comparable to atmospherically corrected sensor reflectance from the image.

$${R}_{{eq}}({\lambda }_{i})=\frac{{\sum }_{j=1}^{n}{F}_{i}({\lambda }_{j}){R}_{w}({\lambda }_{j})}{{\sum }_{j=1}^{n}{F}_{i}({\lambda }_{j})}$$
(4)

where \({R}_{{eq}}\left({\lambda }_{i}\right)\) is the sensor reflectance simulated from in-situ reflectance (water equivalent spectrum), and Fi represents the spectral response function of the i th band of the satellite sensor.

Remote sensing images are obtained from satellites that transit simultaneously with the ground sampling time. The Sentinel-2 data are from the Copernicus Data Space Ecosystem (https://dataspace.copernicus.eu/), and the Landsat 8 data are from the United States Geological Survey (https://earthexplorer.usgs.gov/). GF and HJ satellite data are downloaded from the China Center for Resources Satellite Data and Application (https://data.cresda.cn/#/2dMap). To obtain the surface reflectance (SR), the Sentinel-2 and Landsat 8 data were atmospherically corrected using ACOLITE DSF, and the GF and HJ data were atmospherically corrected using FLAASH.

The Pearson correlation coefficient (r), the mean absolute percentage error (MAPE, %) and root mean square error (RMSE, sr−1) are used to evaluate the deviations between Req and the remote sensing image value.

$$r=\frac{{\sum }_{i=1}^{N}({E}_{i}-\bar{E})({M}_{i}-\bar{M})}{\sqrt{{\sum }_{i=1}^{N}{({E}_{i}-\bar{E})}^{2}}\sqrt{{\sum }_{i=1}^{n}{({M}_{i}-\bar{M})}^{2}}}$$
(5)
$$M{AP}E=\frac{100}{N}\mathop{\sum }\limits_{i=1}^{N}\left|\frac{{E}_{i}-{M}_{i}}{{M}_{i}}\right|$$
(6)
$${RMSE}=\sqrt{\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}{({E}_{i}-{M}_{i})}^{2}}$$
(7)

where N is the total number of measurements; i represents a single measurement; M and E are measured values and estimated values, respectively.

We verified the correlation and accuracy of each band between the water equivalent spectrum and satellite image values using six lakes in different regions of China (Fig. 6 & Table 7). The satellites utilized for validation encompass Landsat, Sentinel, and China’s HJ and GF satellites, ensuring a highly representative validation. The SR data of the satellite and the water equivalent spectrum are evenly distributed on both sides of the 1:1 line, indicating that the satellite data can monitor the water-leaving spectral radiance. However, in some lakes, the SR data of the satellite and the water equivalent spectrum do not show an obvious correlation. For example, in Fig. 6(c,f), the SR data of the satellite and the water equivalent spectrum show a weak correlation and negative correlation, indicating inconsistency between the satellite data and the water spectrum data. In addition, the Sentinel-2 data exhibit weaker consistency with the water equivalent spectrum data in Bands 6, 7, and 8 in Fig. 6(a). This discrepancy can be attributed to the low remote sensing reflectivity of the water spectrum in the near-infrared band. These situations will cause significant uncertainty in the retrieval of water quality parameters from remote sensing satellites.

Fig. 6
figure 6

The correlation between SR value and water equivalent spectra. (a) Sentinel-2 MSI at Lake Bosten, (b) Sentinel-2 MSI at Lake Gaoyou, (c) Landsat 8 OIL at Lake Hulun, (d) HJ2B CCD at Lake Khanka, (e) GF1C PMS at Lake Taihu, (f) GF1D PMS at Lake Dianchi.

Table 7 shows each band’s correlation coefficient, MAPE, and RMSE in six lakes.

Remote sensing inversion of water optical parameters

The in-situ data collected from the lakes in China can be effectively employed for remote sensing retrieval of water optical parameters. To assess the applicability of this dataset, a water optical parameter retrieval model was constructed using simulated sensor reflectance derived from in-situ reflectance (Eq. 4) and water optical parameters (Chl-a, TSM, and SDD)43 as exemplified through GF1 PMS and Sentinel-2 MSI. The retrieval model uses the simulated PMS and MSI sensor reflectance based on in-situ reflectance as training data. The remote sensing retrieval of water optical parameters was performed using a random forest regression algorithm. We employed the RandomForestRegressor model from the Scikit-learn library in Python. The model is based on ensemble learning and utilizes the random forest algorithm to fit and predict continuous target variables. The model parameter configuration is set as n_estimators=500, max_depth=10, random_state=3. The training and validation datasets consist exclusively of quality-controlled in-situ data. The independent variable is the reflectance of the PMS sensor (Band 1–4) and MSI sensor (Band 2, 3, 4 and 8) at different spectral bands (blue, green, red, and near-infrared). The dependent variable is the corresponding water optical parameter. The data were divided into a training dataset (385 samples) and a testing dataset (166 samples) in a 7:3 ratio.

To verify the effectiveness of remote sensing retrieval for Chl-a, TSM, and SDD, the retrieval results of PMS and MSI are presented in Fig. 7 and on two different sensors, except for the Chl-a validation results of GF1 PMS, the training and validation correlations of the retrieval models for the three water optical parameters exceeded 80%. Furthermore, the correlations between the SDD in the validation dataset surpass 90%. Additionally, due to the spectral response functions of the two sensors being relatively similar in the visible to near-infrared range, the root mean square errors (RMSEs) difference between the retrieval results of the two sensors is minimal, within 5%. These results indicate a high accuracy in retrieving the three water optical parameters. Notably, there is a strong correlation between water optical parameters and sensor reflectance. Overall, the findings demonstrate the availability and effectiveness of both in-situ and remote sensing data in water optical parameter retrieval.

Fig. 7
figure 7

The accuracy of the training and validation datasets. On the left is the validation accuracy of the GF1 PMS retrieval model, and on the right is the validation accuracy of the Sentinel-2 MSI retrieval model.

Usage Notes

Quality flags

Every record in the dataset is accompanied by a quality flag, and the interpretation of the quality flag can be found in the “Variable” file. Missing values for water optical parameters were designated as outliers.

Data update

The dataset will undergo regular updates with the project’s ongoing support, ensuring the latest information is included. Any updates to the dataset will be thoroughly documented and described in the Zenodo database, providing a comprehensive account of the changes made.