Abstract
Water quality indicators (WQI) reflect both the current state and the changing trends of water quality. Extracting these indicators from remote sensing data enables rapid and efficient inversion of water quality conditions, providing a key step in predicting water pollution. The low-precision inversion results of WQI limit the understanding of the ecological safety of water resources. In this paper, the Advanced Hyperspectral Imager (AHSI) data from the GF-5 satellite was used to analyze the Dianchi Lake. A spectrum processing method based on the Savitzky-Golay Standard Normal Variate transformation (SG-SNV) was developed, alongside optimal inversion techniques, to enhance the accuracy of predictions for 6 WQI in Dianchi, including CODcr, NH3-N, TP, TN, pH, and Chl. The results indicated that (1) For all inversion models, the determination coefficients (R2) exceeded 0.85. The Back Propagation Nondominated Sorting Genetic Algorithm-II (BP-NGA) model consistently yielded positive results for most WQI. (2) Water quality in Dianchi varied significantly by region and season. (3) It was recommended to build wetlands and ecological parks on the southwest side of Dianchi and improve sewage interception pipelines on the northeast side to lessen the risk of eutrophication by reducing the inflow of nitrogen and phosphorus.
Similar content being viewed by others
Introduction
Lakes are essential forms of surface coverage and vital resources for the survival of humankind1. In recent decades, a plethora of water-related environmental challenges have emerged as a consequence of heightened global warming, urbanization, industrialization, and rapid social development2. Statistics indicate that the water quality of more than 50% of the world’s lakes and reservoirs is deteriorating, with eutrophication identified as a primary contributing factor3. China, as one of the nation most adversely affected by these algal blooms, particularly with regard to Dianchi, a plateau lake experiencing significant eutrophication pressures4. Natural climate, biological resources, social livelihood, and national policies are all significantly impacted by algal blooms5,6,7. However, remote sensing technology offers a means to conduct real-time and accurate assessments of water quality on a large scale, enabling the detection of pollution sources and the timely evaluation of pollution levels. Consequently, the utilization of high-quality remote sensing data to investigate novel methods of water quality inversion can facilitate advancements in water quality monitoring technology, thereby providing essential technical support for enhancing national environmental management capabilities.
Although traditional manual water quality testing methods can yield accurate results, the incorporation of remotely sensed data more effectively addresses the requirements for high-frequency and large-scale water quality monitoring8. For example, Zhu et al.9 utilized the extensive and simultaneous monitoring capabilities of remote sensing to establish baselines for the spectral indices of phycocyanin and chlorophyll-a, facilitating the identification of algal blooms and aquatic vegetation across the entire littoral zone of Taihu Lake. Their combined model achieved accuracies of 93% and 95%, respectively. In addition, spaceborne hyperspectral devices provide information that is not limited to specific points or locations, unlike handheld and aircraft-based sensors. Zhang Shouxuan et al.10 proposed extracting algal blooms using a single-band threshold approach combined with China-Brazil Earth Resources Satellite (CBERS), Moderate-resolution Imaging Spectroradiometer (MODIS), and Enhanced Thematic Mapper (ETM) data. Kong Yuxiang11 analyzed the spatial and temporal distribution of chlorophyll-a concentration in the water column utilizing CBERS data. Hyperspectral satellite imagery has been established as a critical data source supporting water quality assessment, as confirmed by researchers such as Zhang Peng12 and Zhang Zhijun13. The hyperspectral imaging of Gaofen-5 (GF-5) enable the acquisition of hundreds of spectral bands, each with a bandwidth of less than 10 nm, thereby minimizing spectral confusion. The advancements by GF-5 represent significant technological progress in the domain of water quality research, facilitating a transition from qualitative assessments to semi-quantitative or quantitative analyses14. Currently, there is a limited amount of research on the use of hyperspectral remote sensing to quantify water quality in inland plateau lakes. Liu Yiming15, Bao Qifan16 et al. conducted suspended sediment inversion studies of the Yangtze River estuary and coastal waters based on hyperspectral imagery from GF-5. Zhang Yongyong17 conducted bathymetric inversion of Hengsha, a shoal shallow water area in Shanghai, based on GF-5 remote sensing data. So, most of the current studies using GF-5 data have focused on the oceans and estuaries.
The indicators that reflect water quality, along with the algorithm of the mathematical model, have played a crucial role in achieving an effective inversion of water quality remote sensing models. The physicochemical indicators for assessing water quality commonly include potential of hydrogen, nitrogen, and phosphorus et18. In prior research on water quality in Dianchi, conventional band combinations19 or linear models20 are typically employed for inversion; however, these approaches often lack the sensitivity required to adequately address multiple interacting variables, resulting in errors. Machine learning algorithms, by contrast, are adept at analyzing complex multi-dimensional spectral data and can establish non-linear relationships that more accurately represent real-world conditions21,22. For instance, a study conducted by Hayder et al. in 2020 demonstrated that artificial neural network methods generally yield satisfactory results in predicting and estimating parameters in the Kelantan River Basin23. Moreover, machine learning facilitates real-time monitoring by utilizing continuously updated data, thus accommodating the dynamic characteristics of diverse water bodies. Sun et al.24 presented a near-remote sensing water quality monitoring model in 2021, aiming to enhance result accuracy by integrating near-ground hyperspectral data. Marta Jemeljanova25 et al. found that back-propagation neural networks, support vector machines, and random forests generally exhibited superior predictive performance, with the random forest model demonstrating the highest consistency. Notably, the random forest model was employed in 66 out of 94 articles, establishing it as the machine learning model with the most reliable results.
Therefore, this study focused on Dianchi, developing inversion models based on actual data and spectral reflectance for 6 Water quality indicator (WQI): chemical oxygen demand (CODcr), ammonia nitrogen (NH3-N), total phosphorus (TP), total nitrogen (TN), potential of hydrogen (pH), and chlorophyll (Chl). There are two improved inversion algorithms, back propagation neural network model based on the NSGA II optimized genetic algorithm and improved support vector machine regression. These are compared to the classical random forest model. The results regarding the spatial and temporal distribution of the 6 WQI obtained from the 6 most effective models were interpolated using full-area kriging to simulate the continuous transitions of WQI across various locations in real-world scenarios. Finally, field surveys were carried out in selected areas, and the experimental results were validated by combining high-resolution satellite imagery with measured data. The study could serve as a robust scientific basis for the accurate monitoring and management of water bodies in areas with karst landforms, as well as for the enhancement of the water environment in Dianchi.
Method
Study area and data collection
Dianchi Lake, known as the Pearl of the Plateau, lies at an elevation of 1886 m and extending across 24°39′56″ to 25°01′32″N and 102°35′42″ to 102°47′19″E, is located in the heart of the Kunming Basin in southwestern China. It is flanked by DaGuan Park to the north and JinNing County to the south, stretching from the ChengGong District in the east to the foothills of XiShan in the west (Fig. 1). Dianchi, encompassing an area of 311 km2 and with a shoreline length of approximately 150 km, is the largest lake in the basin of southwest China. The lake is partitioned into two sections, Caohai (Ch) and Waihai (Wh), by a longitudinal embankment. The Wh encompasses an area of 289.065 km2, which constitutes more than 90% of the overall extent of Dianchi. As a fault basin lake, Dianchi is believed to have originated during the transition from the Mesozoic to the Cenozoic eras26. Dianchi has experienced permanent eutrophication throughout the year, primarily due to excessive fertilizer inputs resulting from urban expansion that have surpassed the organic load capacity of this unique geological formation. Changes in land use and the unpredictable impacts of precipitation have further contributed to this condition27. Additionally, due to the recurrent interplay between surface water and groundwater, the Dianchi Lake harbors high concentrations of calcium28. Karst water bodies demonstrate a greater susceptibility to pollution in comparison to other inland water bodies. Consequently, incidents of ‘cyanobacterial water bloom’ and ‘black-odorous water’ frequently occur. Concerns regarding the water quality of Dianchi have escalated in recent years, prompting the implementation of various large-scale engineering projects aimed at expeditiously improving the lake’s water quality.
Survey Region. ArcGIS 10.8 was used to create the map containing satellite imagery, with the URL: https://www.esri.com, http://www.gscloud.cn/ and https://www.mnr.gov.cn/.
Launched on May 9, 2018, the GF-5 satellite has the world’s first spacecraft to simultaneously and extensively observe the land and atmosphere. Its notable features include a wide coverage area, high resolution, and high load spectral calibration accuracy. GF-5 is equipped with an AHSI that operates in the visible and shortwave infrared regions, with a swath width of 60 km, spatial resolution of 30 m, and spectral precision of 5 nm. In comparison to conventional cameras, it offers more than a 100-fold increase in spectral range and channel number, capturing 330 spectral channels ranging from the visible to the short-wave infrared (400–2500 nm). For this study, high-quality GF-5 data with minimal cloud cover were selected for March 22, 2019, and December 16, 2019. Detailed parameters of the remotely sensed image data can be found in Table 1.
Methodological overview
The methodology employed in this study encompasses the pre-processing of hyperspectral data obtained from the GF-5 satellite, the development of a machine learning-based inversion model, and the subsequent visualization of the inversion results derived from the optimal model (Fig. 2). During the modeling phase, this research contrasts the improved support vector regression model and the improved back propagation neural network against the classical random forest algorithm, which serves as the benchmark. Taking into account the differing areas of the Ch and Wh, as well as the slower rate of water circulation between the two regions. Subsequently, the Ch is divided into about 11,102 grids of 30 m*30m and the Wh is divided into about 35,641 grids of 90 m*90m to improve the computational efficiency. The most effective model for each WQI was applied to the center of each grid. In order to mitigate spatial heterogeneity and facilitate smoother transitions, maps depicting the distribution of WQI were generated utilizing kriging interpolation.
Preprocessing of satellite image data
The image preprocessing used ENVI software, including radiometric calibration, atmospheric correction, geometric correction, water body extraction, and visual interpretation. Radiometric calibration establishes a correlation between the satellite sensor’s associated field of view and its brightness gray-scale data29. And it is imperative to eliminate strong atmospheric influences as they often constitute more than 80% of the total signals in water remote sensing inversion30. The geometric correction was utilized for the purpose of mitigating geometric distortions present in the remote sensing image, whilst simultaneously providing geographic and projection coordinates. In this study, the files GF5_AHSI_VNIR_Spectralresponse.raw and GF5_AHSI_VNIR_RadCal.raw were utilized as constraining factors for radiometric calibration calculations. The MODTRAN4 + radiation transfer model is employed for pixel-level atmospheric correction31. Finally, the discrepancy in reflectance values between the green band within the non-water sensitive region and the near-infrared band within the water-sensitive region was computed by the Normalized Difference Water Index (NDWI) to facilitate the initial extraction of water bodies32. And in order to achieve a more precise delineation of the study area, manual visual interpretation was incorporated to exclude terrestrial regions that may be erroneously classified.
Spectral data processing
In order to mitigate the influence of water movement on the correspondence between WQI and spectral values at the same sampling points, we adopt an improved measurement method. A square buffer with a side length of three image pixels was applied around each measurement point. The hyperspectra at each measurement point are represented by the mean de-trended curve remaining after the removal of anomalous curves. De-trending is a polynomial smoothing technique to eliminate baseline drifts from spectral curves exhibiting similar trends33. Spectral curves may change dramatically due to various factors, such as abrupt changes in the surrounding environment or human intervention34. These factors can compromise the model’s stability and limit its capacity to respond effectively to different situations. Consequently, it is necessary to remove noise present in the original spectra. In the present study, a combination of various spectral de-noising techniques and spectral variations was utilized. Such as, the spectral curves were filtered using the Savitzky-Golay (SG) filtering method, using weighted values derived from least-squares fitting based on a specified high-order polynomial. All peak points on the spectral curves, identified through polynomial searching, were connected. The trend lines exhibiting the highest slopes were subsequently chosen from the peak lines. These peaks were then subtracted to acquire the continuum-removed reflectance values. As shown in Eqs. (1), (2), and (3).
\(\:\text{y}\) is the polynomial fitting result; \(\:\text{t}\) is the observation point (-n, -n + 1, 0, …, n-1, n); \(\:\text{a}\) is the smoothing parameter; \(\:\text{A}\) is the fitting parameter determined by the least-square method; \(\:\text{Y}\) is the fitting result; \(\:\text{X}\) is design matrix that contains the independent variable terms.
In addition, spectral data demonstrating nonlinearity or heteroscedasticity was eliminated through the utilization of the Standard Normal Variate (SNV). In order for the spectra to follow a standard normal distribution, the centralization was divided by the standard deviation, shown as Eq. (4). Furthermore, the Multiplicative Scatter Correction (MSC), resembling SNV, enhanced data stability by calculating MSC factors that corresponded to the sample wavelengths, shown as Eq. (5).
\(\:\text{n}\:\text{i}\text{s}\:\text{t}\text{h}\text{e}\:\text{t}\text{o}\text{t}\text{a}\text{l}\:\text{n}\text{u}\text{m}\text{b}\text{e}\text{r}\:\text{o}\text{f}\:\text{s}\text{a}\text{m}\text{p}\text{l}\text{e}\text{s}\), \(\:{{\Phi\:}}^{-1}\) is the inverse function of the standard normal cumulative distribution function. \(\:{\text{Z}}_{\text{i}\text{j}}\) is the data of the jth sample point of the ith wavelength after scale scaling, \(\:{\text{Y}}_{\text{i}}\) is the data of the ith wavelength after standard normal transformation, and \(\:{\Phi\:}\left({\text{Z}}_{\text{i}\text{j}}\right)\) is the cumulative distribution function value of the scaled data.
\(\:{\text{K}}_{\text{i}}\) is the MSC factor, \(\:{\text{c}}_{\text{i}\text{j}}\) is the MSC standard sample spectral value of the jth sample point at the ith wavelength, \(\:{\text{s}}_{\text{j}}\) is the MSC spectral value of the jth standard sample.
The partial least squares regression (PLSR) method is particularly adept at addressing issues of high correlation and colinearity in spectral data, while also identifying latent factors associated with the target variable35. So, compared to other fundamental regression techniques, it is better suited for interpreting the suitability of results obtained from hyperspectral data processing. In this research study, a comparative analysis of the outcomes achieved through the utilization of SG + SNV and SG + MSC techniques was undertaken, employing the method of PLSR.
Previous studies have demonstrated the potential of utilizing a combination of spectral bands and logarithmic derivative transformed bands for pixel-by-pixel inversion36,37,38. In this study, the spectral dataset was divided into red, green, blue, and yellow bands. These bands were utilized individually or in combination to perform a range of computations, including band addition, band subtraction, first-order derivative of logarithmic bands, and the ratio of band reflectance difference to band reflectance sum. Based on the various spectral bands, a total of 25 spectral combinatorial variables (SCV) were categorized into four groups: 10 location variables (Supplementary Table A1), 4 area variables (Supplementary Table A2), 6 band combination variables (Supplementary Table A3), and 5 common vegetation indices (Supplementary Table A4). Lastly, in order to explore additional potential bands that could be used for inversion, logarithmic and derivative variations were implemented on the spectral data after de-noising process. These variations encompassed first-order derivative (1Der), second-order derivative (2Der), and third-order derivative (3Der), as well as logarithmic first-order derivative (Log + 1Der), logarithmic second-order derivative (Log + 2Der), and logarithmic third-order derivative (Log + 3Der).
To identify variables that are responsive to changes in different WQI, Pearson correlation analyses were conducted the spectral data (both before and after logarithmic derivative transformations) and the 25 SCV. Only those spectral band variables, spectral derivative variables, logarithmic derivative variables, and SCV variables that met the significance requirements (significance level P < 0.001) were selected.
Improved support vector regression models
The Support Vector Machine (SVM) model serves as the basis for Support Vector Regression (SVR). It possesses a distinctive feature where in losses are not computed within the data’s margin. Losses are only computed when the absolute difference between f(x) and y exceeds the margin39. This implied that the function model was solely impacted by the losses incurred by non-boundary support vectors. Consequently, in order to design the widest margin, it was imperative to also fulfill the minimum loss requirement. The function is shown as Eq. (6):
where, \(\:\text{Z}\) is the optimal model, \(\:\text{w}\) is the model coefficients, \(\:\text{b}\) is the model intercept, \(\:\text{C}\) is the support vector machine loss function, \(\:\text{f}\left(\text{x}\right)\) is the model output values, y is the real value, \(\:{\text{y}}_{\text{i}}\) is the actual value of the -th sample, \(\:{\text{x}}_{\text{i}}\) is the -th sample.
The function’s margin constraints were relaxed by incorporating linear soft interval, allowing certain samples to exceed the boundary40. This ensured that every data point follows Eq. (7).
Where,\(\:\:{\upepsilon\:}\) is the original margin, \(\:{\upzeta\:}\) is the linear soft interval variable, \(\:{\forall\:}_{\text{i}}\) is the measure of whether the i-th sample exceeds\(\:.\)
We achieved improved SVR modeling in MATLAB. The SVM predict function, utilizing the Gaussian radial basis kernel function, was employed for SVR prediction. The gamma parameter within this kernel function governed the data distribution in the novel feature space, consequently impacting both the training process and prediction efficiency. Moreover, the C parameter modulated the model’s generalization capability. We employed a grid search methodology for cross-validation, systematically evaluating each parameter combination. This process identified C = 2 and gamma = 0.0221 as the optimal parameters.
Bp neural network model optimized by NSGA II genetic algorithm
The BP neural network is built upon the Back Propagation technique, which adjusts the weights and biases of the network in response to erroneous values41. The learning process of the artificial neural network was monitored using gradient descent, which progressively aligned the output values of the network with the actual values (Fig. 3b). In genetic algorithms, three genetic operators were utilized: gene crossover, gene mutation, and gene selection42. Genetic crossover introduced various crossover strategies, such as multipoint crossover and even crossover, to enhance the diversity of the population (Fig. 3c).
In this study, we proposed an enhanced BP neural network framework that utilized the NSGA II algorithm for multi-objective optimization. The aim was to improve the adaptive and global advantages of genetic algorithms in efficiently finding the optimal solution. By harnessing the global optimization capabilities of evolutionary algorithms, we could prevent local minima during training and obtain optimal initial weights and thresholds for the BP network. The algorithmic approach is illustrated in Fig. 3. The constructed BP neural network inversion model utilized reflectance data from GF-5 remote sensing satellite as input layer data. The data was split into training and testing sets with a 7:3 ratio for modeling the neural network with five hidden layers. Through 20 iterations, a crossover rate of 0.2 and a mutation probability of 0.001 were identified as the optimal genetic optimization parameters. The use of the NSGA II algorithm for multi-objective optimization resulted in a lower termination mutation rate compared to the unimproved model. Furthermore, the optimized model showed smaller gradient values than the pre-optimization ones, indicating a more stable convergence.
Random forest model
One of the most widely used machine learning algorithms is Random Forest (RF), which employs a recursive feature selection process to segment data from reclassified samples. The predictive reliability of RF could be enhanced by adjusting parameters such as maximum depth, halting condition, feature subset size, and splitting criterion43. In decision tree learning methods, the determination of the maximum depth threshold or the number of samples at a node was influenced by the segmentation criterion. The Gini coefficient was employed as the primary evaluation criterion for splitting decision tree nodes, with a lower Gini coefficient indicating a more effective data segmentation44. The function is defined as Eq. (8):
For binary tree based decision tree can be expressed as Eq. (9) and Eq. (10):
Where,\(\:\text{G}\text{i}\text{n}\text{i}\) is the coefficient, \(\:\text{n}\) is the categories of decision tree, and \(\:{\text{P}}_{\text{n}}\) is the probability that the sample belongs to class n. \(\:\text{p}\) is the probability of the sample. \(\:{\text{D}}_{1}\) is the set of satisfying characteristics, and \(\:{\text{D}}_{2}\) is the set of features that do not satisfy.
RF models were built using the R programming language. The optimal parameter, which was identified as the minimum number of decision trees required to stabilize the model error rate at a relatively low level, was determined. When the count of decision trees reaches 130, the out-of-bag error exhibited both small and stable characteristics.
Error metrics
In this paper, k-fold cross-validation is employed to assess the effectiveness of the model. The dataset is partitioned into k subsets, with one subset designated as the validation set during each iteration. The metrics obtained from the k validation processes are averaged to derive a comprehensive performance of the model. The comprehensive performance evaluation metrics include mean square error (MSE), coefficient of determination (R2), root mean square error (RMSE) and mean absolute error (MAE). R2, which measured the goodness of fit, exhibited a positive correlation with model accuracy, with higher values indicating better precision. Conversely, lower MAE, MSE, and RMSE values implied higher model accuracy and less variation between predicted and actual values. These indicators are defined as Eq. (11), Eq. (12), Eq. (13), Eq. (14):
where, \(\:\text{n}\) is the number of samples, \(\:{\text{x}}_{\text{i}}\) is the independent variable, \(\:\overline{\text{x}}\) is the average of all independent variables, \(\:{\text{y}}_{\text{i}}\) is the value of measured data, \(\:\overline{\text{y}}\) is the average of all measured data, and \(\:\overline{{\text{y}}_{\text{i}}}\) is the predicted data.
Results
SG with SNV outperforms SG with MSC in spectral processing
The spectral reflectance profile of the Dianchi water body was derived by utilizing the regions of interest (ROIs) from the winter and spring data points. It was observed that the visible to near-infrared wavelength range exhibited the most prominent characteristics of our research subject. So, data within the wavelength range of 400 to 1000 nm were exclusively analyzed. Notably, the yellow-green spectrum corresponding to wavelengths between 560 and 580 nm showed the highest level of light reflection by water. Peaks in the red band between 690 and 710 nm, as well as between 790 and 810 nm, were attributed to the presence of algae and suspended particles in Dianchi. Furthermore, two troughs were identified between these peaks, one between 650 and 700 nm and another between 730 and 790 nm, which could be attributed to water absorption caused by carotenoids and chlorophyll.
Fig 4a shows the spectrogram for each sampling point. Figure 4b presents the plots before and after SG smoothing, SNV transformation, and MSC transformation for the data from the Baiyukou monitoring station. Figure 4c illustrates the results of 1Der, 2Der, 3Der, Log + 1Der, Log + 2Der, and Log + 3Der processing, also using data from the Baiyukou monitoring station as an example. The spectral fluctuations obtained after different treatments showed a similar trend in the range of 400–650 nm, but there was a significant difference in the range of 650–1000 nm. Comparing with the PLSR method, it is evident that the regression outcomes of the SNV transformed hyperspectral data are more robust and possess broader applicability (Fig. 5).
(a) Original spectrogram. (b) Comparison of spectral pre-processing. Spectrum acquired at point 10 (BaiYuKou). The blue line represents the original spectrum, the orange line represents the spectrum after SG smoothing, the red line represents the spectrum after SNV transformation based on the orange line, the green line represents the spectrum after MSC transformation based on the orange line; (c) Spectral derivative and logarithmic transformations. The spectral changes from top to bottom are first-order derivative, second-order derivative, third-order derivative, logarithmic first-order derivative, logarithmic second-order derivative, and logarithmic third-order derivative. All six spectral variations are based on preprocessed spectra at point 10.
Comparison of the effects of SNV and MSC when utilized in conjunction with SG. The red dashed line represents the degree of applicability of the spectra obtained by SG + MSC preprocessing in the 6 WQI. The black solid line represents the degree of applicability of the spectra obtained by SG + SNV preprocessing in the 6 WQI.
Development of models for 6 WQI
Screening characteristic bands
The correlation heat maps presented in Supplementary Fig.A1-A6 indicate that there is a correlation between the various WQI and optical properties of Dianchi. The 6 WQI all exhibit more than two significant correlation bands within the original spectral wavelength range of 400 nm to 600 nm (Fig. 6). Significant correlation bands for the 6 WQI were observed decentralize across the wavelength range following the spectral change (Fig. 7). Moreover, as the order of the derivative increases, the number of significant correlation bands obtained diminishes. A limited number of criteria-compliant correlation bands were observed between the 25 SCV and the 6 WQI, with Blue-edge-amplitude (\(\:{\text{D}}_{\text{b}}\)), Blue-edge-area (\(\:{\text{S}\text{D}}_{\text{b}}\)), and Perpendicular Vegetation Index (PVI) emerging more frequently as indicators exhibiting high correlations (Fig. 8).
Correlation diagram of original spectral. The black line represents the correlation between pH and the spectral bands after preprocessing. The red line represents the correlation between CODcr and the spectral bands after preprocessing. The blue line represents the correlation between TP and the spectral bands after preprocessing. The orange line represents the correlation between NH3-N and the spectral bands after preprocessing. The purple line represents the correlation between TN and the spectral bands after preprocessing. The green line represents the correlation between Chl and the spectral bands after preprocessing. Values above the solid line indicate a strong correlation between the band and the WQI. Values above the dashed line indicate a stronger correlation between the band and the WQI.
Correlation diagram of spectral derivatives and logarithmic transformations. (a) first-order derivative; The black line represents the correlation between pH and the spectral bands after preprocessing and first-order derivative. The red line represents the correlation between CODcr and the spectral bands after preprocessing and first-order derivative. The blue line represents the correlation between TP and the spectral bands after preprocessing and first-order derivative. The orange line represents the correlation between NH3-N and the spectral bands after preprocessing and first-order derivative. The purple line represents the correlation between TN and the spectral bands after preprocessing and first-order derivative. The green line represents the correlation between Chl and the spectral bands after preprocessing and first-order derivative. (b) second-order derivative; The WQI indicated by the line colors are the same, but the spectral bands are preprocessed and second-order derivatived. (c) third-order derivative; The WQI indicated by the line colors are the same, but the spectral bands are preprocessed and third-order derivatived. (d) logarithmic first-order derivative; The WQI indicated by the line colors are the same, but the spectral bands are preprocessed and logarithmic first-order derivatived. (e) logarithmic second-order derivative; The WQI indicated by the line colors are the same, but the spectral bands are preprocessed and logarithmic second-order derivatived. (f) logarithmic third-order derivative. The WQI indicated by the line colors are the same, but the spectral bands are preprocessed and logarithmic third-order derivatived. Values above the solid line indicate a strong correlation between the band and the WQI. Values above the dashed line indicate a stronger correlation between the band and the WQI.
Correlation diagram of 25 SCV. The black line represents the correlation between pH and the SCV obtained from preprocessed spectra. The red line represents the correlation between CODcr and the SCV obtained from preprocessed spectra. The blue line represents the correlation between TP and the SCV obtained from preprocessed spectra. The orange line represents the correlation between NH3-N and the SCV obtained from preprocessed spectra. The purple line represents the correlation between TN and the SCV obtained from preprocessed spectra. The green line represents the correlation between Chl and the SCV obtained from preprocessed spectra. Values above the solid line indicate a strong correlation between the band and the WQI. Values above the dashed line indicate a stronger correlation between the band and the WQI.
Screening the best model for each 6 WQI
In this study, we carried out a statistical analysis to evaluate the accuracy of WQI models constructed using three different algorithms under 7 spectral variations: 1Der, 2Der, 3Der, Log + 1Der, Log + 2Der, Log + 3Der, and SCV (Fig. 9). After eliminating over fitting models, it was observed that the BP-NGA model demonstrated suitability for the inversion of most WQI with a high degree of accuracy. Conversely, the improved SVR model exhibited instability in the inversion of various WQI, characterized by a substantial discrepancy between the highest and lowest accuracy. In contrast, the RF model maintained stability. The correspondence between the algorithms utilized in the models and the optimal hyperspectral processing methods is delineated, along with a comprehensive enumeration of essential evaluation criteria, specifically the maximum R2 (Fig. 10). Where the red curve represents the best inversion model for each WQI and the green and yellow curves are progressively less applicable. Specific model parameters and expressions for each WQI are listed in Supplementary Table A5. For the CODcr, the R2 values of the three models all exceed 0.88; however, the improved SVR model exhibits the highest MSE at 15.86. The R2 value of the optimal model for NH3-N is 0.99 and the MSE is \(\:{2.65\text{e}}^{-3}\). Overall, all three inversion models exhibited relatively high accuracy for NH3-N. Concerning the TN, the minimum R2 value for the three inversion models is 0.89, and the maximum MSE is 0.65. As for the TP, the minimum R2 value among the three inversion models is 0.94, while the maximum MSE is \(\:{2.70\text{e}}^{-4}\). In terms of the pH, the maximum MSE and the the maximum R2 of the three inversion models are 0.16 and 0.92, respectively. These values indicate that pH exhibits inferior overall inversion performance in comparison to other WQI. Lastly, for the Chl, the BP-NGA was poor modeling, showing lower R2 and higher MSE. In summary, the Log + Der2, Der3, Der1, and Der3 spectrum transformations of the BP-NGA model served as the foundation for the best inversion models for CODcr, NH3-N, TN, and TP, respectively.
Comparison of precision results based on 126 models. Gray is the accuracy of CODcr modeled based on different spectral features (first-order derivative, second-order derivative, third-order derivative, logarithmic first-order derivative, logarithmic second-order derivative, logarithmic third-order derivative and 25 SCV). Red is the accuracy of NH3-N modeled based on different spectral features. Blue is the accuracy of TP modeled based on different spectral features. Green is the accuracy of TN modeled based on different spectral features. Purple is the accuracy of pH modeled based on different spectral features. Yellow is the accuracy of Chl modeled based on different spectral features.
Optimal modeling of the 6 WQI. The red color indicates the spectral variation and modelling required for the best inversion model for each WQI. The green color indicates the spectral variation and modelling required for the second best inversion model for each WQI. The yellow color indicates the spectral variation and modelling required for the third best inversion model for each WQI.
Validation of the 6 best inversion models from screening
It illustrates the validation results of the test set for the BP-NGA model (Fig. 11). It is observed that the trend of the fitted line closely adheres to the Y = T axis. The validation results for the RF model demonstrate a strong alignment between the predicted values and the actual values (Fig. 12). And it presents the validation results for the improved SVR model (Fig. 13), indicating minor discrepancies between the predicted and actual values. These findings suggest that the BP-NGA model was effective in simulating the CODcr, NH3-N, TN, and TP. The RF model was suitable for determining the pH, while the improved SVR model could be utilized for simulating the Chl.
BP-NGA model validation diagram((a) modelling of CODcr (b) modelling of NH3-N (c) modelling of TN (d) modelling of TP). The horizontal and vertical coordinates are the actual and predicted values of the validation set, respectively. The black solid line is the fitted line for the validation set. The red dashed line is the control line, where R2 is close to 1.
Analysis of spatial and temporal variations in dianchi water quality and validation of results
The winter and spring inversion maps of Dianchi were obtained by kriging the results of the optimal inversion model (Fig. 14). The results revealed significantly lower concentrations of the most WQI in Ch compared to the Wh. Notably, in the normalized water quality map, TP, NH3-N, CODcr, pH and Chl displayed similar correlations, suggesting that the water quality of the Ch was superior to that of the Wh (Fig. 15, 16). In summary, water quality is lower in the edge areas than in the center. And in the seasonal study, the inversion results in March and December were compared. It was observed that the risk of eutrophication was higher in spring than in winter, with both the extent and rate of nutrient enrichment increasing more rapidly. Specifically in the northeastern region, the presence of dominant southerly winds during springtime exacerbates the likelihood of eutrophication.
CODcr concentrations decreased during the winter, particularly in the southwestern region, while new aggregation zones emerged in the northeast. The northeast and southwest coasts of Wh recorded the highest concentrations of NH3-N, while these high-concentration areas gradually diminished from spring to winter, resulting in significantly reduced concentration peaks. Ch exhibited higher levels of TN compared to Wh. Noteworthy high concentrations of TN were observed in the northern regions of Wh, with a declining trend from north to south, and the maximum concentrations were significantly lower during winter. In terms of TP, Ch generally exhibited low concentrations, with a few coastal locations displaying elevated levels. From spring to winter, the high-concentration area on the northeast coast of Wh shifted northward, accompanied by a rapid decline in overall TP concentration across the Dianchi. On the southeastern bank of Ch, there was an arcuate pattern of high pH concentration. Moreover, there was a significant and positive correlation between pH and water temperature45. The pH distribution in Wh decreased gradually from winter to spring. Across Ch, the concentrations of the Chl consistently remain low. In spring, the high concentrations of Chl were primarily situated in the central region of Wh and gradually extended to the north and south.
The results from monitoring station observations were compared to the interpolation results of different WQI (Fig. 17). Most of the sites’ inversion results show good precision and strong consistency with the measured values with the R2 up to 0.8. However, some indices show significant deviations from the true values. As the R2 of pH is only 0.7, indicating that the model or the interpolation method may still need to be improved.
Discussion
In this study, different optimal inversion models were developed for the 6 WQI based on BP-NGA, improved SVR, and RF, which all demonstrates superior accuracy. These models provide a practical option for policymakers, facilitating more sustainable management of water resources. And we conducted a field survey of the conditions surrounding Dianchi to evaluate the efficacy of the proposed predictive model in accurately forecasting various WQI. The primary potential sources of pollution in the study area identified after the field survey (Fig. 18). Anthropogenic pollution, shoreline destruction, and low vegetation coverage contribute to the generation of pollution sources. Anthropogenic pollution sources included river injection pollutants, tertiary industry pollutants, agricultural pollutants, and domestic pollutants. Sporadic factories and numerous waterfront buildings existed. Large-scale agricultural production and free-range animal breeding continued in areas where the “no aquaculture” law was enforced. And in the absence of vegetation cover, nutrients from the land could seep into the river. The results indicate that the spatial and temporal variations in water quality in Dianchi from winter to spring are consistent with findings documented in the existing literature. Niejufen46 proposed an evaluation of water quality pollution in Dianchi utilizing the theory of standard curves and set-pair analysis, revealing that the levels of CODcr and TN were significantly higher in January compared to December of that year, while the level of TP was moderately elevated. Zhu et al.47 employed a geographical detector to assess the spatial variability of the lake’s ecological health, finding that water quality in the southern lake area, as well as in the northern and eastern bay areas, was considerably inferior to that in the central lake area. So, the inverse model, based on hyperspectral satellite data, achieved effective water quality monitoring results for lakes in karstic landscapes, serving as a valuable complement to macro water quality monitoring efforts.
The results of the analyses indicate that various WQI exhibit distinct fluctuation regularities across different regions of Ch and Wh. Field surveys suggest that the inflowing rivers, specifically the PanLong River and BaoXiang River on the northern coast of Wh, significantly contribute to CODcr burden of the Dianchi Lake basin, accounting for 44% of the total input. This finding is consistent with the observed formation of a CODcr aggregation zone in the northeastern region of Wh. Satellite imagery has identified several areas in the eastern part of Wh with minimal or no vegetation cover, rendering them susceptible to soil erosion and acting as conduits for the entry of phosphate fertilizers, pesticides, and sewage from adjacent lands. Furthermore, research conducted by Wang et al.48 indicates that rising temperatures accelerate the degradation rate of organic matter, resulting in an increased outflow of phosphorus. This observation aligns with the documented regional distribution and the spatial and temporal variation of TP in the study. Riparian rocks and soils possess the capacity to neutralize acids, thereby mitigating the influx of pollutants into Dianchi49,50. Extensive field investigations have demonstrated that the majority of the clay-based shoreline substrate in the northeastern area displays negatively charged surfaces and exhibits greater efficacy in neutralizing acidic cations compared to the rocky soil substrate along the southeastern shoreline. It is consistent with the elevated pH levels proposed for the southeastern shore in the study.
The TP, NH3-N, CODcr, pH, and Chl exhibited a substantial regional pattern of similarity. The findings of the study demonstrated that only TN concentrations were markedly higher in the Ch area compared to the Wh area. Investigations revealed that the silt present in the shallow waters of Ch was rich in nitrogen compounds, thereby contributing to the elevated TN levels observed. Furthermore, significant external TN inputs within the northern region were primarily influenced by urbanization and the development of third-industry infrastructure, such as DaGuan Park. Biological processes, particularly photosynthesis occurring in wetlands like GanGouWei Wetland, facilitated the conversion and release of nitrogen. This biological activity may account for the pronounced disparity in concentration between the TN and the other WQI in the distributions of Ch and Wh. Additionally, a temporal regularity was noted among the 6 WQI, which may be attributed to increased runoff and rainfall during the spring, as well as the influences of sunlight and temperature. These environmental factors not only promoted the release of nutrients that had accumulated throughout the preceding winter but also led to a more uniform distribution of WQI. Research conducted by Guo et al.51identified that certain cyanobacteria are capable of both fixing nitrogen and absorbing phosphorus from sediments, which can facilitate the co-production and cycling of these nutrients within aquatic ecosystems. Consequently, the observed similarity in the patterns of change among WQI may also result from the fact that these WQI share common sources. This finding aligns with the results reported by Yang et al.52.
The application of hyperspectral data in water quality inversion is garnering increasing scholarly attention, primarily due to its high spectral resolution and extensive spectral information. Rallo et al.53 indicated that technological advancements in the industrial production of hyperspectral sensors, characterized by a substantial number of contiguous spectral bands, have prompted scientists to conduct more precise analyses. The study presented in this paper demonstrates that the AHSI sensor onboard the GF-5 is capable of providing reliable spectral data for various band combinations, thereby offering improved options for model construction. However, it is important to note that these results of models may not be entirely accurate, as they are derived from the processing of data obtained from individual satellite images collected in March and December. So, we do not take into account the autumn and summer seasons, nor do they consider the specific optical properties of the atmosphere, both of which may significantly influence the results.
Related studies54 have employed machine learning techniques, such as improved SVR and RF, to analyze complex spectral data. Nonetheless, the effective selection and combination of spectral bands that are appropriate for machine learning models in the context of water quality inversion continues to present a significant challenge. This study identifies that the optimal spectral variation for predicting pH and TN using the RF model is SCV, which encompasses a considerable number of yellow-green bands. But, the presence of algae and aquatic plants in the water can influence the spectral response within this spectral region. Carter et al.55 reported that leaf health can dictate the variability of spectral response in the yellow region. The spectral response is intrinsically linked to the natural environment of the study object. Consequently, the model developed in this paper may not be generalizable. So, additional studies across other lakes situated in karstic landscapes are necessary to validate or refute our findings.
Due to the impact of the epidemic, this study was unable to acquire a substantial amount of measured hyperspectral data during the specified timeframe. As a result, the training sample for the model was constrained, which hindered the model’s generalizability. To enhance the applicability of the model, future research should consider integrating data from satellites with shorter imaging intervals, such as Sentinel-2, in conjunction with ground-based spectrometer data. Additionally, conducting comprehensive analyses of spectral properties, including the inclusion of additional blue light bands and the expansion of the spectral bands utilized, would improve the overall reliability of these models.
Conclusions
This study presents the improved methodology aimed at acquiring high-precision inversion outcomes based on hyperspectral data. It specifically focused on identifying the most efficient models for the inversion of the 6 WQI (CODcr, NH3-N, TP, TN, pH, and Chl) in Dianchi Lake by utilizing GF-5 AHSI data in combination with improved BP-NGA, improved SVR, and RF models. The spatial and temporal fluctuations of water quality in Dianchi were obtained based on these optimal model results. This confirms that these methodologies can be applied to the study of intricate optical characteristics of karst lakes. The main conclusions of the research are as follows:
(1) The R2 values in this study all exceed 0.85, suggesting that the water quality inversion models developed using GF-5 data exhibit a high level of accuracy. And the improved BP-NGA approach provided the most versatile applications. The optimized models for each WQI are as follows: BP-NGA model with Log + Der2 for CODcr; BP-NGA model with Der3 for NH3-N and TP; BP-NGA model with Der1 for TN; RF model based on SCV for pH; improved SVR model with Log + Der1 for Chl.
(2) Water quality is generally better in winter than in spring. During the spring, the extreme values of most WQI increased by more than 10%. And the WQI of polluted areas were relatively high and widely dispersed. Conversely, during the winter, the distinctions in water quality between the coastal and central regions were significantly diminished. In addition, Dianchi Lake’s water quality exhibited spatial differences. Ch consistently demonstrates superior water quality in comparison to Wh. The WQI in Ch displayed greater uniformity. Conversely, the coastal sectors of Wh showcased greater variability, with stable central regions.
(3) Urban development is more concentrated in the eastern region. During winter, there has been a noticeable rise in TP concentrations in the northeastern part, which might suggest the presence of illicit sewage discharges. Therefore, it is crucial to prioritize the improvement of interceptor drains and monitoring systems in this specific zone. Furthermore, the establishment of ecological parks and wetlands as buffer zones could serve as effective measures in mitigating pollutant input.
Data availability
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation. Please contact the corresponding author if needed.
References
McDougall, C. W., Hanley, N., Quilliam, R. S. & Oliver, D. M. Blue space exposure, health and well-being: does freshwater type matter. Landsc. Urban Plan. 224, 104446 (2022).
Wu, Z. et al. Water quality assessment based on the water quality index method in lake Poyang: the largest freshwater lake in China. Rep 7, 17999 (2017).
Bonansea, M., Ledesma, M., Rodriguez, C. & Pinotti, L. Using new remote sensing satellites for assessing water quality in a reservoir. Hydrol. Sci. J. 64, 34–44 (2019).
Benzhan, Z. et al. Ecological consequences of cyanobacetrial blooms in lakes and their countermeasures. Adv. Earth Ence. 23, 1115–1123 (2008).
Tao, F. Study on the water ecological compensation mechanism and policy recommendations of dianchi Valley. Ecol. Econ. 1, 154–158 (2010).
Liu, J. N., Cheng, H. & Gao, K. Ldentification and network construction of bird habitats around dianchi lake based on urban biodiversity conservation. Chin. Landsc. Archit. 38, 32–37 (2022).
Liu, Y. et al. Dynamic variation characteristics of NDVI in dianchi lake basin and Lts response to climate. Agricultural Eng. 11, 34–41 (2021).
He, Y., Gong, Z., Yanhui, Z. & Zhang, Y. Inland reservoir water quality inversion and eutrophication evaluation using BP neural network and remote sensing imagery: A case study of dashahe reservoir. Water 13, 2844 (2021).
Qing, Z. et al. Distinguishing cyanobacteria bloom and aquatic plants in lake Taihu based on hyperspectral imager for the coastal ocean images. Remote Sens. Technol. Appl. 31, 879–885 (2016).
Hongtao, D., Shouxuan, Z. & Yuanzhi, Z. Cyanobacteria bloom monitoring with remote sensing in lake Taihu. J. Lake Sci. 20, 145–152 (2008).
Kong, Y. X. Inversion study of Chlorophyll a concentration based on Zhuhai-1hyperspectral remote sensing data (ShangHai ocean university, 2024).
Zhang, P., Guo, Z. X. & Liu, Z. J. Remote sensing inversion of water quality in inland waters based on hyperspectral satellite images. Bull. Surv. Mapp. 2, 206–211 (2022).
Zhang, Z. J., Wang, R., Yao, Y., Du, C. Y. & Shen, X. Retrieval study of total suspended matter concentration in Qinghai lake based on ZY1 02D hyperspectral satellite Lmages. Remote Sens. Technol. Appl. 38, 1159–1166 (2023).
Da, Z. Hyperspectral remote sensing and its development and application review. Opt. Optoelectron. Technol. 11, 72–78 (2013).
Liu, Y. M. et al. A neural networks based method for suspended sediment concentration retrieval from GF-5 hyperspectral images. J. Infrared Millim. Waves. 41, 323–336 (2022).
Bao, Q. F., Peng, D. Y., Bao, D. Y. & Lou, F. Research on hydrological data modeling of the Yangtze river estuary based on satellite ground simultaneous measurement. Port Waterway Eng. 166, 146–151 (2020).
Zhang, Y. Y. Inversion of shallow water depth in Hengsha based on GF5-AHSI remote sensing data. J. Mar. Sci. 40, 93–101 (2022).
Wu, H. H. et al. Study on water quality parameter inversion based on landsat 8 and measured data. Remote Sens. Technol. Appl. 36, 898–907 (2021).
Lin, H. et al. Spatial differentiation analysis of water quality in dianchi lake based on Gf-5 ndvi characteristic optimization. J. Spectrosc. 1–11 (2021). (2021).
Zhang, H., Hu, W. & Jiao, Y. Water quality parameter retrieval with GF5-AHSI imagery for Dianchi Lake (China). Water 16, 225 (2024).
Rahat, S. H. et al. Remote sensing-enabled machine learning for river water quality modeling under multidimensional uncertainty. Sci. Total Environ. 898, 165504 (2023).
Cao, Z. et al. A decade-long chlorophyll-a data record in lakes across China from VIIRS observations. Remote Sens. Environ. 301, 113953 (2024).
Hayder, G., Kurniawan, I. & Mustafa, H. M. Implementation of machine learning methods for monitoring and predicting water quality parameters. Biointerface Res. Appl. Chem. 11, 9285–9295 (2020).
Sun, X., Zhang, Y., Shi, K., Zhang, Y. & Qin, B. Monitoring water quality using proximal remote sensing technology. Sci. Total Environ. 803, 149805 (2021).
Jemeļjanova, M., Kmoch, A. & Uuemaa, E. Adapting machine learning for environmental spatial data - A review. Ecol. Inf. 81, 102634 (2024).
Run-Hai, Y. & Jian-Guo, Z. Reflection data processing for base faulting in Kunming basin. J. Seismol. Res. 31, 377–381 (2008).
Zhang, H. S., Jiao, Y. M., Xu, Q. E., Zhang, Z. N. & Tao, Y. A review of land use changes and their impact on water quality in dianchi basin. Yangtze River. 54, 65–73 (2023).
Yang, W. et al. Monitoring multi-water quality of internationally important karst wetland through deep learning, multi-sensor and multi-platform remote sensing images: A case study of Guilin, China. Ecol. Indic. 154, 110755 (2023).
Xu, H. Q., Sun, F. Q. & Xu, G. Z. Cross comparison of radiance data between hyperspectral AHSl and multispectral VIMI sensors of Gaofen-5 satellite. Geomatics Inform. Sci. Wuhan Univ. 46, 1032–1043 (2021).
Cheng, L. F. et al. Mission overview of the GF-5 satellite for atmospheric parameter monitoring. Natl. Remote Sens. Bull. 25, 1917–1931 (2021).
Guobing, L. A review on detection methods of chemical oxygen demand in water bodies. Rock. Min. Anal. 32, 860–874 (2013).
McFEETERS, S. K. The use of the normalized difference water index (NDWI) in the delineation of open water features. Int. J. Remote Sens. 17, 1425–1432 (1996).
Mishra, P. et al. Near-infrared hyperspectral imaging for non-destructive classification of commercial tea products. J. Food Eng. 238, 70–77 (2018).
Sankararao, A. U. G., Saikiran, K. & Rajalakshmi, P. Hyperspectral image denoising: A comparative study on uav based vegetation data. WHISPERS 13, 1–5 (2023).
Wold, S., Sjöström, M. & Eriksson, L. PLS-regression: a basic tool of chemometrics. Chemom Intell. Lab. Syst. 58, 109–130 (2001).
Zhang, P. L., Song, L. C., Wang, Y., Song, X. Q. & Gu, B. H. Establishment of inversion model for water quality parameters in typical urban rivers based on unmanned aerial vehicle multispectral data. Environ. Pollution Control. 44, 1351–1356 (2022).
Yanjun, L., Kai, X., Hailin, F. & Yiming, F. Inversion of water quality elements in small and micro-size water region using multispectral image by UAV. Acta Sci. Circum. 39, 1241–1249 (2019).
Yunfang, Z. et al. The study of inversion of chlorophyll a in Taihu based on GF-1 WFV image and BP neural network. Acta Entiae Circum. 37, 130–137 (2017).
Smola, A. & Schlkopf, B. A tutorial on support vector regression. Stat. Comput. 14, 199–222 (2004).
Kisi, O. & Parmar, K. S. Application of least square support vector machine and multivariate adaptive regression spline models in long term prediction of river water pollution. J. Hydrol. 534, 104–112 (2016).
Witaszek, J. & Backpropagation Theory, architectures, applications. Neurocomputing 9, 358–359 (1995).
Qing, L. S. et al. Evaluation of deep buried groundwater based on genetic algorithm and Bp neural network. Water Resour. Power. 37, 49–52 (2019).
Wager, S. Asymptotic theory for random forests. Eprint Arxiv. 8, 1831–1845 (2014).
Quinlan, J. R. Induction of decision trees. Mach. Learn. 1, 81–106 (1986).
Mo, M. X. et al. pH characters and lnfluencing factors in dianchi and Xingyun lakes of Yunnan plateau. J. Agro-Environment Sci. 26, 269–273 (2007).
Nie, J. F. Analysis of nitrogen and phosphorus pollution and water quality evaluation in lakes: a case study of dianchi lake. Low Carbon World 14, 4–6 (2024).
Zhu, T. Y., Zhao, H. X., Fan, J. D., Wang, J. Q. & Gu, B. J. Differences in the spatial and temporal distribution and influencing factors in the water environment of rivers and lakes in the dianchi basin. Resour. Environ. Yangtze Basin. 32, 1305–1316 (2023).
Wang, H., Zhang, Z. J., Li, J. J. & Xu, X. Characteristics of phosphorus cycling between sediment of the wetlands and water under warming in simulated wetland habitat. Wetl Ence. 9, 345–354 (2011).
Yun, W. Research progress in surface water acidification and study on environmental information from lake acidification. Adv. Earth Sci. 16, 421–426 (2001).
Liu, L., Cai, M., Chen, F. Z., Yang, S. Y. & Li, Y. Effects of simulated acid rain on pH in lakes with different trophic levels. J. Ecol. Rural Environ. 34, 917–923 (2018).
Guo, W., Yang, F., Li, Y. & Wang, S. New insights into the source of decadal increase in chemical oxygen demand associated with dissolved organic carbon in dianchi lake. Sci. Total Environ. 603–604, 699–708 (2017).
Yang, F., Ma, W., Chen, X., Wang, Y. F. & Wang, J. L. Identifying major influencing factors and their driving mechanisms of the abnormal pH level rise in dianchi lake. J. Changjiang River Sci. Res. 41, 75–82 (2024).
Rallo, M., Provenzano & Ciraolo. & Detecting crop water status in mature Olive groves using vegetation spectral measurements. BIosyst Eng. 128, 52–68 (2014).
Yan, X. et al. A comprehensive review of machine learning for water quality prediction over the past five years. J. Mar. Sci. Eng. 12, 159 (2024).
Carter, G. A. Responses of leaf spectral reflectance to plant stress. Am. J. Bot. 80, 239–243 (1993).
Acknowledgements
This research received funding from the High-resolution Special Project: Yunnan Provincial Government Comprehensive Governance Deep Application and Scale Industrialization Demonstration Project (89-Y50G31-9001-22/23), the Graduate Innovative Talent Training Project of Yunnan University—Quality Improvement Plan for Graduate Teaching Materials Construction (HXKC202112). Chinese border in Fig. 1 is derived from the standard map with review number GS (2019)1822, obtained from the Standard Map Service Website of the Map Technical Review Center of the Ministry of Natural Resources (https://www.mnr.gov.cn/). The base map has remained unaltered.
Author information
Authors and Affiliations
Contributions
Yuewen Feng: Conceptualization, Methodology, Analysis, Writing - original draft, Data curation and Software, Writing - review & editing. Jun Zhang: Methodology, Analysis, Writing - original draft. Sanjie Guo: Investigation, Writing – editing. Yunbai Zhang: Writing - Conceptualization, Methodology, Writing - review & editing. Zhongwei Zhang: Writing – editing.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Consent to participate
All authors confirmed their agreement to participate.
Consent to publish
All authors have read and agreed to the published version of the manuscript.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Feng, Y., Zhang, J., Guo, S. et al. High precision water quality retrieval in Dianchi Lake using Gaofen 5 data and machine learning methods. Sci Rep 15, 6760 (2025). https://doi.org/10.1038/s41598-025-91011-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-91011-1