Background & Summary

Lakes and reservoirs are important freshwater resources, and play important roles in climate regulation, ecological function, and environment quality1,2. On the one hand, lakes and reservoirs are crucial for maintaining the health and sustainability of ecosystems3. As key suppliers of drinking water, lakes and reservoirs offer a higher quality and more stable source of freshwater compared to rivers and groundwater4,5. On the other hand, water quality, including pH, dissolved oxygen (DO), dissolved organic carbon (DOC), permanganate index (CODMn), total nitrogen (TN), total phosphorus (TP), turbidity (Tur), and electrical conductivity (EC), controls greenhouse gas (GHG) emissions in lakes and reservoirs. Water pH directly influences the dissolubility of carbon dioxide (CO2), and it has been reported that CO2 fluxes in lakes and reservoirs showed a significant negative correlation with pH value6,7. As important substances of CO2 and CH4 production, the content of organic matters, indicated by DOC and CODMn, is generally correlated with the emissions of CO2 and CH47,8. As important nutrients for microbes and algal, TN and TP modulate the activities of photosynthesis, respiration, methanogenesis, nitrification, and denitrification, it has been reported that the contents of TN and TP were positively correlated with CH4 and N2O emissions from lakes and reservoirs6,7,8,9,10. Higher Tur in water can alter light penetration which further affects photosynthetic rates11. EC indicates the contents of dissolved minerals and salts12, which impact water pH and buffering capacity, further modulating gas solubility and emissions13. Therefore, water quality parameters are important predictors for GHG emissions in lakes and reservoirs. However, current monitoring programs are limited to hotspot and sensitive areas, failing to provide a high temporal and spatial resolution picture of above water quality parameters for all the 180,000 lakes and reservoirs across China2,10,14,15,16.

Water quality in lakes and reservoirs is determined by the interactions of climate, soil properties, and anthropogenic activities within their respective watersheds. Global warming intensifies thermal and DO stratification of water bodies, which further modulate the processes of GHG production and consumption5,17,18,19. The increase in atmospheric temperature directly causes water bodies to absorb more heat and increase in water temperature, which leads to the reduction of the saturation solubility of oxygen in water and the decrease in DO concentration20. Torrential rain and strong winds can alter the redox conditions and the surface turbulence of the water body, which further modulates the respiration, methanogenesis, and nitrification-denitrification activities in aquatic ecosystems, as well as the enhancement of water vapor exchange rate at the water surface, thereby accelerating the production and emission of GHG21,22. Soils contain essential nutrients such as nitrogen and phosphorus, which can be transferred to water bodies through runoff and leaching23. The enrichment of these nutrients in water can lead to eutrophication and associated environmental problems such as algal blooms and oxygen depletion24. Additionally, the pH level of the soil affects the efficiency of plants to absorb and utilize nutrients, when soil pH is out of balance, it can limit the uptake of essential nutrients by plants, leading to nutrient leaching and runoff, which in turn influences water quality25.

Human activities, such as population explosion, the extensive use of nitrogen fertilizers, and land use changes, are important anthropogenic factors controlling water quality parameters in lakes and reservoirs26,27. Changes in land use patterns and vegetation cover alter the content of soil organic matter, as well as its degradation rate, which further controls the lateral transformation of organic carbon, nitrogen, and phosphorus from lands to inland waters28,29,30. Compared with cropland, the dissolved organic matter in forest ecosystems is more bioavailable31,32, and even though cropland accounted for only 11.2% of the global terrestrial surface, it contributed more than 50% of the total soil erosion in 201233. Furthermore, the growing population is usually associated with increasing discharge of domestic waste and wastewater, resulting in the overload of organic carbon, nitrogen, and phosphorus into nearby lakes and reservoirs27. Agricultural nitrogen fertilization and atmospheric nitrogen deposition are the primary sources of nitrogen pollution in water bodies. Due to the excessive fertilization and low nitrogen utilization, part of nitrogen in cropland directly enters the runoffs and results in agricultural nitrogen non-point source pollution and an increase in nitrogen load in surface water34. The global annual application of nitrogen fertilizer in croplands reached 120 Tg in 2015, but more than half of the applied nitrogen was lost to the environment because the nitrogen utilization efficiency was less than 50%35. Nitrogen deposition has increased twofold over the past 100 years globally and by 60% from 1980 to 2010 in China36. Reactive nitrogen in the form of ammonia (NH3) and nitrogen oxides (NOx) which are mainly from nitrogen fertilizer, livestock cultivation and fossil fuel combustion, deposited in aquatic ecosystems after a series of chemical conversions and physical transport that can alter the biogeochemical cycling and affect the nutrient levels37,38. Above all, climate, soil properties, and anthropogenic activities jointly influence the levels of water quality, and it is promising to use these factors to predict water quality parameters of lakes and reservoirs.

In previous studies, most water quality research has concentrated on the comprehensive determination of the River Water Quality Index (WQI)39,40. This method is particularly helpful for rivers, as it simplifies complex water quality data into an understandable metric for management and policy-making. But, water quality data in lakes and reservoirs are mainly produced by in-situ observation, remote sensing, or mathematical simulations41,42. There are several limitations: (1) most mathematical simulation studies use linear regression models, which makes it challenging to effectively capture the variation of water quality parameters when they are characterized by nonlinearity and randomness43; (2) most studies focus on only the water quality characteristics of specific regions, for instance, the lakes in the middle and lower reaches of the Yangtze River, high resolution nationwide datasets are highly unavailable44,45,46; (3) it is hard to achieve high spatial and temporal resolution water quality datasets due to the huge investment of labor and money required for in-situ observations43; (4) remote sensing is only available to monitor light-sensible parameters, such as DOC and algae contents, but fails to produce national datasets of pH, DO, and EC.

In this study, we aim to produce high resolution datasets of water quality parameters (pH, DO, DOC, CODMn, TN, TP, Tur, and EC) in nearly 180,000 lakes and reservoirs in China from 2000 to 2023. Machine learning was used to interpret the influences of climate, soil properties, and anthropogenic activities on water quality parameters in lakes and reservoirs, and further to produce monthly water quality datasets of the whole country. The results of this study will not only deepen our understanding of how climate, soil properties, and human activities affect the water quality of lakes and reservoirs, but also provide valuable datasets for assessing their impact on greenhouse gas emissions and evaluating the conditions of lakes and reservoirs through the Water Quality Evaluation System47,48.

Methods

Study area

China’s inland water system was divided into 6 regions according to their hydrology, geomorphology, and climate characteristics (Fig. 1), namely Greater Pearl, Yangtze, Huang-Huai-Hai, Northeast China (NE China), Northwest China (NE China), and Tibetan Plateau45. Rivers in different regions influenced the formation, characteristics and ecological health of lakes and reservoirs by influencing water supply, water quality, sediment deposition, and ecological connectivity of water bodies. The Yangtze River and the Pearl River were the first and fourth largest rivers in China respectively, both of which located in the humid and warm subtropical monsoon climate zoon. The Huang-Huai-Hai region included 3 major rivers, namely the Yellow River, the Huaihe River and the Haihe River, which belonged to the semi-arid and semi-humid region. The last three regions were located in northern China and had a relatively dry and cold climate. In general, most of these five areas had low to medium elevations and the terrain was mainly plains, basins and hills while the Tibetan Plateau had a high altitude more than 3000 m and an extremely cold plateau climate49.

Fig. 1
figure 1

Geographical distribution of lakes and reservoirs in China, as well as the China National Environmental Monitoring Center (CNEMC) and DOC observing stations used in this study.

Lakes and reservoirs in China were numerous and diverse. The total area of natural lakes was approximately 80,000 km2 and about 2,800 natural lakes with area greater than 1 km2 each, and China currently had nearly 100,000 reservoirs with a total water surface area of about 50,000 km2 (Fig. 1)44,50,51. The distribution density of lakes and reservoirs in the 6 regions varied significantly, influenced primarily by the terrain and climate44. In China, lakes were mostly distributed in NW China, NE China, Tibetan Plateau, and Huang-Huai-Hai regions, while reservoirs in Greater Pearl and Yangtze regions (Fig. 1). The Greater Pearl and Yangtze regions, particularly the middle and lower reaches of the Yangtze River, had the largest clusters of freshwater lakes and reservoirs in China, and these regions were characterized by a large demand for agricultural irrigation51. By the contrast, lakes on the Tibetan Plateau were mostly densely distributed inland saline lakes fed by mountain snow and glacier meltwater, which were significantly affected by global warming44. The distribution of lakes in NW China was highly uneven with approximately 40% of the lakes concentrated in the area surrounding the Tianshan Mountains, where meltwater from high mountain ice and snow served as a crucial replenishment source52. Lakes in NE China were mainly distributed in plains dominated by agriculture, where there was a significant demand for agricultural irrigation53. Reservoirs in the Huang-Huai-Hai region were mainly located in urban areas and served as the major urban water sources51.

Driving dataset

To predict the water quality of lakes and reservoirs nationwide, we used gridded datasets encompassing climate, soil properties, and anthropogenic activities as input features to build the machine learning model (Table 1). Monthly climate data (including 2 m temperature, precipitation, 10 m wind speed, and surface solar radiation) was from ECMWF Reanalysis v5 (ERA5), with a spatial resolution about 28 km × 28 km54. Annual content of soil organic carbon was obtained from the Global Carbon Budget 2023, with a spatial resolution of about 55 km × 55 km55. Annual land cover data was derived from the China Land Cover Dataset (CLCD), with a spatial resolution of 30 m × 30 m56. Monthly soil respiration data was simulated using Integrated Biosphere Simulator (IBIS) model with a spatial resolution of about 10 km × 10 km57. Soil properties (including soil pH, soil total phosphorus (soil TP), and soil total nitrogen (soil TN)) was obtained from the China High-Resolution National Soil Information Grid Basic Attribute Dataset (2010–2018), with a spatial resolution of 90 m × 90 m58. Monthly Normalized Difference Vegetation Index (NDVI) data was acquired from The Terra Moderate Resolution Imaging Spectroradiometer (MODIS) Vegetation Indices (MOD13A3) Version 6, featuring a 1 km × 1 km spatial resolution59. Annual population distribution data was obtained from the LandScan dataset provided by Oak Ridge National Laboratory (ORNL), with a 1 km × 1 km spatial resolution60. Annual nitrogen fertilizer application data was sourced from the global crop-specific nitrogen fertilization dataset, with an approximate spatial resolution of 10 km × 10 km61. Monthly nitrogen deposition data was obtained from the Chemistry Climate Model Initiative (CCMI) (2000–2014), with a spatial resolution of approximately 250 km × 250 km62. For the period 2015–2025, monthly nitrogen deposition data was available at every five years with a spatial resolution of approximately 250 km × 250 km. It was simulated using the GEOS-Chem model63, which incorporated NOx emissions under the SSP3-70 scenario and NH3 emissions under the BAU scenario64,65. We calculated the monthly total nitrogen deposition in the years 2015, 2020, and 2025, and then performed a linear interpolation to derive the monthly total nitrogen deposition dataset for the years 2016–2019 and 2021–2023, and we distributed the total nitrogen deposition into corresponding grid points following the spatial distribution patterns of 2015 and 2020, respectively.

Table 1 Information of climate, soil properties, and anthropogenic activities for machine learning in this study (T2m, 2 m temperature; Precip, precipitation; Si10, 10 m wind speed; Ssrd, surface solar radiation downwards.).

Water quality data

Water quality data for lakes and reservoirs in China including pH, DO, TP, TN, CODMn, Tur, and EC, were sourced 217 national surface water quality automatic monitoring stations managed by the China National Environmental Monitoring Center (CNEMC) (https://www.cnemc.cn/sssj/), among which 127 stations in lakes and 90 in reservoirs (Fig. 1). These data were collected every 4 hours from 00:00 to 24:00 on the monitoring days with a total of 666,836 monitoring records during January 1, 2021, to December 31, 2022. We also obtained 1326 records of DOC observations from 83 lake and reservoir observation stations between October 2017 and June 2022. The spatial geographic information of lakes and reservoirs in China were derived from the China Reservoir Dataset (CRD) and the Global Lakes Dataset (GLAKES), respectively50,51. In total, 82759 lakes and 97414 reservoirs were extracted from GLAKES and CRD. The base map for the watershed division of each lake or reservoir came from HydroBASINS, which contained geometric attributes such as sub-basin area, distance from upstream headwaters and ocean outlets, and coding information for river basin identification and classification66.

Watershed division

The watershed for each lake and reservoir was determined based on geospatial datasets of CRD, GLAKES and HydroBASINS lev12. The CRD integrated geographic coordinates and maximum water surface of existing reservoirs in China from the Georeferenced global Dams And Reservoirs (GeoDAR), the GlObal geOreferenced Database of Dams (GOODD), and the Global Reservoir and Dam (GRanD) datasets51. Overlappings between GLAKES and CRD were removed to avoid double counting. HydroBASINS lev12 dataset encompassed 1.0 million individual sub-basin polygons66, and a complete watershed of each lake or reservoir was scoped by identifying its nearest sub-basin in HydroBASINS lev12, tracing and merging upstream sub-basins based on the HYBAS_ID.

Data processing

Monthly average of pH, DO, CODMn, TN, TP, Tur, and EC from 217 CNEMC stations and DOC from 83 observing stations were calculated, and their latitude and longitude were corrected using the coordinate picker of Google map (https://www.google.com/maps). To assess the impacts of climate, soil properties, and anthropogenic activities on pH, DO, DOC, CODMn, TN, TP, Tur, and EC in lakes and reservoirs, we calculated their inverse distance weighted (IDW) values using the following equations67,68:

$${w}_{i}=\frac{1}{{d}_{i}^{p}}$$
(1)
$${{\rm{d}}}_{{\rm{i}}}=\sqrt{{({\rm{x}}-{{\rm{x}}}_{{\rm{i}}})}^{2}+{({\rm{y}}-{{\rm{y}}}_{{\rm{i}}})}^{2}}$$
(2)
$${\rm{IDW}}=\frac{{\sum }_{{\rm{i}}=1}^{{\rm{N}}}{{\rm{w}}}_{{\rm{i}}}{{\rm{z}}}_{{\rm{i}}}}{{\sum }_{{\rm{i}}=1}^{{\rm{N}}}{{\rm{w}}}_{{\rm{i}}}}$$
(3)

Among which, wi represented the weight of grid i within a watershed, p was the weighted index which usually set to 2, and di was the distance between the grid i and its belonged lake or reservoir; x and y denoted the longitude and latitude of the lake or reservoir, respectively, while xi and yi represented the longitude and latitude of grid i, respectively; zi represented the value of climate, soil property, or anthropogenic activity of grid i.

Model construction and verification

Random Forest (RF) was an ensemble learning technique, which could effectively deal with the noise and outliers of input data, and enhanced the performance of predictive models by aggregating the outcomes of multiple decision trees69. In this approach, decision trees were trained on randomly selected subsets of input data using the bootstrap resampling method which ensured the robustness and stability of final RF model and reduced the risk of overfitting70. Therefore, we used RF models to predict the monthly values of pH, DO, DOC, TN, TP, CODMn, Tur and EC in lakes and reservoirs with 16 climate, soil properties, and anthropogenic activities characteristics in their watersheds (Fig. 2). Modelling was performed using the scikit-learn package under a virtual environment (Python 3.11.0).

Fig. 2
figure 2

Flow chart of monthly water quality data production for lakes and reservoirs in China. The abbreviations of the input data name were explained as follows: CSoil, Soil organic carbon; RSoil, Soil respiration rate; Napp, nitrogen application; Ndep, nitrogen deposition; Pop, population; Cropland, Forest, and Grassland, represented the proportion of cropland, forest, and grassland in land use types, respectively.

To capture monthly patterns of water quality parameters, we constructed RF models every month for pH, DO, DOC, TN, TP, CODMn, Tur and EC. However, only one RF model was constructed for DOC in the whole year due to limited DOC observations. We traversed all possible combinations of the 16 climate, soil properties, and anthropogenic activities features, and repeated 100 iterations for each feature combination. 70% of the input data were used for training and 30% were withheld for testing, and datasets were resampled for repeating each iteration to guarantee the stability of the model. RF models for pH, DO, TN, TP, CODMn, Tur and EC were constructed based on the feature combinations with the highest average R2 of 100 iterations.

Observed datasets of pH, DO, TN, TP, CODMn, Tur and EC from 217 CNEMC stations in 2021 were used to train their RF models, and datasets in 2022 were used to evaluate the generalization capability and predictive efficacy of these constructed models. Due to the limited DOC observations, RF model for DOC was trained and evaluated using all the observations from 2017 to 2022. Finally, monthly variations of pH, DO, DOC, TN, TP, CODMn, Tur and EC for lakes and reservoirs in China during 2000 to 2023 were predicted using constructed RF models and calculated IDWs of climate, soil properties, and anthropogenic activities in their watersheds.

Model evaluation index

The determination coefficient (R2), root mean square error (RMSE), and mean absolute error (MAE) were used to evaluate the performance of regression models. The closer the value of R2 to 1, the better the explanatory ability of the model. Both RMSE and MAE were expressed in the same units as the output variables, and the lower the RMSE and MAE, the higher the model prediction accuracy. The equations for these parameters were as follows:

$${{\rm{R}}}^{2}=1-\frac{{\sum }_{i=1}^{n}{({y}_{i}-{{\rm{ {\hat{y}} }}}_{i})}^{2}}{{\sum }_{i=1}^{n}{({y}_{i}-\bar{y})}^{2}}$$
(4)
$${\rm{RMSE}}=\sqrt{\frac{\mathop{\sum }\limits_{i=1}^{n}({{\rm{ {\hat{y}} }}}_{i}{-{y}_{i})}^{2}}{n}}$$
(5)
$${\rm{MAE}}=\frac{1}{n}{\sum }_{i=1}^{n}|{{\rm{ {\hat{y}} }}}_{i}-{y}_{i}|$$
(6)

Data Records

The data is available at https://doi.org/10.6084/m9.figshare.27626286.v271. They were formatted in CSV files, with each containing the ID number, latitude and longitude coordinates, and the monthly averages of pH, DO, DOC, TN, TP, CODMn, Tur and EC for respective lakes and reservoirs from 2000 to 2023. These files were named according to the water body type and the year and month.

Technical Validation

Model evaluation index

The computation and extraction of various datasets, the coding of machine learning models and the graph rendering of related results were all completed by using the software Visual Studio Code (Version 1.88, Microsoft, United States) and ArcMap (Version 10.8, Esri, United States).

The performance of the RF models for pH, DO, DOC, TN, TP, CODMn, Tur and EC were shown in Table 2. All the RF models show a good performance with high R2, low RMSE and MAE. The average R2 values for all the RF models were higher than 0.64, while RF models for EC stood for the highest R2, followed by DO and pH. The average RMSE and MAE values for all RF models were lower than 4.56 and 3.55, respectively. Among which, RF models for EC showed the lowest RMSE and MAE, followed by TP and pH.

Table 2 Performance of RF models for EC, DO, DOC, pH, CODMn CODMn, TN, TP, and Tur.

The discrepancy between predicted and observed values of RF models were depicted in Fig. 3, and each subplot contained four quarters of data, except for DOC in which all the data were shown together. Overall, the predictions showed a good agreement with observed values for all the parameters (Fig. 3). Especially, the predicted values of DOC, DO, pH and EC were very close to the observed values, and most of the predictions were tightly distributed around the 1:1 line, which indicated a high accuracy of these model predictions (Fig. 3b–d,h).

Fig. 3
figure 3

Scatter plots of predicted values versus observed values.

Overall, validation using observations from CNEMC stations in 2022 showed that all the of RF models had good stability and generalization ability with RMSE and MAE lower than 1.52 and 1.11, respectively, demonstrating their strong generalization ability.

Monthly variations in water quality parameters

The monthly variations of pH, DO, DOC, TN, TP, CODMn, Tur and EC were showed in Fig. 4. The pH value ranged from 7.7 to 8.2, was slightly higher in June but had little seasonal fluctuation (Fig. 4a). Similar with our results, previous studies also reported that the proliferation of eutrophic algae along with enhanced photosynthesis and respiration can slightly increase the pH value of lakes through impacting the carbonate equilibrium in surface water during the humid season in summer72,73. The DOC concentration increased from 9.2 mg L−1 in January to 13.1 mg L−1 in August, but afterwards declined again to 8.9 mg L−1 in December (Fig. 4b). Similarly, CODMn concentration reached a maximum of 5.1 mg L−1 in September and a minimum of 2.8 mg L−1 in November (Fig. 4e). The seasonal variations of DOC were not only affected by endogenous effects of phytoplankton photosynthesis production and release of DOC, but also by exogenous inputs such as surface runoffs and rivers inflows74. Thus, the DOC concentration increased during the biological growth and rainfall period75,76. Meanwhile, CODMn concentration was positively correlated with water temperature and organic pollutants, the death and degradation of algae in summer was an important potential reason for the increase of CODMn concentration from June to September45. DO concentrations ranged from 7.9 to 10.4 mg L−1, and showed an obvious seasonal fluctuation from January to December (Fig. 4c). Water temperature dominated the monthly variation of DO concentration, which was negatively correlated with water temperature77,78. Tur was significantly higher in summer, reaching 4.4 JTU in July, and the lowest in February at 0.7 JTU, respectively (Fig. 4d). The monthly variation of EC ranged from 0.05 to 0.07 S m−1, with the highest in May and the lowest in December (Fig. 4h). Strong precipitation in summer transferred a significant amount of dissolved ions into lakes and reservoirs through runoffs, which further increased the values of EC12. Simultaneously, the increase in particulate matter carried by runoffs and the disturbance of sediment by wind and current during rainy season led to the increase of Tur in lakes and reservoirs11.

Fig. 4
figure 4

Monthly variations of pH (a), DOC (b), DO (c), Tur (d), CODMn (e), TN (f), TP (g) and EC (h) in Chinese lakes and reservoirs during the 2010s simulated by RF the model.

TN concentrations was higher from September to February with a maximum concentration of 2.4 mg L−1 and lower from March to August with a minimum of 1.4 mg L−1 (Fig. 4f). TP concentrations was higher from June to October with a maximum of 0.08 mg L−1, but reached a minimum of 0.03 mg L−1 in December (Fig. 4g). The decrease of TN concentration from April to August was usually related to the dilution of nitrogen accumulation by runoffs and the increase of water temperature, which promoted the denitrification of microorganisms, as well as the growth and propagation of algae that consumed nitrogen27,46. These metabolites of biological activities also changed the pH of water, resulting in increased phosphorus content in the overlying water, which together promoted the release of phosphorus from the sediment46,79.

Spatial profile of water quality parameters

The spatial profiles of water quality in lakes and reservoirs in China significantly differed with each other (Fig. 5). The spatial profile of pH in lakes and reservoirs followed a distinct regional pattern, which increased gradually from southeast to northwest, ranging from 7.2 to 8.7 (Fig. 5a). The concentrations of DOC and CODMn ranged nationally from 5.0 to 26.7 mg L−1 and 1.0 to 7.0 mg L−1, respectively, and they were both relatively higher in Tibetan Plateau, NW China, NE China, and Yangtze regions compared with most areas of Huang-Huai-Hai and Greater Pearl (Fig. 5b,e). By the contrast, DO was relatively lower (7.5 to 8.5 mg L−1) in the Greater Pearl, southern area of Yangte and NW China compared with other regions (8.5 to 10.5 mg L−1) (Fig. 5c). Nationally, Tur ranged from 0 to 5 JTU, while it was generally higher than 3 JTU in the Tibetan Plateau, the eastern part of the Huang-Huai-Hai and Yangtze regions (Fig. 5d). The TN concentrations in the NE China, Huang-Huai-Hai, and the west part of Greater Pearl ranged from 3.0 to 3.8 mg L−1, while usually turned out to be less than 2.5 mg L−1 in NW China, Tibetan Plateau, as well as most parts of Greater Pearl and Yangtze (Fig. 5f). The TP concentration was above 0.09 mg L−1 in most areas of the Tibetan Plateau, the west part of NE China, and the junction of Huang-Huai-Hai and Yangtze regions, while the concentration was less than 0.05 mg L−1 in most other regions (Fig. 5g). The values of EC were more than three times lower in the Yangtze and Greater Pearl (<0.03 S m−1), than those in the other four regions (>0.09 S m−1) (Fig. 5h).

Fig. 5
figure 5

Spatial profiles of pH (a), DOC (b), DO (c), Tur (d), CODMn (e), TN (f), TP (g) and EC (h) in Chinese lakes and reservoirs during the 2010s simulated by RF models.

Since most lakes on the Tibetan Plateau were mainly enclosed saline lakes that continuously received pollutants from external sources, and salinity was the dominant factor affecting the abundance and biomass of phytoplankton communities, these waters were prone to the accumulation of nutrients, and salinity was positively linear to organic carbon sources and soluble ion concentrations58,80,81,82,83. These interconnected factors collectively explained the high concentrations of DOC, CODMn, TP, and EC in the lakes of Tibetan Plateau (Fig. 5b,e,g,h). Due to the arid and semi-arid environment formed by the shortage of water resources and the intense evaporation effect of water surface, there were also many saline lakes with internal flow in NW China84. Field observations showed that these saline lakes tended to be highly eutrophic and the average DOC concentration was approximately 5 times compared to freshwater lakes44. Therefore, under the impact of high salinity and organic matter content, NW China showed higher concentrations of DOC, CODMn and EC (Fig. 5b,e,h). In NW China, the northern Huang-Huai-Hai and the western NE China, soils usually contained carbonate or heavy phosphate which was easily lost into the water and further caused the high pH value due to hydrolysis72. Due to the influence of soil characteristics, vegetation cover, and agricultural management activities, water bodies in the agricultural regions of the eastern NE China and the red soil areas of the Yangtze and Greater Pearl exhibited low pH values72,85,86.

Excessive input of nutrients due to human activities increased the risk of eutrophication in lakes and reservoirs. For example, livestock manure discharge and excessive fertilization in NE China and the plain of Huang-Huai-Hai led to high nitrogen emission density and transferred to lakes and reservoirs by leaching and runoffs which increased the TN loads of water bodies24,87. Driven by population growth, urbanization and intensive agricultural production in the junction of Huang-Huai-Hai and Yangtze regions, large amount of nitrogen and phosphorus nutrients were discharged into lakes and reservoirs along with domestic sewage, industrial and agricultural wastewater, resulting in eutrophication with high concentrations of DOC, TN and TP in the water bodies44,79,88. Algal blooms, residual aquatic plant debris, and re-suspended sediments, coupled with the accumulation of feed residues and excrement from aquaculture activities, had collectively led to elevated levels of Tur in the lakes and reservoirs located in the junction of Huang-Huai-Hai and the middle and lower Yangtze regions88,89.

Annual variations in water quality parameters

Fig. 6 presents the annual variations of pH, DO, DOC, TN, TP, CODMn, Tur, and EC in Chinese lakes and reservoirs from 2000 to 2023. From 2000 to 2023, the annual averages of Tur and EC showed significant temporal variations (k = 0.0042, p < 0.05 for Tur; k = -0.0001, p < 0.05 for EC) (Fig. 6d and h). By the contrast, pH declined from 7.99 in 2000 to 7.97 in 2023 with a slope of −0.0015 (p < 0.05) (Fig. 6a). While no significant changes were found for DOC, TN, TP (p > 0.05) (Fig. 6b,f and g).

Fig. 6
figure 6

Annual variations of pH (a), DOC (b), DO (c), Tur (d), CODMn (e), TN (f), TP (g) and EC (h) in Chinese lakes and reservoirs from 2000 to 2023.

Anthropogenic discharge was one of the main external factors that caused the change of EC12, e.g. significant amount of dissolved salts was lost from agricultural soils to water bodies due to heavy fertilizer application90. While climate change affected EC values in lakes and reservoirs in two ways: (1) extreme precipitation lead to soil erosion and intensified the input of soluble ions from runoffs; (2) extreme drought increased the concentration of certain ions91,92. The interannual variations of CODMn, TN and TP concentrations of lakes and reservoirs in China decreased slightly from 2000 to 2023, which was consistent with previously reported results93. Excessive application of chemical fertilizers led to soil acidification, which in turn allowed dissolved acidic ions to enter water bodies through surface runoffs and leaching, thereby lowering its pH levels94,95. Diminished atmospheric deposition of NH3 resulted in elevated deposition of acidic compounds, which intensified the occurrence of acid rain and consequently lowered the pH of aquatic environments96. Aquatic plant reduction and algal bloom outbreaks due to increased water temperature, had caused shallow water bodies to transition from clear to turbid states97. Increased extreme precipitation events which rapidly washed out material in watersheds, increased benthic activity and water Tur98. Additionally, anthropogenic activities such as deforestation and reclamation, and urban construction led to soil erosion, which in turn caused an increase in Tur92. Meanwhile, Fig. 7 presents the changes of pH, DOC, DO, Tur, CODMn, TN, TP and EC in lakes and reservoirs in China from 2000 to 2023.

Fig. 7
figure 7

The trends in pH (a), DOC (b), DO (c), Tur (d), CODMn (e), TN (f), TP (g), and EC (h) of lakes and reservoirs in China from 2000 to 2023. Slopes were evaluated by constructing regression curves based on the annual mean values of each water body.

Limitation analysis

In this study we build RF models to interpret influences of climate, soil properties, and anthropogenic activities on water quality parameters (pH, DOC, DO, Tur, CODMn, TN, TP, and EC) in lakes and reservoirs, and further developed monthly dataset of these water quality parameters using these RF models from 2000 to 2023. This dataset offers high resolution and long-term information of environmental characteristics for nearly 180,000 lakes and reservoirs across China, which bridges the current gap between limited observations and keen demand for water quality parameters in aquatic ecosystems. While, some limitations still need to be solved in further studies: (1) long-term high-resolution dataset of soil properties is currently unavailable except for the China High-Resolution National Soil Information Grid Basic Attribute Dataset (2010–2018); (2) fertilizer applications are important anthropogenic activities influencing the water quality in lakes and reservoirs, while currently, they are only available interannually instead of monthly; (3) few CNEMC observation stations are located in Tibetan Plateau, which may limit the accuracy of RF models in this region. In the further, efforts should be made to develop long-term and high temporal resolution datasets of soil properties and fertilizer applications, as well as to build more CNEMC observation stations in Tibetan Plateau, in order to improve the accuracy of predicted values of pH, DOC, DO, Tur, CODMn, TN, TP, and EC in lakes and reservoirs.