Abstract
River discharge information is crucial for various applications, but the measurement process often remains impeded by factors that hinder near real-time (NRT) data availability in India. However, leveraging telemetry-based water surface elevation (WSE) data across the country provides an opportunity to convert it into river discharge. This conversion is made possible by utilizing rating curves (RCs) derived from historical collocated measurements of WSE and discharge. In this study, NRT WSE from the Central Water Commission (CWC) flood portal is obtained via the web-scraping tool. Through the application of RCs, discharge data is extended across 210 gauging stations in India from the year 2020 to the present, encompassing sub-daily discharge during the non-monsoon and hourly discharge series during monsoon (June-September) across the Indian rivers. Annually, the study generated over 800,000 discharge data points for Indian rivers, accounting for more than 4000 discharge measurements per station annually. These comprehensive datasets provide valuable insights for water resource and flood management research, offering NRT access to WSE, and discharge, along with the local RCs.
Similar content being viewed by others
Background & Summary
Floods are increasing in severity, duration, and frequency. World Bank reports that flood early warning systems in developing countries could save an average of 23,000 lives per year1. Flood forecasting systems exist in different parts of the world with many national agencies taking responsibility for issuing early warnings. Given that the severity of floods cannot be understated, it is imperative that agencies evaluate their predictions and an important step towards this is by archiving near real-time (NRT) river discharge information. Near real-time (NRT) discharge measurements remain indispensable to manage flood and drought risk2. Unfortunately, many parts of the world simply do not have long-term river discharge observations at application specific latency. In addition, existing observational networks are constrained with severe observation gaps in spatiotemporal observations of river discharges posing a serious barrier to meaningfully assist in early warning of hydrological extremes like floods. This is further challenged by an uneven distribution of gauges making it impossible for NRT discharge measurements from many countries3.
Delayed and incomplete gauge records result in serious consequences. As the absence of NRT discharge observations often restricts the availability of ensuing hydrological products at the required latency. A way forward would be to synergistically use incomplete in-situ gauge observations with satellite observations which offer a complementary source of surface water level and extents. As satellites do not directly measure discharge, literature presents several empirical relationships that relates discharge to satellite derived water levels and river width4,5,6,7,8,9,10,11,12. Existing studies have also focussed on generating rating curves (RCs) that represent an empirical relationship between river width, water surface elevation (WSE), and river discharge4,5,6,7. However, for the larger rivers flowing through transboundary river basins (Ganga, Indus, Brahmaputra), do the existing works satisfy the latency requirement of NRT discharge observations?
In Asia, one of the regions most exposed to flooding and which is underrepresented in the global flood database is India. Across India, the Central Water Commission (CWC) is the primary authority for river discharge monitoring and data availability which oversees 1543 monitoring stations with approximately 960 telemetry-based stations in various river basins13. Based on data accessibility14, rivers in India are classified into two groups: unclassified rivers and classified rivers (Fig. 1). Notably, the classified rivers encompass major transboundary river basins such as the Ganga, Brahmaputra, and Indus. CWC has established the Water Resources Information System (hereafter WRIS) (https://indiawris.gov.in) portal to support water resource management, which offers daily historical WSE and discharge data of unclassified Indian rivers. However, the data is only available up until year 2020 with a significant number of data gaps.
In addition to the historical data, the CWC provides graphical representations of sub-daily WSE data (hourly during monsoon seasons) for the entire river network (classified and unclassified) monitoring stations over CWC’s flood monitoring and flood dissemination portal (hereafter: CWC flood portal) (https://ffs.india-water.gov.in/). These graphical datasets offer invaluable insights into the variations of WSE across 7-day moving windows, providing first-hand information for understanding river dynamics. Unfortunately, these insightful graphical datasets, specifically those spanning 7-day moving windows, are presently inaccessible and cannot be downloaded or ingested by modelling or analysis tools without human intervention. This limitation in accessibility hampers the seamless integration of such data by the hydrological modelling community. Through the present work, we present web-scraping as an automated means of extracting data from CWC flood portal which can immensely augment research by automated collection of data into a structured format15.
Web-scraping uses software to simulate human browsing behaviour. Very few studies exist which demonstrate web-scraping for real-world applications related to environment, and water resources-related applications16. For example17, harnessed web-scraping to extract data from different repositories (USGS’ National Water Information System, EPA’s Storage and Retrieval System) without human intervention and provided standardized access to hydrological data for machine-to-machine communication of data18. presented a new climate data-scraping tool, Canadian Climate Data Scraping Tool (CCDST) to simplify analysis and enhance data access to climate data from Canada’s National Climate Data and Information Archive (NCDIA)16. developed a cloud data-scraping tool for extracting hydrological data and estimating the downstream water discharge (outflows from dammed river) over the Aliakmonas river19. developed an application to retrieve water quality data of drinking water treatment plant from monitoring networks in the Surabaya city. Web-scraping stands as a potent yet underexplored technique in enhancing data availability for water resource applications.
Through this work, we introduce GUARDIAN, a framework to examine the potential of web-scraping to improve the latency of NRT discharge observations for India. Additionally, the framework converts the WSE information to streamflow through local rating curves (RCs)5. In this study, we rely on the largest known dataset of publicly available river gauge data over India from WRIS India and CWC flood portal to demonstrate three major themes namely: (1) Accessing historical to NRT WSE data using web-scraping, (2) Establishing RCs using historical WSE and discharge data, and (3) Generating an extended historical to NRT river discharge data over India.
Methods
A brief of the methodology is presented in Fig. 2. The method has three major parts namely: (1) to extract historical to NRT WSE data from the CWC flood portal, (2) to generate RCs using historical WSE and collocated discharge obtained from the WRIS India portal for different stations over India, and (3) to generate NRT discharge using extracted WSE and RCs for different stations over Indian rivers.
Accessing historical to near real-time WSE data
The CWC flood portal provides RT/NRT measured WSE data across more than 1300 gauging stations. As discussed earlier, the data at CWC flood portal is presented in graphical format with seven days of moving windows. By utilizing web-scraping techniques, it becomes possible to navigate through the CWC flood portal, access the relevant web pages, and extract the desired WSE data. This data can then be stored in a structured format for further analysis and integration with other datasets. In the current study, a robust web-scraping framework is developed utilizing BeautifulSoup, a sophisticated Python-based web-scraping library renowned for its ability to handle complex HTML and XML structures. This framework meticulously crafts HTTP requests with carefully constructed headers to emulate human-like browsing behaviour. Upon sending these requests, the framework efficiently processes the server’s responses and handles potential errors20. This framework efficiently extracts both historical and near real-time WSE data in individual station-wise excel files. The collected information serves as a crucial foundation for the subsequent generation of discharge data. Covering a comprehensive network, this study demonstrates data extraction from 1365 stations spread across the diverse hydrological landscape of India (Fig. 3).
Generation of RC
The historical WSE and discharge data from the years 1970 to 2020 are obtained from the WRIS portal. To ensure data quality, a rigorous process has been applied to remove outliers in the data using a threshold of a second standard deviation from the mean. For RC generation, the latest data from 2000 to 2020 is used to better represent the stage-discharge relationship. If the latest data is not available, historical data from 1970 to 1999 is used instead. The data is divided into calibration (75%) and validation (25%) datasets, and later the entire dataset is used to recalibrate the algorithm. Four different RC algorithms (Eqs. 1 to 4) are evaluated for both monsoon and non-monsoon periods to accurately estimate discharge dynamics. The RC algorithm with the highest Nash-Sutcliffe Efficiency (NSE) is then selected. Further RCs with NSE exceeding more than 0.6 are considered to generate the extended discharge21.
A total of 210 valid RCs have been obtained across the Indian region, and discharge is extended accordingly (Fig. 3).
Data Records
The data product of the extended discharge is available at figshare22. The repository consists of extended WSE, extended discharge, and RC used to generate extended discharge. Extended WSE data and extended discharge are available in a separate folder in the data.rar compressed file the file name represents the station name. RC file is available in the RC.xlsx file. RC file consists of the station name, RC algorithm used, RC coefficients, RMSE (Root mean square error), and NSE (Nash-Sutcliffe efficiency).
Technical Validation
The stage-discharge RC for extending river discharge for more than 250 stations is generated using the historical gauge data of WSE and discharge measurement. Second-degree polynomial (Eq. 1), and cubic (Eq. 4) demonstrate the highest accuracy, while exponential relation (Eq. 2) shows poor accuracy characterized by low RMSE and maximum NSE, across the Indian region.
In the current study, stations with Nash-Sutcliffe Efficiency (NSE) of more than 0.6 are considered valid21 for discharge data generation. A total of 210 stations were found to fall under the NSE threshold.
Figure 4 represents the NSE and root mean square error (RMSE) for monsoon and non-monsoon for all 210 stations. Also presented are the pie chart ranges of NSE and RMSE for both seasons.
Figure 4a,b,c indicate that NSE ranges from 0.6 to 0.99 across different stations over India for monsoon and non-monsoon discharge. It is evident that 3.3% (7 stations) stations show NSE values between 0.6 and 0.7, 8.5% (18 stations) stations show NSE values between 0.7 and 0.8, and 16.1% (34 stations) stations show NSE values between 0.8 and 0.9. A significant majority of the stations, approximately 71.9% (151 stations), present an NSE exceeding 0.9 for the RCs used for monsoon discharge estimation from WSE data. Furthermore, 6.1% (13 stations) stations show NSE values between 0.6 and 0.7, 12.3% (26 stations) stations show NSE values between 0.7 and 0.8, and 30% (63 stations) stations show NSE values between 0.8 and 0.9. A significant majority of the stations, approximately 51.4% (108 stations), present an NSE exceeding 0.9 for the RCs used for non-monsoon discharge estimation from WSE data.
Figure 4d,e,f indicate that RMSE ranges from less than 5 m3s−1 to 1147 m3s−1 across different stations over India. From the figure, it is also inferred that 6.6% (14 stations) stations show RMSE values of more than 300 m3s−1, 7.1% (15 stations) stations show RMSE values between 200 and 300 m3s−1, and 16.6% (35 stations) stations show RMSE values between 100 and 200 m3s−1. A significant majority of the stations, approximately 69.5% (146 stations), present an RMSE of less than 100 m3s−1 for the RCs used for monsoon discharge estimation from WSE data. Additionally, 1.9% (4 stations) stations show RMSE values of more than 300 m3s−1, 1.9% (4 stations) stations show RMSE values between 200 and 300 m3s−1, and 8% (17 stations) stations show RMSE values between 100 and 200 m3s−1. A significant majority of the stations, approximately 88% (185 stations), present an RMSE of less than 100 m3s−1 for the RCs used for non-monsoon discharge estimation.
Figure 5a,b illustrate the spatial distribution of NSE for monsoon and non-monsoon discharge estimation for all 210 stations. While spatial distribution of RMSE for monsoon and non-monsoon discharge estimation can be observed from Fig. 5c,d. The figure highlights that most stations exhibit NSE values exceeding 0.9, RMSE values below 100 m3s−1 for the RCs used for discharge estimation from WSE data, and are distributed across the unclassified basins within India.
Usage Notes
In this study, real-time WSE are extracted using web-scraping from the CWC flood portal over more than 1200 stations, and discharge is generated for more than 200 stations that are available in tabular format in the figshare repository22. The extended WSE and discharge data may be visualized, and downloaded from http://indiariverflow.com/. The implication of the data22 can be as follows:
-
1.
WSE and discharge data for real-time monitoring of rivers for better management purposes.
-
2.
Users can also generate their discharge series by generating local RCs and using the provided WSE series for discharge estimation.
-
3.
The provided discharge and WSE data can be harnessed for flood frequency analysis.
-
4.
The implication of generated data can be calibrating and validating the hydrodynamic/hydrological models.
-
5.
The extended WSE data will be useful for validating Surface Water and Ocean Topography (SWOT) missions launched in December 2022, particularly for SWOT high-resolution WSE point cloud and node-based WSE data over the Indian region. Additionally, the generated discharge data can serve as a valuable resource for local flow law parameters (FLP) estimation23, facilitating precise discharge estimation through SWOT missions across Indian rivers.
Code availability
The code to extract WSE data from the CWC flood portal is generated in Python and available in the GitHub repository (https://github.com/girishpatidar/discharge_india).
References
Hallegatte, S. A cost effective solution to reduce disaster losses in developing countries: hydro-meteorological services, early warning, and evacuation. http://documents.worldbank.org/curated/en/190261468181486694/A-cost-effective-solution-to-reduce-disaster-losses-in-developing-countries-hydro-meteorological-services-early-warning-and-evacuation (2012).
Sheffield, J. et al. Satellite Remote Sensing for Water Resources Management: Potential for Supporting Sustainable Development in Data-Poor Regions. Water Resour. Res. 54, 9724–9758 (2018).
Hou, J., van Dijk, A. I. J. M. & Beck, H. E. Global satellite-based river gauging and the influence of river morphology on its application. Remote Sens. Environ. 239, 111629 (2020).
Riggs, R. M. et al. Extending global river gauge records using satellite observations. Environ. Res. Lett. 18, 064027 (2023).
Paris, A. et al. Stage-discharge rating curves based on satellite altimetry and modeled discharge in the Amazon basin. Water Resour. Res. 52, 3787–3814 (2016).
Kouraev, A. V., Zakharova, E. A., Samain, O., Mognard, N. M. & Cazenave, A. Ob’ river discharge from TOPEX/Poseidon satellite altimetry (1992-2002). Remote Sens. Environ. 93, 238–245 (2004).
Tarpanelli, A., Paris, A., Sichangi, A. W., O’Loughlin, F. & Papa, F. Water Resources in Africa: The Role of Earth Observation Data and Hydrodynamic Modeling to Derive River Discharge. Surv. Geophys. 44, 97–122 (2022). 2022 441.
Bjerklie, D. M., Dingman, S. L., Vorosmarty, C. J., Bolster, C. H. & Congalton, R. G. Evaluating the potential for measuring river discharge from space. J. Hydrol. 278, 17–38 (2003).
Papa, F. et al. Ganga-Brahmaputra river discharge from Jason-2 radar altimetry: An update to the long-term satellite-derived estimates of continental freshwater forcing flux into the Bay of Bengal. J. Geophys. Res. Ocean. 117 (2012).
Sridevi, T., Sharma, R., Mehra, P. & Prasad, K. V. S. R. Estimating discharge from the Godavari River using ENVISAT, Jason-2, and SARAL/AltiKa radar altimeters. Remote Sens. Lett. 7, 348–357 (2016).
Pan, F., Wang, C. & Xi, X. Constructing river stage-discharge rating curves using remotely sensed river cross-sectional inundation areas and river bathymetry. J. Hydrol. 540, 670–687 (2016).
Scherer, D., Schwatke, C., Dettmering, D. & Seitz, F. Long-Term Discharge Estimation for the Lower Mississippi River Using Satellite Altimetry and Remote Sensing Images. Remote Sensing vol. 12 (2020).
Ministry of Jal Shakti. Central Water Commission Annual Report 2021 - 2022. 1–366 (2022).
CWC. Hydrometeorological Data Dissemination Policy. 18 (2018).
Dogucu, M. & Çetinkaya-Rundel, M. Web Scraping in the Statistics and Data Science Curriculum: Challenges and Opportunities. J. Stat. Data Sci. Educ. 29, S112–S122 (2021).
Skoulikaris, C. & Krestenitis, Y. Cloud Data Scraping for the Assessment of Outflows from Dammed Rivers in the EU. A Case Study in South Eastern Europe. Sustain. 12, 7926 (2020).
Beran, B., Valentine, D., Zaslavsky, I., Goodall, J. & Piasecki, M. Standardizing access to hydrologic data repositories through web services. Proc. Int. Conf. Adv. Geogr. Inf. Syst. Web Serv. GEOWS 64–67, https://doi.org/10.1109/GEOWS.2009.27 (2009).
Bonifacio, C., Barchyn, T. E., Hugenholtz, C. H. & Kienzle, S. W. CCDST: A free Canadian climate data scraping tool. Comput. Geosci. 75, 13–16 (2015).
Budiarti, R. P. N., Widyatmoko, N., Hariadi, M. & Purnomo, M. H. Web scraping for automated water quality monitoring system: A case study of PDAM Surabaya. Proceeding - 2016 Int. Semin. Intell. Technol. Its Appl. ISITIA 2016 Recent Trends Intell. Comput. Technol. Sustain. Energy 641–648, https://doi.org/10.1109/ISITIA.2016.7828735 (2017).
Richardson, L. Beautiful Soup Documentation Release 4.4.0. Media.Readthedocs.Org 1–72 (2019).
Moriasi, D. N. et al. Model Evaluation Guidelines for Systematic Quantification of Accuracy in Watershed Simulations. Trans. ASABE 50, 885–900 (2007).
Patidar, G., Indu, J. & Karmakar, S. ExtendinG SUb-DAily River Discharge data over INdia (GUARDIAN). Figshare https://doi.org/10.6084/m9.figshare.27004282 (2024).
Durand, M. et al. A Framework for Estimating Global River Discharge From the Surface Water and Ocean Topography Satellite Mission. Water Resour. Res. 59 (2023).
Allen, G. H. & Pavelsky, T. M. Global extent of rivers and streams. Science (80-) 361, 585–588 (2018).
Acknowledgements
The authors express their gratitude and acknowledge the support of DST Centre of Excellence in Climate Studies, IIT Bombay, under project DST/CCP/CoE/140/2018 (G) and CEFIPRA CSRP Project 7000-W-1. Special thanks are extended to Ritik Badiya for providing insights into web-scraping techniques. The authors also acknowledge the Central Water Commission (CWC), Ministry of Jal Shakti Government of India, and Indian Space Research Organization (ISRO), Department of Space Government of India, for providing historical water surface elevation (WSE) and discharge datasets on India-WRIS portal (https://indiawris.gov.in/). The author acknowledges the Central Water Commission (CWC), Ministry of Jal Shakti Government of India for WSE data on the CWC flood portal (https://ffs.india-water.gov.in/#/).
Author information
Authors and Affiliations
Contributions
G.P., J.I., S.K., designed the study. G.P. conducted the analysis. All the authors wrote the first draft of the manuscript and revised the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Patidar, G., Indu, J. & Karmakar, S. ExtendinG SUb-DAily River Discharge data over INdia (GUARDIAN). Sci Data 11, 1155 (2024). https://doi.org/10.1038/s41597-024-03923-8
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41597-024-03923-8







