Background & Summary

Floods are increasing in severity, duration, and frequency. World Bank reports that flood early warning systems in developing countries could save an average of 23,000 lives per year1. Flood forecasting systems exist in different parts of the world with many national agencies taking responsibility for issuing early warnings. Given that the severity of floods cannot be understated, it is imperative that agencies evaluate their predictions and an important step towards this is by archiving near real-time (NRT) river discharge information. Near real-time (NRT) discharge measurements remain indispensable to manage flood and drought risk2. Unfortunately, many parts of the world simply do not have long-term river discharge observations at application specific latency. In addition, existing observational networks are constrained with severe observation gaps in spatiotemporal observations of river discharges posing a serious barrier to meaningfully assist in early warning of hydrological extremes like floods. This is further challenged by an uneven distribution of gauges making it impossible for NRT discharge measurements from many countries3.

Delayed and incomplete gauge records result in serious consequences. As the absence of NRT discharge observations often restricts the availability of ensuing hydrological products at the required latency. A way forward would be to synergistically use incomplete in-situ gauge observations with satellite observations which offer a complementary source of surface water level and extents. As satellites do not directly measure discharge, literature presents several empirical relationships that relates discharge to satellite derived water levels and river width4,5,6,7,8,9,10,11,12. Existing studies have also focussed on generating rating curves (RCs) that represent an empirical relationship between river width, water surface elevation (WSE), and river discharge4,5,6,7. However, for the larger rivers flowing through transboundary river basins (Ganga, Indus, Brahmaputra), do the existing works satisfy the latency requirement of NRT discharge observations?

In Asia, one of the regions most exposed to flooding and which is underrepresented in the global flood database is India. Across India, the Central Water Commission (CWC) is the primary authority for river discharge monitoring and data availability which oversees 1543 monitoring stations with approximately 960 telemetry-based stations in various river basins13. Based on data accessibility14, rivers in India are classified into two groups: unclassified rivers and classified rivers (Fig. 1). Notably, the classified rivers encompass major transboundary river basins such as the Ganga, Brahmaputra, and Indus. CWC has established the Water Resources Information System (hereafter WRIS) (https://indiawris.gov.in) portal to support water resource management, which offers daily historical WSE and discharge data of unclassified Indian rivers. However, the data is only available up until year 2020 with a significant number of data gaps.

Fig. 1
figure 1

Indian river basins with river network from Global River Width from Landsat (GRWL)24 data with classified (red) and unclassified (green) zones as per14.

In addition to the historical data, the CWC provides graphical representations of sub-daily WSE data (hourly during monsoon seasons) for the entire river network (classified and unclassified) monitoring stations over CWC’s flood monitoring and flood dissemination portal (hereafter: CWC flood portal) (https://ffs.india-water.gov.in/). These graphical datasets offer invaluable insights into the variations of WSE across 7-day moving windows, providing first-hand information for understanding river dynamics. Unfortunately, these insightful graphical datasets, specifically those spanning 7-day moving windows, are presently inaccessible and cannot be downloaded or ingested by modelling or analysis tools without human intervention. This limitation in accessibility hampers the seamless integration of such data by the hydrological modelling community. Through the present work, we present web-scraping as an automated means of extracting data from CWC flood portal which can immensely augment research by automated collection of data into a structured format15.

Web-scraping uses software to simulate human browsing behaviour. Very few studies exist which demonstrate web-scraping for real-world applications related to environment, and water resources-related applications16. For example17, harnessed web-scraping to extract data from different repositories (USGS’ National Water Information System, EPA’s Storage and Retrieval System) without human intervention and provided standardized access to hydrological data for machine-to-machine communication of data18. presented a new climate data-scraping tool, Canadian Climate Data Scraping Tool (CCDST) to simplify analysis and enhance data access to climate data from Canada’s National Climate Data and Information Archive (NCDIA)16. developed a cloud data-scraping tool for extracting hydrological data and estimating the downstream water discharge (outflows from dammed river) over the Aliakmonas river19. developed an application to retrieve water quality data of drinking water treatment plant from monitoring networks in the Surabaya city. Web-scraping stands as a potent yet underexplored technique in enhancing data availability for water resource applications.

Through this work, we introduce GUARDIAN, a framework to examine the potential of web-scraping to improve the latency of NRT discharge observations for India. Additionally, the framework converts the WSE information to streamflow through local rating curves (RCs)5. In this study, we rely on the largest known dataset of publicly available river gauge data over India from WRIS India and CWC flood portal to demonstrate three major themes namely: (1) Accessing historical to NRT WSE data using web-scraping, (2) Establishing RCs using historical WSE and discharge data, and (3) Generating an extended historical to NRT river discharge data over India.

Methods

A brief of the methodology is presented in Fig. 2. The method has three major parts namely: (1) to extract historical to NRT WSE data from the CWC flood portal, (2) to generate RCs using historical WSE and collocated discharge obtained from the WRIS India portal for different stations over India, and (3) to generate NRT discharge using extracted WSE and RCs for different stations over Indian rivers.

Fig. 2
figure 2

Brief of methodology with extraction of WSE from CWC flood portal (Blue), RC generation using historical WSE and discharge data (orange), and extending discharge measurement (green).

Accessing historical to near real-time WSE data

The CWC flood portal provides RT/NRT measured WSE data across more than 1300 gauging stations. As discussed earlier, the data at CWC flood portal is presented in graphical format with seven days of moving windows. By utilizing web-scraping techniques, it becomes possible to navigate through the CWC flood portal, access the relevant web pages, and extract the desired WSE data. This data can then be stored in a structured format for further analysis and integration with other datasets. In the current study, a robust web-scraping framework is developed utilizing BeautifulSoup, a sophisticated Python-based web-scraping library renowned for its ability to handle complex HTML and XML structures. This framework meticulously crafts HTTP requests with carefully constructed headers to emulate human-like browsing behaviour. Upon sending these requests, the framework efficiently processes the server’s responses and handles potential errors20. This framework efficiently extracts both historical and near real-time WSE data in individual station-wise excel files. The collected information serves as a crucial foundation for the subsequent generation of discharge data. Covering a comprehensive network, this study demonstrates data extraction from 1365 stations spread across the diverse hydrological landscape of India (Fig. 3).

Fig. 3
figure 3

Location of extracted (extended) WSE (red) and discharge (green) over Indian rivers.

Generation of RC

The historical WSE and discharge data from the years 1970 to 2020 are obtained from the WRIS portal. To ensure data quality, a rigorous process has been applied to remove outliers in the data using a threshold of a second standard deviation from the mean. For RC generation, the latest data from 2000 to 2020 is used to better represent the stage-discharge relationship. If the latest data is not available, historical data from 1970 to 1999 is used instead. The data is divided into calibration (75%) and validation (25%) datasets, and later the entire dataset is used to recalibrate the algorithm. Four different RC algorithms (Eqs. 1 to 4) are evaluated for both monsoon and non-monsoon periods to accurately estimate discharge dynamics. The RC algorithm with the highest Nash-Sutcliffe Efficiency (NSE) is then selected. Further RCs with NSE exceeding more than 0.6 are considered to generate the extended discharge21.

$${Discharge}=a\times {{WSE}}^{2}+b\times {WSE}+c$$
(1)
$${Discharge}=a\times {{\exp }}^{b\times {WSE}}$$
(2)
$${Discharge}=a\times \,{({WSE}-b)}^{2}$$
(3)
$${Discharge}=a\times \,{({WSE}-b)}^{3}$$
(4)

A total of 210 valid RCs have been obtained across the Indian region, and discharge is extended accordingly (Fig. 3).

Data Records

The data product of the extended discharge is available at figshare22. The repository consists of extended WSE, extended discharge, and RC used to generate extended discharge. Extended WSE data and extended discharge are available in a separate folder in the data.rar compressed file the file name represents the station name. RC file is available in the RC.xlsx file. RC file consists of the station name, RC algorithm used, RC coefficients, RMSE (Root mean square error), and NSE (Nash-Sutcliffe efficiency).

Technical Validation

The stage-discharge RC for extending river discharge for more than 250 stations is generated using the historical gauge data of WSE and discharge measurement. Second-degree polynomial (Eq. 1), and cubic (Eq. 4) demonstrate the highest accuracy, while exponential relation (Eq. 2) shows poor accuracy characterized by low RMSE and maximum NSE, across the Indian region.

In the current study, stations with Nash-Sutcliffe Efficiency (NSE) of more than 0.6 are considered valid21 for discharge data generation. A total of 210 stations were found to fall under the NSE threshold.

Figure 4 represents the NSE and root mean square error (RMSE) for monsoon and non-monsoon for all 210 stations. Also presented are the pie chart ranges of NSE and RMSE for both seasons.

Fig. 4
figure 4

Plot and pie chart of NSE for monsoon (a,b), NSE for non-monsoon (a,c), RMSE for monsoon (d,e), and RMSE for non-monsoon (d,f) for different RCs across India.

Figure 4a,b,c indicate that NSE ranges from 0.6 to 0.99 across different stations over India for monsoon and non-monsoon discharge. It is evident that 3.3% (7 stations) stations show NSE values between 0.6 and 0.7, 8.5% (18 stations) stations show NSE values between 0.7 and 0.8, and 16.1% (34 stations) stations show NSE values between 0.8 and 0.9. A significant majority of the stations, approximately 71.9% (151 stations), present an NSE exceeding 0.9 for the RCs used for monsoon discharge estimation from WSE data. Furthermore, 6.1% (13 stations) stations show NSE values between 0.6 and 0.7, 12.3% (26 stations) stations show NSE values between 0.7 and 0.8, and 30% (63 stations) stations show NSE values between 0.8 and 0.9. A significant majority of the stations, approximately 51.4% (108 stations), present an NSE exceeding 0.9 for the RCs used for non-monsoon discharge estimation from WSE data.

Figure 4d,e,f indicate that RMSE ranges from less than 5 m3s−1 to 1147 m3s−1 across different stations over India. From the figure, it is also inferred that 6.6% (14 stations) stations show RMSE values of more than 300 m3s−1, 7.1% (15 stations) stations show RMSE values between 200 and 300 m3s−1, and 16.6% (35 stations) stations show RMSE values between 100 and 200 m3s−1. A significant majority of the stations, approximately 69.5% (146 stations), present an RMSE of less than 100 m3s−1 for the RCs used for monsoon discharge estimation from WSE data. Additionally, 1.9% (4 stations) stations show RMSE values of more than 300 m3s−1, 1.9% (4 stations) stations show RMSE values between 200 and 300 m3s−1, and 8% (17 stations) stations show RMSE values between 100 and 200 m3s−1. A significant majority of the stations, approximately 88% (185 stations), present an RMSE of less than 100 m3s−1 for the RCs used for non-monsoon discharge estimation.

Figure 5a,b illustrate the spatial distribution of NSE for monsoon and non-monsoon discharge estimation for all 210 stations. While spatial distribution of RMSE for monsoon and non-monsoon discharge estimation can be observed from Fig. 5c,d. The figure highlights that most stations exhibit NSE values exceeding 0.9, RMSE values below 100 m3s−1 for the RCs used for discharge estimation from WSE data, and are distributed across the unclassified basins within India.

Fig. 5
figure 5

Spatial variation of NSE for Monsoon (a), Non-Monsoon (b) RMSE for Monsoon (c), and Non-Monsoon (c) for RCs across India.

Usage Notes

In this study, real-time WSE are extracted using web-scraping from the CWC flood portal over more than 1200 stations, and discharge is generated for more than 200 stations that are available in tabular format in the figshare repository22. The extended WSE and discharge data may be visualized, and downloaded from http://indiariverflow.com/. The implication of the data22 can be as follows:

  1. 1.

    WSE and discharge data for real-time monitoring of rivers for better management purposes.

  2. 2.

    Users can also generate their discharge series by generating local RCs and using the provided WSE series for discharge estimation.

  3. 3.

    The provided discharge and WSE data can be harnessed for flood frequency analysis.

  4. 4.

    The implication of generated data can be calibrating and validating the hydrodynamic/hydrological models.

  5. 5.

    The extended WSE data will be useful for validating Surface Water and Ocean Topography (SWOT) missions launched in December 2022, particularly for SWOT high-resolution WSE point cloud and node-based WSE data over the Indian region. Additionally, the generated discharge data can serve as a valuable resource for local flow law parameters (FLP) estimation23, facilitating precise discharge estimation through SWOT missions across Indian rivers.