Background & Summary

Large-sample hydrometeorological datasets have become fundamental to advancing hydrological research by enabling researchers to analyze streamflow patterns across diverse catchments1,2,3, develop robust predictive models4,5,6, and support critical applications in water resource management and climate change assessment2,7,8. The expansion of large-sample datasets, driven by the need for benchmark data in large-scale studies, increasing data availability, and improved data-sharing initiatives, began with the MOPEX (Model Parameter Estimation Experiment)9 dataset in the United States (US) and significantly expanded through the development of the first CAMELS (Catchment Attributes and MEteorology for Large-sample Studies)10 in 2017, which covered 671 U.S. catchments and integrated extensive hydrometeorological observations with catchment attributes. CAMELS served as a catalyst for similar datasets across multiple countries, including Chile11, Brazil12, Great Britain13, Australia14, Germany15, India16, and Switzerland17, ultimately leading to Caravan (a series of CAMELS)18, a comprehensive global dataset that synthesizes various CAMELS sources to provide harmonized, high-quality data for studying hydrological processes at regional and global scales.

Despite the widespread availability of shared datasets, it remains challenging to find large sample datasets at sub-daily or hourly temporal scales; most existing datasets (for continental or global scales) operate at a daily scale. This hinders large-scale research and model capability assessment at a sub-daily time scale, which is crucial for developing machine learning (ML) approaches relevant for decision-making and mitigation activities. While ML applications have shown effectiveness for simulating daily streamflow5,19,20, similar achievements in large-scale hourly streamflow prediction or Prediction in Ungauged Basins (PUB) remain limited to isolated case studies21,22. This limitation is particularly critical as hourly forecasts play a crucial role in disaster risk mitigation, especially for short-lived flooding events23,24,25,26. At hourly timescales, streamflow patterns and physical characteristics of streamflow generation become increasingly complex and less predictable as compared to the daily scale27,28,29. Besides, the potential benefits of hourly data extend beyond model development, offering information to provide vital insights into short-term variability, extreme events, and diurnal patterns in streamflow, precipitation, and temperature27,30,31.

While hourly hydrometeorological datasets exist, they lack comprehensive coverage in terms of the number of watersheds and diversity of basin characteristics. Specifically, Demir, et al.22 provided the WaterBench-Iowa dataset, which focuses on a specific region in Iowa, US, covering 125 locations. In another study, Gauch, et al.21 presented an hourly dataset for 516 CAMELS basins spanning the period 1990–2018. However, one of the most well-known limitations of the CAMELS dataset is its lack of diversity, as it primarily contains data for near-natural basins with minimal human impacts10. This limitation was highlighted in Ouyang, et al.32, which demonstrated that ML models trained on this dataset cannot generalize to regulated basins. It is worth noting that two-thirds of the river basins globally are effected by human activities, resulting in non-natural flow conditions33.

Given the needs outlined above, in this study, we present the first comprehensive hourly dataset, named CAMELSH (Catchment Attributes and Hourly hydroMEteorology for Large-sample Studies), which provides data for 9,008 basins across the contiguous United States (CONUS) under diverse conditions. The dataset is structured according to FAIR principles (Findable, Accessible, Interoperable, and Re-usable), featuring standardized units, time formats, time zones, and data formats to ensure seamless integration and analysis. All data are organized in a readily accessible format to facilitate straightforward extraction and utilization, with planned ongoing updates to expand the spatial coverage through additional basins, the temporal extent of the data records, forcing data from various sources, as well as different measurable parameters that can be collected such as water level or stream temperature.

The CAMELSH dataset comprises three main components: hourly hydrometeorological time series, basin attributes from multiple sources, and watershed boundary shapefiles. The hydrometeorological data spans from 1980 to 2024, incorporating 11 meteorological variables derived from the NLDAS-2 (North American Land Data Assimilation System) dataset, while the streamflow observations are sourced from the United States Geological Survey (USGS). Operating at a higher temporal resolution, CAMELSH encompasses 9,008 basins (the first version of CAMELSH contains 3,166 basins with observed streamflow data), representing diverse watersheds across different climatic regions, geological characteristics, and varying degrees of human influence.

Our vision in publishing high temporal resolution (e.g., hourly) datasets is to promote the convergence of science efforts, in which researchers worldwide can work with common datasets to reach shared goals. Using standardized datasets particularly benefits model development studies while also facilitating easier reproduction of results for benchmarking models used in evaluation and comparison. More importantly, beyond developing models solely to improve their forecasting performance, developing datasets at finer timescales is crucial for exploring the contribution of various physical mechanisms in the land-surface phase of the hydrological cycle, especially useful for (mostly) process-agnostic machine learning applications. It is worth noting that recent studies have extensively addressed timescale-related questions in process-based modeling and downscaling applications34,35. With the current remarkable advancement of ML and AI, similar efforts need to be undertaken to understand the scalability of these novel tools across various hydrologic regimes. However, improving model prediction performance at hourly temporal resolution still remains the dominant research topics.

One of the key criteria for large-sample datasets is the ability to provide uncertainty assessments, particularly for data not directly obtained from observations, such as meteorological forcing and basin attributes. In this version of the dataset, we provide two versions of basin attributes. Additionally, alongside the completed dataset, we also provided a second forcing dataset derived from ERA5-Land36 to enhance the number of options for available data sources. This also opens future opportunities for integrating CAMELSH into a global dataset, as ERA5-Land offers full global coverage, ensuring consistency. We also are searching for sources of hourly streamflow data at a global scale, but availability of such data is extremely limited. Current global datasets such as GRDC (Global Runoff Data Centre) only contain data at a daily temporal resolution. Another key plan in our efforts to improve the dataset quality is to enhance basin attribute data to serve diverse research applications, not only hydrological studies but also those related to ecological studies37. Also, basin characteristics (land cover, soil, vegetation, or infrastructure) should ideally be considered dynamic rather than static variables. With advancements in AI and remote sensing, processing such data has become feasible; however, it requires substantial resources and time, particularly for large, continental, or global-scale applications.

Method

Basin selection

CAMELSH uses the basins from the Geospatial Attributes of Gages for Evaluating Streamflow (GAGES-II) database38, which contains geospatial information for over 9,322 stream gages across the USGS network, including CONUS, Alaska, and island territories. Our dataset prioritizes diversity in terms of basin size, climate, topographical, geographical, geology, and human interventions that affect stream flow generation mechanisms, without imposing additional selection criteria. The only constraint is the geographical limitation to CONUS (Contiguous United States) basins due to the spatial coverage of our meteorological forcing data. This results in 9,008 basins in CAMELSH. All catchment shapefiles have been standardized to the WGS84 coordinate system.

Streamflow data collection

The streamflow data are obtained automatically from the USGS Water Information System’s Instantaneous Values Service (https://waterservices.usgs.gov/), which provides historical measurements at sub-daily resolutions (typically 15 to 60-minute intervals) that we aggregated to hourly timesteps for consistency. To ensure standardization, all data, originally recorded in different time zones, were converted to + 0 UTC timestamps, and the discharge units were transformed from cubic feet per second (cfs) to cubic meters per second (cms). Note that the streamflow data has timezone flags at each time step, so Daylight Saving Time was automatically considered in the process of standardizing the timezone. The data collection period was established from 1980/1/1 00:00:00 to 2024/12/31 23:00:00 + 0 UTC, with missing hourly observations denoted as “NaN” in the time series.

Meteorological data collection

Only a limited number of datasets provide high-resolution hourly meteorological data, such as NLDAS-2 and ERA5 (or ERA5-Land). For the first version of CAMELSH, we utilize NLDAS-2 data (https://ldas.gsfc.nasa.gov/nldas/v2/forcing), which is specifically designed for the CONUS. We acknowledge that the spatial resolution of NLDAS-2 is relative coarse for small basins and the accuracy of this data is not optimal, particularly for extreme events39,40. However, NLDAS-2 is the only dataset that meets our expectations that is specifically designed for CONUS and provides meteorological forcing data at hourly scale with the necessary spatial and temporal coverage for our study.

NLDAS-2 data features a spatial resolution of 0.125 × 0.125 degrees and 1-hour temporal resolution, sourced directly from NASA (https://hydro1.gesdisc.eosdis.nasa.gov/data/NLDAS/NLDAS_FORA0125_H.002/) with a default timestamp of +0 UTC. All eleven available forcing variables from this dataset are incorporated (Table 1), representing commonly used inputs for both land surface models and machine learning applications. From the original gridded data, we generate a single time series for each variable by computing area-weighted spatial averages based on the catchment boundaries as inferred from the GAGES-II dataset, specifically by identifying NLDAS-2 grid cells within river basins and then calculating averages based on these grid cells. Among the variables, air temperature data, initially provided in Kelvin degrees, was converted to Celsius degrees by subtracting 273.15.

Table 1 The list of NLDAS-2 meteorological variables.

Catchment attributes

CAMELSH includes a series of static catchment attributes that are considered time-invariant and contain information about topography, hydroclimatic signatures, and catchment characteristics covering land cover, soil, hydrogeology, and human influences. Specifically, this dataset includes three sets of catchment attributes: (1) attributes used directly from GAGES-II; (2) attributes derived from the global HydroATLAS dataset41; and (3) climate attributes derived from NLDAS-2 data, which are similar to the climate attributes provided in CAMELS. HydroATLAS was chosen as the source for these attributes due to its global coverage and redistribution-friendly license. The provision of basin attributes from the global HydroATLAS dataset also creates opportunities for easier integration of datasets on a global scale, as similar datasets from other regions can derive basin attributes from the same source to ensure consistency.

First, regarding the set of basin attributes from GAGES-II, since this dataset was already compiled and provided by USGS, we did not modify the data but rather filtered and selected information matching the basin coverage in this dataset for the CONUS region. A total of 439 attributes are listed in this set, including climate, geology, hydrology, river/stream morphology, landscape, nutrient, soil, topography, and anthropogenic influences (such as dams and population). Although this attribute dataset provides comprehensive information about the watersheds, some specific information related to river attributes such as river bankfull width and depth or slope is not available. Detailed descriptions of these attributes are provided in Table S.1 in the Supplementary Information (SI) file.

Second, in order to extract the basin attributes from the HydroATLAS dataset, we employed HydroATLAS’s highest-resolution shape file (level 12). The HydroATLAS polygonal units are smaller than the catchment boundaries (from the GAGES-II dataset) associated with individual gauge, wherein a singular drainage area polygon corresponding to a specific gauge station typically encompasses multiple HydroATLAS polygonal units. Therefore, we first aggregated the spatial join of the HydroATLAS subbasins to form the catchment boundary corresponding to each stream gauge and then derived the catchment attributes. Catchment attributes from HydroATLAS can be grouped into the following categories: hydrology, physiography, climatology, soils and geology, land cover, and anthropogenic influences. A list of 195 catchment attributes is provided in Table S.2 in the SI file.

Lastly, based on the NLDAS-2 data collected for each basin, we calculated climate attributes (corresponding to those described in the CAMELS datasets) as detailed in Table 2. The specific calculation methods can be referenced in Addor, et al.10.

Table 2 Climate attributes derived from NLDAS-2 time series.

Data Records

The current versions (v2 and v3) of the CAMELSH dataset42,43, containing data for 9,008 basins, are available as Zenodo repositories (https://doi.org/10.5281/zenodo.15066778 and https://doi.org/10.5281/zenodo.15070091). Due to the total data volume in the repository being approximately 57 GB, which exceeds maximum allowable Zenodo’s size limit, we split it into two different repositories. The first link contains data on attributes, shapefiles, and time series data for basins with observed streamflow. The second link contains time series data for the ‘non-obs’ basins. All data is compressed in 7zip format. After extraction, the dataset is organized into the following subfolders and summarized in Table 3:

  • The attributes folder contains 28 CSV (comma-separated values) files that store basin attributes with all files beginning with “attributes_” and one excel file. Of these, the ‘attributes_nldas2_climate.csv’ file contains nine climate attributes (Table 2) derived from NLDAS-2 data. The ‘attributes_hydroATLAS.csv’ file includes 195 basin attributes derived from the HydroATLAS dataset. 26 files with names starting with ‘attributes_gageii_’ contain a total of 439 basin attributes extracted from the GAGES-II dataset. The name of each file represents a distinct group of attributes, as described in Table S.1. The remaining file, named ‘Var_description_gageii.xlsx’, provides explanatory details regarding the variable names included in the 26 CSV files, with information similar to that presented in Table S.1. The first column in all CSV files, labeled ‘STAID’, contains the identification (ID) names of the stream gauges. These IDs are assigned by the USGS and are sourced from the original GAGES-II dataset.

  • The shapefiles folder contains two sets of shapefiles for the catchment boundary. The first set, CAMELSH_shapefile.shp, is derived from the original GAGES-II dataset and is used to obtain the corresponding climate forcing data for each catchment. The second set, CAMELSH_shapefile_hydroATLAS.shp, includes catchment boundaries derived from the HydroATLAS dataset. Each polygon in both shapefiles contains a field named GAGE_ID, which represents the ID of the stream gauges.

  • The timeseries (7zip) file contains a compressed archive (7zip) that includes time series data for 3,166 basins with observed streamflow data. Within this 7zip file, there are a total of 3,166 NetCDF files, each corresponding to a specific basin. The name of each NetCDF file matches the stream gauge ID. Each file contains an hourly time series from 1980-01-01 00:00:00 to 2024-12-31 23:00:00 for streamflow (denoted as “Streamflow” in the NetCDF file) and 11 climate variables (see Table 1). The streamflow data series includes missing values, which are represented as “NaN”. All meteorological forcing data and streamflow records have been standardized to the + 0 UTC time zone.

  • The timeseries_nonobs (7zip) file contains time series data for 5,842 basins without observed streamflow data. The structure of each NetCDF file is similar to the one described above.

  • The info.csv file, located in the main directory of the dataset, contains basic information for 9,008 stream stations. This includes the stream gauge ID, the total number of observed hourly data points over 45 years (from 1980 to 2024), and the number of observed hourly data points for each year from 1980 to 2024. Stations with and without observed data are distinguished by the value in the second column, where stations without observed streamflow data have a corresponding value of 0.

Table 3 Description of the CAMELSH data record.

Technical Validation

Streamflow observations

We selected 9,008 basins within the CONUS region to generate the CAMELSH dataset, with the locations of the corresponding stream gauge stations shown in Fig. 1. Based on the IDs of these stations, observed streamflow data were automatically retrieved from the USGS Water Services server over the period of 1979–2024.

Fig. 1
figure 1

Map showing the locations of 9,008 streamflow gauges over CONUS used to generate the CAMELSH dataset. Of these, 5,842 stations do not have observed hourly streamflow measurements, shown as gray dots labeled “Non-obs”, while 3,166 stations have measurements (shown as blue dots labeled “Obs”).

The data for each station were downloaded individually, with varying temporal resolutions ranging from 15 minutes to 1 hour, and the default unit being cubic feet per second. The USGS provides streamflow data flags to indicate periods of estimated flow, which are included in this dataset. These flags are as follows: [A] Approved for publication—processing and review completed; [e] Value has been estimated; [R] Records for these data have been revised; and [P] Provisional data subject to revision. However, additional data quality information is unavailable without further investigation and is not included in this dataset. Therefore, we opted to use the original data in its entirety rather than removing any records. Notably, this dataset includes basins near coastal areas, where streamflow measurements may be influenced by tidal effects, making negative discharge values possible.

The data series were converted to cubic meters per second and standardized into uniform hourly time series, all adjusted to the UTC + 0 time zone. To ensure no missing data in the first few hours of 1980 due to the time zone conversion, data retrieval began in 1979 as mentioned above. After compiling the entire dataset, we found that 5,842 stations had no hourly data records available in the USGS Water Information System throughout the period from 1980 to 2024, while 3,166 stations contained streamflow records (Fig. 1). The stations with available data are distributed across most regions of CONUS, though with varying densities. Notably, the Central U.S. (West Plains) and Eastern Highlands have a higher concentration of stations with available data. Figure 2 presents the statistics on streamflow data availability, revealing that the length of the time series varies across stations, ranging from less than one year to nearly 40 years. The highest number of stations falls within the 15- to 35-year range, totaling 2,292 out of the 3,166 stations. Only 11 stations have more than 35 years of recorded hourly streamflow data.

Fig. 2
figure 2

The availability of streamflow records. Map (a) shows the total number of available hourly data records corresponding to each stream gauge from a total of 3,166 stream gauges. Histogram (b) shows the distribution of the number of stations corresponding to the total available data records (measured in years). Stations without available observed data are not shown in this figure.

Climate attributes

As described in the methodology, the meteorological forcing data were aggregated from the NLDAS-2 dataset, comprising a total of 11 variables (Table 1). Initially, these data were downloaded as gridded datasets, with each hourly time step stored in a separate file. The data were then processed for each catchment using the catchment boundaries provided by the GAGES-II dataset. All meteorological forcing data are available for every basin without any missing values. From this dataset, we computed nine climate attributes following Addor, et al.10. Since the original data are hourly, we processed key variables, including precipitation, temperature, and potential evapotranspiration, into daily time series for attribute calculations.

The results of the 9 climate attributes are presented in Fig. 3. For instance, the results show that the Northwestern and Southeastern regions of the U.S. experience the highest average daily precipitation, whereas the Central U.S., particularly the Great Plains and Rocky Mountains, receive significantly lower amounts. The potential evapotranspiration exhibits a clear latitudinal gradient, increasing toward the south of the U.S. Aridity is highest in the Southwest, High Plains, and Great Plains, while the Northwest, Northeast, and Appalachian regions are the most humid. The spatial variation in the annual precipitation cycle is strongest (with high absolute values of p_seasonality) along the Pacific coast, the northern Great Plains, and Florida, while the weakest seasonality is observed along the Atlantic coast. Additionally, the snow_fraction index (fraction of precipitation falling as snow) highlights areas with significant snowfall, with the highest values found in the high-altitude regions of the Central U.S.–West Mountains, followed by the Northeast region.

Fig. 3
figure 3

Climate Attributes for 9,008 Catchments in the CONUS. The maps display the values (indicated by the color of the dots) of various climate attributes for individual basins. The histograms show the frequency distribution of the corresponding climate attributes, using 10 bins. The maps, arranged from left to right and top to bottom, represent: mean daily precipitation (p_mean), mean daily potential evaporation (pet_mean), aridity index (aridity_index), seasonality and timing of precipitation (p_seasonality), fraction of precipitation falling as snow (frac_snow), frequency of high precipitation days (high_prec_freq), average duration of high precipitation events (high_prec_dur), frequency of low precipitation days (low_prec_freq), and average duration of low precipitation events (low_prec_dur).

Aggregating HydroAtlast basin boundary

The key challenge in extracting data from HydroAtlas is to define which HydroATLAS polygons are within a given gauge’s drainage area. The primary complication is that HydroATLAS basin polygons are not derived based on stream gauge locations. This means that catchment boundaries from the source datasets do not perfectly align with the polygons in HydroATLAS. An example of this is shown in Fig. 4, which compares HydroATLAS polygons with GAGES-II polygons. After manually inspecting a random sample of 100 basins, we identified four distinct cases:

  • The basin boundaries between GAGES-II and aggregated HydroATLAS coincide, with the stream gauge located at either the outlet or along the basin boundary edges (Fig. 4a).

  • The stream gauge is situated within the interior of a HydroATLAS polygon, rather than along its boundaries (Fig. 4b).

  • While the stream gauge is positioned at the outlet in both GAGES-II and HydroATLAS polygons, the area and boundary of the basin aggregated from HydroATLAS exceed those of the GAGES-II polygon.

  • While the stream gauge is positioned at the outlet in both GAGES-II and HydroATLAS polygons, the area and boundary of the basin aggregated from HydroATLAS are smaller than those of the GAGES-II polygon.

Fig. 4
figure 4

Examples of different cases encountered in identifying corresponding catchment for each stream gauge using HydroATLAS polygons. The polygon with black outline and gray fill color represents the basin area from GAGES-II data and is considered the correct polygon. Polygons with blue boundaries and white fill color represent sub-basins from HydroATLAS. The green dot indicates the stream gauge location. Each map shows data for station locations with the name shown in the plot title. Plot (a) shows the overlap between GAGES-II and HydroATLAS polygons. Plot (b) shows the discrepancy when the stream gauge location does not align with the outlet (or boundary) of the HydroATLAS polygon. Plot (c,d) shows the difference in the boundary of the HydroATLAS polygon compared to the GAGES-II polygon despite the stream gauge location being at the outlet.

For cases 1, 3, and 4, we directly used the aggregated basins from ATLAS, attributing the boundary mismatches at stream gauge control points between the two datasets to the use of different Digital Elevation Map sources. However, case 2, where the stream gauge is located within the polygon interior, required an additional adjustment, as including the entire polygon would incorporate downstream catchment areas. To address this, we calculated the overlap between this sub-basin and the GAGES-II polygon, selecting only the overlapping portion to merge with upstream polygons to form the final basin boundary. The non-overlapping portion was excluded. After completing the basin aggregation based on the HydroATLAS dataset, our analysis revealed that 5,887 basins demonstrated over 90% boundary overlap (Fig. 5). Additionally, 302 and 854 basins showed areas smaller and larger than the GAGES-II dataset by more than 50%, respectively. For the selected sub-basins used in aggregation, basin attributes were derived as described in the methods section. For the final sub-basin containing the stream gauge, in scenario 2 cases (as shown in Fig. 4b), the new overlapping area was used to derive basin attributes.

Fig. 5
figure 5

A comparison of catchment polygons between GAGES-II and HydroATLAS datasets. Plot (a) shows a 1:1 comparison of the catchment areas between GAGES-II and HydroATLAS dataset. Plot (b) shows a 1:1 comparison of overlap fraction, calculated as the ratio of intersecting area to the catchment area from GAGES-II and HydroATLAS datasets. Scatter plot values closer to 1 indicate that the catchment polygons from the two datasets have more similar boundaries.

Since the original attributes were independently calculated for each individual sub-basin, we recalculated them when analyzing the entire watershed area controlled by stream gauges. For the 195 attributes, we implemented three different approaches based on their characteristics. First, we used an area-weighted aggregate method for attributes expressed as averaged values of sub-basins to calculate their magnitudes for the whole basin. Second, for attributes categorized into classes (denoted by integer values), we determined the total area for each class and selected the class with the largest areal proportion as representative of the total catchment area. Last, for attributes that were pre-calculated in the original HydroATLAS dataset based on the total dranage area (such as inundation extent, percent lake area, and lake volume), we used these pre-calculated values from the sub-catchment containing the streamgage directly. In cases where a portion of the last sub-basin containing the stream gauge was excluded (as the case shown in Fig. 4b), we adjusted these attributes by multiplying them by the ratio of the catchment area derived from HydroATLAS to the original catchment area with the full area of the last sub-basin.

Basin attribute uncertainty

One of the key aspects of providing a large sample dataset is evaluating dataset uncertainty, particularly for basin attributes when a single attribute can be calculated based on different climate data or natural geographical conditions. For example, the comparison of basin areas above demonstrates the uncertainty in basin attributes. After filtering basin attribute data between GAGES-II and HydroATLAS, we identified 37 attributes with overlapping information including data on climate (air temperature, precipitation, potential evapotranspiration), population density, road density, urban extent, soil characteristics (clay, silt, and sand fractions), and topographic features (elevation and slope). Figure 6 presents a 1:1 comparison of 11 attributes, clearly showing differences between the two datasets. Only attributes such as mean annual precipitation (p-mean), mean annual air temperature (Ta-mean), and mean elevation mean show better correlation with R² values of 0.85, 0.98, and 0.98 respectively. Mean annual potential evapotranspiration (pet-mean), despite having a correlation as high as 0.8, shows values from HydroATLAS that are generally higher than values calculated from the GAGES-II dataset. The remaining attributes exhibit high uncertainty with R² values all below 0.7.

Fig. 6
figure 6

A comparison of 11 basin attributes from GAGES-II and HydroATLAS datasets. The attributes include mean annual precipitation (p-mean), mean annual air temperature (Ta-mean), mean annual potential evapotranspiration (pet-mean), urban extent, population density, road density, clay, silt, and sand fractions, mean elevation and mean slope.

Basin diversity

Figure 7 demonstrates the diversity of the dataset when selecting all CONUS basins from GAGES-II. Specifically, Fig. 7a shows the diversity of watersheds in terms of flow disturbance levels, as measured by the hydrological disturbance index. The inclusion of managed basins represents a key difference from the previous daily CAMELS dataset, which primarily used near-natural basins (labeled as “ref” basins in Fig. 7a). In our dataset, the proportion of “Non-ref” stations (i.e., those with significantly disturbed flow) is substantially larger than ref basins (7,080 compared to 1,928). When considering only basins with available streamflow data, the proportional difference remains nearly similar with 2,454 (disturbed) versus 712 (natural) catchments. Additionally, this dataset offers variability in basin characteristics such as catchment area, specifically ranging from 3.8 km2 to 5 × 104 km2 based on GAGES-II polygons (Fig. 7b). The basins were also collected from all 9 different ecoregions and represent 7 distinct geological characteristics (Fig. 7c,d).

Fig. 7
figure 7

The diversity of the CAMELSH dataset. Map (a) shows the hydrological disturbance index corresponding to each stream gauge. Subplot (b) shows the CDF (cummulative distribution function) of basin areas corresponding to stream gauge locations. Stacked bar plots (c,d) show the number of basins belonging to different groups of ecoregion and geology characteristics. All these basin classification attributes were collected from the GAGES-II dataset. A detailed explanation of ecoregions is provided in Table S.3. In subplot (d) at the location of the intermediate group, the number of ‘non-obs’ and ‘obs’ basins is both 1. Due to these very small quantities, they do not show up.