Abstract
We present CAMELSH (Catchment Attributes and Hourly HydroMeteorology for Large-Sample Studies), the first large-sample hydrometeorological dataset at the hourly scale for the contiguous United States. CAMELSH intergrates hourly meteorological time series, catchment attributes and boundaries from GAGES-II and HydroATLAS for 9,008 catchments across diverse climatic, hydrological, and anthropogenic conditions. In addition, hourly streamflow time series is provided for 3,166 catchments. The dataset spans 45 years (1980–2024) with 11 meteorological variables from the NLDAS-2 forcing dataset, from which we compute nine climate indices related to precipitation, evapotranspiration, seasonality, and snow fraction. Additionally, CAMELSH includes two sets of catchment attributes: 439 from GAGES-II and 195 derived from HydroATLAS. These attributes include factors related to climate, geology, hydrology, river/stream morphology, landscape, nutrient, soil, topography, and anthropogenic influences. Developed in accordance with FAIR (Findability, Accessibility, Interoperability, and Reusability) principles, CAMELSH is the first large-sample dataset at an hourly timescale, supporting machine learning applications for short-term streamflow (flood) prediction and advancing data-driven hydrological research across multiple timescales.
Similar content being viewed by others
Background & Summary
Large-sample hydrometeorological datasets have become fundamental to advancing hydrological research by enabling researchers to analyze streamflow patterns across diverse catchments1,2,3, develop robust predictive models4,5,6, and support critical applications in water resource management and climate change assessment2,7,8. The expansion of large-sample datasets, driven by the need for benchmark data in large-scale studies, increasing data availability, and improved data-sharing initiatives, began with the MOPEX (Model Parameter Estimation Experiment)9 dataset in the United States (US) and significantly expanded through the development of the first CAMELS (Catchment Attributes and MEteorology for Large-sample Studies)10 in 2017, which covered 671 U.S. catchments and integrated extensive hydrometeorological observations with catchment attributes. CAMELS served as a catalyst for similar datasets across multiple countries, including Chile11, Brazil12, Great Britain13, Australia14, Germany15, India16, and Switzerland17, ultimately leading to Caravan (a series of CAMELS)18, a comprehensive global dataset that synthesizes various CAMELS sources to provide harmonized, high-quality data for studying hydrological processes at regional and global scales.
Despite the widespread availability of shared datasets, it remains challenging to find large sample datasets at sub-daily or hourly temporal scales; most existing datasets (for continental or global scales) operate at a daily scale. This hinders large-scale research and model capability assessment at a sub-daily time scale, which is crucial for developing machine learning (ML) approaches relevant for decision-making and mitigation activities. While ML applications have shown effectiveness for simulating daily streamflow5,19,20, similar achievements in large-scale hourly streamflow prediction or Prediction in Ungauged Basins (PUB) remain limited to isolated case studies21,22. This limitation is particularly critical as hourly forecasts play a crucial role in disaster risk mitigation, especially for short-lived flooding events23,24,25,26. At hourly timescales, streamflow patterns and physical characteristics of streamflow generation become increasingly complex and less predictable as compared to the daily scale27,28,29. Besides, the potential benefits of hourly data extend beyond model development, offering information to provide vital insights into short-term variability, extreme events, and diurnal patterns in streamflow, precipitation, and temperature27,30,31.
While hourly hydrometeorological datasets exist, they lack comprehensive coverage in terms of the number of watersheds and diversity of basin characteristics. Specifically, Demir, et al.22 provided the WaterBench-Iowa dataset, which focuses on a specific region in Iowa, US, covering 125 locations. In another study, Gauch, et al.21 presented an hourly dataset for 516 CAMELS basins spanning the period 1990–2018. However, one of the most well-known limitations of the CAMELS dataset is its lack of diversity, as it primarily contains data for near-natural basins with minimal human impacts10. This limitation was highlighted in Ouyang, et al.32, which demonstrated that ML models trained on this dataset cannot generalize to regulated basins. It is worth noting that two-thirds of the river basins globally are effected by human activities, resulting in non-natural flow conditions33.
Given the needs outlined above, in this study, we present the first comprehensive hourly dataset, named CAMELSH (Catchment Attributes and Hourly hydroMEteorology for Large-sample Studies), which provides data for 9,008 basins across the contiguous United States (CONUS) under diverse conditions. The dataset is structured according to FAIR principles (Findable, Accessible, Interoperable, and Re-usable), featuring standardized units, time formats, time zones, and data formats to ensure seamless integration and analysis. All data are organized in a readily accessible format to facilitate straightforward extraction and utilization, with planned ongoing updates to expand the spatial coverage through additional basins, the temporal extent of the data records, forcing data from various sources, as well as different measurable parameters that can be collected such as water level or stream temperature.
The CAMELSH dataset comprises three main components: hourly hydrometeorological time series, basin attributes from multiple sources, and watershed boundary shapefiles. The hydrometeorological data spans from 1980 to 2024, incorporating 11 meteorological variables derived from the NLDAS-2 (North American Land Data Assimilation System) dataset, while the streamflow observations are sourced from the United States Geological Survey (USGS). Operating at a higher temporal resolution, CAMELSH encompasses 9,008 basins (the first version of CAMELSH contains 3,166 basins with observed streamflow data), representing diverse watersheds across different climatic regions, geological characteristics, and varying degrees of human influence.
Our vision in publishing high temporal resolution (e.g., hourly) datasets is to promote the convergence of science efforts, in which researchers worldwide can work with common datasets to reach shared goals. Using standardized datasets particularly benefits model development studies while also facilitating easier reproduction of results for benchmarking models used in evaluation and comparison. More importantly, beyond developing models solely to improve their forecasting performance, developing datasets at finer timescales is crucial for exploring the contribution of various physical mechanisms in the land-surface phase of the hydrological cycle, especially useful for (mostly) process-agnostic machine learning applications. It is worth noting that recent studies have extensively addressed timescale-related questions in process-based modeling and downscaling applications34,35. With the current remarkable advancement of ML and AI, similar efforts need to be undertaken to understand the scalability of these novel tools across various hydrologic regimes. However, improving model prediction performance at hourly temporal resolution still remains the dominant research topics.
One of the key criteria for large-sample datasets is the ability to provide uncertainty assessments, particularly for data not directly obtained from observations, such as meteorological forcing and basin attributes. In this version of the dataset, we provide two versions of basin attributes. Additionally, alongside the completed dataset, we also provided a second forcing dataset derived from ERA5-Land36 to enhance the number of options for available data sources. This also opens future opportunities for integrating CAMELSH into a global dataset, as ERA5-Land offers full global coverage, ensuring consistency. We also are searching for sources of hourly streamflow data at a global scale, but availability of such data is extremely limited. Current global datasets such as GRDC (Global Runoff Data Centre) only contain data at a daily temporal resolution. Another key plan in our efforts to improve the dataset quality is to enhance basin attribute data to serve diverse research applications, not only hydrological studies but also those related to ecological studies37. Also, basin characteristics (land cover, soil, vegetation, or infrastructure) should ideally be considered dynamic rather than static variables. With advancements in AI and remote sensing, processing such data has become feasible; however, it requires substantial resources and time, particularly for large, continental, or global-scale applications.
Method
Basin selection
CAMELSH uses the basins from the Geospatial Attributes of Gages for Evaluating Streamflow (GAGES-II) database38, which contains geospatial information for over 9,322 stream gages across the USGS network, including CONUS, Alaska, and island territories. Our dataset prioritizes diversity in terms of basin size, climate, topographical, geographical, geology, and human interventions that affect stream flow generation mechanisms, without imposing additional selection criteria. The only constraint is the geographical limitation to CONUS (Contiguous United States) basins due to the spatial coverage of our meteorological forcing data. This results in 9,008 basins in CAMELSH. All catchment shapefiles have been standardized to the WGS84 coordinate system.
Streamflow data collection
The streamflow data are obtained automatically from the USGS Water Information System’s Instantaneous Values Service (https://waterservices.usgs.gov/), which provides historical measurements at sub-daily resolutions (typically 15 to 60-minute intervals) that we aggregated to hourly timesteps for consistency. To ensure standardization, all data, originally recorded in different time zones, were converted to + 0 UTC timestamps, and the discharge units were transformed from cubic feet per second (cfs) to cubic meters per second (cms). Note that the streamflow data has timezone flags at each time step, so Daylight Saving Time was automatically considered in the process of standardizing the timezone. The data collection period was established from 1980/1/1 00:00:00 to 2024/12/31 23:00:00 + 0 UTC, with missing hourly observations denoted as “NaN” in the time series.
Meteorological data collection
Only a limited number of datasets provide high-resolution hourly meteorological data, such as NLDAS-2 and ERA5 (or ERA5-Land). For the first version of CAMELSH, we utilize NLDAS-2 data (https://ldas.gsfc.nasa.gov/nldas/v2/forcing), which is specifically designed for the CONUS. We acknowledge that the spatial resolution of NLDAS-2 is relative coarse for small basins and the accuracy of this data is not optimal, particularly for extreme events39,40. However, NLDAS-2 is the only dataset that meets our expectations that is specifically designed for CONUS and provides meteorological forcing data at hourly scale with the necessary spatial and temporal coverage for our study.
NLDAS-2 data features a spatial resolution of 0.125 × 0.125 degrees and 1-hour temporal resolution, sourced directly from NASA (https://hydro1.gesdisc.eosdis.nasa.gov/data/NLDAS/NLDAS_FORA0125_H.002/) with a default timestamp of +0 UTC. All eleven available forcing variables from this dataset are incorporated (Table 1), representing commonly used inputs for both land surface models and machine learning applications. From the original gridded data, we generate a single time series for each variable by computing area-weighted spatial averages based on the catchment boundaries as inferred from the GAGES-II dataset, specifically by identifying NLDAS-2 grid cells within river basins and then calculating averages based on these grid cells. Among the variables, air temperature data, initially provided in Kelvin degrees, was converted to Celsius degrees by subtracting 273.15.
Catchment attributes
CAMELSH includes a series of static catchment attributes that are considered time-invariant and contain information about topography, hydroclimatic signatures, and catchment characteristics covering land cover, soil, hydrogeology, and human influences. Specifically, this dataset includes three sets of catchment attributes: (1) attributes used directly from GAGES-II; (2) attributes derived from the global HydroATLAS dataset41; and (3) climate attributes derived from NLDAS-2 data, which are similar to the climate attributes provided in CAMELS. HydroATLAS was chosen as the source for these attributes due to its global coverage and redistribution-friendly license. The provision of basin attributes from the global HydroATLAS dataset also creates opportunities for easier integration of datasets on a global scale, as similar datasets from other regions can derive basin attributes from the same source to ensure consistency.
First, regarding the set of basin attributes from GAGES-II, since this dataset was already compiled and provided by USGS, we did not modify the data but rather filtered and selected information matching the basin coverage in this dataset for the CONUS region. A total of 439 attributes are listed in this set, including climate, geology, hydrology, river/stream morphology, landscape, nutrient, soil, topography, and anthropogenic influences (such as dams and population). Although this attribute dataset provides comprehensive information about the watersheds, some specific information related to river attributes such as river bankfull width and depth or slope is not available. Detailed descriptions of these attributes are provided in Table S.1 in the Supplementary Information (SI) file.
Second, in order to extract the basin attributes from the HydroATLAS dataset, we employed HydroATLAS’s highest-resolution shape file (level 12). The HydroATLAS polygonal units are smaller than the catchment boundaries (from the GAGES-II dataset) associated with individual gauge, wherein a singular drainage area polygon corresponding to a specific gauge station typically encompasses multiple HydroATLAS polygonal units. Therefore, we first aggregated the spatial join of the HydroATLAS subbasins to form the catchment boundary corresponding to each stream gauge and then derived the catchment attributes. Catchment attributes from HydroATLAS can be grouped into the following categories: hydrology, physiography, climatology, soils and geology, land cover, and anthropogenic influences. A list of 195 catchment attributes is provided in Table S.2 in the SI file.
Lastly, based on the NLDAS-2 data collected for each basin, we calculated climate attributes (corresponding to those described in the CAMELS datasets) as detailed in Table 2. The specific calculation methods can be referenced in Addor, et al.10.
Data Records
The current versions (v2 and v3) of the CAMELSH dataset42,43, containing data for 9,008 basins, are available as Zenodo repositories (https://doi.org/10.5281/zenodo.15066778 and https://doi.org/10.5281/zenodo.15070091). Due to the total data volume in the repository being approximately 57 GB, which exceeds maximum allowable Zenodo’s size limit, we split it into two different repositories. The first link contains data on attributes, shapefiles, and time series data for basins with observed streamflow. The second link contains time series data for the ‘non-obs’ basins. All data is compressed in 7zip format. After extraction, the dataset is organized into the following subfolders and summarized in Table 3:
-
The attributes folder contains 28 CSV (comma-separated values) files that store basin attributes with all files beginning with “attributes_” and one excel file. Of these, the ‘attributes_nldas2_climate.csv’ file contains nine climate attributes (Table 2) derived from NLDAS-2 data. The ‘attributes_hydroATLAS.csv’ file includes 195 basin attributes derived from the HydroATLAS dataset. 26 files with names starting with ‘attributes_gageii_’ contain a total of 439 basin attributes extracted from the GAGES-II dataset. The name of each file represents a distinct group of attributes, as described in Table S.1. The remaining file, named ‘Var_description_gageii.xlsx’, provides explanatory details regarding the variable names included in the 26 CSV files, with information similar to that presented in Table S.1. The first column in all CSV files, labeled ‘STAID’, contains the identification (ID) names of the stream gauges. These IDs are assigned by the USGS and are sourced from the original GAGES-II dataset.
-
The shapefiles folder contains two sets of shapefiles for the catchment boundary. The first set, CAMELSH_shapefile.shp, is derived from the original GAGES-II dataset and is used to obtain the corresponding climate forcing data for each catchment. The second set, CAMELSH_shapefile_hydroATLAS.shp, includes catchment boundaries derived from the HydroATLAS dataset. Each polygon in both shapefiles contains a field named GAGE_ID, which represents the ID of the stream gauges.
-
The timeseries (7zip) file contains a compressed archive (7zip) that includes time series data for 3,166 basins with observed streamflow data. Within this 7zip file, there are a total of 3,166 NetCDF files, each corresponding to a specific basin. The name of each NetCDF file matches the stream gauge ID. Each file contains an hourly time series from 1980-01-01 00:00:00 to 2024-12-31 23:00:00 for streamflow (denoted as “Streamflow” in the NetCDF file) and 11 climate variables (see Table 1). The streamflow data series includes missing values, which are represented as “NaN”. All meteorological forcing data and streamflow records have been standardized to the + 0 UTC time zone.
-
The timeseries_nonobs (7zip) file contains time series data for 5,842 basins without observed streamflow data. The structure of each NetCDF file is similar to the one described above.
-
The info.csv file, located in the main directory of the dataset, contains basic information for 9,008 stream stations. This includes the stream gauge ID, the total number of observed hourly data points over 45 years (from 1980 to 2024), and the number of observed hourly data points for each year from 1980 to 2024. Stations with and without observed data are distinguished by the value in the second column, where stations without observed streamflow data have a corresponding value of 0.
Technical Validation
Streamflow observations
We selected 9,008 basins within the CONUS region to generate the CAMELSH dataset, with the locations of the corresponding stream gauge stations shown in Fig. 1. Based on the IDs of these stations, observed streamflow data were automatically retrieved from the USGS Water Services server over the period of 1979–2024.
The data for each station were downloaded individually, with varying temporal resolutions ranging from 15 minutes to 1 hour, and the default unit being cubic feet per second. The USGS provides streamflow data flags to indicate periods of estimated flow, which are included in this dataset. These flags are as follows: [A] Approved for publication—processing and review completed; [e] Value has been estimated; [R] Records for these data have been revised; and [P] Provisional data subject to revision. However, additional data quality information is unavailable without further investigation and is not included in this dataset. Therefore, we opted to use the original data in its entirety rather than removing any records. Notably, this dataset includes basins near coastal areas, where streamflow measurements may be influenced by tidal effects, making negative discharge values possible.
The data series were converted to cubic meters per second and standardized into uniform hourly time series, all adjusted to the UTC + 0 time zone. To ensure no missing data in the first few hours of 1980 due to the time zone conversion, data retrieval began in 1979 as mentioned above. After compiling the entire dataset, we found that 5,842 stations had no hourly data records available in the USGS Water Information System throughout the period from 1980 to 2024, while 3,166 stations contained streamflow records (Fig. 1). The stations with available data are distributed across most regions of CONUS, though with varying densities. Notably, the Central U.S. (West Plains) and Eastern Highlands have a higher concentration of stations with available data. Figure 2 presents the statistics on streamflow data availability, revealing that the length of the time series varies across stations, ranging from less than one year to nearly 40 years. The highest number of stations falls within the 15- to 35-year range, totaling 2,292 out of the 3,166 stations. Only 11 stations have more than 35 years of recorded hourly streamflow data.
The availability of streamflow records. Map (a) shows the total number of available hourly data records corresponding to each stream gauge from a total of 3,166 stream gauges. Histogram (b) shows the distribution of the number of stations corresponding to the total available data records (measured in years). Stations without available observed data are not shown in this figure.
Climate attributes
As described in the methodology, the meteorological forcing data were aggregated from the NLDAS-2 dataset, comprising a total of 11 variables (Table 1). Initially, these data were downloaded as gridded datasets, with each hourly time step stored in a separate file. The data were then processed for each catchment using the catchment boundaries provided by the GAGES-II dataset. All meteorological forcing data are available for every basin without any missing values. From this dataset, we computed nine climate attributes following Addor, et al.10. Since the original data are hourly, we processed key variables, including precipitation, temperature, and potential evapotranspiration, into daily time series for attribute calculations.
The results of the 9 climate attributes are presented in Fig. 3. For instance, the results show that the Northwestern and Southeastern regions of the U.S. experience the highest average daily precipitation, whereas the Central U.S., particularly the Great Plains and Rocky Mountains, receive significantly lower amounts. The potential evapotranspiration exhibits a clear latitudinal gradient, increasing toward the south of the U.S. Aridity is highest in the Southwest, High Plains, and Great Plains, while the Northwest, Northeast, and Appalachian regions are the most humid. The spatial variation in the annual precipitation cycle is strongest (with high absolute values of p_seasonality) along the Pacific coast, the northern Great Plains, and Florida, while the weakest seasonality is observed along the Atlantic coast. Additionally, the snow_fraction index (fraction of precipitation falling as snow) highlights areas with significant snowfall, with the highest values found in the high-altitude regions of the Central U.S.–West Mountains, followed by the Northeast region.
Climate Attributes for 9,008 Catchments in the CONUS. The maps display the values (indicated by the color of the dots) of various climate attributes for individual basins. The histograms show the frequency distribution of the corresponding climate attributes, using 10 bins. The maps, arranged from left to right and top to bottom, represent: mean daily precipitation (p_mean), mean daily potential evaporation (pet_mean), aridity index (aridity_index), seasonality and timing of precipitation (p_seasonality), fraction of precipitation falling as snow (frac_snow), frequency of high precipitation days (high_prec_freq), average duration of high precipitation events (high_prec_dur), frequency of low precipitation days (low_prec_freq), and average duration of low precipitation events (low_prec_dur).
Aggregating HydroAtlast basin boundary
The key challenge in extracting data from HydroAtlas is to define which HydroATLAS polygons are within a given gauge’s drainage area. The primary complication is that HydroATLAS basin polygons are not derived based on stream gauge locations. This means that catchment boundaries from the source datasets do not perfectly align with the polygons in HydroATLAS. An example of this is shown in Fig. 4, which compares HydroATLAS polygons with GAGES-II polygons. After manually inspecting a random sample of 100 basins, we identified four distinct cases:
-
The basin boundaries between GAGES-II and aggregated HydroATLAS coincide, with the stream gauge located at either the outlet or along the basin boundary edges (Fig. 4a).
-
The stream gauge is situated within the interior of a HydroATLAS polygon, rather than along its boundaries (Fig. 4b).
-
While the stream gauge is positioned at the outlet in both GAGES-II and HydroATLAS polygons, the area and boundary of the basin aggregated from HydroATLAS exceed those of the GAGES-II polygon.
-
While the stream gauge is positioned at the outlet in both GAGES-II and HydroATLAS polygons, the area and boundary of the basin aggregated from HydroATLAS are smaller than those of the GAGES-II polygon.
Examples of different cases encountered in identifying corresponding catchment for each stream gauge using HydroATLAS polygons. The polygon with black outline and gray fill color represents the basin area from GAGES-II data and is considered the correct polygon. Polygons with blue boundaries and white fill color represent sub-basins from HydroATLAS. The green dot indicates the stream gauge location. Each map shows data for station locations with the name shown in the plot title. Plot (a) shows the overlap between GAGES-II and HydroATLAS polygons. Plot (b) shows the discrepancy when the stream gauge location does not align with the outlet (or boundary) of the HydroATLAS polygon. Plot (c,d) shows the difference in the boundary of the HydroATLAS polygon compared to the GAGES-II polygon despite the stream gauge location being at the outlet.
For cases 1, 3, and 4, we directly used the aggregated basins from ATLAS, attributing the boundary mismatches at stream gauge control points between the two datasets to the use of different Digital Elevation Map sources. However, case 2, where the stream gauge is located within the polygon interior, required an additional adjustment, as including the entire polygon would incorporate downstream catchment areas. To address this, we calculated the overlap between this sub-basin and the GAGES-II polygon, selecting only the overlapping portion to merge with upstream polygons to form the final basin boundary. The non-overlapping portion was excluded. After completing the basin aggregation based on the HydroATLAS dataset, our analysis revealed that 5,887 basins demonstrated over 90% boundary overlap (Fig. 5). Additionally, 302 and 854 basins showed areas smaller and larger than the GAGES-II dataset by more than 50%, respectively. For the selected sub-basins used in aggregation, basin attributes were derived as described in the methods section. For the final sub-basin containing the stream gauge, in scenario 2 cases (as shown in Fig. 4b), the new overlapping area was used to derive basin attributes.
A comparison of catchment polygons between GAGES-II and HydroATLAS datasets. Plot (a) shows a 1:1 comparison of the catchment areas between GAGES-II and HydroATLAS dataset. Plot (b) shows a 1:1 comparison of overlap fraction, calculated as the ratio of intersecting area to the catchment area from GAGES-II and HydroATLAS datasets. Scatter plot values closer to 1 indicate that the catchment polygons from the two datasets have more similar boundaries.
Since the original attributes were independently calculated for each individual sub-basin, we recalculated them when analyzing the entire watershed area controlled by stream gauges. For the 195 attributes, we implemented three different approaches based on their characteristics. First, we used an area-weighted aggregate method for attributes expressed as averaged values of sub-basins to calculate their magnitudes for the whole basin. Second, for attributes categorized into classes (denoted by integer values), we determined the total area for each class and selected the class with the largest areal proportion as representative of the total catchment area. Last, for attributes that were pre-calculated in the original HydroATLAS dataset based on the total dranage area (such as inundation extent, percent lake area, and lake volume), we used these pre-calculated values from the sub-catchment containing the streamgage directly. In cases where a portion of the last sub-basin containing the stream gauge was excluded (as the case shown in Fig. 4b), we adjusted these attributes by multiplying them by the ratio of the catchment area derived from HydroATLAS to the original catchment area with the full area of the last sub-basin.
Basin attribute uncertainty
One of the key aspects of providing a large sample dataset is evaluating dataset uncertainty, particularly for basin attributes when a single attribute can be calculated based on different climate data or natural geographical conditions. For example, the comparison of basin areas above demonstrates the uncertainty in basin attributes. After filtering basin attribute data between GAGES-II and HydroATLAS, we identified 37 attributes with overlapping information including data on climate (air temperature, precipitation, potential evapotranspiration), population density, road density, urban extent, soil characteristics (clay, silt, and sand fractions), and topographic features (elevation and slope). Figure 6 presents a 1:1 comparison of 11 attributes, clearly showing differences between the two datasets. Only attributes such as mean annual precipitation (p-mean), mean annual air temperature (Ta-mean), and mean elevation mean show better correlation with R² values of 0.85, 0.98, and 0.98 respectively. Mean annual potential evapotranspiration (pet-mean), despite having a correlation as high as 0.8, shows values from HydroATLAS that are generally higher than values calculated from the GAGES-II dataset. The remaining attributes exhibit high uncertainty with R² values all below 0.7.
A comparison of 11 basin attributes from GAGES-II and HydroATLAS datasets. The attributes include mean annual precipitation (p-mean), mean annual air temperature (Ta-mean), mean annual potential evapotranspiration (pet-mean), urban extent, population density, road density, clay, silt, and sand fractions, mean elevation and mean slope.
Basin diversity
Figure 7 demonstrates the diversity of the dataset when selecting all CONUS basins from GAGES-II. Specifically, Fig. 7a shows the diversity of watersheds in terms of flow disturbance levels, as measured by the hydrological disturbance index. The inclusion of managed basins represents a key difference from the previous daily CAMELS dataset, which primarily used near-natural basins (labeled as “ref” basins in Fig. 7a). In our dataset, the proportion of “Non-ref” stations (i.e., those with significantly disturbed flow) is substantially larger than ref basins (7,080 compared to 1,928). When considering only basins with available streamflow data, the proportional difference remains nearly similar with 2,454 (disturbed) versus 712 (natural) catchments. Additionally, this dataset offers variability in basin characteristics such as catchment area, specifically ranging from 3.8 km2 to 5 × 104 km2 based on GAGES-II polygons (Fig. 7b). The basins were also collected from all 9 different ecoregions and represent 7 distinct geological characteristics (Fig. 7c,d).
The diversity of the CAMELSH dataset. Map (a) shows the hydrological disturbance index corresponding to each stream gauge. Subplot (b) shows the CDF (cummulative distribution function) of basin areas corresponding to stream gauge locations. Stacked bar plots (c,d) show the number of basins belonging to different groups of ecoregion and geology characteristics. All these basin classification attributes were collected from the GAGES-II dataset. A detailed explanation of ecoregions is provided in Table S.3. In subplot (d) at the location of the intermediate group, the number of ‘non-obs’ and ‘obs’ basins is both 1. Due to these very small quantities, they do not show up.
Code availability
The source code can be found at https://github.com/vinhngoctran/CAMELSH.
References
Gudmundsson, L. et al. Globally observed trends in mean and extreme river flow attributed to climate change. Science 371, 1159–1162 (2021).
Blöschl et al. Changing climate both increases and decreases European river floods. Nature 573, 108–111 (2019).
Hodgkins, G. A. et al. Climate Driven Trends in Historical Extreme Low Streamflows on Four Continents. Water Resources Research 60, e2022WR034326, https://doi.org/10.1029/2022WR034326 (2024).
Tran, V. N., Ivanov, V. Y. & Kim, J. Data reformation–A novel data processing technique enhancing machine learning applicability for predicting streamflow extremes. Advances in Water Resources 182, 104569 (2023).
Nearing, G. et al. Global prediction of extreme floods in ungauged watersheds. Nature 627, 559–563 (2024).
Feng, D., Liu, J., Lawson, K. & Shen, C. Differentiable, learnable, regionalized process‐based models with multiphysical outputs can approach state‐of‐the‐art hydrologic prediction accuracy. Water Resources Research 58, e2022WR032404 (2022).
Bierkens, M. F. Global hydrology 2015: State, trends, and directions. Water Resources Research 51, 4923–4947 (2015).
Döll, P., Douville, H., Güntner, A., Müller Schmied, H. & Wada, Y. Modelling freshwater resources at the global scale: challenges and prospects. Surveys in Geophysics 37, 195–221 (2016).
Schaake, J., Cong, S. & Duan, Q. US MOPEX data set. (Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States), 2006).
Addor, N., Newman, A. J., Mizukami, N. & Clark, M. P. The CAMELS data set: catchment attributes and meteorology for large-sample studies. Hydrology and Earth System Sciences 21, 5293–5313 (2017).
Alvarez-Garreton, C. et al. The CAMELS-CL dataset: catchment attributes and meteorology for large sample studies–Chile dataset. Hydrology and Earth System Sciences 22, 5817–5846 (2018).
Chagas, V. B. et al. CAMELS-BR: hydrometeorological time series and landscape attributes for 897 catchments in Brazil. Earth System Science Data 12, 2075–2096 (2020).
Coxon, G. et al. CAMELS-GB: hydrometeorological time series and landscape attributes for 671 catchments in Great Britain. Earth System Science Data 12, 2459–2483 (2020).
Fowler, K. J., Acharya, S. C., Addor, N., Chou, C. & Peel, M. C. CAMELS-AUS: hydrometeorological time series and landscape attributes for 222 catchments in Australia. Earth System Science Data Discussions 2021, 1–30 (2021).
Loritz, R. et al. CAMELS-DE: hydro-meteorological time series and attributes for 1555 catchments in Germany. Earth System Science Data Discussions 2024, 1–30 (2024).
Mangukiya, N. K. et al. CAMELS-IND: hydrometeorological time series and catchment attributes for 228 catchments in Peninsular India. Earth System Science Data 17, 461–491 (2025).
Höge, M. et al. CAMELS-CH: hydro-meteorological time series and landscape attributes for 331 catchments in hydrologic Switzerland. Earth System Science Data Discussions 2023, 1–46 (2023).
Kratzert, F. et al. Caravan-A global community dataset for large-sample hydrology. Scientific Data 10, 61 (2023).
Tran, T. D., Tran, V. N. & Kim, J. Improving the accuracy of dam inflow predictions using a long short-term memory network coupled with wavelet transform and predictor selection. Mathematics 9, 551 (2021).
Tran, V. N. et al. A deep learning modeling framework with uncertainty quantification for inflow-outflow predictions for cascade reservoirs. Journal of Hydrology 629, 130608 (2024).
Gauch, M. et al. Rainfall–runoff prediction at multiple timescales with a single Long Short-Term Memory network. Hydrology and Earth System Sciences 25, 2045–2062 (2021).
Demir, I., Xiang, Z., Demiray, B. & Sit, M. Waterbench: a large-scale benchmark dataset for data-driven streamflow forecasting. Earth System Science Data Discussions 2022, 1–19 (2022).
Hapuarachchi, H., Wang, Q. & Pagano, T. A review of advances in flash flood forecasting. Hydrological processes 25, 2771–2784 (2011).
Tran, V. N., Dwelle, M. C., Sargsyan, K., Ivanov, V. Y. & Kim, J. A novel modeling framework for computationally efficient and accurate real‐time ensemble flood forecasting with uncertainty quantification. Water Resources Research 56, e2019WR025727 (2020).
Ivanov, V. Y. et al. Breaking down the computational barriers to real‐time urban flood forecasting. Geophysical Research Letters 48, e2021GL093585 (2021).
Tran, V. N., Ivanov, V. Y., Xu, D. & Kim, J. Closing in on Hydrologic Predictive Accuracy: Combining the Strengths of High‐Fidelity and Physics‐Agnostic Models. Geophysical Research Letters 50, e2023GL104464 (2023).
Beven, K. J. Rainfall-runoff modelling: the primer. (John Wiley & Sons, 2012).
Sivapalan, M. et al. IAHS Decade on Predictions in Ungauged Basins (PUB), 2003–2012: Shaping an exciting future for the hydrological sciences. Hydrological sciences journal 48, 857–880 (2003).
McDonnell, J. J. et al. Moving beyond heterogeneity and process complexity: A new vision for watershed hydrology. Water Resources Research 43, https://doi.org/10.1029/2006WR005467 (2007).
Hirsch, R. M. & Costa, J. E. U.S. stream flow measurement and data dissemination improve. Eos, Transactions American Geophysical Union 85, 197–203, https://doi.org/10.1029/2004EO200002 (2004).
Kirchner, J. W. Catchments as simple dynamical systems: Catchment characterization, rainfall-runoff modeling, and doing hydrology backward. Water Resources Research 45, https://doi.org/10.1029/2008WR006912 (2009).
Ouyang, W. et al. Continental-scale streamflow modeling of basins with reservoirs: Towards a coherent deep-learning-based strategy. Journal of Hydrology 599, 126455 (2021).
Grill, G. et al. Mapping the world’s free-flowing rivers. Nature 569, 215–221 (2019).
Blöschl, G. & Sivapalan, M. Scale issues in hydrological modelling: a review. Hydrological processes 9, 251–290 (1995).
Fatichi, S. et al. An overview of current applications, challenges, and future trends in distributed process-based models in hydrology. Journal of Hydrology 537, 45–60 (2016).
Muñoz-Sabater, J. et al. ERA5-Land: a state-of-the-art global reanalysis dataset for land applications. Earth Syst. Sci. Data 13, 4349–4383, https://doi.org/10.5194/essd-13-4349-2021 (2021).
Leathwick, J. et al. Freshwater ecosystems of New Zealand (FENZ) geodatabase. Department of conservation (2010).
Falcone, J. A. GAGES-II: Geospatial attributes of gages for evaluating streamflow. USGS Report, 41 (2011).
Lewis, C. S., Geli, H. M. & Neale, C. M. Comparison of the NLDAS weather forcing model to agrometeorological measurements in the western United States. Journal of Hydrology 510, 385–392 (2014).
Jung, J. et al. Evaluation of NLDAS-2 and downscaled air temperature data in Florida. Physical Geography 43, 562–588 (2022).
Linke, S. et al. Global hydro-environmental sub-basin and river reach characteristics at high spatial resolution. Scientific Data 6, 283, https://doi.org/10.1038/s41597-019-0300-6 (2019).
Tran, V. N. CAMELSH: A Large-Sample Hourly Hydrometeorological Dataset and Attributes at Watershed-Scale for Contiguous United States. Zenodo, v2, https://doi.org/10.5281/zenodo.15066778 (2025).
Tran, V. N. CAMELSH: A Large-Sample Hourly Hydrometeorological Dataset and Attributes at Watershed-Scale for Contiguous United States. Zenodo, v3, https://doi.org/10.5281/zenodo.15070091 (2025).
Acknowledgements
V. N. Tran and V. Y. Ivanov acknowledge the support of the U.S. National Science Foundation CMMI program award # 2053429 and the Department of Defense, Department of the Navy, the Office of Naval Research award # N00014-23-1-2735. Valeriy Y. Ivanov acknowledges the support of the U.S. National Science Foundation (NSF) Navigating the New Arctic Program Track-I grant 2126792. Donghui Xu was supported by the Scientific Discovery through Advanced Computing 5 (Capturing the Dynamics of Compound Flooding in E3SM), funded by the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research.
Author information
Authors and Affiliations
Contributions
Vinh Ngoc Tran: Conceptualization, Methodology, Software, Validation, Investigation, Data Acquisition, Data Curation, Visualization, Writing – Original Draft, Writing – Review & Editing. Donghui Xu: Investigation, Writing – Review & Editing. Tam Van Nguyen: Writing – Review & Editing. Taeho Kim: Data Acquisition, Writing – Review & Editing. Valeriy Y. Ivanov: Supervision, Writing – Review & Editing, Resources, Funding acquisition.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Tran, V.N., Xu, D., Van Nguyen, T. et al. CAMELSH: A Large-Sample Hourly Hydrometeorological Dataset and Attributes at Watershed-Scale for CONUS. Sci Data 12, 1307 (2025). https://doi.org/10.1038/s41597-025-05612-6
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41597-025-05612-6









