CAMELSH: A Large-Sample Hourly Hydrometeorological Dataset and Attributes at Watershed-Scale for CONUS

Tran, Vinh Ngoc; Xu, Donghui; Van Nguyen, Tam; Kim, Taeho; Ivanov, Valeriy Y.

doi:10.1038/s41597-025-05612-6

Download PDF

Data Descriptor
Open access
Published: 28 July 2025

CAMELSH: A Large-Sample Hourly Hydrometeorological Dataset and Attributes at Watershed-Scale for CONUS

Scientific Data volume 12, Article number: 1307 (2025) Cite this article

5788 Accesses
1 Citations
1 Altmetric
Metrics details

Subjects

Abstract

We present CAMELSH (Catchment Attributes and Hourly HydroMeteorology for Large-Sample Studies), the first large-sample hydrometeorological dataset at the hourly scale for the contiguous United States. CAMELSH intergrates hourly meteorological time series, catchment attributes and boundaries from GAGES-II and HydroATLAS for 9,008 catchments across diverse climatic, hydrological, and anthropogenic conditions. In addition, hourly streamflow time series is provided for 3,166 catchments. The dataset spans 45 years (1980–2024) with 11 meteorological variables from the NLDAS-2 forcing dataset, from which we compute nine climate indices related to precipitation, evapotranspiration, seasonality, and snow fraction. Additionally, CAMELSH includes two sets of catchment attributes: 439 from GAGES-II and 195 derived from HydroATLAS. These attributes include factors related to climate, geology, hydrology, river/stream morphology, landscape, nutrient, soil, topography, and anthropogenic influences. Developed in accordance with FAIR (Findability, Accessibility, Interoperability, and Reusability) principles, CAMELSH is the first large-sample dataset at an hourly timescale, supporting machine learning applications for short-term streamflow (flood) prediction and advancing data-driven hydrological research across multiple timescales.

Caravan - A global community dataset for large-sample hydrology

Article Open access 31 January 2023

Hierarchical climate-driven dynamics of the active channel length in temporary streams

Article Open access 02 November 2021

EStreams: An integrated dataset and catalogue of streamflow, hydro-climatic and landscape variables for Europe

Article Open access 13 August 2024

Background & Summary

Large-sample hydrometeorological datasets have become fundamental to advancing hydrological research by enabling researchers to analyze streamflow patterns across diverse catchments^1,2,3, develop robust predictive models^4,5,6, and support critical applications in water resource management and climate change assessment^2,7,8. The expansion of large-sample datasets, driven by the need for benchmark data in large-scale studies, increasing data availability, and improved data-sharing initiatives, began with the MOPEX (Model Parameter Estimation Experiment)⁹ dataset in the United States (US) and significantly expanded through the development of the first CAMELS (Catchment Attributes and MEteorology for Large-sample Studies)¹⁰ in 2017, which covered 671 U.S. catchments and integrated extensive hydrometeorological observations with catchment attributes. CAMELS served as a catalyst for similar datasets across multiple countries, including Chile¹¹, Brazil¹², Great Britain¹³, Australia¹⁴, Germany¹⁵, India¹⁶, and Switzerland¹⁷, ultimately leading to Caravan (a series of CAMELS)¹⁸, a comprehensive global dataset that synthesizes various CAMELS sources to provide harmonized, high-quality data for studying hydrological processes at regional and global scales.

Despite the widespread availability of shared datasets, it remains challenging to find large sample datasets at sub-daily or hourly temporal scales; most existing datasets (for continental or global scales) operate at a daily scale. This hinders large-scale research and model capability assessment at a sub-daily time scale, which is crucial for developing machine learning (ML) approaches relevant for decision-making and mitigation activities. While ML applications have shown effectiveness for simulating daily streamflow^5,19,20, similar achievements in large-scale hourly streamflow prediction or Prediction in Ungauged Basins (PUB) remain limited to isolated case studies^21,22. This limitation is particularly critical as hourly forecasts play a crucial role in disaster risk mitigation, especially for short-lived flooding events^23,24,25,26. At hourly timescales, streamflow patterns and physical characteristics of streamflow generation become increasingly complex and less predictable as compared to the daily scale^27,28,29. Besides, the potential benefits of hourly data extend beyond model development, offering information to provide vital insights into short-term variability, extreme events, and diurnal patterns in streamflow, precipitation, and temperature^27,30,31.

While hourly hydrometeorological datasets exist, they lack comprehensive coverage in terms of the number of watersheds and diversity of basin characteristics. Specifically, Demir, et al.²² provided the WaterBench-Iowa dataset, which focuses on a specific region in Iowa, US, covering 125 locations. In another study, Gauch, et al.²¹ presented an hourly dataset for 516 CAMELS basins spanning the period 1990–2018. However, one of the most well-known limitations of the CAMELS dataset is its lack of diversity, as it primarily contains data for near-natural basins with minimal human impacts¹⁰. This limitation was highlighted in Ouyang, et al.³², which demonstrated that ML models trained on this dataset cannot generalize to regulated basins. It is worth noting that two-thirds of the river basins globally are effected by human activities, resulting in non-natural flow conditions³³.

Given the needs outlined above, in this study, we present the first comprehensive hourly dataset, named CAMELSH (Catchment Attributes and Hourly hydroMEteorology for Large-sample Studies), which provides data for 9,008 basins across the contiguous United States (CONUS) under diverse conditions. The dataset is structured according to FAIR principles (Findable, Accessible, Interoperable, and Re-usable), featuring standardized units, time formats, time zones, and data formats to ensure seamless integration and analysis. All data are organized in a readily accessible format to facilitate straightforward extraction and utilization, with planned ongoing updates to expand the spatial coverage through additional basins, the temporal extent of the data records, forcing data from various sources, as well as different measurable parameters that can be collected such as water level or stream temperature.

The CAMELSH dataset comprises three main components: hourly hydrometeorological time series, basin attributes from multiple sources, and watershed boundary shapefiles. The hydrometeorological data spans from 1980 to 2024, incorporating 11 meteorological variables derived from the NLDAS-2 (North American Land Data Assimilation System) dataset, while the streamflow observations are sourced from the United States Geological Survey (USGS). Operating at a higher temporal resolution, CAMELSH encompasses 9,008 basins (the first version of CAMELSH contains 3,166 basins with observed streamflow data), representing diverse watersheds across different climatic regions, geological characteristics, and varying degrees of human influence.

Our vision in publishing high temporal resolution (e.g., hourly) datasets is to promote the convergence of science efforts, in which researchers worldwide can work with common datasets to reach shared goals. Using standardized datasets particularly benefits model development studies while also facilitating easier reproduction of results for benchmarking models used in evaluation and comparison. More importantly, beyond developing models solely to improve their forecasting performance, developing datasets at finer timescales is crucial for exploring the contribution of various physical mechanisms in the land-surface phase of the hydrological cycle, especially useful for (mostly) process-agnostic machine learning applications. It is worth noting that recent studies have extensively addressed timescale-related questions in process-based modeling and downscaling applications^34,35. With the current remarkable advancement of ML and AI, similar efforts need to be undertaken to understand the scalability of these novel tools across various hydrologic regimes. However, improving model prediction performance at hourly temporal resolution still remains the dominant research topics.

One of the key criteria for large-sample datasets is the ability to provide uncertainty assessments, particularly for data not directly obtained from observations, such as meteorological forcing and basin attributes. In this version of the dataset, we provide two versions of basin attributes. Additionally, alongside the completed dataset, we also provided a second forcing dataset derived from ERA5-Land³⁶ to enhance the number of options for available data sources. This also opens future opportunities for integrating CAMELSH into a global dataset, as ERA5-Land offers full global coverage, ensuring consistency. We also are searching for sources of hourly streamflow data at a global scale, but availability of such data is extremely limited. Current global datasets such as GRDC (Global Runoff Data Centre) only contain data at a daily temporal resolution. Another key plan in our efforts to improve the dataset quality is to enhance basin attribute data to serve diverse research applications, not only hydrological studies but also those related to ecological studies³⁷. Also, basin characteristics (land cover, soil, vegetation, or infrastructure) should ideally be considered dynamic rather than static variables. With advancements in AI and remote sensing, processing such data has become feasible; however, it requires substantial resources and time, particularly for large, continental, or global-scale applications.

Method

Basin selection

CAMELSH uses the basins from the Geospatial Attributes of Gages for Evaluating Streamflow (GAGES-II) database³⁸, which contains geospatial information for over 9,322 stream gages across the USGS network, including CONUS, Alaska, and island territories. Our dataset prioritizes diversity in terms of basin size, climate, topographical, geographical, geology, and human interventions that affect stream flow generation mechanisms, without imposing additional selection criteria. The only constraint is the geographical limitation to CONUS (Contiguous United States) basins due to the spatial coverage of our meteorological forcing data. This results in 9,008 basins in CAMELSH. All catchment shapefiles have been standardized to the WGS84 coordinate system.

Streamflow data collection

The streamflow data are obtained automatically from the USGS Water Information System’s Instantaneous Values Service (https://waterservices.usgs.gov/), which provides historical measurements at sub-daily resolutions (typically 15 to 60-minute intervals) that we aggregated to hourly timesteps for consistency. To ensure standardization, all data, originally recorded in different time zones, were converted to + 0 UTC timestamps, and the discharge units were transformed from cubic feet per second (cfs) to cubic meters per second (cms). Note that the streamflow data has timezone flags at each time step, so Daylight Saving Time was automatically considered in the process of standardizing the timezone. The data collection period was established from 1980/1/1 00:00:00 to 2024/12/31 23:00:00 + 0 UTC, with missing hourly observations denoted as “NaN” in the time series.

Meteorological data collection

Only a limited number of datasets provide high-resolution hourly meteorological data, such as NLDAS-2 and ERA5 (or ERA5-Land). For the first version of CAMELSH, we utilize NLDAS-2 data (https://ldas.gsfc.nasa.gov/nldas/v2/forcing), which is specifically designed for the CONUS. We acknowledge that the spatial resolution of NLDAS-2 is relative coarse for small basins and the accuracy of this data is not optimal, particularly for extreme events^39,40. However, NLDAS-2 is the only dataset that meets our expectations that is specifically designed for CONUS and provides meteorological forcing data at hourly scale with the necessary spatial and temporal coverage for our study.

NLDAS-2 data features a spatial resolution of 0.125 × 0.125 degrees and 1-hour temporal resolution, sourced directly from NASA (https://hydro1.gesdisc.eosdis.nasa.gov/data/NLDAS/NLDAS_FORA0125_H.002/) with a default timestamp of +0 UTC. All eleven available forcing variables from this dataset are incorporated (Table 1), representing commonly used inputs for both land surface models and machine learning applications. From the original gridded data, we generate a single time series for each variable by computing area-weighted spatial averages based on the catchment boundaries as inferred from the GAGES-II dataset, specifically by identifying NLDAS-2 grid cells within river basins and then calculating averages based on these grid cells. Among the variables, air temperature data, initially provided in Kelvin degrees, was converted to Celsius degrees by subtracting 273.15.

Table 1 The list of NLDAS-2 meteorological variables.

Full size table

Catchment attributes

CAMELSH includes a series of static catchment attributes that are considered time-invariant and contain information about topography, hydroclimatic signatures, and catchment characteristics covering land cover, soil, hydrogeology, and human influences. Specifically, this dataset includes three sets of catchment attributes: (1) attributes used directly from GAGES-II; (2) attributes derived from the global HydroATLAS dataset⁴¹; and (3) climate attributes derived from NLDAS-2 data, which are similar to the climate attributes provided in CAMELS. HydroATLAS was chosen as the source for these attributes due to its global coverage and redistribution-friendly license. The provision of basin attributes from the global HydroATLAS dataset also creates opportunities for easier integration of datasets on a global scale, as similar datasets from other regions can derive basin attributes from the same source to ensure consistency.

First, regarding the set of basin attributes from GAGES-II, since this dataset was already compiled and provided by USGS, we did not modify the data but rather filtered and selected information matching the basin coverage in this dataset for the CONUS region. A total of 439 attributes are listed in this set, including climate, geology, hydrology, river/stream morphology, landscape, nutrient, soil, topography, and anthropogenic influences (such as dams and population). Although this attribute dataset provides comprehensive information about the watersheds, some specific information related to river attributes such as river bankfull width and depth or slope is not available. Detailed descriptions of these attributes are provided in Table S.1 in the Supplementary Information (SI) file.

Second, in order to extract the basin attributes from the HydroATLAS dataset, we employed HydroATLAS’s highest-resolution shape file (level 12). The HydroATLAS polygonal units are smaller than the catchment boundaries (from the GAGES-II dataset) associated with individual gauge, wherein a singular drainage area polygon corresponding to a specific gauge station typically encompasses multiple HydroATLAS polygonal units. Therefore, we first aggregated the spatial join of the HydroATLAS subbasins to form the catchment boundary corresponding to each stream gauge and then derived the catchment attributes. Catchment attributes from HydroATLAS can be grouped into the following categories: hydrology, physiography, climatology, soils and geology, land cover, and anthropogenic influences. A list of 195 catchment attributes is provided in Table S.2 in the SI file.

Lastly, based on the NLDAS-2 data collected for each basin, we calculated climate attributes (corresponding to those described in the CAMELS datasets) as detailed in Table 2. The specific calculation methods can be referenced in Addor, et al.¹⁰.

Table 2 Climate attributes derived from NLDAS-2 time series.

Full size table

Data Records

The current versions (v2 and v3) of the CAMELSH dataset^42,43, containing data for 9,008 basins, are available as Zenodo repositories (https://doi.org/10.5281/zenodo.15066778 and https://doi.org/10.5281/zenodo.15070091). Due to the total data volume in the repository being approximately 57 GB, which exceeds maximum allowable Zenodo’s size limit, we split it into two different repositories. The first link contains data on attributes, shapefiles, and time series data for basins with observed streamflow. The second link contains time series data for the ‘non-obs’ basins. All data is compressed in 7zip format. After extraction, the dataset is organized into the following subfolders and summarized in Table 3:

The attributes folder contains 28 CSV (comma-separated values) files that store basin attributes with all files beginning with “attributes_” and one excel file. Of these, the ‘attributes_nldas2_climate.csv’ file contains nine climate attributes (Table 2) derived from NLDAS-2 data. The ‘attributes_hydroATLAS.csv’ file includes 195 basin attributes derived from the HydroATLAS dataset. 26 files with names starting with ‘attributes_gageii_’ contain a total of 439 basin attributes extracted from the GAGES-II dataset. The name of each file represents a distinct group of attributes, as described in Table S.1. The remaining file, named ‘Var_description_gageii.xlsx’, provides explanatory details regarding the variable names included in the 26 CSV files, with information similar to that presented in Table S.1. The first column in all CSV files, labeled ‘STAID’, contains the identification (ID) names of the stream gauges. These IDs are assigned by the USGS and are sourced from the original GAGES-II dataset.
The shapefiles folder contains two sets of shapefiles for the catchment boundary. The first set, CAMELSH_shapefile.shp, is derived from the original GAGES-II dataset and is used to obtain the corresponding climate forcing data for each catchment. The second set, CAMELSH_shapefile_hydroATLAS.shp, includes catchment boundaries derived from the HydroATLAS dataset. Each polygon in both shapefiles contains a field named GAGE_ID, which represents the ID of the stream gauges.
The timeseries (7zip) file contains a compressed archive (7zip) that includes time series data for 3,166 basins with observed streamflow data. Within this 7zip file, there are a total of 3,166 NetCDF files, each corresponding to a specific basin. The name of each NetCDF file matches the stream gauge ID. Each file contains an hourly time series from 1980-01-01 00:00:00 to 2024-12-31 23:00:00 for streamflow (denoted as “Streamflow” in the NetCDF file) and 11 climate variables (see Table 1). The streamflow data series includes missing values, which are represented as “NaN”. All meteorological forcing data and streamflow records have been standardized to the + 0 UTC time zone.
The timeseries_nonobs (7zip) file contains time series data for 5,842 basins without observed streamflow data. The structure of each NetCDF file is similar to the one described above.
The info.csv file, located in the main directory of the dataset, contains basic information for 9,008 stream stations. This includes the stream gauge ID, the total number of observed hourly data points over 45 years (from 1980 to 2024), and the number of observed hourly data points for each year from 1980 to 2024. Stations with and without observed data are distinguished by the value in the second column, where stations without observed streamflow data have a corresponding value of 0.

Table 3 Description of the CAMELSH data record.

Full size table

Technical Validation

Streamflow observations

We selected 9,008 basins within the CONUS region to generate the CAMELSH dataset, with the locations of the corresponding stream gauge stations shown in Fig. 1. Based on the IDs of these stations, observed streamflow data were automatically retrieved from the USGS Water Services server over the period of 1979–2024.

The data for each station were downloaded individually, with varying temporal resolutions ranging from 15 minutes to 1 hour, and the default unit being cubic feet per second. The USGS provides streamflow data flags to indicate periods of estimated flow, which are included in this dataset. These flags are as follows: [A] Approved for publication—processing and review completed; [e] Value has been estimated; [R] Records for these data have been revised; and [P] Provisional data subject to revision. However, additional data quality information is unavailable without further investigation and is not included in this dataset. Therefore, we opted to use the original data in its entirety rather than removing any records. Notably, this dataset includes basins near coastal areas, where streamflow measurements may be influenced by tidal effects, making negative discharge values possible.

The data series were converted to cubic meters per second and standardized into uniform hourly time series, all adjusted to the UTC + 0 time zone. To ensure no missing data in the first few hours of 1980 due to the time zone conversion, data retrieval began in 1979 as mentioned above. After compiling the entire dataset, we found that 5,842 stations had no hourly data records available in the USGS Water Information System throughout the period from 1980 to 2024, while 3,166 stations contained streamflow records (Fig. 1). The stations with available data are distributed across most regions of CONUS, though with varying densities. Notably, the Central U.S. (West Plains) and Eastern Highlands have a higher concentration of stations with available data. Figure 2 presents the statistics on streamflow data availability, revealing that the length of the time series varies across stations, ranging from less than one year to nearly 40 years. The highest number of stations falls within the 15- to 35-year range, totaling 2,292 out of the 3,166 stations. Only 11 stations have more than 35 years of recorded hourly streamflow data.

Climate attributes

As described in the methodology, the meteorological forcing data were aggregated from the NLDAS-2 dataset, comprising a total of 11 variables (Table 1). Initially, these data were downloaded as gridded datasets, with each hourly time step stored in a separate file. The data were then processed for each catchment using the catchment boundaries provided by the GAGES-II dataset. All meteorological forcing data are available for every basin without any missing values. From this dataset, we computed nine climate attributes following Addor, et al.¹⁰. Since the original data are hourly, we processed key variables, including precipitation, temperature, and potential evapotranspiration, into daily time series for attribute calculations.

The results of the 9 climate attributes are presented in Fig. 3. For instance, the results show that the Northwestern and Southeastern regions of the U.S. experience the highest average daily precipitation, whereas the Central U.S., particularly the Great Plains and Rocky Mountains, receive significantly lower amounts. The potential evapotranspiration exhibits a clear latitudinal gradient, increasing toward the south of the U.S. Aridity is highest in the Southwest, High Plains, and Great Plains, while the Northwest, Northeast, and Appalachian regions are the most humid. The spatial variation in the annual precipitation cycle is strongest (with high absolute values of p_seasonality) along the Pacific coast, the northern Great Plains, and Florida, while the weakest seasonality is observed along the Atlantic coast. Additionally, the snow_fraction index (fraction of precipitation falling as snow) highlights areas with significant snowfall, with the highest values found in the high-altitude regions of the Central U.S.–West Mountains, followed by the Northeast region.

Aggregating HydroAtlast basin boundary

The key challenge in extracting data from HydroAtlas is to define which HydroATLAS polygons are within a given gauge’s drainage area. The primary complication is that HydroATLAS basin polygons are not derived based on stream gauge locations. This means that catchment boundaries from the source datasets do not perfectly align with the polygons in HydroATLAS. An example of this is shown in Fig. 4, which compares HydroATLAS polygons with GAGES-II polygons. After manually inspecting a random sample of 100 basins, we identified four distinct cases:

The basin boundaries between GAGES-II and aggregated HydroATLAS coincide, with the stream gauge located at either the outlet or along the basin boundary edges (Fig. 4a).
The stream gauge is situated within the interior of a HydroATLAS polygon, rather than along its boundaries (Fig. 4b).
While the stream gauge is positioned at the outlet in both GAGES-II and HydroATLAS polygons, the area and boundary of the basin aggregated from HydroATLAS exceed those of the GAGES-II polygon.
While the stream gauge is positioned at the outlet in both GAGES-II and HydroATLAS polygons, the area and boundary of the basin aggregated from HydroATLAS are smaller than those of the GAGES-II polygon.

For cases 1, 3, and 4, we directly used the aggregated basins from ATLAS, attributing the boundary mismatches at stream gauge control points between the two datasets to the use of different Digital Elevation Map sources. However, case 2, where the stream gauge is located within the polygon interior, required an additional adjustment, as including the entire polygon would incorporate downstream catchment areas. To address this, we calculated the overlap between this sub-basin and the GAGES-II polygon, selecting only the overlapping portion to merge with upstream polygons to form the final basin boundary. The non-overlapping portion was excluded. After completing the basin aggregation based on the HydroATLAS dataset, our analysis revealed that 5,887 basins demonstrated over 90% boundary overlap (Fig. 5). Additionally, 302 and 854 basins showed areas smaller and larger than the GAGES-II dataset by more than 50%, respectively. For the selected sub-basins used in aggregation, basin attributes were derived as described in the methods section. For the final sub-basin containing the stream gauge, in scenario 2 cases (as shown in Fig. 4b), the new overlapping area was used to derive basin attributes.

Since the original attributes were independently calculated for each individual sub-basin, we recalculated them when analyzing the entire watershed area controlled by stream gauges. For the 195 attributes, we implemented three different approaches based on their characteristics. First, we used an area-weighted aggregate method for attributes expressed as averaged values of sub-basins to calculate their magnitudes for the whole basin. Second, for attributes categorized into classes (denoted by integer values), we determined the total area for each class and selected the class with the largest areal proportion as representative of the total catchment area. Last, for attributes that were pre-calculated in the original HydroATLAS dataset based on the total dranage area (such as inundation extent, percent lake area, and lake volume), we used these pre-calculated values from the sub-catchment containing the streamgage directly. In cases where a portion of the last sub-basin containing the stream gauge was excluded (as the case shown in Fig. 4b), we adjusted these attributes by multiplying them by the ratio of the catchment area derived from HydroATLAS to the original catchment area with the full area of the last sub-basin.

Basin attribute uncertainty

One of the key aspects of providing a large sample dataset is evaluating dataset uncertainty, particularly for basin attributes when a single attribute can be calculated based on different climate data or natural geographical conditions. For example, the comparison of basin areas above demonstrates the uncertainty in basin attributes. After filtering basin attribute data between GAGES-II and HydroATLAS, we identified 37 attributes with overlapping information including data on climate (air temperature, precipitation, potential evapotranspiration), population density, road density, urban extent, soil characteristics (clay, silt, and sand fractions), and topographic features (elevation and slope). Figure 6 presents a 1:1 comparison of 11 attributes, clearly showing differences between the two datasets. Only attributes such as mean annual precipitation (p-mean), mean annual air temperature (Ta-mean), and mean elevation mean show better correlation with R² values of 0.85, 0.98, and 0.98 respectively. Mean annual potential evapotranspiration (pet-mean), despite having a correlation as high as 0.8, shows values from HydroATLAS that are generally higher than values calculated from the GAGES-II dataset. The remaining attributes exhibit high uncertainty with R² values all below 0.7.

Basin diversity

Figure 7 demonstrates the diversity of the dataset when selecting all CONUS basins from GAGES-II. Specifically, Fig. 7a shows the diversity of watersheds in terms of flow disturbance levels, as measured by the hydrological disturbance index. The inclusion of managed basins represents a key difference from the previous daily CAMELS dataset, which primarily used near-natural basins (labeled as “ref” basins in Fig. 7a). In our dataset, the proportion of “Non-ref” stations (i.e., those with significantly disturbed flow) is substantially larger than ref basins (7,080 compared to 1,928). When considering only basins with available streamflow data, the proportional difference remains nearly similar with 2,454 (disturbed) versus 712 (natural) catchments. Additionally, this dataset offers variability in basin characteristics such as catchment area, specifically ranging from 3.8 km² to 5 × 10⁴ km² based on GAGES-II polygons (Fig. 7b). The basins were also collected from all 9 different ecoregions and represent 7 distinct geological characteristics (Fig. 7c,d).

Code availability

The source code can be found at https://github.com/vinhngoctran/CAMELSH.

References

Gudmundsson, L. et al. Globally observed trends in mean and extreme river flow attributed to climate change. Science 371, 1159–1162 (2021).
Article ADS CAS PubMed Google Scholar
Blöschl et al. Changing climate both increases and decreases European river floods. Nature 573, 108–111 (2019).
Article ADS PubMed Google Scholar
Hodgkins, G. A. et al. Climate Driven Trends in Historical Extreme Low Streamflows on Four Continents. Water Resources Research 60, e2022WR034326, https://doi.org/10.1029/2022WR034326 (2024).
Article Google Scholar
Tran, V. N., Ivanov, V. Y. & Kim, J. Data reformation–A novel data processing technique enhancing machine learning applicability for predicting streamflow extremes. Advances in Water Resources 182, 104569 (2023).
Article Google Scholar
Nearing, G. et al. Global prediction of extreme floods in ungauged watersheds. Nature 627, 559–563 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Feng, D., Liu, J., Lawson, K. & Shen, C. Differentiable, learnable, regionalized process‐based models with multiphysical outputs can approach state‐of‐the‐art hydrologic prediction accuracy. Water Resources Research 58, e2022WR032404 (2022).
Article ADS Google Scholar
Bierkens, M. F. Global hydrology 2015: State, trends, and directions. Water Resources Research 51, 4923–4947 (2015).
Article ADS Google Scholar
Döll, P., Douville, H., Güntner, A., Müller Schmied, H. & Wada, Y. Modelling freshwater resources at the global scale: challenges and prospects. Surveys in Geophysics 37, 195–221 (2016).
Article ADS Google Scholar
Schaake, J., Cong, S. & Duan, Q. US MOPEX data set. (Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States), 2006).
Addor, N., Newman, A. J., Mizukami, N. & Clark, M. P. The CAMELS data set: catchment attributes and meteorology for large-sample studies. Hydrology and Earth System Sciences 21, 5293–5313 (2017).
Article ADS Google Scholar
Alvarez-Garreton, C. et al. The CAMELS-CL dataset: catchment attributes and meteorology for large sample studies–Chile dataset. Hydrology and Earth System Sciences 22, 5817–5846 (2018).
Article ADS Google Scholar
Chagas, V. B. et al. CAMELS-BR: hydrometeorological time series and landscape attributes for 897 catchments in Brazil. Earth System Science Data 12, 2075–2096 (2020).
Article ADS Google Scholar
Coxon, G. et al. CAMELS-GB: hydrometeorological time series and landscape attributes for 671 catchments in Great Britain. Earth System Science Data 12, 2459–2483 (2020).
Article ADS Google Scholar
Fowler, K. J., Acharya, S. C., Addor, N., Chou, C. & Peel, M. C. CAMELS-AUS: hydrometeorological time series and landscape attributes for 222 catchments in Australia. Earth System Science Data Discussions 2021, 1–30 (2021).
Google Scholar
Loritz, R. et al. CAMELS-DE: hydro-meteorological time series and attributes for 1555 catchments in Germany. Earth System Science Data Discussions 2024, 1–30 (2024).
Google Scholar
Mangukiya, N. K. et al. CAMELS-IND: hydrometeorological time series and catchment attributes for 228 catchments in Peninsular India. Earth System Science Data 17, 461–491 (2025).
Article Google Scholar
Höge, M. et al. CAMELS-CH: hydro-meteorological time series and landscape attributes for 331 catchments in hydrologic Switzerland. Earth System Science Data Discussions 2023, 1–46 (2023).
Google Scholar
Kratzert, F. et al. Caravan-A global community dataset for large-sample hydrology. Scientific Data 10, 61 (2023).
Article PubMed PubMed Central Google Scholar
Tran, T. D., Tran, V. N. & Kim, J. Improving the accuracy of dam inflow predictions using a long short-term memory network coupled with wavelet transform and predictor selection. Mathematics 9, 551 (2021).
Article Google Scholar
Tran, V. N. et al. A deep learning modeling framework with uncertainty quantification for inflow-outflow predictions for cascade reservoirs. Journal of Hydrology 629, 130608 (2024).
Article Google Scholar
Gauch, M. et al. Rainfall–runoff prediction at multiple timescales with a single Long Short-Term Memory network. Hydrology and Earth System Sciences 25, 2045–2062 (2021).
Article ADS Google Scholar
Demir, I., Xiang, Z., Demiray, B. & Sit, M. Waterbench: a large-scale benchmark dataset for data-driven streamflow forecasting. Earth System Science Data Discussions 2022, 1–19 (2022).
Google Scholar
Hapuarachchi, H., Wang, Q. & Pagano, T. A review of advances in flash flood forecasting. Hydrological processes 25, 2771–2784 (2011).
Article ADS Google Scholar
Tran, V. N., Dwelle, M. C., Sargsyan, K., Ivanov, V. Y. & Kim, J. A novel modeling framework for computationally efficient and accurate real‐time ensemble flood forecasting with uncertainty quantification. Water Resources Research 56, e2019WR025727 (2020).
Article ADS Google Scholar
Ivanov, V. Y. et al. Breaking down the computational barriers to real‐time urban flood forecasting. Geophysical Research Letters 48, e2021GL093585 (2021).
Article ADS Google Scholar
Tran, V. N., Ivanov, V. Y., Xu, D. & Kim, J. Closing in on Hydrologic Predictive Accuracy: Combining the Strengths of High‐Fidelity and Physics‐Agnostic Models. Geophysical Research Letters 50, e2023GL104464 (2023).
Article ADS Google Scholar
Beven, K. J. Rainfall-runoff modelling: the primer. (John Wiley & Sons, 2012).
Sivapalan, M. et al. IAHS Decade on Predictions in Ungauged Basins (PUB), 2003–2012: Shaping an exciting future for the hydrological sciences. Hydrological sciences journal 48, 857–880 (2003).
Article Google Scholar
McDonnell, J. J. et al. Moving beyond heterogeneity and process complexity: A new vision for watershed hydrology. Water Resources Research 43, https://doi.org/10.1029/2006WR005467 (2007).
Hirsch, R. M. & Costa, J. E. U.S. stream flow measurement and data dissemination improve. Eos, Transactions American Geophysical Union 85, 197–203, https://doi.org/10.1029/2004EO200002 (2004).
Article ADS Google Scholar
Kirchner, J. W. Catchments as simple dynamical systems: Catchment characterization, rainfall-runoff modeling, and doing hydrology backward. Water Resources Research 45, https://doi.org/10.1029/2008WR006912 (2009).
Ouyang, W. et al. Continental-scale streamflow modeling of basins with reservoirs: Towards a coherent deep-learning-based strategy. Journal of Hydrology 599, 126455 (2021).
Article Google Scholar
Grill, G. et al. Mapping the world’s free-flowing rivers. Nature 569, 215–221 (2019).
Article ADS CAS PubMed Google Scholar
Blöschl, G. & Sivapalan, M. Scale issues in hydrological modelling: a review. Hydrological processes 9, 251–290 (1995).
Article ADS Google Scholar
Fatichi, S. et al. An overview of current applications, challenges, and future trends in distributed process-based models in hydrology. Journal of Hydrology 537, 45–60 (2016).
Article ADS Google Scholar
Muñoz-Sabater, J. et al. ERA5-Land: a state-of-the-art global reanalysis dataset for land applications. Earth Syst. Sci. Data 13, 4349–4383, https://doi.org/10.5194/essd-13-4349-2021 (2021).
Article ADS Google Scholar
Leathwick, J. et al. Freshwater ecosystems of New Zealand (FENZ) geodatabase. Department of conservation (2010).
Falcone, J. A. GAGES-II: Geospatial attributes of gages for evaluating streamflow. USGS Report, 41 (2011).
Lewis, C. S., Geli, H. M. & Neale, C. M. Comparison of the NLDAS weather forcing model to agrometeorological measurements in the western United States. Journal of Hydrology 510, 385–392 (2014).
Article ADS Google Scholar
Jung, J. et al. Evaluation of NLDAS-2 and downscaled air temperature data in Florida. Physical Geography 43, 562–588 (2022).
Article Google Scholar
Linke, S. et al. Global hydro-environmental sub-basin and river reach characteristics at high spatial resolution. Scientific Data 6, 283, https://doi.org/10.1038/s41597-019-0300-6 (2019).
Article PubMed PubMed Central Google Scholar
Tran, V. N. CAMELSH: A Large-Sample Hourly Hydrometeorological Dataset and Attributes at Watershed-Scale for Contiguous United States. Zenodo, v2, https://doi.org/10.5281/zenodo.15066778 (2025).
Tran, V. N. CAMELSH: A Large-Sample Hourly Hydrometeorological Dataset and Attributes at Watershed-Scale for Contiguous United States. Zenodo, v3, https://doi.org/10.5281/zenodo.15070091 (2025).

Download references

Acknowledgements

V. N. Tran and V. Y. Ivanov acknowledge the support of the U.S. National Science Foundation CMMI program award # 2053429 and the Department of Defense, Department of the Navy, the Office of Naval Research award # N00014-23-1-2735. Valeriy Y. Ivanov acknowledges the support of the U.S. National Science Foundation (NSF) Navigating the New Arctic Program Track-I grant 2126792. Donghui Xu was supported by the Scientific Discovery through Advanced Computing 5 (Capturing the Dynamics of Compound Flooding in E3SM), funded by the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research.

Author information

Authors and Affiliations

Department of Civil and Environmental Engineering, University of Michigan, Ann Arbor, MI, USA
Vinh Ngoc Tran, Taeho Kim & Valeriy Y. Ivanov
Atmospheric, Climate, & Earth Sciences Division, Pacific Northwest National Laboratory, Richland, WA, 99354, USA
Donghui Xu
Department of Hydrogeology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
Tam Van Nguyen

Authors

Vinh Ngoc Tran
View author publications
Search author on:PubMed Google Scholar
Donghui Xu
View author publications
Search author on:PubMed Google Scholar
Tam Van Nguyen
View author publications
Search author on:PubMed Google Scholar
Taeho Kim
View author publications
Search author on:PubMed Google Scholar
Valeriy Y. Ivanov
View author publications
Search author on:PubMed Google Scholar

Contributions

Vinh Ngoc Tran: Conceptualization, Methodology, Software, Validation, Investigation, Data Acquisition, Data Curation, Visualization, Writing – Original Draft, Writing – Review & Editing. Donghui Xu: Investigation, Writing – Review & Editing. Tam Van Nguyen: Writing – Review & Editing. Taeho Kim: Data Acquisition, Writing – Review & Editing. Valeriy Y. Ivanov: Supervision, Writing – Review & Editing, Resources, Funding acquisition.

Corresponding authors

Correspondence to Vinh Ngoc Tran or Valeriy Y. Ivanov.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Tran, V.N., Xu, D., Van Nguyen, T. et al. CAMELSH: A Large-Sample Hourly Hydrometeorological Dataset and Attributes at Watershed-Scale for CONUS. Sci Data 12, 1307 (2025). https://doi.org/10.1038/s41597-025-05612-6

Download citation

Received: 08 April 2025
Accepted: 10 July 2025
Published: 28 July 2025
Version of record: 28 July 2025
DOI: https://doi.org/10.1038/s41597-025-05612-6