Introduction

Hydrological modelling has advanced the understanding of the water cycle by simulating the movement, dynamics and quality of water, allowing scientists and policymakers to monitor and predict complex hydrological processes and their interactions with climatic and environmental factors1,2,3. Large-scale hydrological models (LSHM) are applied to provide valuable insights into complex transboundary river systems that are difficult to directly monitor and describe the river system functions and responses to different inputs and environmental factors4,5. However, LSHMs, especially at national, continental or global levels, face considerable challenges when applied to local scales, referring to locations within the river system which are critical for water management and decision-making6,7. One of the primary issues is the inherent uncertainties and errors in model setup and parameter identification, leading to poor performance and incomplete or even misinformed understanding of the fluxes8,9. Strong hydro-climatic gradients across the large domain, driven by varying climate conditions, topography, and anthropogenic influences like irrigation and reservoir regulation, has introduced additional challenges. Moreover, the lack of sufficient gauging in river systems, particularly in remote areas, further complicates LSHM setups and parameterisations, which traditionally depend on long streamflow time series10,11. Additionally, the lack of a “perfect” meteorological dataset poses another barrier, with no global product accurately capturing the meteorological dynamics at the local scale, particularly for precipitation12,13,14. Other hydrometeorological fluxes, such as evapotranspiration, groundwater recharge and soil moisture, remain critical in closing the water and energy balance of the river systems, yet poorly quantified in the water cycle15,16. These challenges collectively highlight the need for beyond state-of-the-art frameworks to enhance the regional applicability of LSHMs, especially in the context of varying environmental and climatic conditions.

Post-processing in hydrological and meteorological modelling has proved capability for enhancing local performance and process representation. The refinement of model outputs to better represent the observations improves the model reliability and applicability for local decision making17,18,19,20. Among the various techniques, statistical and machine learning (ML) methods have been increasingly recognized for their potential to tailor hydrological model outputs. Statistical methods (i.e., quantile mapping) are commonly employed to bias adjust and downscale model outputs21,22. Meanwhile, ML-based methods (i.e., neural networks, decision trees, ensemble learning) are particularly capable at handling large and diverse datasets and extracting meaningful patterns, and have emerged as powerful tools for capturing complex, nonlinear relationships within data, allowing more advanced prediction capabilities17,23,24. Both statistical and ML-based approaches are capable of reducing uncertainties and increasing accuracy, and therefore setting a pathway for more reliable local applications of hydrological models25.

The misuse of post-processing and their non-explainability through, for instance, overtrained parameterisation or black box modelling, induces the lack of interpretability and transparency, which makes the understanding of the underlying processes less clear, potentially limiting the ability to trust the results in decision-making scenarios26,27,28. Conventional post-processing methods, which primarily depend on mathematical algorithms, frequently fail to account for the key influences of topography, soil type, vegetation, and regional climate patterns—factors closely associated with the dynamics of river systems29,30,31. Understanding these physiographic characteristics in the post-processing context, will not only ensure that the post-processing techniques indeed improve the model performance, but also provide insights on the underlying processes, indicating how they bridge the gap between generic model outputs and the varied local conditions they aim to represent.

Here, we enhance the quality of streamflow simulations from LSHM at the local scale across the pan-European domain, and improve the understanding of model enhancement to allow for more reliable applications for local decision-making, by answering the following scientific questions: (1) How do hybrid process-based and statistical/ML methods enhance local model performance across various streamflow characteristics? (2) How does the performance of different post-processing methods vary across Europe’s hydro-climatic gradient? and (3) What are the key drivers controlling the hybrid model performance enhancement across Europe? To address these questions, we establish a hybrid framework (Fig. 1a) for post-processing the outputs from the E-HYPE process-based LSHM across the entire European domain. This framework employs two statistical methods (Generalised Linear Model, GLM; Quantile Mapping, QM; Methods section) and two ML-based methods (Random Forest, RF; Long Short-Term Memory model, LSTM; Methods section), with comprehensive evaluation metrics and performance attribution, allowing a thorough assessment and process understanding. Our analysis, covering over 2000 gauging stations across a wide range of hydrological regimes (Fig. 1b), shows that the two ML methods yield higher improvements than the statistical methods, particularly in capturing extreme streamflow characteristics. However, no single method consistently outperforms across the entire domain, rather a spatial complementarity occurs, which is primarily influenced by catchment characteristics, including hydrological regimes (represented by predefined hydrological clusters30,32, details in Methods, Supplementary Fig. 1 and Supplementary Table 1) and climate conditions among the investigated potential drivers. From an operational perspective linked to either early warning systems or climate services, this effort strongly enhances LSHM usability for streamflow predictions at local conditions and carry critical implications for water-dependent sectors (e.g., agriculture, hydropower, drinking water etc.).

Fig. 1: Hybrid framework for post-processing process-based LSHM and data availability of the observations.
figure 1

a Schematic of the hybrid framework, detailing the process-based model, post-processing, evaluation, and performance attribution steps. b Spatial distribution of the stations used in the study, annotated with the start year and duration of the observational data.

Results

Hybrid modelling improves representation of streamflow characteristics at local scale

We applied four post-processing methods, including both classical statistical and state-of-art ML methods, to correct streamflow simulations across the pan-European region. Our evaluation, focusing on total volume as well as high and low flow extremes, reveals that integrating any of these methods yields substantial improvement in the performance of the underlying process-based E-HYPE model (Fig. 2), while also revealing notable differences in their effectiveness. All methods perform almost equally across all performance groups with regard to total volume; however the differences between the methods are more apparent for high and low extremes with the ML methods achieving better performance than the statistical methods at all groups below 0.5 (below the fair performance group defined in Fig. 3, the same for other groups presented in italic), especially at the group below 0 (very poor and unsatisfactory groups), where QM gives relatively the lowest performance.

Fig. 2: Performance comparison of process-based (E-HYPE) and hybrid models (E-HYPE integrated with GLM, QM, RF and LSTM) in predicting streamflow total volume (SMAE), high extremes (NSE) and low extremes (logNSE).
figure 2

The cumulative distribution of model performance is shown using the SMAE (a), NSE (b), and logNSE (c) metrics (see Methods). Perfect performance corresponds to 0 for SMAE and 1 for NSE and logNSE. The grey line represents E-HYPE, while colored lines with varying styles denote hybrid models with different post-processing methods. Performance improves as the lines approach the perfect value marker on the x-axis. The x-axis represents the metric values, and the y-axis indicates the proportion of stations with performance not exceeding the corresponding metric level. The inset plot provides a zoomed-in view of the most common range (highlighted on the x-axis) for clarity.

Fig. 3: Chord diagram showing performance transitions before and after post-processing.
figure 3

The performance for process-based (E-HYPE, left side of the chord) and hybrid models (E-HYPE integrated with GLM, QM, RF and LSTM, right side of the chord) in predicting streamflow total volume (SMAE; a), high extremes (NSE; b), and low extremes (logNSE; c) are presented. The diagram visualizes how stations transition across six performance groups. The width of each chord represents the proportion of stations shifting between performance groups, highlighting improvements or deteriorations due to post-processing. Portions of stations that experienced performance jumps (i.e., from fair to very good, and from poor to good/very good), are displayed below each chord diagram.

We next investigate the changes in the performance groups between the stations before and after post-processing and identify stations where the highest/lowest improvements are achieved (Fig. 3). Notably, our analysis reveals that the two ML methods not only increase the number of stations achieving very good and good performance but also yield larger improvement jumps across performance groups compared to statistical methods. LSTM and RF are particularly effective at some stations enhancing performance from an initial fair performance to a very good group, whereas the statistical approaches mainly enhance performance within the fair-to-good range. This confirms that ML methods can compensate for model structural errors (e.g., due to anthropogenic interventions) which are challenging to represent, while the statistical methods mainly account for uncertainties in forcing inputs and model parameters33,34.

Overall, RF performs similarly to LSTM with their performance having small differences over the very good and good groups with regard to high and low extremes. LSTM is designed to handle sequential data and complex, nonlinear relationships, making the model adept at capturing temporal dependencies. This is crucial in hydrological modelling, where past conditions significantly influence future events. RF, on the other hand, is more suited to capturing complex, non-linear relationships between features without assuming temporal dependencies, which could explain the differences between the two ML methods. In addition, GLM typically does not account for such sequential dependencies, as its linear assumptions consequently do not capture nonlinear dynamics effectively. QM can overall improve the total volume, yet the method can lead to performance deterioration for extremes in certain catchments (Fig. 3). Notably, for the low streamflow extremes, the performance at approximately 2% of stations deteriorates and ends up in the unsatisfactory category, which leads to an “unexpected” expansion of the unsatisfactory group after the QM post-processing. Although QM has been widely applied in hydro-meteorological time series, the method mainly adjusts the statistical variability of the data and consequently the volume. Whilst QM does not show sensitivity to temporal dynamics, which is the reason for occasionally deteriorating performance in extremes which are time sensitive.

No single best hybrid method: spatial complementarity of post-processing potential

Building on the overall improvement achieved by the hybrid modelling framework, we now assess its spatial effectiveness, examining how different post-processing methods perform across regions and whether a universally applicable method exists. The added value (skill) achieved by the post-processing over the process-based LSHM is provided for each station (Fig. 4b). A consistent pattern emerges across all hybrid methods, with post-processing achieving higher skills in central, southern and eastern Europe, over which raw E-HYPE performance for high streamflow extremes is considered at least poor, in comparison to the other regions (Fig. 4a). Similar patterns with high skill values for both total volume and low extremes further confirms the overall capability of the hybrid modelling framework; however, spatial variations of skills across the post-processing methods are also evident. For instance, in the United Kingdom, only a small improvement is achieved from the two statistical post-processing methods (Fig. 4b), while the two ML methods result in considerable skills, especially LSTM. This can be attributed to LSTM’s superior capability in detecting complex and nonlinear relationships within the dataset (e.g., driven by the chalk streams and the river-aquifer interactions), which is less strong in the QM method.

Fig. 4: Spatial distribution of E-HYPE performance, post-processing skill, and best-performing models across different streamflow characteristics.
figure 4

a Raw performance of the process-based E-HYPE model. b Skill improvement from each post-processing method, with colors ranging from yellow to blue indicating performance improvements, while grey represents no improvement after post-processing. c Best-performing post-processing method at each station and the proportion of stations where each method achieves the highest skill. Colors represent different models: process-based E-HYPE (grey), and hybrid using GLM (green), QM (purple), RF (orange) and LSTM (pink).

We next identify the best performing model based on the highest skill achieved at each station (Fig. 4c), and conclude the methods’ spatial complementarity. The majority of the stations are mostly improved by LSTM, with over 50% demonstrating this in terms of high streamflow extremes, and about 20% for total volume and low streamflow extremes, with the river systems mainly located in central and western Europe. RF excels the other methods at approximately 20% of the stations, making it the second most effective post-processing method overall, especially in northern Europe and along the Mediterranean coastlines. Both statistical methods (GLM and QM) show their superiority at various stations within the domain, with performance varying according to different evaluation metrics. QM excels particularly in handling low extremes in the region west of the Urals Mountains in the Russian Federation, where the streamflow is mainly snow dominated. Despite these performance improvements, in few river systems, no hybrid method adds value. These stations are spread across the European domain with the river systems being characterised by small upstream area (mostly less than 500 km2) and quick response to rainfall input based on analysis of their hydrological regime (Supplementary Fig. 2 and Supplementary Table 1), even if in some of them baseflow is a strong contributor, and hence small changes in the streamflow dynamics can affect the model performance.

Overall, the analysis suggests that there is no universally superior hybrid model, with each incorporated post-processing method presenting varying degrees of skill across different spatial locations and under different streamflow properties (total volume and extremes). This could be addressed with a model averaging approach, for instance, based on Bayesian concepts35,36 and/or copula-based frameworks37,38, allowing for integrating multiple models, deriving advantages of each model and compensating for the individual limitations. The spatial variability of best performing models also highlights the importance of diagnostically selecting appropriate post-processing methods accounting specific local characteristics and particular signatures of river system behaviour. This sets the need for explainable hybrid frameworks by investigating the key factors of post-processing improvements.

Hydrological regime as a key driver to model performance enhancement

We next introduce different potentially key factors with regard to climatology, physiography, hydrological similarity and anthropogenic impact, and filter them by removing interdependency (Methods; Fig. 5a). Hydrological similarity has been widely considered for model parameterisation and regionalisation, yet its impact in hybrid modelling is not sufficiently explored. Here we use predefined clusters of hydrologically similar regimes (Supplementary Table 1 and Supplementary Fig. 1) across Europe based on a set of hydrological signatures30. The Classification and Regression Tree (CART) method provides the feature importance each of these factors has on model performance, including both the process-based and hybrid models. The overall importance is further summarised by the comprehensive ranking index (RI) across the models (Methods).

Fig. 5: Drivers influencing model performance enhancement.
figure 5

a Potential drivers included and filtered by interdependency (see Table 2 for acronyms). bd Feature importance from CART analysis for the process-based and hybrid models for the total volume (SMAE; b), high extremes (NSE; c), and low conditions (log NSE; d). An example of how skill changes as a function of its influencing drivers is also presented (c), showing the skill of E-HYPE-LSTM for high streamflow extremes.

The same dominant factors are identified with regard to total volume (SMAE; Fig. 5b) and high streamflow extremes (NSE; Fig. 5c), the leading factors are the hydrological similarity represented by hydrological clusters, and climatic conditions represented by mean precipitation and mean temperature as shown by the higher feature importance and ranking index (Cluster, Prec and Temp; Fig. 5). In addition, we show how skill changes as a function of its influencing drivers, using as an example the skill of LSTM for high streamflow extremes (Fig. 5c). LSTM enhances model performance, with higher skills in drier and warmer conditions; the skill increases with increased mean temperature and decreases with increased mean precipitation. Moreover, the degree of improvement varies across the hydrological clusters, as indicated by the differences in the distribution shape and median values of the skills. Similar patterns are observed for the other post-processing methods and evaluation metrics, as shown in the Appendix (Supplementary Fig. 3). For low streamflow extremes (logNSE; Fig. 5d), different dominant factors arise, including the hydrological cluster, elevation, and dryness index. In particular, the hydrological cluster ranks among the most influential drivers, reflecting that low streamflows are much less influenced by precipitation and are instead strongly correlated to river systems’ memory, which is well represented by the hydrological signatures of the clusters30,39.

The observed patterns between model skill and key driving factors remain consistent across different post-processing methods (Fig. 5 and Supplementary Fig. 3), suggesting that the skill improvements are prone to these factors, rather than the choice of post-processing method alone. While different methods may vary in their capacity to enhance model performance, their responses to underlying hydrological and climatic controls are similar, highlighting the importance of considering these factors when selecting or explaining results from post-processing methods. This conclusion also offers insights into future frameworks focusing on optimising hybrid hydrological model performance in both gauged and ungauged conditions. Similar to parameter regionalisation in hydrological modelling40, here the revealing of strong influence from local basin characteristics suggests that post-processing methods may also be effectively adapted across different river systems by considering their hydro-climatic similarity.

Discussion

Here we established a strong connection between hydrological regimes and the effectiveness of model enhancement through post-processing. This extends the current knowledge about the influence of hydrological similarity in streamflow simulation and regionalisation41,42 into a post-processing context31, where the correction of process-based LSHM outputs at the local scale, whether analysed in terms of total volume, high or low streamflow extremes, is influenced by basins’ hydrological characteristics. Whilst it is important to note that this connection persists regardless of the post-processing methods employed. Therefore, this new insight provides support for the regionalisation of the post-processors, which is important for the broader application of hydrological models29,30,43. A way forward would be the establishment of a multi-basin post-processing approach building on the current method by integrating data from multiple river systems to train a single, regionalized model. While the current framework trains individual models at each gauged basin, the multi-basin approach can incorporate basin characteristics, such as climatic conditions, physiographic attributes, and hydrological regimes, as static input features, enabling the post-processing method to capture shared hydrological behaviours across basins and allowing it to generalize beyond the training locations. Consequently, basins under ungauged or data sparse conditions can benefit from the learned patterns in hydrologically similar gauged systems. Furthermore, this approach lays a solid scientific foundation for expanding the insights gained from pilot studies to broader applications, particularly in vulnerable areas with limited resources. This aligns with the vision that using large-scale hydrological services to identify solutions at the local scale is essential for maximising global impact44,45.

Another key outcome of our study is the potential to produce more accurate forecasts in an operational setting. Hydrological forecast predictability relies on two primary components: the initialization of hydrological conditions at the onset of a forecast, and the hydrological model forcing with (bias-adjusted) meteorological forecasts30,43. However, the biases in these two components are inherited from the reference simulation and the quality of the meteorological forecasts which deteriorates as a function of lead time46,47. These limitations can both be addressed by training the post-processing models using reforecasts and the corresponding observations for each lead time48, and consequently providing lead time-specific correction factors. These trained models are then applied to new forecasts, improving accuracy across the entire forecast horizon. Overall, such investigations support global scientific and operational efforts to ensure equitable access to reliable hydrological data, information and services, including the HELPING scientific decade launched by the International Association of Hydrological Sciences49 and the Early Warnings for All (EW4ALL) initiative launched by the United Nations50. Both global community efforts aim to protect everyone from hydrological hazards (floods and droughts), and this achievement relies on accurate hydrological forecasts as a foundation for action, calling for the operationalization of model enhancement efforts.

Methods

Hybrid modelling framework and data

Our hybrid hydrological modelling framework (Fig. 1) combines the output from the process-based continental E-HYPE model with post-processing methods and adjusts the simulated streamflow to better align with local observations by capturing complex patterns or discrepancies that the process-based LSHM alone may not account for. This hybrid approach leverages the strengths of both process-based modelling (understanding natural processes) and data-driven techniques (capturing complex, site-specific patterns), with the aim to result in improved model performance.

The hybrid models are benchmarked against the E-HYPE hydrological model which is driven by meteorological forcing, i.e., temperature and precipitation, to produce streamflow simulations across the pan-European domain. E-HYPE is a semi-distributed process-based LSHM of water quantity and quality based on the HYPE (The HYdrological Predictions for the Environment) model structure. The pan-European setup simulates components of the water cycle at daily time steps, i.e., snow accumulation and melting, evapotranspiration, soil moisture, streamflow generation, groundwater recharge, and routing through rivers and lakes. The historical model performance in terms of streamflow reaches a median Nash-Sutcliffe Efficiency (NSE) of 0.53 over more than 500 streamflow stations across Europe51,52.

Simulated streamflow (m3 s−1) was obtained for the period 1961–2023 by forcing E-HYPE with the HydroGFD v3.2 meteorological reanalysis data53. Streamflow observations were collected in the pan-European domain from various data sources, including Global Runoff Data Centre, European Water Archive, and national authorities, reaching 2072 stations51. To ensure the sufficiency of training samples, we selected only the stations with at least 10 years of observations (Fig. 1). The final dataset shows a comprehensive spatial coverage of the stations across the entire European domain, with a higher concentration in central Europe and relatively fewer stations in the southern (e.g., Spain) and the eastern part of the continent.

Post-processing method description

In total four methods were used to post-process the E-HYPE LSHM output to better align with local observations at each individual station; two statistical (Generalised Linear Model and Quantile Mapping) and two ML-based (Random Forest and Long Short-Term Memory), which are briefly described below. The models are implemented using the R packages randomForest and qmap, along with the Python package TensorFlow. Details on data processing and model training are provided in the code availability section, ensuring reproducibility and transparency.

Generalised Linear Model (GLM): a statistical technique that extends linear regression to allow for non-normal distributions of error terms54. It allows the inclusion of different types of predictor variables and the modelling of response variables that follow non-normal distributions, such as Gaussian, to provide a flexible framework for understanding the relationships between variables.

Quantile Mapping (QM): a statistical technique used for bias correction by adjusting the distribution of the variable of interest (here simulated streamflow) to match the target variable (here observed streamflow) distribution, in order to correct systematic biases in model outputs55. The tricubic spline method is adopted here to allow for a smooth adjustment of the cumulative distribution functions, to improve the biases in the tails of the distribution.

Random Forest (RF): a supervised, non-parametric method, where an ensemble of uncorrelated trees yields prediction for classification or regression. Multiple trees are built based on bootstrapping samples from the training data. After all the trees are grown, the forests produce the final results by averaging predictions from the trees56. The same model configuration, regarding maximum node numbers (10) and minimum node size, is maintained across all stations, to ensure comparability throughout the study domain, allowing the analysis of potential influencing factors.

Long Short-Term Memory (LSTM): a model for time series, which is capable of learning long-term dependencies57. For post-processing purposes, previous research has proved that the lookback length can be reduced as model performance remains reliably consistent across diverse temporal scales58. Our designed lookback length for LSTM in the hybrid framework is 3-day, as the seasonal dynamics are already represented in the process-based model, which also confirmed its capability of capturing temporal dependencies present in streamflow data by initially experimenting values between 1 to 215 lookback days. This model is structured with three layers containing different numbers of cells (i.e., 100-50-20), allowing an effective process and remembering information over extended periods.

To prevent overfitting, a portion of the training set (10%) is reserved for validation, while the model training includes a monitoring mechanism where if the validation loss does not decrease over 10 consecutive steps, an early stopping criterion is triggered. Normalisation is applied to the input data to scale the range of data points, allowing smoother training process and more stable convergence. To address data imbalances, particularly concerning extreme values critical for hydrological services, a sample weight technique is implemented. This method assigns weights to samples, emphasising the importance of accurately predicting extreme events, which are often underrepresented in the dataset but hold importance for hydrological analyses and applications. Weights are calculated based on percentiles in the observations, where the 10th, 33rd, 66th and 90th percentiles divide the samples into five groups, representing low extremes, lower than normal, normal, higher than normal, and high extremes. Samples within each group share a total weight of 0.2. The root mean square error is used as the loss function for the LSTM model during optimisation.

The post-processing models take simulated streamflow from the E-HYPE hydrological model as input, with the target variable representing either the observed streamflow (Eq. 1) or the relative residual between observed and simulated values (Eq. 2). Both observed streamflow and relative residuals were tested as target variables across the methods. An exception is the QM method, which exclusively uses observed streamflow as the target.

$${{target}}_{{obs}}={y}_{{obs}}$$
(1)
$${{target}}_{{residual}=}({y}_{{obs}}-{y}_{{sim}})/({y}_{{sim}}+\varepsilon )$$
(2)

where \(\varepsilon\) is a small constant value introduced to prevent division by zero, particularly in scenarios of low streamflow, ensuring the target variable remains within a reasonable range. By setting the target thresholds to be no smaller than −1, this approach also effectively mitigates the common issue of generating negative streamflow values when using residuals as the target variable.

In the Results section, the target variable yielding the highest performance for each method is selected and presented (Supplementary Table 2, calculation of the metrics can be found in Table 1), ensuring that the analysis highlights the most effective implementation of each approach.

Table 1 The evaluation metrics used to quantify the potential model performance improvements for different characteristics of the streamflow time series

Each station is corrected independently, with separate model calibration for each location. For model training, the dataset was subsequently divided into training and testing periods, by applying an 80–20% data split. The model evaluation is conducted on the testing periods.

Overall, the hypotheses for this experiment include using the identical model structure for all stations, e.g., the same number of layers, cells, and hyperparameters, without individual optimization for each station. Nevertheless, this generalised approach enables us to compare and provide an overall assessment for the methods across the domain, which well aligns with the objective of this study.

Model evaluation

To evaluate the added value from post-processing, three evaluation metrics were used to assess the potential improvements with regard to errors in total volume, high and low streamflow extremes (Table 1), as represented by the Mean Absolute Error59 (MAE), NSE60 and its logarithmic form61 (logNSE), respectively. In particular, the Scaled Mean Absolute Error (SMAE) is applied to adjust MAE in relation to the average streamflow observed at each station, thus allowing the comparison of MAE values across stations that have varying streamflow magnitudes.

Improvement at each station is further denoted by calculating the skill, which quantifies the efficacy of post-processing methods (pp, Eq. 3) relative to raw E-HYPE simulations (ref, Eq. 3), with positive (negative) skill values indicating improvements (deterioration). A skill value approaching 1 signifies a greater enhancement in predictive performance, highlighting the effectiveness of the post-processing techniques in refining hydrological simulations. The skill (over the historical simulation period) is expressed as:

$${Skill}=\frac{{{Score}}_{{pp}}-{{Score}}_{{ref}}}{{{Score}}_{{perfect}}-{{Score}}_{{ref}}}$$
(3)

The cumulative distribution plot (Fig. 2) presents the proportion of stations that fall below a given performance threshold for the three evaluation metrics. This allows for an inter-comparison between the different post-processing methods.

The chord diagram (Fig. 3) provides a detailed comparison by tracking the transitions of stations between performance groups before and after post-processing. By depicting the “flow” of stations from one group to another, this visualisation helps clarify the extent to which post-processing methods improve, degrade, or maintain performance across different stations. The groups are determined subjectively but still driven by expert knowledge from previous analyses62. However, we note that these are not universally applicable63,64 and are determined specifically for this study.

To further evaluate model performance across stations, we analyse the spatial distribution of skill by identifying the best-performing method at each station (Fig. 4). The results are visualised using a color-coded map, where each station is assigned a color based on the method that yields the highest performance. Additionally, we calculate the proportion of stations where each method performs best (Fig. 4). These ratios are presented in a bar plot alongside the map, offering a comprehensive view of how different methods perform across the entire study area. This combined visualisation helps highlight spatial patterns in model performance and provides insights into the effectiveness of different post-processing methods.

Attributing hybrid model enhancement to hydrological processes

The CARTs method is used to identify the most important drivers of model performance and to explain the complex, non-linear relationships between them65. The algorithm splits the data into subsets based on the values of the input features that result in the largest reduction in heterogeneity of the target variable (i.e., model performance). This process continues until further splitting does not significantly improve the algorithm’s accuracy or until predefined stopping criteria are met, such as a minimum number of leaf nodes. To avoid overfitting, the technique of pruning is used by removing branches that have little to no contribution to the algorithm’s predictive power, aiming to find the optimal balance between the tree’s complexity and its accuracy.

The drivers’ importance is calculated by summing changes in the probability of splitting on every driver and dividing the sum by the number of branch nodes30. This importance score is then standardised, spanning from 0 to 100 for comparability. The association between hydrological model performance and potential drivers is investigated by calculating the feature importance of each potential driver (Table 2). We note that some drivers are highly interdependent and could therefore introduce uncertainty to the CART analysis. Therefore, the highly interdependent drivers (Pearson correlation coefficient greater than 0.6) are removed, and finally 8 potential drivers are kept for the CART analysis.

Table 2 Drivers considered to influence model performance, including topography, climate, anthropogenic impact and hydrological regimes

Following the concept of feature importance, a comprehensive ranking index66 is used to enable the evaluation and comparison of potential drivers’ influence across the process-based and hybrid models. The ranking index (RI) is mathematically expressed as:

$${RI}=1-\frac{1}{{nm}}{\sum }_{i=1}^{n}{{rank}}_{i}$$
(4)

where m represents the total number of potential drivers, which here is 8, and n denotes the number of models, here set at 5 (the process-based model and four hybrid models). \({{rank}}_{i}\) indicates the assigned rank of each potential driver, with 1 being the most critical and 8 the least. Thus, an RI value approaching 1 indicates a more accurate and effective simulation outcome.

With RI, the analysis identifies the three most influential drivers across both the process-based model and the different hybrid models. This approach can reveal the underlying drivers of the model performance and provide information on where post-processing methods can significantly refine the model’s accuracy.

To assess feature importance, we present results as heatmaps (Fig. 5), where color intensity represents feature importance on a scale from 0 to 100%. The corresponding rankings are visualised as points in the marginal plots, providing a clear comparison of relative feature contributions across the models.

In analysing model skill as a function of key driving factors (Fig. 5 and Supplementary Fig. 3), we illustrate trends using a locally estimated scatterplot smoothing curve (LOESS). This approach captures the general pattern of how model skill varies with numerical driving factors (e.g., temperature or precipitation), providing insight into the underlying relationships. For categorical driving factors as hydrological clusters, model skill is decomposed by cluster group and visualized using violin plots. These plots illustrate the distribution of skill within each cluster, highlighting variations across hydrological regimes and emphasising the role of river system characteristics in shaping model performance.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.