Introduction

Heatwaves (HWs) are prolonged periods of extreme temperature, and lead to a wide range of impacts, including the collapse of agricultural yields1, drastic increases in energy usage2, impacts on human health, and increased mortality3,4. Europe has experienced devastating heatwaves in the past decades, including, but not limited to, deadly events in 2003 and 20105 and more recently in 20226,7. Climate projections suggest further intensification of HWs in the coming decades8, which will likely lead to an increase in deaths attributed to extreme heat unless mitigation measures are implemented9. Consequently, the ability to predict extreme summer heat several months in advance provides an opportunity for the agricultural industry and national health services to implement mitigation measures3,10.

Seasonal forecasting, the prediction of seasonal climate conditions several months in advance, has the potential to provide society with time and necessary information to take meaningful action prior to potentially damaging climate events. Such information already serves as the foundation for climate services in various sectors, such as the early warning of droughts for agriculture11,12 and snow cover for tourism13. The generation and maintenance of seasonal forecasts, however, is an enormous computational undertaking, with many centres around the world producing dozens of ensemble members of climate models, each of which couples several components of the climate system, at horizontal resolutions typically less than 1 geographical degree. Moreover, predicting heatwaves beyond the deterministic limit (roughly 10–15 days) is challenging14,15,16,17. The state-of-the-art operational seasonal forecasting systems from the Copernicus Climate Change Service (C3S) have demonstrated reliable forecast skill over large parts of Europe when predicting, up to 3 months in advance, seasonal heatwave indices18,19. However, skill gaps remain; for example, over northern Europe, which is less influenced by more predictable mid-latitude variability20.

The increasing use of Machine Learning (ML) for weather and climate science applications provides a means of reducing the resources required to make accurate weather forecasts without compromising on skill, as demonstrated in the field of weather forecasting21. Dynamical forecasting systems, named after their ability to solve dynamical and thermodynamic equations numerically, are now matched in skill or even outperformed by purely data-driven approaches using a range of machine or deep learning architectures22,23,24,25,26. These data-driven approaches leverage techniques designed to identify relationships between multiple variables from large datasets of observations or model simulations. The current frameworks used are being adjusted for subseasonal27 and seasonal timescales28, and any data-driven approach to seasonal forecasting requires considerably more data than is available in the observational records29.

Prior to the widespread use of ML, statistical seasonal forecasting showed that the input of known predictors, such as soil moisture for European heatwaves, into simple statistical models could provide skill for the prediction of certain climate variables on seasonal timescales30,31. Previously, such methods relied on the selection of known drivers and thus were limited by the current scientific understanding of heatwave dynamics. Nowadays, more sophisticated ML models are used and are able to select from a set of potential predictors, as widely demonstrated on the subseasonal timescale32,33,34,35,36. ML-based seasonal forecasting techniques have shown that spatially and temporally distant predictors of seasonal climate can be identified and used to make accurate predictions37,38,39,40,41. However, studies either do not include comparisons to operational systems or use a restricted number of pre-selected and known predictors in their feature selection. Moreover, there is currently no purely data-driven seasonal forecast approach for Europe, nor one that focuses explicitly on temperature extremes.

This study describes a data-driven seasonal forecast system that is computationally inexpensive, provides scientifically relevant information on HW predictors, and is shown to match, and in some instances outperform, the state-of-the-art of operational dynamical seasonal forecasting. This work merges efforts in previous statistical and ML-based approaches with training based on a multi-millennial paleo-simulation dataset. Crucially, it employs a feature selection method that boasts the freedom to identify optimal predictor variables and the time-lags over which they contribute to skill. This framework provides an index-specific forecast and driver detection for summer heatwave propensity at any location.

Results

Feature selection of heatwave predictors

This study begins with a feature selection framework, designed to identify the combination of predictors that provides the optimal seasonal forecast skill of European summer heatwave indicators (Supplementary Fig. 1). The chosen potential predictors describe atmospheric, land and ocean conditions which are known to influence the European climate or extremes. First, a range of dimension-reduced predictors is defined using an enhanced version of k-means clustering (which employs a weight of 5% for distances on the geoid) applied to variables known to impact European summer climate (e.g. soil moisture, sea ice content; Supplementary Fig. 2; Supplementary Table 1; “Methods” section). The target in this study is the number of days in which the temperature exceeds the climatological 90th percentile between May and July (MJJ NDQ90). To identify the most influential predictors, a multi-method ensemble optimisation algorithm42 is employed to select the variables and the corresponding range of time-lags that provide optimal forecast skill of the target. The multi-method ensemble43 tests various combinations of predictors, and aims to reduce the forecast error; the optimisation algorithm combines various subsets of variables and time-lags into a Logistic Regression model to predict NDQ90 (Fig. 1). The framework benefits from a paleoclimate simulation of the years 0–1850 with a coupled atmosphere-ocean model (hereafter “past2k”), which provides long-term simulated data of predictors and HWs in a stationary climate. The optimisation-based feature selection is performed using a training period of years 0–1600, and a test period of 1601–1850. Finally, when applied to the modern 1993–2016 period with ERA5 predictors, the optimal predictors are used to train ML-based prediction models to provide fully data-driven seasonal forecasts of heatwave occurrence. The optimisation is performed individually for each grid point; see Supplementary Figs. 3 and 4 for an example of the optimal predictors selected and the corresponding seasonal forecasts of the test period.

Fig. 1: Optimisation of seasonal forecast skill (N-RMSE) of seasonal European heatwave indicators (NDQ90) using data from a paleoclimate simulation.
figure 1

The training period and test period are  0–1650 and 1601–1850 respectively. Two examples of optimisation are shown: 43.13° E, 58.76° N (poor, upper cluster of solutions) and 24.38° E, 41.97° N (good, lower cluster). Symbols and colours represent stages of optimisation (latest stages in red), with the “optimal” solution (black circle) corresponding to the solution with the lowest training N-RMSE. The diagonal (dashed grey) represents a perfect fit between training and test scores, while the vertical and horizontal lines at N-RMSE = 1 indicate where the error is equivalent to interannual variability (Supplementary Fig. 5).

The European-scale view of optimised forecast skill (root-mean-squared-error normalised by the interannual variability of the target, N-RMSE; “Methods” section; Fig. 2) demonstrates a zone of low skill (N-RMSE >1) stretching from northern central Europe and Scandinavia, while the highest skill is found over central Europe, and the Mediterranean and Black Sea basins. Two grid points representative of either relatively “poor” or “good” regions of forecast skill (Fig. 1) show that the degree of possible improvement, relative to the initial first guess, depends on location. In the “poor” example, optimisation leads to an improvement of 0.18 in N-RMSE, while in the “good” example, the improvement is 0.29. Although in both examples the optimal training N-RMSE obtained is below 1 (0.94 and 0.78), indicating that error is within the range of interannual variability, the same is not true for the test period in the “poor” example. The European pattern of data-driven skill in the model world (Fig. 2) resembles the skill of dynamical seasonal forecast systems in predicting temperature44,45 and its extremes18. By applying the data-driven approach first to the model world, we isolate where the use of reduced-dimensionality predictors provides insufficient predictability. If the framework cannot recreate the paleoclimate model training data, then it is highly unlikely to perform well on the test data or in real-world forecasting.

Fig. 2: Seasonal forecasts of paleo-climate heatwaves across Europe.
figure 2

Optimised seasonal forecast skill (N-RMSE) of seasonal European heatwave indicators (MJJ NDQ90) using data from a paleoclimate simulation for the training period (a, 0–1600) and optimisation test period (b, 1601–1850). Each point represents the optimal forecast skill based on the point-specific optimisation of predictors (e.g. Fig. 1). The locations corresponding to the good and poor skill examples from Fig. 1 are represented by a square and a circle, respectively.

Collecting the optimal predictors from all individual points across Europe (see Supplementary Fig. 3 for an example) provides an overview of the model-world HW drivers at a regional level (Fig. 3). The most commonly selected variables across the domain are the European soil moisture, temperature and geopotential height (z500) clusters. The identified key role of these local predictors agrees with studies of many HWs that have occurred in Europe15,30. Commonly selected predictors that represent more distant precursors include sea surface temperature (SST) over the equatorial Pacific and outgoing longwave radiation (OLR) over the tropical Atlantic. While the former represents the phase of the El Niño Southern Oscillation, which is known to play a role in European climate extremes46, the contribution of the OLR over the tropical Atlantic is not obvious. The feature selection also allows us to study the time-lags in which the variables play a role. The most frequently selected time-lags occur on average around six weeks prior to initialisation (i.e. mid-March; Fig. 3, Supplementary Fig. 6). However, the key temporal lag depends on the variables. Temperature and z500 clusters are selected more frequently in the few weeks prior to initialisation and decay gradually with longer lag, while soil moisture and sea ice selection peak between 7 and 8 weeks prior to initialisation. The number of predictors selected before February is negligible. The most commonly selected short-term European predictors are unsurprisingly selected within or near to the area they represent (Supplementary Fig. 7); TMXEur-1 at 1 1-week time-lag is selected across central Europe, while SSTMed-3 at a 5-week time-lag is largely selected around the Black Sea. For the geographically distant predictors, the tropical Atlantic OLR (OLRTro-2 at 4-week lag) is chosen as a predictor for areas across the Barents Sea and Scandinavia, as well as parts of the southern Mediterranean Sea, while the selection of the tropical Pacific SST is more sporadic. Studying these points in a SHAP (SHapley Additive exPlanations47; “Methods” section) analysis, used to quantify their relative contribution to forecasts, confirms that the local predictors carry more predictive value, while the more distant predictors contribute more weakly (Supplementary Fig. 7). While distant teleconnection-based predictors would be expected to have less direct impact on HW occurrence and therefore a relatively low predictive power according to the SHAP analysis, we cannot rule out that such drivers are present only in the model world. However, their widespread selection, in particular, that of OLR, suggests they are not spurious results and merit further analysis in future studies.

Fig. 3: Feature selection of predictors for seasonal heatwave indicators across Europe.
figure 3

The matrix displays the percentage of grid points with respect to the entire European domain in which the cluster and time-lag appear in the optimal solutions. Initialisation is on May 1st. Variable labels: mean sea level pressure (SLP), geopotential height at 500 hPa (z500), soil moisture (SM), daily maximum 2 m temperature (TMX), sea surface temperature (SST), outgoing longwave radiation (OLR), and sea ice concentration (SIC). Cluster maps are shown in Supplementary Fig. 2.

Data-driven seasonal forecasts of European heatwaves

Using the selected optimised predictors for each grid point, the data-driven forecast system is adjusted to be trained on the entire past2k simulation period (0–1850) and then tested for the period 1993–2016 using predictors from ERA5. Whereas the optimisation-based feature selection training and testing is performed with linear regression to avoid large computational cost, the real-world forecasts use ML models in an attempt to boost the skill achieved by the same predictors (e.g. Random Forest; Supplementary Fig. 8). In the “Methods” section, skill for each ML model used is reported; here, the skill graphs reflect the best performing model for each grid point. Data-driven re-forecasts display significant correlation skill scores over 56% of the European domain, including central Europe and the Mediterranean Basin. The skill patterns in the data-driven forecasts (Fig. 4) match those of the optimisation test period (Fig. 2), indicating the successful transfer of learning from the paleoclimate.

Fig. 4: Data-driven seasonal forecast skill of European summer heatwaves.
figure 4

Correlation skill score of seasonal European heatwave indicators (MJJ NDQ90) in the data-driven (a) and C3S multi-model ensemble (b) forecasts over the forecast test period 1993–2016, validated against ERA5. Black stippling represents statistically significant correlation (a & b) or correlation difference (c) at the 95% confidence interval.

Existing operational systems from the C3S represent the state-of-the-art of dynamical seasonal forecasting and can also provide forecasts of summer heatwaves, but the skill has previously only been tested for ECMWF-5118. Individual systems (CMCC-35, MF-8, DWD-21 and ECMWF-51, Supplementary Table 2) display similar patterns of skill, such as the zone of low skill extending across Scandinavia and northern central Europe (Supplementary Fig. 9). A multi-model mean is often used in dynamical forecasting to smooth out errors in individual systems and boost forecast skill; this holds true for forecasts of NDQ90 for which the multi-model product provides statistically significant skill over 58% of Europe (Fig. 4). As a result, over the majority of the domain, there is no statistically significant difference between the data-driven and dynamical skills (with the exception of western Russia; Fig. 4c). Therefore, regions that are skilfully predicted by the dynamical system are also skilfully predicted by the data-driven system. Moreover, the skill in the zone extending over northern central Europe and Scandinavia, a well-known issue in dynamical systems, is also higher in the data-driven approach. When compared to the individual systems (Fig. 5), the data-driven approach is more skilful over certain areas, such as over Eastern Europe when compared to CMCC-35 and ECMWF-51, the previously mentioned Scandinavian zone in MF-8, and over France in DWD-21. However, the skill increase is rarely statistically significant. To predict summertime HWs over Europe, the data-driven approach is as capable as the state-of-the-art multi-model dynamical product and, in some places, more skilful than individual operational forecasting systems.

Fig. 5: Differences in correlation skill score between the data-driven and individual C3S forecast systems.
figure 5

Positive values indicate higher skill in the data-driven system for forecasts of seasonal heatwave indicators (MJJ NDQ90). Black stippling represents a statistically significant correlation difference (c) at the 90% confidence level.

It is crucial for newly proposed systems to demonstrate skill in forecasting the most exceptional events and, crucially, the data-driven forecasts display this capability in some cases (Fig. 6). In northern Italy, where both data-driven and dynamical systems generally display high skill, the top-performing models in data-driven system forecasts are remarkably close to the observed values for the two years with the greatest number of HW days (2003 and 2015). In this region, a simple linear regression model is as effective as ML-based models and outperforms the dynamical systems (Supplementary Fig. 8), while some display stronger biases in NDQ90 (e.g. Light Gradient Boost - DD-LGB) than others. The extent of the HW in 2003 across western Europe and the Mediterranean basin is also well forecast by the data-driven approach (Supplementary Fig. 10), although the exceptionally deadly event of 2010 over western Russia was not predicted by either type of forecast20.

Fig. 6: Forecasts of summer heatwave indicators over northern Italy.
figure 6

NDQ90 in ERA5 (black) is compared to the full ensemble (120 members) of the dynamical systems (C3S Ensemble) and the data-driven approach with three high-performing ML models (Linear Regression, DD-LR; AdaBoost, DD-AB; Light Gradient Boost, DD-LGB). Boxplots represent the medians, interquartile ranges and maxima and minima for each forecast year in the C3S Ensemble. The correlation skill scores over the 1993–2016 period are as follows: C3S ensemble median (0.63), DD-LR (0.74), DD-AB (0.78) and DD-LGB (0.76). The box used to define northern Italy is shown in Supplementary Fig. 10.

Discussion

Although dynamical forecasts of heat extremes display skill over much of Europe18,19, the zone of low skill over northern central Europe and Scandinavia is a problem that has persisted despite continued updates to dynamical systems44,45,48. Recent efforts have demonstrated that hybrid dynamical-ML approaches, in which only ensemble members that best represent the North Atlantic Oscillation are selected, can boost dynamical forecast skill of summer conditions in this region49. The purely data-driven approach described here also achieves the goal of improving upon dynamical systems (Figs. 4 & 5), with the benefit of doing so at a considerably lower cost by identifying region-specific predictors. Although Linear Regression displays very high skill in Central Europe and the Mediterranean basin, ML models such as Random Forest and Light Gradient Boosting have shown greater skill across Europe as a whole.

The computational expense of the data-driven approach is very low. For each grid cell, the optimisation of predictors requires roughly 1 CPU-hour (on the DKRZ Levante BullSequana XH2000 supercomputer with 3rd generation AMD EPYC CPUs), and the forecasts require only minutes; scaling to cover the 1° rectangular grid of C3S over Europe (1066 grid points) requires roughly 1000 CPU-hours in total. The optimisation-based feature selection is required only once per start date. Here, the data-driven system was initialised in May by choosing predictors prior to May 1st, and was applied to forecast an HW index of May–June–July. Unlike a dynamical system designed to output many variables at many start dates, our approach focuses on a specific task. In the future, the system can be easily re-optimised for other start dates, target dates and even other extreme events, and can include other potential predictors; for example, humidity for nighttime heatwaves50.

The predictors identified in the climate simulation (Fig. 3) are not necessarily equivalent to those in the real world, especially given that the training dataset past2k may present biases in both predictor and target variables. Moreover, drivers may change over time, and past2k may provide “outdated” predictors. For example, there has been a shift in the role of Arctic sea ice on European atmospheric circulation during recent decades51,52. However, it is clear that sufficient knowledge has been gained from the paleoclimate simulation to make accurate predictions. The feature selection identifies the principal role of predictors (e.g. soil moisture) at 4-8 weeks prior to initialisation, i.e. around March (Supplementary Fig. 6). This analysis provides indications for future studies on heatwave drivers, and can assist in describing the physical mechanisms behind their influence. Moreover, it serves as a means to study the recently highlighted differences in predictability between day and night extremes19,50.

Here, the pool of potential predictors used is wider than typically used in feature selection studies on S2S or seasonal forecasting; the number of cluster variables is 70, with each time-lag (up to 28 weeks prior to the target) counting as an individual predictor, thus leading to a total of roughly 2000 potential predictors. The framework does not rely only on expected or known predictors, but instead clusters predictor variables as a means of reducing the dimensionality of the problem without including human bias or relying on prior knowledge. A benefit of the optimisation algorithm used is its capacity to filter out unnecessary predictor information and identify the key predictors, thereby allowing the input of many potential predictors. Across the domain studied here, between 3.2 and 10.6% (6.5% on average) of the cluster-lag combinations are selected as predictors. Generally, including more predictors slows down ML forecasts, ruling out the possibility of including all 2000 in this case. To illustrate the benefits of the optimisation-based method, we compare it to an alternative and simpler method of feature selection based on linear correlation analysis, in which the selected predictors are those significantly (positively or negatively) correlated to the target data during the period 1993–2016. Although the correlation approach selects a similar number of predictors (7.8% on average across the domain), the resulting forecasts are considerably poorer in skill (Supplementary Fig. 11). This highlights the greater ability of our optimisation-based approach to capture physically plausible predictors compared to simpler statistical-based approaches.

By focusing on one climate simulation (past2k) for training, it is demonstrated that there is a successful transfer of learning between the model and the real world. Increasing the training period from 50 to 1850 years of paleo-simulation data has a noticeable impact on the forecast skill (Supplementary Fig. 12), although there is limited growth in skill between 1000 and 1850 years of training data. This plateau implies that increasing the data beyond what is available from a single source, for example, by extending the paleo-simulation further back in time, would contribute little to improving the skill. Thus, the next avenue of research should be to attempt a multi-model training approach, for example, using the dynamical forecasting systems themselves as training data. Recent advances in short-term forecasting have also demonstrated that an ML-based ensemble outperforms deterministic data-driven forecast models, as in dynamical systems53. Emulating the dynamical multi-model approach with the data-driven system has the potential to further increase skill.

Although already successful, future improvements can be made to this prediction system. Parameter tuning for the ML models used could identify potential improvements, but it is an enormous undertaking for several models covering a wide geographical domain. K-means can be replaced by clustering algorithms which provide more interpretable and physically meaningful output54. We find that local variables are important for accurate predictions (Fig. 3), meaning that each target region should have a diverse range of potential predictors located close to it. The setup used in this study is ideal for forecasting central European conditions. The edge of the domain (e.g. Western Russia) displays lower skill (Fig. 4), likely due to the use of fewer local variables centred around this part of the domain. A crucial difference to the dynamical systems is that the current data-driven approach is inherently deterministic; future efforts must either explore probabilistic alternatives53 or how to leverage the use of a single skilful member, such as for dynamical ensemble member selection49.

Methods

ERA5 reanalysis

The ERA5 reanalysis55,56 provides the target and predictor data for the modern period 1993–2016. Daily maximum 2m temperature (TMX) is used to calculate the heatwaves. To allow for comparison with the dynamical seasonal forecasts from the Copernicus Climate Change Service (C3S), TMX is regridded from the 0.25o regular grid to a regular 1° grid. The following variables are used as predictors: mean sea level pressure (SLP), volumetric soil moisture content in the upper 7 cm (SM), sea ice concentration (SIC), sea surface temperature (SST), geopotential height at 500 hPa (z500), outgoing longwave radiation (OLR), and TMX.

Previous studies have shown that ERA5 accurately reproduces both mean and extreme temperatures57, confirming that it is a reliable source of climate information over Europe, in particular for heatwave indicators58.

MPI-ESM paleo-simulation “past2k”

The “past2k” simulation is a simulation of the climate system with a state-of-the-art Earth System Model over the past two millennia. It was performed with the MPI-ESM1.2-LR model, which couples the ECHAM6.3 as its atmospheric component (1.875° horizontal resolution with 47 vertical levels) and the MPIOM1.63 as its ocean component (1.5° horizontal resolution reaching 30–40 km in the subpolar North Atlantic, with 40 vertical levels). The spin-up time of the simulation was 1200 model years prior to the year 0. The model is forced by reconstructions of past atmospheric greenhouse gases, volcanic forcing, solar forcings (with an artificial 11-year cycle), derived from the analysis of polar ice-core data; land-use changes derived from historical and palynological data; and ozone concentrations resulting from an atmospheric photochemistry model forced by past solar irradiance59,60. No de-biasing or correction was made to the predictor or target data in past2k.

The soil moisture content in past2k is provided as the mass of water per m2 in the upper 10 cm, as opposed to the volumetric water content in the upper 7 cm provided by ERA5. To convert from m3 m−3 (ERA5) to kg m2 (past2k), we scale by density and extrapolate the value to 10 cm by assuming even distribution of water in the top 10 cm and a water density of 1000 kg m3.

The representation of summer temperature and heatwaves in past2k is found to agree with ERA5 (Supplementary Fig. 5). First, the patterns of interannual variability of NDQ90 are similar across Europe, although the magnitude is consistently higher in past2k. For example, peaks appear across the Mediterranean basin, the Caucasus and north-western Russia. In the two leading Empirical Orthogonal Functions (EOFs) of average summer TMX, past2k and ERA5 resemble each other in terms of patterns and magnitude; the leading EOF is dominated by variability over Russia, while the second clearly separates land and ocean variability.

Dynamical seasonal forecast systems

The C3S provides several operational dynamical seasonal forecast systems, each with a different number of ensemble members (i.e. individual realisations with perturbed initial conditions used to sample the uncertainty) and set-ups. Four systems from the C3S (Supplementary Table 2) were selected for this study: SPS3.5 from the Centro Euro-Mediterraneo sui Cambiamenti Climatici (CMCC-35, 40 ensemble members), System2.1 from the Deutscher Wetterdienst (DWD-21, 30 ensemble members), SEAS5.1 from the European Centre for Medium-Range Weather Forecasts (ECMWF-51, 25 ensemble members), and System8 from Météo-France (MF-8, 25 ensemble members). These systems are initialised in “burst-mode”, meaning the full set of ensemble members is run on the first day of the month. Other forecasting systems release ensemble members periodically throughout the month, in “lagged” mode. To remain consistent with the data-driven approach, which is also initialised on the first of each month, we use only burst-mode forecasts. The horizontal spatial resolution is 1°, and 6-hourly data are used to extract the daily maximum (TMX).

Definition of heatwave index

The target data used in this study is the number of days, from 1st May to 31st July, in which TMX exceeds the climatological 90th percentile (NDQ90). Each product uses its own respective climatology, thereby adjusting the mean biases inherent to the dynamical seasonal prediction systems. For ERA5, the 90th percentile is calculated by first averaging each calendar day over the 1993–2016 period and then applying an 11-day running mean to smooth the climatology61. For past2k, the period used is the last 30 years of the simulation (1821–1850), a choice which is justified by the stability of the climate throughout the simulation60. The NDQ90 index provides a measure of the propensity of a season to experience extreme temperatures, and displays similar variability and predictability to other indices based on intensity8,18.

Optimisation-based feature selection and seasonal forecasts

Here, a data-driven seasonal forecast system is designed based on an optimisation-based feature selection framework62 (Supplementary Fig. 1). The framework is composed of the following steps: a dimensionality reduction of potential predictor variables; a feature selection that identifies the optimal combination of variables, spatial domains and time-lags; the use of selected features to train statistical or ML prediction models. Previous work has successfully tested the ability of the framework to detect HWs based on short-term drivers in a detection-mode context62; here, the framework is adapted to only use predictor information on seasonal timescales, thereby providing seasonal forecasts.

First, a pool of potential predictors is defined. Here, a predictor refers to a variable within a domain at a particular time-lag. The chosen variables describe atmospheric, land and oceanic conditions, some of which are known to influence the European climate or extremes, such as atmospheric circulation (e.g. z50063,64,65), ocean-atmospheric interactions (e.g. SST66,67, soil moisture68,69, and sea ice51,52). Among these, variables such as SST and OLR are used to capture modes of climate variability30,70. Using variables in a global domain or from outside Europe allows for the potential identification of teleconnections. An upgraded k-means clustering is applied to each variable to extract five clusters per domain (Supplementary Fig. 2; Supplementary Table 1). The innovation, compared to the traditional k-means, is performed using weighted multi-dimensional distances, where the Euclidean distances between the time series are given 95% weight and the remaining 5% is assigned to spatial distances on the geoid between the grid point and the centroid of the cluster. The clustering is performed on variable anomalies in ERA5 with respect to the 1993–2016 climatology. The ERA5-derived cluster shapes are then used to calculate weekly area-averages in both ERA5 and past2k, covering the period from November 1st to April 30th. Given that our objective is to demonstrate the capability of such a framework, we use k = 5 as a compromise between choosing a large pool of predictors and performing a suitable reduction of the dimensionality of the problem. The numbered week of the year is also included as a dummy predictor variable.

In the second step, the identification of predictors of extreme summers is treated as an optimisation problem. The optimisation algorithm employed is the Probabilistic Coral Reef Optimization algorithm with Substrate Layers (PCRO-SL)43, which uses a multi-model method to combine different search procedures. In particular, it has recently been adapted to create a Spatio-Temporal Cluster-Optimized Feature Selection (STCO-FS) for heatwaves62. While previously the STCO-FS has been applied to the detection of HWs, here it is adapted to seasonal forecasting. We provide a description of the seasonal forecast setup of STCO-FS and refer readers to the aforementioned references for more complete technical descriptions of the optimisation algorithm PCRO-SL and the feature selection setup STCO-FS.

The aim of the second step is to select the combination of predictors which together provide the optimal skill for the target time series. Here, the problem to be optimised is the forecasting of NDQ90 using a multiple linear regression model. The optimisation is performed on past2k data, with training and test periods of 0–1600 and 1601–1850, respectively, and on each grid point individually. The skill score used is the root-mean-squared-error normalised by the standard deviation (interannual variability) of the target data (N-RMSE). The training score is calculated with a 5-fold cross-validation of the training period to reduce overfitting. Three parameters are simultaneously adjusted during the evolution of the optimisation: the variable cluster, the time-lag and the sequence length. The variable cluster is treated as a binary selection process (either selected or not). The time-lag, together with the sequence length, determines the times prior to May 1st in which the cluster is selected. Specifically, the sequence length represents the period during which the cluster is important. The value ranges for each parameter are as follows: variable cluster (0–1), time-lag (0–24 weeks prior to May 1st) and sequence length (0–8 weeks).

The optimisation aglorithm used begins with a first guess, after which the evolution of solutions during the optimisation improves both the training and test scores until the algorithm converges on an optimal solution, which typically occurs between 10,000–15,000 evolutions (Fig. 1). The solution with the lowest N-RMSE of all evolutions is selected as the optimal solution (see an example of selected variables and lags for an optimal solution in Supplementary Fig. 3). In summary, the optimal solution is obtained by repeating seasonal forecasts in the model world and adjusting the input predictors. The forecast is considered seasonal because the target data correspond to May–June–July, whereas the predictor data are obtained from the months prior to May.

The final step is to apply the method to real-world data. This requires simply changing the test period to 1993–2016 and the test predictors to those of ERA5, and extending the training period to cover the full 1850-year period of past2k. Given that past2k and ERA5 have different grids, a nearest-neighbour mapping function is used to associate past2k grid cells with those of ERA5.

While the optimisation is based on a multi-linear regression forecast for the sake of computational time, the second step of producing real-world forecasts can be performed with ML models in order to boost skill. Several candidates are tested (Supplementary 8), and in all cases, the default values provided in the Python modules (see Code Availability statement) are used. For instance, the Random Forest regressor used n_estimators=100, criterion="squared_error", max_depth=None, and max_features=1.0. Random Forests provide the greatest area of significant correlation over Europe, corresponding to a 10% increase over Linear Regression. However, the most skilful model depends on the grid point (Supplementary 8). In the most skilfully predicted regions (e.g. the central Mediterranean), all models provide significant skill, except Decision Trees. In the low skill zone extending over northern central Europe and Scandinavia, significant skill is rare among models, but the best performing models are Random Forest, Light Gradient Boost and AdaBoost. Multi-Layer Perceptron is a deep learning (DL) neural network model, ideal for larger datasets in which there are more non-linear relationships between predictors and the target. In this study, it is outperformed by ML-based models such as Random Forest, which has been found to be more suited to similar tasks29,33. The data-driven approach displayed in this study (e.g. Fig. 4) is derived from the most skilful model, depending on the grid point. While all models provide similar patterns of skill, there is no coherent pattern in which a model provides the best skill in certain regions (Supplementary Fig. 8).

The framework allows us to quantify the relative importance of each variable and cluster, and crucially, to identify the time-lag from short-term to seasonal timescales. By ensuring that potential predictors are restricted to certain time-lags, the system resembles a dynamical forecast system that receives climate information only before the initialisation date. The cut-off time for potential predictors determines the effective “initialisation” time; for example, using predictor data prior to May 1st to target summer HWs is equivalent to a May initialisation of the dynamical system.

SHapley Additive exPlanations - SHAP analysis

SHAP is a method used to interpret machine learning models by quantifying the contribution of each feature to individual predictions47. Here, we apply SHAP to Random Forest forecasts to explore the contribution of predictors selected in past2k to the forecasts using ERA5 predictors from 1993–2016. For each example predictor studied (Supplementary Fig. 7), and in each grid cell, we calculate the average of the SHAP value magnitudes for the target predictors.

Statistics

Statistical significance of correlations (e.g. Fig. 4a) is calculated using the two-sided test included in the stats.pearsonr function from the Python module scipy. Statistical significance of the difference between correlations (e.g. Fig. 4c) is calculated using a Fisher's Z-test, suitable for correlations with overlapping data (i.e. the ERA5 data used for validation). In both cases, a confidence interval of 95% is used, and the sample size is 24 (the number of available re-forecast years).