Introduction

In recent years a revolution in weather prediction has occurred in which machine learning-based models can match or outperform physics-based models over a range of metrics1,2,3,4,5. Learning the 1–6-hour evolution of the atmospheric state, these models can produce skilful forecasts for several days by feeding the predictions back into themselves, as dynamical models do, known as “autoregressive” forecasting6. Recent studies suggest skilful forecasts can be made covering several weeks5,7,8,9 and very large ensembles can provide improved estimates of extreme events10. Beyond these timescales, instabilities can grow, or the predictions blur and smooth, restricting their application to long-range climate predictions at monthly or seasonal time scales11. Some models are stable for long autoregressive rollouts and can capture the climatological state and aspects of interannual variability7,12,13,14,15, however to date, their ability for skilful seasonal predictions has not been established.

Machine learning predictions at seasonal timescales (1–3 month lead times) often utilise more direct approaches in learning relationships between predictors and specific predictands, or resort to using model data for training. For example, skilful predictions have been demonstrated for the El Niño-Southern Oscillation (ENSO) as well as some regional scale climate variability16,17,18,19,20,21,22. Understanding the mechanisms underpinning such predictions can be difficult and developing methods to provide explainability is a key topic of research23,24. With only one event per season, a key limitation at longer forecast periods is the relatively small sample size available for training. This restricts the ability to learn complex relationships while at the same time keeping a suitable number of years separate for testing, as needed for dynamical models25. One approach to overcome this is to utilise model data for training19,26,27, but the errors and biases found in physics-based models are inevitably inherited.

In this study we assess the newly developed machine learning weather model ACE213 from a seasonal forecasting perspective. This model predicts the atmospheric evolution at 6-hourly time steps and can remain stable for long autoregressive forecast periods, enabling it to provide seasonal simulations even though it was not explicitly trained to provide such predictions. It is trained only on historical conditions from the ERA5 dataset28. We initialise ACE2 during autumn each year from 1993 to 2015 and assess the seasonal skill of December-January-February (DJF) conditions, a lead time of 1–3 months. To provide boundary conditions, the SST and sea-ice anomalies at the time of initialisation are persisted throughout the forecast period each year. The influence from large-scale drivers such as ENSO are therefore preserved, but any coupled ocean-atmosphere processes are missing. We compare the ACE2 seasonal forecasts to those from GloSea, a leading physics-based coupled ocean-atmosphere ensemble prediction system29,30.

Results

Skilful data-driven seasonal forecasts

Over the 23-year assessment period the pattern of seasonal skill (1-3 month lead) demonstrated by ACE2 closely resembles that of the dynamical model for mean sea level pressure (MSLP, Fig. 1a, b). This is remarkable considering ACE2 was designed for stable climate simulations, with no deliberate attempt to capture seasonal predictability. While much of the tropical skill is due to the persistence of slowly evolving processes such as ENSO from the initialisation of the tropical oceans31,32, ACE2 also exhibits skill across the tropical land and the extratropics, including the North Atlantic and North Pacific. Interestingly, ACE2 also exhibits reduced skill over Eurasia, as seen in the physics-based model GloSea. In most regions the ACE2 correlation is weaker than that for GloSea. For example, the area-average correlation across the northern hemisphere extratropics (20°N to 90°N) is 0.39 in ACE2 and 0.44 in GloSea, while over the tropics (20°S to 20°N) the scores are 0.79 and 0.82, respectively. In comparison, a persistence forecast using October monthly mean conditions scores 0.17 across the northern hemisphere and 0.52 across the tropics. Subsampling predictions across years indicates no evidence that these results are biased by predictions based on initial conditions seen during the training of ACE2 (Supplementary Figs. 2 and 3).

Fig. 1: Skilful seasonal (DJF) predictions from the ACE2 machine learning and GloSea dynamical models with a lead time of 1-3 months.
figure 1

Correlation score of mean sea level pressure (a), surface temperature (c) and precipitation (e) for ACE2 and GloSea (b, d, f) calculated across 1993/1994 to 2015/2016. Stippling indicates correlations are significantly different to zero (23 years, 95% confidence level).

For temperature (Fig. 1c, d) we continue to see large regions of skill from ACE2, including South America, Africa, Australia and parts of North America. As seen for MSLP, GloSea outperforms ACE2 across many parts of the world with the area-weighted mean correlation across the northern hemisphere extratropics at 0.41 in ACE2 and 0.45 in GloSea, and 0.68 and 0.77 respectively across the tropics. The skill for both systems is lower for precipitation, however the ACE2 model (Fig. 1e) once again closely resembles that of GloSea (Fig. 1f), particularly across the tropics, the Caribbean and east Asia.

These results demonstrate that the ACE2 model can skilfully predict seasonal variability across many parts of the world with a lead time of 1-3 months.

Predictability of the North Atlantic Oscillation

The NAO is the primary mode of seasonal variability across the North Atlantic33 and is a key focus for extratropical seasonal prediction34,35,36. ACE2 can predict the DJF-mean NAO37 with a correlation score of r = 0.47 (Fig. 2a), at a lead time of 1–3 months. This is statistically significant at the 95% level (p = 0.023) and is highly competitive with a range of dynamical models. For example, over a shorter 19-year analysis period (1993–2011) ACE2 exhibits higher NAO skill (r = 0.42) than 4 operational ensemble prediction systems36.

Fig. 2: Skilful predictions of the DJF-mean North Atlantic Oscillation (NAO).
figure 2

a DJF-mean NAO index, standardised to unit variance, from ERA5 (black), GloSea (red) and ACE2 (blue). b ensemble mean RMSE (solid) and spread (dashed) for the NAO (hPa) as a function of lead time during November each year, averaged over all years. c Relationship between NAO correlation score and ensemble size (solid lines) and skill in predicting individual withheld ensemble members (dashed lines) based on 1000 random samples with no replacement. The dashed lines are thickened when significantly below the corresponding solid line (outside 95% sampling range). The horizontal dashed grey line indicates the 95% significance level for a sample size of 23 years.

It is important to note that only the 9 winters between 2002 and 2010 are fully independent of the ACE2 training period13. Over this shorter period the NAO correlation remains high (r = 0.6), although with reduced significance due to the smaller sample size (p = 0.07). Skill is also high across an extended 1981–2022 period (r = 0.52) and a subsampling analysis suggests that these NAO results are not biased by predictions from years within the ACE2 training period (Supplementary Figs. 1 and 3).

Interestingly, ACE2 gives a poor prediction of the extreme winter in 2009/2010 (see Section “The extreme winter of 2009/2010” below). Nevertheless, given the long autoregressive forecasts, the lack of a well resolved stratosphere, and the use of non-interacting, persisted SSTs, the ACE2 model skilfully predicts the NAO. This is surprising as both stratospheric variability and interactive ocean processes underpin dynamical model skill38,39.

We also find that the ACE2 and GloSea NAO predictions are not strongly correlated (r = 0.34, p = 0.11) and so there may be additional value in combining them. Indeed, an ensemble mean constructed from both models results in an NAO correlation score of r = 0.65 (p < 0.01), matching that estimated by GloSea with an extended ensemble size of 127 members. Furthermore, after removing the climatological mean, the ACE2 and GloSea NAO predictions appear to be drawn from the same underlying distribution (two-sample KS-test, 95% confidence). This indicates that ACE2 could also be utilised to enhance dynamical model ensembles.

In addition to skilful seasonal predictions, the ACE2 ensemble closely matches the dynamical model in terms of NAO variability. Following initialisation, we find that the ACE2 ensemble mean error and ensemble spread increase in line with GloSea (Fig. 2, Equations (1) and (2)). Furthermore, the DJF-mean total standard deviation across all years and members is 4.3 hPa in ERA5, 3.6 hPa in ACE2 and 3.8 hPa in GloSea. For the ensemble mean variability the standard deviation is 1.11 hPa in ACE2 and 1.21 hPa in GloSea. The lagged-ensemble methodology used here therefore enables sufficient ensemble member spread to develop, but other methods for ensemble generation are key topics for future research.

In line with dynamical models34,40,41, ACE2 NAO skill also increases strongly with ensemble size (solid line, Fig. 2c). This is encouraging as it is much cheaper and quicker, in computational terms, to increase the ensemble size of data-driven models compared to dynamical models. However, it can also be seen that when the ACE2 ensemble mean is used to predict one of its own individual members (so-called ‘perfect model’ skill), the skill is markedly lower (r = 0.25, dashed lines in Fig. 2c) than the ACE2 skill in predicting the observed NAO (thick solid lines, Fig. 2c). The ratio of predictable components (Equation (3)) provides a measure of observed and modelled predictability and variance. For ACE2 this quantity is found to be 1.6, only slightly less than the 1.8 for GloSea, but still greater than 1 (90% confidence). This indicates that for ACE2, the ensemble mean variance is small compared to the total ensemble variance given its skill in predicting the observed NAO42.

Therefore, despite having been trained only on reanalysis data, the ACE2 predictions also exhibit a signal-to-noise error which resembles that found in dynamical models34,40,42,43,44. This is somewhat surprising as it may suggest that the signal-to-noise error is not restricted to a physical model error and instead occurs due to some other damping effect on the predictable signal. For example, weak eddy forcing and feedback are one hypothesised cause of the error45, however these characteristics are not weak within the reanalysis used to train ACE2. Further investigation of ACE2 characteristics is needed, but we note that machine learning predictions can also exhibit damping and smoothing of the kinetic energy spectrum11,46 potentially leading to similar errors in forecast anomaly amplitude. It is possible that the same qualitative behaviour occurs for different reasons in the ACE2 and GloSea models, but further research is needed to understand if this is the case.

ENSO as a driver of seasonal skill

ENSO is the primary mode of interannual climate variability and is a key driver of seasonal skill across many parts of the world47,48. In this section we investigate whether ACE2 is correctly capturing ENSO teleconnections.

Composite differences between El Niño and La Niña years (Fig. 3) reveal that ACE2 exhibits very similar teleconnection patterns to those seen in ERA5 and GloSea for both MSLP and surface temperature. In particular, we find El Niño deepens the Aleutian low and influences the North Atlantic jet, extending eastward from the Caribbean. This suggests that ACE2 is capturing the ENSO relationship on the subtropical jet, an important mechanism underpinning the global influence of ENSO47,49. In terms of the surface temperature response, ACE2 also exhibits very similar ENSO teleconnections to ERA5 and GloSea, particularly over North America, South America, southern Africa and Australia. These composites indicate that ACE2 is correctly capturing the regional interannual variability associated with ENSO across many parts of the world despite being trained only on the 6-hourly evolution of the atmosphere.

Fig. 3: Influence of ENSO on DJF surface conditions.
figure 3

Composite maps of El Niño years (n = 8) minus La Niña years (n = 9) for mean sea level pressure (hPa) and surface temperature (K) anomalies for ERA5 (a, b) ACE2 (c, d) and GloSea (e, f). Shaded contours show the DJF mean anomaly. Stippling indicates significant differences (two-tailed T-test, 95% confidence level).

The extreme winter of 2009/2010

As a final part of our assessment we focus on predictions for the extreme northern hemisphere winter of 2009/2010, which is part of the independent dataset withheld during the training of ACE2. This winter is characterised by a record negative NAO, well beyond the anomalies seen in other years. It was also subject to a minor and a major sudden stratospheric warming (SSW), a strong El Niño and an easterly Quasi Biennial Oscillation (QBO)50. The winter mean MSLP anomaly (Fig. 4a) exhibits a very zonal negative NAO which is well captured by GloSea (Fig. 4c). However, the ACE2 ensemble mean prediction does not appear to capture this signal with only slightly above average pressure across the Arctic (Fig. 4b). This is surprising given the strong tropical forcing and potentially indicates a limitation of ACE2 in predicting extreme, out of sample conditions. Exploring this further, we find that both ERA5 and GloSea exhibit a weakened stratospheric polar vortex Fig. 4d, f), while ACE2 exhibits near-normal vortex strength (Fig. 4e).

Fig. 4: Surface and stratospheric anomalies associated with the extreme winter of 2009/2010.
figure 4

Anomalies from the 1994-2016 climatology of MSLP (hPa) and zonal wind at 10hPa (ms-1) for ERA5 (a, d), ACE2 (b, e) and GloSea (c, f). ACE2 stratospheric conditions are model layer 0 (above 50 hPa).

In terms of SSWs, the winter comprised of a minor warming in December 2009 and a major warming in January 2010, reflecting the increased SSW probability due to the El Niño and easterly QBO50,51,52,53. GloSea appears to capture this increase, with 81% of members (51 out of 63) experiencing easterly zonal winds at 10hPa and 60°N within the winter. This is significantly higher than GloSea’s climatological probability of 62% (two proportion Z-test, 95% confidence level). In comparison, only 39% of ACE2 members (25 out of 64) exhibit easterly stratospheric winds in the upper most model layer (above 50mb), which is not significantly different to the climatological rate of 40%. This indicates that the ACE2 model is not correctly capturing the disruption to the stratospheric polar vortex during winter 2009/2010.

Furthermore, the SSW probability within ACE2 is relatively consistent across El Niño (45%) and La Niña (36%) years, neither of which are significantly different from neutral years (41%, one-tailed two proportion Z-test, 95% confidence level). GloSea and ERA5 however exhibit significant differences between active and neutral ENSO years, with a higher chance of an SSW during El Niño54,55,56,57. This suggests that while the ACE2 can exhibit sub-seasonal stratospheric variability13 it is not fully capturing the ENSO teleconnection to the stratosphere despite realistic tropospheric teleconnections.

Discussion

This study demonstrates skilful seasonal predictions from a machine learning weather model. Despite being trained only on the 6-hourly observed evolution of the atmosphere, when assessed from a seasonal prediction perspective (i.e. lead time 1-3 months), the ACE2 model exhibits significant skill and is competitive with current dynamical systems. A lagged-ensemble approach is found to generate ensemble spread which closely matches observations and a physics-based ensemble prediction system, a characteristic is it not specifically trained on. The model produces realistic ENSO teleconnections in the troposphere, but the stratospheric pathway is not in line with observations. This may be due to a relatively small sample of observed events (e.g. slower time scales in the stratosphere and limited number of SSWs), the training methodology (e.g. loss weightings applied to different levels or parameters), or model architecture. If the latter, this could potentially be addressed through enhanced vertical resolution in the stratosphere, a characteristic found to be important in dynamical models54,58,59,60, providing an opportunity for improved skill in the future.

Dataset independence is an important part of understanding the generalization of machine learning models and our results are based on predictions initialised with conditions both within and independent of the ACE2 training period. However, we find no evidence of bias within our predictions at the global or regional scale. This is potentially due to the use of long (4-month) rollouts and persisted boundary conditions which differ from the 6-hour loss minimalization and time-evolving conditions within the ACE2 training. Understanding the sensitivity of seasonal predictions to different training and test years, particularly over the satellite period, is a key topic for moving towards real-time predictions which occur within a climate outside of the training period.

A significant benefit of machine learning models is the relatively cheap computational cost. For seasonal forecasting timescales, a dynamical model can take hours on a supercomputer for each simulation. In comparison, the ACE2 model can complete a 4-month forecast simulation in under 2 minutes on an Nvidia A100 GPU. Opportunities arising from this include the ability to generate very large ensemble sizes (e.g. over 7000 members10), much longer assessment periods, rapid testing of new experimental setups and better exploration of sources of predictability and the signal-to-noise error44. Machine learning models are therefore highly applicable for seasonal and climate timescales where large ensembles are needed. Further research is needed on optimal ensemble generation approaches as well as coupling to data-driven ocean models61 or ocean-atmosphere-coupled dynamical models. However, it is clear from this work that the machine learning models can supplement and support current seasonal forecasting methods.

Overall, these results show that the machine learning revolution is not limited to short-range weather forecasts and can provide several new opportunities for advancing near-term climate predictions.

Methods

Datasets

Historical atmospheric conditions are taken from the ERA5 reanalysis28. To persist SST and sea-ice conditions throughout a forecast we create a seasonally varying climatology based on the 6-hourly atmospheric state, for each grid cell, using a rolling-mean gaussian filter with a width (standard deviation) of 10 days. Observed monthly rainfall totals are taken from the Global Precipitation Climatology Project version 2.3 (GPCP)62.

For comparison with dynamical models, hindcasts (retrospective forecasts) initialised from 1993 to 2015 are taken from the GloSea operational ensemble prediction system with GC3.2 configuration29,30,63. A 63-member ensemble is constructed from 21 members initialised on 25th October, 1st November and 9th November each year and the ensemble spread is generated through a stochastic physics scheme64. GloSea simulations cover a forecast period of 6 months with an atmospheric resolution of approximately 0.5 degrees and an ocean resolution of 0.25 degrees. It has 85 vertical levels in the atmosphere, covering the entire stratosphere and extending up to 85km (0.01 hPa) as well as 75 levels in the ocean. The GloSea prediction system is one of the top performing dynamical models across sub-seasonal and seasonal timescales for both the tropics and mid-latitudes32,36,65,66.

For this study we use the machine learning atmospheric model ACE213. The model is trained solely on ERA5 reanalysis atmospheric fields and predicts the evolution of the atmospheric state at 6-hour time steps at a 1° grid resolution. Importantly, ACE2 autoregressive forecasts are stable over multiple years hypothesized to be due to its Spherical Fourier Neural Operator architecture67, use of user prescribed ocean and sea-ice boundary conditions, and physical constraints on mass conservation, moisture, precipitation rate and radiative fluxes13.

Of relevance to this study, the 10 years from 2001 to 2010, which lies within our 23-year hindcast period, are withheld during training of ACE213 and form an independent test period for the model. The remaining years are used to train the model. However, our experiments (see below) are initialised one month prior to the periods of interest and utilise persisted boundary conditions, while time-evolving boundary data were used for training ACE2. These specific atmospheric and ocean states will therefore be new to the model, although the large-scale patterns will have been seen previously. Combined with this, each forecast involves over 500 autoregressive steps, over which which errors will grow and result in individual trajectories. This is demonstrated through the realistic ensemble spread within ACE2 at seasonal timescales. Quantitative testing of the ensemble (Supplementary Figs. 2 and 3) at global and regional scales found no evidence of bias within the ACE2 predictions between training and independent years.

All ERA5 and GloSea data is bilinearly interpolated to the native 1° x 1° ACE2 grid, except for precipitation, in which ACE2 and GloSea are interpolated to the 2.5° x 2.5° GPCP grid.

Indices and metrics

We define ENSO years based on the DJF Oceanic Niño Index68 with a threshold of ± 0.5 K. El Niño winters are 1995, 1998, 2003, 2005, 2007, 2010, 2015, and 2016. La Niña winters are 1996, 1999, 2000, 2001, 2006, 2008, 2009, 2011, and 2012.

We define the NAO index37 as the difference in mean sea level pressure between a southern box (90°W-60°E, 20°N-55°N) and a northern box (90°W-60°E, 55°N-90°N). The results are consistent when applying a smaller regional definition40 (r = 0.42, p = 0.048) and a point-based estimate34 (r = 0.41, p = 0.053).

To calculate the ensemble mean error and spread as a function of lead time we utilise only ACE2 members initialised between 00:00z on 28th October and 00:00z on 1st November (n = 20) each year and GloSea members initialised at 00:00z on 1st November (n = 21). Forecasted daily NAO values are aggregated into 5-day means (pentads) and the climatological mean removed. The ACE2 values are therefore partly larger than GloSea’s due to the inclusion of longer lead time forecasts. The ensemble mean error for a given 5-day average, RMSEp is defined as:

$${RMSE}_{p}=\sqrt{\frac{1}{23}\mathop{\sum }_{i = 1994}^{2016}{\left({model}_{i,p}-{ERA5}_{i,p}\right)}^{2}}$$
(1)

The corresponding average ensemble spread is defined as:

$${\sigma }_{p}=\sqrt{\frac{1}{23}\mathop{\sum }_{i = 1994}^{2016}{\sigma }_{ip}^{2}}$$
(2)

Where σip is the standard deviation of the model NAO across members for year i and pentad p.

To assess ACE2 and GloSea predictions in terms of signal and noise we compute the ratio of predictable components (RPC,43) as

$$RPC=\frac{r}{{\sigma }_{sig}/{\sigma }_{tot}}$$
(3)

where r is the ensemble mean correlation with ERA5, σsig is the ensemble mean standard deviation and σtot is the standard deviation across all members and years. A random resampling procedure is used for significance testing43.

ACE2 experimental setup

ACE2 seasonal predictions are generated using a lagged ensemble approach. An ensemble member is initialised every 6 hours between 25th October and 9th November each year from 1993 to 2015, creating a total of 64 members per year. The forecast period extends from initialisation through to mid-March the following year, providing a lead time of 1-3 months. For example, a forecast member initialised in November 2001 is rolled out over 500 times until March 2002. Initial conditions for each member are taken from the ERA5 reanalysis dataset28. Boundary SST and sea-ice conditions are provided throughout each forecast by calculating the instantaneous anomaly at initialisation for each grid cell and persisting this throughout the forecast using the derived ERA5 6-hourly climatology. This is different to the ACE2 training, in which time-evolving boundary conditions are used.

The 6-hourly climatology is calculated using a gaussian filter with a width (standard deviation) of 10 days, averaged across the 1994-2016 period (23 years). For each initialisation the instantaneous initial condition anomaly is persisted using this climatology, e.g. for a given gridcell at time (t) the SST boundary condition is

$$SST(t)=SST(0)-climatology(0)+climatology(t)$$
(4)

Where t = 0 indicates the value at initialisation. The same method is used to persist sea-ice concentrations, with all values limited to be between 0 and 1.

Historical downward shortwave radiative flux at the top of the atmosphere and global mean atmospheric carbon dioxide inputs are prescribed throughout the hindcast period13 as performed for the GloSea simulations. However, understanding the sensitivity of ACE2 predictions to these boundary conditions is a key topic for further research. We find that repeating the hindcast experiment using a climatology derived from 1988–2022 (excluding 1994–2016) produces consistent results (NAO r = 0.54) as does utilising the previous year’s TOA shortwave flux (NAO r = 0.43) and using the previous year’s CO2 (NAO r = 0.38). These additional results are in line with a natural variability test (NAO r = 0.42) where the initial condition times were manually altered by 6 hours, suggesting a limited sensitivity of these boundary conditions for this application.