End-to-end data-driven weather prediction

Allen, Anna; Markou, Stratis; Tebbutt, Will; Requeima, James; Bruinsma, Wessel P.; Andersson, Tom R.; Herzog, Michael; Lane, Nicholas D.; Chantry, Matthew; Hosking, J. Scott; Turner, Richard E.

doi:10.1038/s41586-025-08897-0

Download PDF

Article
Open access
Published: 20 March 2025

End-to-end data-driven weather prediction

Nature volume 641, pages 1172–1179 (2025)Cite this article

79k Accesses
13 Citations
513 Altmetric
Metrics details

Subjects

This article has been updated

Abstract

Weather prediction is critical for a range of human activities, including transportation, agriculture and industry, as well as for the safety of the general public. Machine learning transforms numerical weather prediction (NWP) by replacing the numerical solver with neural networks, improving the speed and accuracy of the forecasting component of the prediction pipeline^1,2,3,4,5,6. However, current models rely on numerical systems at initialization and to produce local forecasts, thereby limiting their achievable gains. Here we show that a single machine learning model can replace the entire NWP pipeline. Aardvark Weather, an end-to-end data-driven weather prediction system, ingests observations and produces global gridded forecasts and local station forecasts. The global forecasts outperform an operational NWP baseline for several variables and lead times. The local station forecasts are skilful for up to ten days of lead time, competing with a post-processed global NWP baseline and a state-of-the-art end-to-end forecasting system with input from human forecasters. End-to-end tuning further improves the accuracy of local forecasts. Our results show that skilful forecasting is possible without relying on NWP at deployment time, which will enable the realization of the full speed and accuracy benefits of data-driven models. We believe that Aardvark Weather will be the starting point for a new generation of end-to-end models that will reduce computational costs by orders of magnitude and enable the rapid, affordable creation of customized models for a range of end users.

A data-to-forecast machine learning system for global weather

Article Open access 19 July 2025

Accurate medium-range global weather forecasting with 3D neural networks

Article Open access 05 July 2023

Probabilistic weather forecasting with machine learning

Article Open access 04 December 2024

Main

Numerical weather prediction (NWP) systems are vital for creating weather forecasts required by emergency agencies, transport providers, agriculture, energy providers and the general public. Since the first numerical forecasts were produced in the 1950s, which required 24 h to compute a single-day single-variable forecast on a 700-km grid⁷, NWP systems have undergone a remarkable transformation. Modern systems predict a wide range of variables at lead times of up to 15 days, which is the theoretical limit of medium-range weather forecasting predictability⁸. These systems consist of an intricate series of models of different components of Earth’s atmosphere, building on decades of research in Earth observation, data assimilation, fluid dynamics and statistical post-processing and requiring purpose-built supercomputers to run.

Generating a modern weather forecast begins with the acquisition of observations from a multitude of sources, including remote sensing instruments, in situ observations, radar systems, radiosondes and aircraft data⁹. Some of these data are processed to generate derived products, such as atmospheric motion vectors and surface winds. Raw data and the resulting processed products are fed into a data assimilation system, which combines these with an initial guess from the previous forecast to generate a global approximation of the current state of the atmosphere. This approximation is then used as an initial state for a forecasting system that integrates the equations of fluid mechanics and thermodynamics to output predictions at future lead times. Finally, the resulting predictions from the forecasting system are used for downstream tasks, for example, to generate local forecasts. This step may consist of statistical post-processing and running higher-resolution regional NWP models. Each stage of this pipeline consists of several numerical models chained together, resulting in an intricate workflow¹⁰ that is challenging to iterate on and improve and requires purpose-built supercomputers to run. This motivates the development of fast, lightweight and customizable alternatives.

With end-to-end machine learning revolutionizing several fields by replacing complex human-designed workflows, it has been suggested that a data-driven model may one day replace the entire NWP pipeline¹¹. This will be transformational for weather prediction, reducing computational costs, removing bias from inflexible aspects of NWP systems and enabling fast prototyping and optimization for specific tasks. However, this has not been attempted so far, with studies focusing on applying machine learning to the easiest components of the pipeline. For example, machine learning models have been shown to outperform their operational state-of-the-art counterparts to replace the numerical solver in the forecasting component^1,2,4,5,6,12, deriving variables from raw satellite data in pre-processing^13,14,15 and post-processing forecast data in the downstream stages^16,17. Work on replacing the most challenging component, the assimilation system, remains at the stage of developing initial prototypes^{3,18,19,20,21,22,23}. Therefore, the vision of an end-to-end data-driven solution remains aspirational, with conventional NWP systems being essential for all forms of operational forecasting.

In a recent article assessing the prospect of end-to-end deep learning weather prediction, the verdict was that “a number of fundamental breakthroughs are needed before this goal comes into reach”¹¹. Here we report that these breakthroughs are happening earlier than expected. We present Aardvark Weather, an end-to-end data-driven weather forecasting system capable of generating predictions with no input from conventional NWP by instead learning a mapping from raw input observations to output forecasts. This allows Aardvark to tackle the complete weather prediction pipeline while being entirely independent from NWP products at prediction time, relying solely on observation data to generate forecasts. We demonstrate that using an order of magnitude fewer observations than those available to operational baselines and orders of magnitude less computational resources, Aardvark is capable of producing forecasts on a global 1.50° grid that achieves lower root mean square error (RMSE) than operational NWP systems across several variables and lead times. Furthermore, we demonstrate that this system provides local forecasts that achieve lower errors than post-processed NWP and a full end-to-end operational forecasting system for several lead times and can be optimized end-to-end to maximize performance over variables and regions of interest.

Aardvark Weather

Aardvark Weather is a deep learning model that provides forecasts of eastward wind, northward wind, specific humidity, geopotential and temperature (at 200, 500, 700 and 850 hPa pressure levels), 10-m eastward wind, 10-m northward wind, 2-m temperature and mean sea level pressure on a dense global grid, and station forecasts for 2-m temperature and 10-m wind speed. Aardvark consists of three modules and is designed to leverage high-quality reanalysis data during training while being entirely independent from NWP products at deployment time. Figure 1 (bottom) illustrates the operation of Aardvark, outlining the function of each of its three modules.

First, an encoder module obtains observational data from several sources, both on the grid and off the grid, and produces a gridded initial state. On the grid observations are data modalities on a regular grid, whereas off the grid modalities are available at a set of longitude–latitude locations. To achieve this, we leveraged recent advances from deep learning²⁴ in handling off the grid and missing data. This approach to state estimation differs from data assimilation systems used in conventional NWP pipelines. Conventional data assimilation systems use a recurrent update in which the previous forecast is adjusted in light of new observations, similar to Kalman filter recursions in a Markov model. In principle, data assimilation accumulates information from observations across all past time steps. However, in practice, it has been estimated that the effective window size is as short as 4 days (ref. ²⁵). Owing to the complexities of training recurrent neural networks, including the need for a spin-up period and gradient instabilities²⁶, we opted for a non-recurrent approach.

Once the initial atmospheric state has been estimated, it is used as an input to a processor module, which produces a gridded forecast at a lead time of 24 h. Forecasts at subsequent lead times are produced by autoregressively feeding the predictions of the processor module back to it as an input, similar to the existing approaches in data-driven weather forecasting^1,6. Finally, task-specific decoder modules ingest these forecasts and produce local predictions. In this study, we considered a decoder designed for a single downstream task, producing local station forecasts. However, this system is suitable for use with several separate decoders for different tasks. Together, the encoder, processor and decoder modules form a neural process²⁴, a machine learning system that naturally handles off the grid and missing data. A vision transformer²⁷ forms the backbone of the encoder and processor modules, whereas the decoder modules are implemented as a lightweight convolutional architecture. The full set of inputs and outputs for the modules is detailed in Extended Data Table 1.

A key challenge in designing machine learning systems for observational atmospheric data is that the records for many instruments are relatively short, limiting the data available for training. The modular design of Aardvark (Fig. 1) addresses this issue by enabling pretraining using high-fidelity historical reanalysis data before fine-tuning on scarcer observational data. Specifically, we trained the system in a way that mimics how it will be deployed. We started by pretraining the encoder module using raw observations as input and reanalysis data as targets. An advantage of this machine learning approach is that the model can learn to correct for biases in the input observations during training; therefore, no bias correction step was performed on the input data. We also pretrained the processor using reanalysis data for both inputs and targets and then fine-tuned the output of the state-estimation module. In the processor module, the inputs and outputs were both on a regular 1.50° grid to match the reanalysis training data. Next, we trained the decoder using the output of the processor as the input and raw data as targets. This procedure ensures that there is no mismatch between the training and deployment of the system. Finally, we fine-tuned the encoder, processor and decoder modules jointly to optimize the entire model for a specific variable and region. For all modules, we trained on data before 2018 and held out 2018 and 2019 as the test and validation years, respectively.

Input variables

Accurately estimating the state of the atmosphere requires inputs from various observation sources. Input variables are selected to capture the dynamics both at Earth’s surface and at several levels through the atmosphere. In situ observations are taken from weather stations and ships at surface level and radiosondes at upper levels. As coverage from these instruments is largely confined to the surface, as well as geographically skewed and sparse, remote sensing instruments provide a crucial complementary global data source. Motivated by gains observed in operational NWP systems^28,29,30, we selected four primary sources of satellite data: scatterometer data to provide information about surface wind over the ocean, multispectral (approximately ten channels) microwave and infrared sounders, hyperspectral (approximately 10⁵ channels) infrared sounders to provide information on upper-atmosphere temperature and humidity profiles, and geostationary infrared sounder data to provide an instantaneous snapshot of the state of the atmosphere. These observations were made with different time windows ranging from 1 to 24 h before lead time 0. By contrast to operational medium-range NWP systems, observations are only included in the input if they are taken before lead time 0 (ref. ³¹). Figure 1 (top) shows an example of a single time slice of input data to Aardvark for in situ and remote sensing sources, with full details in Extended Data Table 2. These atmospheric observations were augmented by several temporal and orographic variables. Aardvark only ingests approximately 8% of the observations¹ available to conventional NWP systems³², more than an order of magnitude less input data.

Evaluation of global forecasting

For global gridded forecasts, we compared Aardvark with four baselines. The simplest of these, persistence and hourly climatology, assess whether a forecasting system is skilful. A more challenging comparison is to the two most widely used deterministic operational global NWP systems: the Integrated Forecasting System (IFS) in its high-resolution (HRES) configuration from the European Centre for Medium-Range Weather Forecasts (ECMWF) and the Global Forecast System (GFS) from the National Centers for Environmental Prediction. Although HRES typically outperforms GFS on global metrics, operational centres often use a selection of different models, including GFS, to create their local forecasts; therefore, we included it in our comparison. For each variable, pressure level and lead time, we report the latitude-weighted RMSE, a common metric for assessing the performance of deterministic forecasting systems³³. For all baselines, we used the ECMWF Reanalysis v5 (ERA5) dataset as the ground truth. This choice was made because this is a standard practice for the evaluation of machine learning NWP models. At present, HRES analysis is of higher quality than ERA5 reanalysis, because ERA5 was developed using cycle Cy41r2 (ref. ³⁴), which remained operational until 2017. However, the discrepancies between the two were limited for the test year of 2018.

Figure 2 shows the latitude-weighted RMSE performance compared with the baselines for eight headline variables. Here Aardvark matched or outperformed GFS across most lead times, with the only exception being the geopotential at 500 hPa. In addition, for most variables, Aardvark approached the performance of HRES. Overall, Aardvark’s errors were larger at higher atmospheric levels and shorter lead times than those of the operational baselines. This was possibly caused by the higher concentration of observations close to the surface. For longer lead times, a by-product of fine-tuning to minimize errors at future lead times (Methods) is that forecasts tend to become spectrally blurred. This phenomenon is commonly observed in data-driven weather forecasting systems^1,6,35. A full display of the latitude-weighted RMSE of Aardvark across all variables and levels can be found in Supplementary Fig. 1. Further insights can be drawn from inspecting the power spectra, anomaly correlation coefficients and activities of Aardvark’s forecasts, as shown in Supplementary Figs. 2–4. This analysis suggests that although forecast blurring plays a role, Aardvark produces skilful forecasts and maintains meaningful signals, even at longer lead times.

**Fig. 2: Gridded global forecast performance for selected variables.**

Figure 3 shows an example of gridded global predictions at lead times of 0, 1, 2 and 4 days for 10-m eastward wind. Aardvark successfully captured the large-scale features of the atmospheric state, both in the mid-latitudes and the tropics. Many details are well represented; for example, the formation of a tropical cyclone in the Southern Indian Ocean closely matched that in the ERA5 reanalysis data. This example hints at the potential of Aardvark for forecasting mesoscale high-impact weather events. Although some spectral blurring of the higher spatial frequencies is evident, these results are of remarkably high fidelity given the limited resolution and range of observations provided to the model. A comprehensive set of spatial plots across all variables is provided in Supplementary Figs. 6–29.

**Fig. 3: Example of Aardvark’s gridded forecasts for the U10 wind component.**

Encoder module ablation

A central innovation of the Aardvark Weather system is the estimation of an initial state from disparate data sources using the encoder module. With the volume and diversity of available observational modalities, two important questions arise. Which observational sources are most important for estimating each atmospheric variable, and how does each affect predictive performance? To investigate this, we conducted an ablation experiment to quantify the significance of each observational source in our encoder module. We removed different observational sources from the set of encoder inputs, retrained the encoder with this reduced set and evaluated it on the same test set as our original configuration, marked ‘ALL’ (Fig. 4). For example, the rows ‘no in situ’ and ‘no satellites’ correspond to removing in situ data and all satellite data, respectively, from the ‘ALL’ configuration. We report a fractional increase in the latitude-weighted RMSE relative to the ‘ALL’ configuration across all atmospheric variables for the initial condition generated at t = 0.

**Fig. 4: Encoder ablation experiments quantifying the impact of each data modality.**

These results demonstrate that remote sensing data are of crucial importance in constraining the initial atmospheric state. Removing these data (no satellite in Fig. 4) and training with in situ observations lead to large skill reductions across all variables. Among different satellite modalities, low-Earth-orbit (LEO) sounder data are the most important. For example, removing these sounder modalities (no LEO) resulted in larger skill deterioration than, for example, removing scatterometer data (no Advanced Scatterometer (ASCAT)) or geostationary satellite data (no GEO). In situ observations are most important for surface variables. However, they also play a surprisingly large role in predicting geopotential, particularly at lower levels. These results indicate that for the future improvement of this system and the development of other end-to-end data-driven systems, LEO sounder data are the most important source to include, with in situ data providing an important complementary source to improve surface variables and geopotential forecasts. We provide full details of this experiment in Supplementary Information A.

Evaluation of station forecasting

In the next stage of the weather prediction pipeline, global gridded forecasts are used as inputs to downstream models to produce a variety of products for end users. One such category of products is producing local forecasts. We focused on applying Aardvark Weather to predict 2-m atmospheric temperature and 10-m wind speed at off the grid station locations. Accurate local predictions of temperature are vital for the protection of public health during heatwaves and cold waves, in addition to agriculture and other use cases. Similarly, wind speed forecasts have a variety of end users, such as in wind energy, marine forecasting and fire weather forecasting. The modules for any desired downstream task can be substituted for this station forecasting module.

There are significant differences in how agencies in different countries produce forecasts for end users. In well-resourced countries, station forecasts are produced using global models followed by higher-resolution regional models out to a few days of lead time and statistical post-processing³⁶. By contrast, in less well-resourced areas, although agencies have access to global products, they often do not have access to comparable infrastructure to run HRES, local NWP or post-process forecasts to a comparable degree³⁷. With these considerations in mind, we report Aardvark’s performance across all stations globally but also break it down over four regions of particular interest: the contiguous United States (CONUS), Europe, West Africa and the Pacific (Fig. 5k). The USA and most European countries run both local NWP for shorter lead times, as well as sophisticated post-processing of both global and local products. By contrast, West Africa and the Pacific are regions in which many centres are less well equipped. Although some agencies in these regions run sophisticated NWP pipelines, others solely use raw HRES forecasts and issue operational forecasts for very short lead times³⁷. We compared Aardvark against per-station persistence and climatology, as well as against two challenging baselines: station-corrected HRES and a full operational end-to-end baseline, the National Digital Forecast Database (NDFD) from the National Weather Service³⁶. For a detailed description of the baselines, see Methods.

**Fig. 5: Station forecast performance and end-to-end fine-tuning improvements.**

Figure 5 shows the mean absolute error (MAE) performance of Aardvark reported by variable and region. Globally, Aardvark generated skilful forecasts for both temperature and wind speed up to a lead time of 10 days, performing competitively with station-corrected HRES. For temperature, Aardvark was competitive with the station-corrected HRES over both CONUS and Europe. In addition, Aardvark matched the performance of the full operational NDFD baseline over CONUS. For lower-resource areas in West Africa and the Pacific, Aardvark outperformed the station-corrected HRES at all lead times. For 10-m wind speed, Aardvark had higher errors than the station-corrected HRES over CONUS and significantly outperformed the NDFD baseline. Over Europe, Aardvark had similar errors with the station-corrected HRES up to 4 days of lead time and outperformed it thereafter. Finally, Aardvark generally outperformed the station-corrected HRES over West Africa while performing slightly worse over the Pacific. In addition to these results, we compared Aardvark’s forecasts with a version of HRES that we post-processed using a separate scale and bias term for each station and NDFD for CONUS, demonstrating competitive performance on both variables (Supplementary Fig. 5).

End-to-end tuning

End users of NWP products typically have a particular region and set of applications that are of interest. A powerful capability of Aardvark is its ability to tune the entire pipeline end to end to directly optimize for any desired quantity and region of interest. Optimizing the performance for a particular end-user product is challenging and expensive in a conventional NWP system. To explore this capability, we fine-tuned Aardvark to optimize predictions of 2-m temperature and 10-m wind speed at 1-day lead time globally and for each of the four regions. Although we focused only on these two variables, this is a powerful paradigm that can be applied anywhere there is uncertainty in the reanalysis training data, such as clouds and precipitation.

We observed that fine-tuning Aardvark yielded improvements both globally and in the specific regions of CONUS, Europe, West Africa and the Pacific (Fig. 5; bottom). For temperature, fine-tuning Aardvark resulted in large reductions in MAE of 6% over Europe, West Africa, the Pacific and globally, and an improvement of 3% over CONUS. For 10-m wind speed, small but statistically significant improvements of 1–2% were observed for all regions except the Pacific. To put these improvements into context, the last cycle update of IFS improved the surface variable scores in the range of 2–6% and took more than a year of development by a large team of scientists.

Discussion

We have introduced Aardvark Weather, an end-to-end weather forecasting system, which is a data-driven system to tackle the entire NWP pipeline. Aardvark provides accurate forecasts that are orders of magnitude quicker to generate than existing systems without any reliance on NWP products at deployment time. Generating a full forecast from observational data takes approximately 1 s on four NVIDIA A100 GPUs compared to the approximately 1,000 node hours required by HRES to perform data assimilation and forecasting³⁸ alone, before accounting for downstream local models and processing. In downstream tasks generating station forecasts of 2-m temperature and 10-m wind speed, Aardvark shows strong performance against operational NWP systems. Learning an end-to-end model offers the extra capability of optimizing the system to maximize performance over an arbitrary variable or region of interest, opening the door for the creation of inexpensive, individually tailored models for any region globally, in an automated and streamlined fashion.

End-to-end forecasting has significant potential for real-world effect. Compared with conventional NWP systems, machine learning systems are not only faster and computationally cheaper but are also significantly easier to improve and maintain. In conventional NWP, a new module, such as for a new parameterization or microphysics scheme, may take a team considerable time to build and integrate into the model. End-to-end data-driven systems, such as Aardvark, elegantly bypass this issue using a single model in place of this complex pipeline. The simplicity of this system makes it easier to deploy and maintain for users already running NWP and also opens the potential for wider access to running bespoke forecasts in areas of the developing world where agencies often lack the resources and expertise to run conventional systems. There is also significant potential in the demonstrated ability to fine-tune bespoke models to maximize predictive skill for specific regions and variables. This capability is of interest to many end users in areas as diverse as agriculture, renewable energy, insurance and finance.

To envisage how an end-to-end data-driven model such as Aardvark could be deployed operationally, it is necessary to consider the limitations of the current model and the concrete set of steps required to turn it into a fully fledged system. As with all current AIWP systems^1,6, Aardvark does not yet run at the resolution of IFS. Further studies are required to increase the grid resolution and produce forecast ensembles through, for example, diffusion². Other limitations centre around the use of observations. Further observational modalities will probably increase forecast skill. It is also important to consider how data from new instruments for which there are no training data available can be usefully integrated into the system. This can be accomplished by, for example, training on simulated data³⁹. A further consideration is dealing with observation drift and other changes in data over time, which can be mitigated by regularly fine-tuning all modules with the most recent few months of data to adapt to changes in instrument characteristics.

The results presented in this study only scratched the surface of the potential of Aardvark Weather and end-to-end data-driven weather forecasting systems more broadly. Further capabilities can also be added by extending Aardvark to support several other forecast variables, both in its gridded forecasts and through its decoder module. For example, Aardvark can support a diverse range of decoder modules to provide different types of end user forecasts, such as hurricanes, floods, severe convection, fire weather and other extreme weather warnings. A further exciting avenue for future research is to use end-to-end systems at longer lead times to generate seasonal forecast products. More observational modalities would allow for the modelling of other components of the Earth system, such as atmospheric chemistry for air quality forecasts and ocean parameters for marine forecasts. We envision that Aardvark Weather will be a pioneer of a new generation of end-to-end weather forecasting systems to tackle these diverse tasks.

Methods

State estimation inputs

We selected several remote sensing and in situ observations as inputs to the atmospheric state estimation module. To ensure that no NWP system is required for the operational deployment of Aardvark, we selected only data that were available at either level 1B or 1C processing level⁵⁰. Level 1B satellite data are calibrated and geolocated data, which means that the raw sensor measurements have been processed to correct for sensor and instrument biases but are still in the form of physical measurements, whereas level 1C satellite data are further processed to include radiometric and geometric corrections, making them ready for analysis with accurate geolocation and radiance values⁵⁰. Other requirements for the inclusion of datasets are that they are available from 2007 to 2020 and in near real time to facilitate anticipated operational deployment. Where available for remote sensing products, we used fundamental climate data records, in which data from earlier generation sensors were homogenized to match the characteristics of current sensors, creating a consistent data record for training. Extended Data Table 1 provides a summary of all datasets that were used as inputs to the encoder module, including the type of instrument, orbit and platform (if applicable), as well as the data provider and data selection window that we used. For satellite instruments in LEO, it was necessary to include a longer window of observations to attain full global coverage. By contrast, station observations for all locations were available at t = 0 h. Therefore, adding data would be useful but is not necessary to achieve global coverage. As the data record was relatively short and overfitting is a concern, we decided to limit the data to the shortest window possible while retaining global coverage.

In situ observations from land stations, marine platforms and radiosondes were included. In situ land station observations measuring surface temperature (8,719 stations), pressure (8,016 stations), wind (8,721 stations) and dew point temperature (8,617 stations) at six hourly intervals were taken from the HadISD dataset^51,52, provided by the UK Met Office. Marine in situ observations were taken from the International Comprehensive Ocean-Atmosphere Data Set⁵³ provided by the National Oceanic and Atmospheric Administration. This dataset consists of observations from ships and buoys globally, from which five variables were included, namely 2-m air temperature, 10-m northward and eastward winds, sea surface temperature and mean sea level pressure. As observations were not taken precisely on the hour, all observations from t = −1 h to t = 0 h were included in the input. Upper-atmosphere observations of humidity, wind, geopotential and temperature were obtained from the Integrated Global Radiosonde Archive⁵⁴, provided by the National Centers for Environmental Information. This dataset consists of radiosonde observations from 1,375 sites globally. Each record contained observations at several levels, of which we selected observations at the surface and at 200-, 500-, 700- and 850-hPa pressure levels. All profiles retrieved within the past 6 h, from t = −6 h to t = 0 h, were included in the input.

Because in situ observations were limited in geographic coverage, remote sensing observations from scatterometers and microwave and infrared sounders were included. Input data from satellites were ingested in the form of level 1 granules, each containing a 6-min slice of observations or orbits. Although, in principle, the Aardvark Weather system can handle these data in their raw form, for simplicity, data were first transferred to a regular 1° grid by nearest-neighbour interpolation, in which the most recent observation is maintained in cases where several observations are available for the same grid point.

Several scatterometers are currently operational worldwide, of which we used the ASCAT⁵⁵ instrument aboard Metop-A, -B and -C. Data for this instrument are provided by the European Organisation for the Exploitation of Meteorological Satellites. ASCAT provides a triplet of three measurements of backscatter (σ₀) from which operational centres retrieve the wind speed and direction, using a geophysical model function that solves for the two unknowns as a function of the σ₀ triplet together with satellite metadata⁵⁶. By contrast to this approach, we opted to simply include the raw σ₀ values together with the metadata as channels to the encoder module, eliminating the complexity of the retrieval process. As all Metop satellites are in LEO, with a revisit time of approximately 24 h, the input to the state estimation module comprises the latest ASCAT observations available within the grid box from any of the three platforms on a regular 1.50° longitude–latitude grid from t = −1 day to t = 0 days.

In operational NWP, temperature and humidity profiles in the upper atmosphere are retrieved using infrared and microwave sounder instruments⁵⁷. For this purpose, we included the Advanced Microwave Sounding Units A and B, Microwave Humidity Sounder instruments for microwave observations and the High-Resolution Infrared Radiation Sounder (HIRS)/4 for infrared observations. Together, these instruments comprise the Advanced TIROS Operational Vertical Sounder system that is used operationally to retrieve temperature and moisture profiles⁵⁸. Data for these instruments are provided by the National Centers for Environmental Information. Observations for Advanced Microwave Sounding Units A and B, Microwave Humidity Sounder and HIRS are taken from the National Oceanic and Atmospheric Administration 15–19, Aqua and Metop-A satellites. In operational NWP systems, both the retrieved profiles and raw radiances are assimilated. Similar to ASCAT, profiles of the target variable were retrieved using a geophysical model function, taking in the raw radiances and satellite metadata and solving for the desired observational profiles. Again, we opted to input the raw radiances together with the satellite metadata directly into the state estimation module without relying on higher-level retrievals. As for ASCAT, the dataset consisted of the latest observations from t = −1 day to t = 0 days, taken within a grid box of a regular 1.50° longitude–latitude grid.

We augmented the Advanced TIROS Operational Vertical Sounder observations with data from the Infrared Atmospheric Sounding Interferometer (IASI)⁵⁹, a hyperspectral infrared sounder. Data for this instrument were provided by the National Centers for Environmental Information. IASI captured data at a much higher spectral resolution than HIRS/4, with a total of 8,461 channels across three bands. To limit the input data volume, we took the leading 15 principal components across these channels, a technique demonstrated to lead to limited performance degradation in operational NWP systems. Data from IASI were available from October 2007 as opposed to January 2007 for the rest of the training set.

Although platforms carrying scatterometers and passive microwave sounder instruments in LEO provide HRES observations, they have the disadvantage of a lower temporal resolution. By contrast, geostationary satellites provide a very high temporal resolution although with more limited instrumentation. As the available channels on geostationary satellites vary geographically and with time, we opted to use a composite product, the Gridded Satellite dataset⁶⁰, which provides homogenized retrievals of infrared and vapour window channels over standard geostationary platforms. Data for this instrument were provided by the National Climatic Data Center. For this data source, we included the image taken at t = 0 h.

To account for diurnal, seasonal and longer-term variations in the data, we included temporal information as input to both the encoder and forecasting modules. These channels consisted of $\sin \left(\frac{2{\rm{\pi }}d}{366}\right),\cos \left(\frac{2{\rm{\pi }}d}{366}\right),\sin \left(\frac{2{\rm{\pi }}h}{24}\right)$ and $\cos \left(\frac{2{\rm{\pi }}h}{24}\right)$, where d is the day of the year and h is the hour of the day. The absolute year was also included to account for any changes in data characteristics over the training record. To account for the effects of orography on the weather system, we included several sources of orographic information taken from the ERA5 dataset³⁴ as static fields. The data were provided by the ECMWF. These were the geopotential at surface level, angle of sub-grid-scale orography, anisotropy of sub-grid-scale orography, slope of sub-grid-scale orography and standard deviation of orography.

Pretraining

The modular structure of Aardvark leveraged ERA5 reanalysis data during the training phase to increase the length of the available data record. ERA5, or the fifth generation of the ECMWF reanalysis³⁴, is a state-of-the-art global atmospheric reanalysis dataset. It provides comprehensive information on various meteorological parameters, such as temperature, humidity, wind and geopotential, covering the period from 1940 to the present. These data are provided by the ECMWF. From this, we elected to train on data from 1979 onwards, coinciding with the beginning of widely available remote sensing observations, which significantly improve the quality of the atmospheric reanalysis product.

Baselines

For the global gridded forecast experiments, we compared the performance of Aardvark with four baselines: persistence, climatology, HRES and GFS. Persistence and climatology provide simple baselines for assessing whether a forecasting system is skilful. In persistence forecasting, it was assumed that the weather remains unchanged from t = 0 at all future lead times. For the climatology baseline, we used the climatology product from WeatherBench 2 (ref. ³³). The predicted state was obtained by taking the mean value of all ERA5 observations from 1990 to 2017 for a given day of the year and hour using a sliding window length of 61 days.

The IFS and GFS are the two most widely used global operational NWP systems. As the focus of this study was on deterministic forecasting, we chose to compare our results with the HRES and GFS, deterministic runs at resolutions of 0.10° and 0.25°, respectively. These constitute challenging baselines for comparison with Aardvark Weather, which operates at a 1.50° resolution with just five vertical levels. For comparison with Aardvark, the HRES and GFS outputs were conservatively re-gridded to 1.50° resolution. In particular, we used HRES forecast data and ERA5 target data as provided by WeatherBench 2 (ref. ³³), in which both datasets were coarsened to 1.50° resolution using first-order conservative re-gridding⁶¹. This procedure reduces the effects of aliasing, ensuring that Aardvark does not get an unfair competitive advantage because of distortions in the power spectrum that would occur from naive subsampling. To ensure that the GFS forecasts are compared fairly against Aardvark and HRES, we also applied conservative re-gridding to GFS. See Supplementary Information for further details on aliasing and its effects on signal spectra.

We considered four baselines for station forecasts. Persistence and climatology were calculated on the basis of station observations. For 2-m temperature, we calculated the daily climatology and for 10-m wind speed monthly. We further considered two more challenging baselines: station-corrected HRES and NDFD over the CONUS. As HRES is a gridded product, sub-grid-scale processes were not resolved. Therefore, we learned a bias correction individually for each station in the 2007–2017 training set and used this to correct the station forecasts on the 2018 test set. NDFD is produced by the National Weather Service in the USA and is a state-of-the-art local forecasting system⁶². Forecasts in the NDFD are created from an ensemble of more than 30 models⁶³, including the IFS and GFS, together with HRES regional models at shorter lead times. The data from these systems are shown to human forecasters at different National Weather Service offices that create the final forecast. Our station forecasts were taken as the nearest-grid-box forecast from the final NDFD forecast, which was at approximately 2-km resolution. Therefore, NDFD constitutes an extremely challenging baseline, capturing the full complexity of operational forecasting pipeline.

Evaluation metrics

For the global gridded forecasting experiments, we compared models on LW-RMSE. Given arrays of gridded target forecasts y and gridded target predictions $\hat{y}$, the LW-RMSE of variable $v$ is calculated as

$$\text{LW-RMSE}(\,y,\hat{y},v)=\frac{1}{B}\mathop{\sum }\limits_{b=1}^{B}\sqrt{\frac{1}{HW}{\sum }_{h=1}^{H}{\sum }_{w=1}^{W}{\alpha }_{h}{({y}_{bhwv}-{\hat{y}}_{bhvw})}^{2}}$$

(1)

where b indexes over B batch elements, $v$ indexes over V atmospheric variables, h and w are the index latitude and longitude coordinates over a grid with H points latitude-wise and W points longitude-wise, and ${\alpha }_{h}$ are the latitude weights, defined as

$${\alpha }_{h}=\frac{\cos {\theta }_{h}}{\frac{1}{H}{\sum }_{h=1}^{H}\cos {\theta }_{h}}$$

(2)

where ${\theta }_{h}$ is the latitude along the latitude-wise index h, so that their average is equal to 1. In machine learning, a (mini-)batch refers to a subset of the training dataset, typically used to compute a stochastic estimate of a model’s parameter gradients when performing gradient-based optimization. For the station forecasting experiments, we compared methods on MAE. Given arrays of station target temperatures y and predictions $\hat{y}$, MAE is calculated as

$${\rm{MAE}}(\,y,\hat{y})=\frac{1}{{BN}}\mathop{\sum }\limits_{b=1}^{B}\mathop{\sum }\limits_{n=1}^{N}|{{y}}_{{bn}}-{\hat{y}}_{{bn}}|$$

(3)

where $b$ indexes batch elements and n indexes the N stations in the forecast.

Training objectives

Separate training objectives were used for each of the three modules. For all three modules, we normalized the targets by calculating the mean and standard deviation for each variable and level, aggregating across all grid points. In the encoder and processor modules, which involve several target variables, this normalization had an effect of implicitly weighting the variables, owing to the scaling applied during normalization. For the encoder module, we determined an extra weighting by first training the model using an LW-RMSE objective of the form

$$\text{SUM-LW-RMSE}(\,y,\hat{y})=\frac{1}{V}\mathop{\sum }\limits_{v=1}^{V}\text{LW-RMSE}(\,y,\hat{y},v)$$

(4)

Therefore, in the initial run, all variables were weighted equally. Next, weights ${\beta }_{v}$ were produced for each variable by taking the reciprocal of the LW-RMSE for each variable multiplied by a factor of 3 to generate weights within the range of approximately 0 to 1. The training objective for the encoder used these weights, giving the variable and LW-RMSE

$$\text{VLW-RMSE}(\,y,\hat{y})=\frac{1}{V}\mathop{\sum }\limits_{v=1}^{V}{\beta }_{v}\text{LW-RMSE}(\,y,\hat{y},v)$$

(5)

For the processor module, the training objective was SUM-LW-RMSE (equation (4)). However, the processor module was trained to predict residuals (see ‘Processor module’ below). We found that the implicit weighting that was applied through normalization worked well, and we did not further weight the variables individually. Finally, for the decoder module, the training objective was the same as for evaluation, that is, equation (3).

Model architecture

Aardvark Weather is a neural process model⁶⁴. Neural processes are a family of deep learning models that provide a flexible framework capable of learning with off the grid data, as well as missing and sparse data, and providing probabilistic predictions at arbitrary locations at test time. These characteristics are ideally suited to working with complex environmental data, such as in climate downscaling and sensor placement^{65,66,67,68,69}.

Our specific architecture is a new member of the neural process family combining SetConv layers developed for the convolutional conditional neural process²⁴, which handles off the grid and sparse data modalities and produces off the grid predictions, together with a vision transformer backbone that is currently used in state-of-the-art AIWP forecasting systems⁷⁰. This provides scalability not currently attainable with standard transformer neural process models with attention-based encoders⁷¹ while still retaining the flexibility to handle diverse data modalities. Here we give details on the architectures of these modules, how they are trained and fine-tuned and how they are deployed. In the discussion that follows, note that the encoder, processor and decoder modules all receive auxiliary channels, such as temporal embeddings and orographic information, as input. For simplicity, we suppressed these channels in our exposition, but it should be understood that all three modules received them as inputs. We provide a complete list of all inputs and outputs to our models in Extended Data Table 2.

Encoder module

The encoder module E takes raw observations as input, and outputs a gridded estimate of the initial state of each variable for the processor module. Let ${o}_{\tau }=\{{o}_{\tau ,1},\ldots ,{o}_{\tau ,N}\}$ be the set of observations corresponding to time τ, where each ${o}_{\tau ,n},$ corresponds to the observations and the corresponding metadata (such as viewing angle, solar elevation angle and observation time) of a single data modality. Each ${o}_{\tau ,n}=({x}_{\tau ,n},\,{y}_{\tau ,n})$ consists of a set of observations ${y}_{\tau ,n}$ and their corresponding longitude and latitude coordinates ${x}_{\tau ,n}$. Each data modality is either on the grid or off the grid and has a corresponding function ${\psi }_{n}$ to transform ${o}_{\tau ,n}$ into a gridded representation of fixed dimensionality. For gridded observations, ${\psi }_{n}$ consists of the addition of a masking channel to distinguish the missing data from the observed data in the grid. For off the grid observations, each ${\psi }_{n}$ consists of a SetConv layer²⁴ with a learnable length scale. The SetConv layer produces a gridded representation of the data, as well as an accompanying density channel that carries information about the presence or absence of data, to handle irregularly sampled observations. The regular gridded representations of the modalities are concatenated to give a single gridded representation of dimension C × H × W, where C is the number of resulting channels, H is the number of latitude points and W is the number of longitude points. This representation of the input data is fed into the backbone of the module, consisting of a vision transformer ${V}_{{\rm{e}}}$ with a patch size of three, eight transformer blocks and a latent dimension of 512. Embeddings for each patch use a multi-layer perceptron following a previous study²⁷. The encoder outputs the initial state estimate ${\hat{s}}_{\tau ,0}$ at time τ with dimensions of 24 × W × H, where 24 is the number of variables modelled in the forecasting module. Putting this together, we have

$${\widehat{s}}_{\tau ,0}=E({o}_{\tau })={V}_{{\rm{e}}}({\odot }_{n=1}^{N}{\psi }_{n}({o}_{\tau ,n}))$$

(6)

where ${\hat{s}}_{\tau ,0}$ is the estimated initial state corresponding to time τ, and ⊙ denotes concatenation. The encoder module is trained to predict ERA5 reanalysis targets using the VLW-RMSE (equation (5)) as its loss function. We trained the module for 150 epochs using AdamW with early stopping and a cosine learning rate scheduler starting at an initial learning rate of 5 × 10⁻⁴ and decaying to zero at the final epoch.

Processor module

The processor module P takes the initial state estimate ${\hat{s}}_{\tau ,0}$ as input and outputs forecasts for lead times of 1–10 days. This module consists of ten separate vision transformers, ${V}_{{\rm{p}}}^{(1)},\ldots ,{V}_{{\rm{p}}}^{(10)}$, which were composed to produce gridded global forecasts at each of the ten lead times we considered. Here each ${V}_{{\rm{p}}}^{(i)}$ was designed to provide a 1-day forecast conditioned on the forecast of ${V}_{{\rm{p}}}^{(i-1)}$. This 24-h time step is a common configuration in AIWP models^71,72 and was used here to avoid inconsistencies in assimilation procedures at the 06:00 and 18:00 UTC runs of IFS, which may disadvantage this baseline in the comparison², and for computational tractability. All vision transformers have a patch size 5, latent dimension of 512 and 16 transformer blocks. To improve the modelling of interactions between variables, we added cross-attention between variables at the start of the network, as suggested in a previous study⁷³. The processor is trained using a pretraining phase followed by a fine-tuning phase. Let ${\hat{s}}_{\tau ,{t}}$ be the ERA5 state corresponding to time t and lead time τ. During pretraining, the first vision transformer, ${V}_{{\rm{p}}}^{(1)}$, is trained to ingest ${s}_{\tau ,0}$ as input and predict the residual ${s}_{\tau ,1}\,-\,{s}_{\tau ,0}$ using the SUM-LW-RMSE loss (equation (4)). We pretrained ${V}_{{\rm{p}}}^{(1)}$ for 100 epochs using AdamW with a cosine learning rate scheduler starting at an initial learning rate of 5 × 10⁻⁴ and decaying to zero at 100 epochs. During the fine-tuning phase, we trained each vision transformer ${V}_{{\rm{p}}}^{(i)}$ to work with the estimated state produced by the previous transformer ${V}_{{\rm{p}}}^{(i-1)}$ as follows. Recall that ${\hat{s}}_{\tau ,0}$ is the estimated state produced by the encoder module. We started by training ${V}^{(1)}$ to predict ${s}_{\tau ,1}\,-\,{\hat{s}}_{\tau ,0}$ using the initial state ${\hat{s}}_{\tau ,0}$ as input. Once ${V}_{{\rm{p}}}^{(1)}$ has been fine-tuned, we computed ${\hat{s}}_{\tau ,1}={\hat{s}}_{\tau ,0}\,+\,{V}^{(1)}({\hat{s}}_{\tau ,0})$ and initialized the network ${V}_{{\rm{p}}}^{(2)}$ using the weights of ${V}_{{\rm{p}}}^{(1)}$. We then fine-tuned ${V}_{{\rm{p}}}^{(2)}$ to predict ${s}_{\tau ,2}\,-\,{\hat{s}}_{\tau ,1}$ using ${\hat{s}}_{\tau ,1}$ the previously estimated initial state as input. We proceeded sequentially in this fashion until all networks have been initialized and fine-tuned. This procedure can be regarded as an instance of the pushforward trick⁷⁴. At deployment time, we composed the transformers to obtain a forecast for the desired lead time, that is

$${s}_{\tau ,t}=P({s}_{\tau ,0},t)={\widetilde{V}}_{{\rm{p}}}^{(t)}\circ \ldots \circ {\widetilde{V}}_{{\rm{p}}}^{(1)}({s}_{\tau ,0})$$

(7)

where ${\widetilde{V}}_{{\rm{p}}}^{(t)}(\cdot )=\,\cdot \,+{V}_{{\rm{p}}}^{(t)}(\cdot )$, and ${s}_{\tau ,0}=E({o}_{\tau })$ is the initial state produced by the encoder.

Decoder module

The final step in the forecasting pipeline is the decoder module. For each lead time t, we trained a lightweight convolutional station forecasting module ${D}_{t}$, which takes the gridded estimated state ${s}_{\tau ,t}$, the target’s longitude–latitude coordinates x and auxiliary orographic information as inputs and produces predictions for the corresponding station temperature measurements ${y}_{\tau ,t}$. Each ${D}_{t}$ consists of a U-Net architecture⁷⁵, followed by a SetConv layer that maps on-grid predictions to predictions at arbitrary station locations, followed by a multi-layer perceptron which incorporates the auxiliary orographic information, to produce local forecasts ${\hat{y}}_{\tau ,{t}}$. The U-Net consists of four encoder blocks (which consist of two-dimensional convolutions, BatchNorm layers, ReLU activations and MaxPool operations), followed by four decoder blocks (which consist of transpose two-dimensional convolutions, BatchNorm layers, ReLU activations and MaxPool operations). The encoder and decoder blocks have skip connections and channel dimensions (16, 32, 64, 128, 64, 32, 16, 1). We trained each D_t for 10 epochs using AdamW, with a learning rate of 1 × 10⁻³ and RMSE loss (equation (3)). To produce local forecasts at coordinates x, we computed

$${\widehat{y}}_{\tau ,t}={D}_{t}({s}_{\tau ,t},x)$$

(8)

where ${s}_{\tau ,t}$ is the global forecast defined in equation (7).

End-to-end deployment

At deployment time, no ERA5 input is required to run the system. To obtain global forecasts, we composed the encoder and processor together and computed

$${\widehat{s}}_{\tau ,t}={P}_{t}\circ E({o}_{\tau })$$

(9)

where ${P}_{t}(\cdot )=P(\cdot ,t)$. If we want to produce local station forecasts, we compose the encoder, processor and decoder modules and compute

$${\widehat{y}}_{\tau ,t}={D}_{t}({P}_{t}\circ E({o}_{\tau }),x)$$

(10)

Station forecasting baselines

We compared Aardvark against per-station persistence and climatology, as well as against two challenging baselines. The first of these is a station-corrected version of HRES: for each station, we selected the nearest grid point from the HRES 0.25° forecast and learned an affine correction (a scale and a constant bias) on a per-station basis to correct for systematic biases, which is a common and highly effective downscaling method⁷⁶. Further, region-specific downscaling refinements are possible, for example, using a local nested NWP. These could potentially further improve the performance of NWP systems, so the station-corrected HRES results we presented should not necessarily be interpreted as the state-of-the-art in downscaling performance, but rather as a strong and globally applicable baseline. Second, over CONUS, we also compared against a full operational end-to-end baseline, the NDFD from the National Weather Service. NDFD forecasts are an archive of data from the National Weather Service offices produced by combining the output of several global and regional forecasting models, post-processing these and incorporating input from human forecasters³⁶.

End-to-end fine-tuning

To perform end-to-end fine-tuning, we composed the encoder together with the lead time t = 1 day processor and decoder modules, producing local station forecasts for lead time t = 1 day given by

$${\hat{y}}_{\tau ,1}={D}_{1}({P}_{1}\,\circ \,E({o}_{\tau }),x)$$

(11)

This composition produces a single machine learning model with inputs that consist of all raw observational sources of the encoder module and outputs that consist of the predictions of the decoder module. We then fine-tuned this composite mode, that is, all three networks, jointly with either 2-m temperature or 10-m wind speed station observations ${y}_{t,1}$ as the only targets, using the RMSE loss. Specifically, the fine-tuning procedure consists of loading the pretrained weights of the encoder, processor and decoder modules and performing stochastic gradient descent on the parameters of the three modules E, P₁ and D to minimize the RMSE loss between the station forecast ${\hat{y}}_{\tau ,1}$ and its corresponding target ${y}_{t,1}$. We used AdamW and optimized all the parameters of the modules for 25,000 gradient steps with a constant learning rate of 5 × 10⁻⁵ and early stopping, as described by the following procedure.

During training, we stored checkpoints of our models to perform region-based model selection during evaluation. Specifically, every 1,000 fine-tuning gradient steps, we stored a copy of the model weights at that point in training, commonly referred to as a checkpoint. We then used the checkpoints to perform model selection on the basis of performance on a held-out validation set. Specifically, we evaluated each of the model checkpoints generated during fine-tuning on the validation data on the data from each of the regions we considered, namely global, CONUS, Europe, West Africa and the Pacific. For each region, we then selected the best checkpoint, as measured by performance on the validation set for that region, and evaluated this on the test data corresponding to the given region.

Model size and training costs

All model training in this study was performed on a single virtual machine with four NVIDIA A100 GPUs. The encoder module contains approximately 31 million parameters and requires 13 h to train. The processor module contains approximately 54 million parameters and requires 8 h to train on ERA5 and 3 h to fine-tune using the output of the encoder module as the input. Each of the 11 decoder modules contains approximately 2 million parameters and takes approximately 30 min to train. End-to-end fine-tuning of the encoder, processor and decoder modules takes 2 h. Therefore, the total time to train the model is approximately 100 GPU hours.

Further details

Further details on several aspects of this study, including supplementary figures and further discussion, are available in the Supplementary Information and rely on supplementary references^{77,78,79,80,81,82}.

Data availability

The dataset to run Aardvark Weather will be made available at https://huggingface.co/datasets/av555/aardvark-weather. All figures have been generated using a combination of the LaTeX TikZ package and the Matplotlib Python package⁸³. All coastlines and borders drawn in the spatial plots in the main text (Figs. 1a, 3 and 5k) and Supplementary Information use the border and coastline functionality of the Matplotlib package.

Code availability

The code used for training the models, the trained models, example test data and notebook examples for how to apply the models to make predictions will be made available on GitHub (https://github.com/annavaughan/aardvark-weather-public)⁸⁴.

Change history

17 June 2025
The version of Supplementary information, and reference citations within, were an earlier, incorrect version and the Supplementary information is now updated in the online version of the article.

References

Lam, R. et al. Learning skillful medium-range global weather forecasting. Science 382, 1416–1421 (2023).
Article MathSciNet CAS PubMed ADS Google Scholar
Price, I. et al. Probabilistic weather forecasting with machine learning. Nature 637, 84–90 (2025).
Xu, X. et al. FuXi-DA: a generalized deep learning data assimilation framework for assimilating satellite observations. npj Clim. Atmos. Sci. 8, 156 (2025).
Chen, K. et al. Fengwu: pushing the skillful global medium-range weather forecast beyond 10 days lead. Preprint at https://arxiv.org/abs/2304.02948 (2023).
Keisler, R. Forecasting global weather with graph neural networks. Preprint at https://arxiv.org/abs/2202.07575 (2022).
Bi, K. et al. Accurate medium-range global weather forecasting with 3D neural networks. Nature 619, 533–538 (2023).
Article CAS PubMed PubMed Central ADS Google Scholar
Lynch, P. The origins of computer weather prediction and climate modeling. J. Comput. Phys. 227, 3431–3444 (2008).
Article MathSciNet MATH ADS Google Scholar
Zhang, F. et al. What is the predictability limit of midlatitude weather? J. Atmos. Sci. 76, 1077–1091 (2019).
Article ADS Google Scholar
European Centre for Medium-Range Weather Forecasts. IFS Documentation CY48R1, Part I: Observations (2023).
Dueben, P. D. & Bauer, P. Challenges and design choices for global weather and climate models based on machine learning. Geosci. Model Dev. 11, 3999–4009 (2018).
Article ADS Google Scholar
Schultz, M. G. et al. Can deep learning beat numerical weather prediction? Philos. Trans. R. Soc. A 379, 20200097 (2021).
Article MathSciNet CAS ADS Google Scholar
Chen, L. et al. FuXi: a cascade machine learning forecasting system for 15-day global weather forecast. npj Clim. Atmos. Sci. 6, 190 (2023).
Article Google Scholar
Shao, W., Zhou, Y., Zhang, Q. & Jiang, X. Machine learning-based wind direction retrieval from quad-polarized Gaofen-3 SAR images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 17, 808–816 (2023).
Article ADS Google Scholar
Yan, X. et al. A deep learning approach to improve the retrieval of temperature and humidity profiles from a ground-based microwave radiometer. IEEE Trans. Geosci. Remote Sens. 58, 8427–8437 (2020).
Article ADS Google Scholar
Zhang, Z., Dong, X., Liu, L. & He, J. Retrieval of barometric pressure from satellite passive microwave observations over the oceans. J. Geophys. Res.: Oceans 123, 4360–4372 (2018).
Article ADS Google Scholar
Kirkwood, C., Economou, T., Odbert, H. & Pugeault, N. A framework for probabilistic weather forecast post-processing across models and lead times using machine learning. Philos. Trans. R. Soc. A 379, 20200099 (2021).
Article MathSciNet ADS Google Scholar
Grönquist, P. et al. Deep learning for post-processing ensemble weather forecasts. Philos. Trans. R. Soc. A 379, 20200092 (2021).
Article MathSciNet ADS Google Scholar
Chen, K. et al. Towards an end-to-end artificial intelligence driven global weather forecasting system. Preprint at https://arxiv.org/abs/2312.12462 (2023).
Huang, L., Gianinazzi, L., Yu, Y., Dueben, P. D. & Hoefler, T. DiffDA: a diffusion model for weather-scale data assimilation. In Proceedings of the 41st International Conference on Machine Learning (eds Salakhutdinov, R. et al.) 19798–19815 (PMLR, 2024).
McNally, A. et al. Data driven weather forecasts trained and initialised directly from observations. Preprint at https://arxiv.org/abs/2407.15586 (2024).
Manshausen, P. et al. Generative data assimilation of sparse weather station observations at kilometer scales. Preprint at https://arxiv.org/abs/2406.16947 (2024).
Keller, J. D. & Potthast, R. AI-based data assimilation: learning the functional of analysis estimation. Preprint at https://arxiv.org/abs/2406.00390 (2024).
Cheng, S., Min, J., Liu, C. & Arcucci, R. TorchDA: a Python package for performing data assimilation with deep learning forward and transformation functions. Comput. Phys. Commun. 306, 109359 (2025).
Article CAS Google Scholar
Gordon, J. et al. Convolutional conditional neural processes. In 8th International Conference on Learning Representations, 12534–12565 (ICLR, 2020).
Berre, L. Simulation and diagnosis of observation, model and background error contributions in data assimilation cycling. Q. J. R. Meteorolog. Soc. 145, 597–608 (2019).
Article ADS Google Scholar
Pascanu, R., Mikolov, T. & Bengio, Y. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning (eds Dasgupta, S. & McAllester, D.) 1310–1318 (PMLR, 2013).
Dosovitskiy, A. et al. An image is worth 16×16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations (ICLR, 2021).
Laloyaux, P., Thépaut, J.-N. & Dee, D. Impact of scatterometer surface wind data in the ECMWF coupled assimilation system. Mon. Weather Rev. 144, 1203–1217 (2016).
Article ADS Google Scholar
Isaksen, L. & Janssen, P. A. Impact of ERS scatterometer winds in ECMWF’s assimilation system. Q. J. R. Meteorolog. Soc. 130, 1793–1814 (2004).
Article ADS Google Scholar
Eyre, J. et al. Assimilation of satellite data in numerical weather prediction. Part II: recent years. Q. J. R. Meteorolog. Soc. 148, 521–556 (2022).
Article ADS Google Scholar
Continuous long-window data assimilation. ECMWF Newsletter www.ecmwf.int/en/newsletter/163/news/continuous-long-window-data-assimilation (2020).
Healy, S. et al. Methods for Assessing the Impact of Current and Future Components of the Global Observing System. Memorandum No. 916 (European Centre for Medium-Range Weather Forecasts, 2024).
Rasp, S. et al. WeatherBench 2: A benchmark for the next generation of data-driven global weather models. J. Adv. Model. Earth Syst. 16, e2023MS004019 (2024).
Hersbach, H. et al. The ERA5 global reanalysis. Q. J. R. Meteorolog. Soc. 146, 1999–2049 (2020).
Article ADS Google Scholar
A new ML model in the ECMWF web charts. ECMWF www.ecmwf.int/en/about/media-centre/aifs-blog/2023/new-ml-model-ecmwf-web-charts (2023).
Glahn, H. R. & Ruth, D. P. The new digital forecast database of the National Weather Service. Bull. Am. Meteorol. Soc. 84, 195–202 (2003).
Article ADS Google Scholar
WMO Integrated Processing and Prediction System (WIPPS) Dashboard (World Meteorological Organization, accessed 5 July 2024); https://community.wmo.int/en/activity-areas/wmo-integrated-processing-and-prediction-system-wipps.
Buizza, R. et al. The Development and Evaluation Process Followed at ECMWF to Upgrade the Integrated Forecasting System (IFS). Memorandum No. 829 (European Centre for Medium-Range Weather Forecasts, 2018).
Kaspar, M., Osorio, J. D. M. & Bock, J. Sim2real transfer for reinforcement learning without dynamics randomization. In Proc. 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 4383–4388 (IEEE, 2020).
Metop ASCAT Level 1B SZF Product (EUMETSAT, accessed 20 October 2024); https://navigator.eumetsat.int/product/EO:EUM:DAT:METOP:ASCSZF1B.
Zou, C.-Z., Wang, W. & NOAA CDR Program. NOAA Fundamental Climate Data Record (FCDR) of AMSU-A Level 1c Brightness Temperature, Version 1.0. NOAA National Climatic Data Center https://doi.org/10.7289/V5X63JT2 (accessed 22 October 2024).
Ferraro, R. R., Meng, H. & NOAA CDR Program. NOAA Climate Data Record (CDR) of Advanced Microwave Sounding Unit (AMSU)-B, version 1.0. NOAA National Climatic Data Center https://doi.org/10.7289/V500004W (2016).
HIRS level 1C Fundamental Data Record release 1—multimission—global. EUMETSAT https://doi.org/10.15770/EUM_SEC_CLM_0026 (2022).
IASI Principal Components Scores Fundamental Data Record release 1—Metop-A and -B. EUMETSAT https://doi.org/10.15770/EUM_SEC_CLM_0084 (2022).
Gridded Geostationary Brightness Temperature Data. NOAA NCEI www.ncei.noaa.gov/products/gridded-geostationary-brightness-temperature (accessed 20 October 2024).
HadISD: Met Office Hadley Centre integrated surface dataset. Met Office www.metoffice.gov.uk/hadobs/hadisd/ (2024).
International Comprehensive Ocean-Atmosphere Data Set (ICOADS). NOAA NCEI https://icoads.noaa.gov (accessed 20 October 2024).
Integrated Global Radiosonde Archive (IGRA). NOAA NCEI www.ncei.noaa.gov/products/weather-balloon/integrated-global-radiosonde-archive (accessed 20 October 2024).
National Centers for Environmental Prediction, National Weather Service, NOAA & U.S. Department of Commerce. NCEP GFS 0.25 degree global forecast grids historical archive. NSF https://rda.ucar.edu/datasets/d084001/ (2015).
Data Processing Levels. NASA EARTHDATA www.earthdata.nasa.gov/learn/earth-observation-data-basics/data-processing-levels (accessed 26 October 2024).
Dunn, R. J. et al. HadISD: a quality-controlled global synoptic report database for selected variables at long-term stations from 1973–2011. Clim. Past 8, 1649–1679 (2012).
Article Google Scholar
Dunn, R. J., Willett, K. M., Parker, D. E. & Mitchell, L. Expanding HadISD: quality-controlled, sub-daily station data from 1931. Geosci. Instrum. Methods Data Syst. 5, 473–491 (2016).
Article ADS Google Scholar
Freeman, E. et al. ICOADS release 3.0: a major update to the historical marine climate record. Int. J. Climatol. 37, 2211–2232 (2017).
Article Google Scholar
Durre, I., Vose, R. S. & Wuertz, D. B. Overview of the integrated global radiosonde archive. J. Clim. 19, 53–68 (2006).
Article ADS Google Scholar
Gelsthorpe, R., Schied, E. & Wilson, J. ASCAT-Metop’s advanced scatterometer. ESA Bull. 102, 19–27 (2000).
Google Scholar
Stoffelen, A., Verspeek, J. A., Vogelzang, J. & Verhoef, A. The CMOD7 geophysical model function for ASCAT and ERS wind retrievals. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 10, 2123–2134 (2017).
Article ADS Google Scholar
Rosenkranz, P. W. Retrieval of temperature and moisture profiles from AMSU-A and AMSU-B measurements. IEEE Trans. Geosci. Remote Sens. 39, 2429–2435 (2001).
Article ADS Google Scholar
Li, J. et al. Global soundings of the atmosphere from ATOVS measurements: the algorithm and validation. J. Appl. Meteorol. Climatol. 39, 1248–1268 (2000).
Article ADS Google Scholar
Blumstein, D. et al. IASI instrument: technical overview and measured performances. In Proc. SPIE 5543, Infrared Spaceborne Remote Sensing XII (ed. Strojnik, M.) https://doi.org/10.1117/12.560907 (SPIE, 2004).
Knapp, K. R. & Wilkins, S. L. Gridded satellite (GridSat) GOES and CONUS data. Earth Syst. Sci. Data 10, 1417–1425 (2018).
Article ADS Google Scholar
Jones, P. W. First- and second-order conservative remapping schemes for grids in spherical coordinates. Mon. Weather Rev. 127, 2204–2210 (1999).
Article ADS Google Scholar
National Digital Forecast Database: Short Range Guidance for TAF Sites (National Weather Service, 2024); www.weather.gov/media/mdl/ndfd/pd01002001curr.pdf.
How do we use models in our forecasting? National Weather Service www.weather.gov/ilx/about_models (2024).
Garnelo, M. et al. Conditional neural processes. In Proceedings of the 35th International Conference on Machine Learning (eds Dy, J. & Krause, A.) 1704–1713 (PMLR, 2018).
Andersson, T. R. et al. Environmental sensor placement with convolutional Gaussian neural processes. Environ. Data Sci. 2, e32 (2023).
Article Google Scholar
Markou, S., Requeima, J., Bruinsma, W., Vaughan, A. & Turner, R. E. Practical conditional neural process via tractable dependent predictions. In 10th International Conference on Learning Representations (ICLR, 2022).
Vaughan, A., Tebbutt, W., Hosking, J. S. & Turner, R. E. Convolutional conditional neural processes for local climate downscaling. Geosci. Model Dev. 15, 251–268 (2022).
Article ADS Google Scholar
Vaughan, A., Lane, N. D. & Herzog, M. Multivariate climate downscaling with latent neural processes. In Tackling Climate Change with Machine Learning ICML Workshop (2021).
Bruinsma, W. et al. Autoregressive conditional neural processes. In 11th International Conference on Learning Representations (ICLR, 2023).
Bodnar, C. et al. A foundation model for the Earth system. Nature https://doi.org/10.1038/s41586-025-09005-y (2025).
Nguyen, T. et al. Scaling transformer neural networks for skillful and reliable medium-rangeweather forecasting. In The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS, 2024).
Couairon, G., Lessig, C., Charantonis, A. & Monteleoni, C. ArchesWeather: an efficient AI weather forecasting model at 1.5 degree resolution. Preprint at https://arxiv.org/abs/2405.14527 (2024).
Nguyen, T., Brandstetter, J., Kapoor, A., Gupta, J. K. & Grover, A. In Proceedings of the 40th International Conference on Machine Learning, 25904–25938 (PMLR, 2023).
Brandstetter, J., Worrall, D. E. & Welling, M. Message passing neural PDE solvers. In International Conference on Learning Representations (ICLR, 2022).
Ronneberger, O., Fischer, P. & Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015 (eds Navab, N. et al.) 234–241 (Springer, 2015).
Bouallègue, Z. B. et al. Statistical modeling of 2-m temperature and 10-m wind speed forecast errors. Mon. Weather Rev. 151, 897–911 (2023).
Article ADS Google Scholar
Scholz, J., Andersson, T. R., Vaughan, A., Requeima, J. & Turner, R. E. Sim2Real for environmental neural processes. In NeurIPS 2023 Workshop on Tackling Climate Change with Machine Learning: Blending New and Existing Knowledge Systems (NeurIPS, 2023).
Chai, J., Zeng, H., Li, A. & Ngai, E. W. Deep learning in computer vision: a critical review of emerging techniques and application scenarios. Mach. Learn. Appl. 6, 100134 (2021).
Google Scholar
Deshmukh, A. M. Comparison of hidden Markov model and recurrent neural network in automatic speech recognition. Eur. J. Eng. Technol. Res. 5, 958–965 (2020).
Google Scholar
Gordon, J., Bronskill, J., Bauer, M., Nowozin, S. & Turner, R. Meta-learning probabilistic inference for prediction. In 7th International Conference on Learning Representations, 7205–7225 (ICLR, 2019).
Cao, Y. et al, Towards understanding the spectral bias of deep learning. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21 (ed. Zhou, Z.-H.) 2205–2211 (IJCAI Organization, 2021).
Rahaman, N. et al. On the spectral bias of neural networks. In Proc. 36th International Conference on Machine Learning (eds Chaudhuri, K. & Salakhutdinov, R.) 5301–5310 (PMLR, 2019).
Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).
Article Google Scholar
Vaughan, A. et al. End-to-end data-driven weather forecasting (source code, sample data and trained models). GitHub https://github.com/annavaughan/aardvark-weather-public (2024).

Download references

Acknowledgements

We acknowledge the agencies whose efforts in collecting, curating and distributing datasets made this study possible. This study stands on the foundation of decades of contributions from the meteorological community and their commitment to sharing data. Specifically, we thank the European Organisation for the Exploitation of Meteorological Satellites, the UK Met Office, the National Environmental Satellite, Data, and Information Service, the National Centers for Environmental Information, the National Oceanic and Atmospheric Administration, the National Climatic Data Center, the NSF National Center for Atmospheric Research and ECMWF. The JASMIN Environmental Data Service and WeatherBench 2 project provided invaluable access to pre-processed data sources. This study was generously supported by The Alan Turing Institute, with funding and access to computational resources. A.A. acknowledges the UKRI Centre for Doctoral Training in the Application of Artificial Intelligence to the study of Environmental Risks (AI4ER), led by the University of Cambridge (EP/S022961/1), and studentship funding from Google DeepMind. S.M. acknowledges funding from the Vice Chancellor’s and George and Marie Vergottis scholarship of the Cambridge Trust and the Qualcomm Innovation Fellowship. W.T. acknowledges funding from Huawei and EPSRC grant EP/W002965/1. J.R. acknowledges funding from the Data Sciences Institute at the University of Toronto. J.S.H. is supported by The Alan Turing Institut’s Turing Research and Innovation Cluster in Digital Twins, the Environment and Sustainability Grand Challenge and EPSRC grant EP/Y028880/1. R.E.T. is supported by an EPSRC Prosperity Partnership grant EP/T005386/1 between the University of Cambridge and Microsoft. We would like to thank T. Lazauskas for cloud engineering support in setting up the compute platform, J. Bronskill for technical advice on both compute and machine learning techniques, P. Dueben for advice on baselines and P. Lean for advice on counting the number of observation input to the IFS.

Author information

Will Tebbutt
Present address: The Alan Turing Institute, London, UK
Tom R. Andersson
Present address: Google DeepMind, London, UK
These authors contributed equally: Anna Allen, Stratis Markou

Authors and Affiliations

Department of Computer Science and Technology, University of Cambridge, Cambridge, UK
Anna Allen & Nicholas D. Lane
Department of Engineering, University of Cambridge, Cambridge, UK
Stratis Markou, Will Tebbutt & Richard E. Turner
Vector Institute, University of Toronto, Toronto, Ontario, Canada
James Requeima
Microsoft Research AI for Science, Cambridge, UK
Wessel P. Bruinsma
British Antarctic Survey, Cambridge, UK
Tom R. Andersson & J. Scott Hosking
Department of Geography, University of Cambridge, Cambridge, UK
Michael Herzog
European Centre for Medium-Range Weather Forecasts, Reading, UK
Matthew Chantry
The Alan Turing Institute, London, UK
J. Scott Hosking & Richard E. Turner

Authors

Anna Allen
View author publications
Search author on:PubMed Google Scholar
Stratis Markou
View author publications
Search author on:PubMed Google Scholar
Will Tebbutt
View author publications
Search author on:PubMed Google Scholar
James Requeima
View author publications
Search author on:PubMed Google Scholar
Wessel P. Bruinsma
View author publications
Search author on:PubMed Google Scholar
Tom R. Andersson
View author publications
Search author on:PubMed Google Scholar
Michael Herzog
View author publications
Search author on:PubMed Google Scholar
Nicholas D. Lane
View author publications
Search author on:PubMed Google Scholar
Matthew Chantry
View author publications
Search author on:PubMed Google Scholar
J. Scott Hosking
View author publications
Search author on:PubMed Google Scholar
Richard E. Turner
View author publications
Search author on:PubMed Google Scholar

Contributions

A.A. and R.E.T. conceptualized the project. A.A., S.M., W.T., J.R., W.P.B. and R.E.T. designed the experiments. A.A. selected and collected all data and designed the end-to-end system. A.A., S.M. and W.T. implemented the codebase. A.A., S.M., W.T. and R.E.T. wrote the initial draft of the paper. S.M. produced all figures. All authors provided feedback on the results at various stages of the study and contributed to the final version of the paper.

Corresponding authors

Correspondence to Anna Allen, Stratis Markou or Richard E. Turner.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature thanks David John Gagne, Jan Keller and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Table 1 Listing of the inputs and outputs of each module

Full size table

Extended Data Table 2 Summary of the observational datasets used to train Aardvark

Full size table

Supplementary information

Supplementary Information

Supplementary Sections A–E, including Supplementary Figs. 1–29.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Allen, A., Markou, S., Tebbutt, W. et al. End-to-end data-driven weather prediction. Nature 641, 1172–1179 (2025). https://doi.org/10.1038/s41586-025-08897-0

Download citation

Received: 10 July 2024
Accepted: 12 March 2025
Published: 20 March 2025
Issue date: 29 May 2025
DOI: https://doi.org/10.1038/s41586-025-08897-0

This article is cited by

A foundation model for the Earth system
- Cristian Bodnar
- Wessel P. Bruinsma
- Paris Perdikaris
Nature (2025)
Evaluating the Shanghai Typhoon Model against State-of-the-Art Machine-Learning Weather Prediction Models: A Case Study for Typhoon Danas (2025)
- Zeyi Niu
- Wei Huang
- Xu Zhang
Advances in Atmospheric Sciences (2025)

Subjects

Abstract

Similar content being viewed by others

A data-to-forecast machine learning system for global weather

Accurate medium-range global weather forecasting with 3D neural networks

Probabilistic weather forecasting with machine learning

Main

Aardvark Weather

Input variables

Evaluation of global forecasting

Encoder module ablation

Evaluation of station forecasting

End-to-end tuning

Discussion

Methods

State estimation inputs

Pretraining

Baselines

Evaluation metrics

Training objectives

Model architecture

Encoder module

Processor module

Decoder module

End-to-end deployment

Station forecasting baselines

End-to-end fine-tuning

Model size and training costs

Further details

Data availability

Code availability

Change history

17 June 2025

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data figures and tables

Supplementary information

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

A foundation model for the Earth system

Evaluating the Shanghai Typhoon Model against State-of-the-Art Machine-Learning Weather Prediction Models: A Case Study for Typhoon Danas (2025)

Search

Quick links