Introduction

The importance of accurate predictions for future Earth systems, ranging from minutes to centuries and from local to global scales, has significantly influenced humanity’s continuous efforts to observe and comprehend our planet1. In recent decades, advancements in observation, theoretical understanding, and high-performance computing have led to the development of numerical weather prediction (NWP) models. These models provide warnings for local extreme events with lead times ranging from hours to days and can predict global weather up to 2 weeks in advance1. However, the complexity of these physical models necessitates significant investments in supercomputing clusters to enable the development and operationalization of the NWP. In recent years, deep learning (DL)-based approaches for weather forecasting have emerged, demonstrating comparable performance to the state-of-the-art (SOTA) NWP model2,3,4,5,6. Significantly, these methods exhibit forecasting efficiency in orders of magnitude higher than that of traditional NWP models. It is crucial to note that high-quality and accurate initial fields serve as the foundation for making skillful forecasts. These fields, when combined with the forecasting models, govern the evolution of the state trajectory in both space and time. The preparation of accurate initial fields has garnered increased attention in the domain of Earth science7. Therefore, significant efforts have been devoted to developing data assimilation (DA) techniques to effectively integrate existing atmospheric knowledge, often in the form of forecasting models, and observations to achieve the best estimation of the initial fields8,9,10,11.

With notable advancements in DA methods and the proliferation of observations, substantial improvements have been achieved in the accuracy of initial fields12,13,14,15,16. In particular, the four-dimensional variational (4DVar) method is widely acknowledged as one of the most advanced DA techniques, producing dynamically and physically constrained analysis fields17. It implicitly implements the flow-dependent background error covariance matrix and propagates information within assimilation windows using the tangent linear model (TLM)18 and the adjoint model (ADM)19. This method has been extensively adopted in most leading operational NWP centers. Furthermore, the current generation of initial fields for DL-based weather forecasting models also frequently depends on the analysis field produced by the conventional 4DVar approach2,3,6. However, the background error covariance statistics in 4DVar, along with the development of the ADM, require highly trained experts to innovate better programs and algorithms. This process can be time-consuming and expensive. Additionally, the iterative minimization of the 4DVar cost function entails computational costs typically two to three orders of magnitude higher than those of a single model forecast11. Thus, relying on initial fields derived from the traditional 4DVar method constrains the real-time applicability of current DL-based weather forecasting models for multivariate 3D weather systems.

Fortunately, the conceptual similarity between DA and DL20,21, as well as the increasing use of DL techniques in the atmospheric and oceanic domains3,22,23,24,25, has prompted exploration of the application of DL methods for DA26. Unlike conventional DA approaches, the “pure" data-driven DL-based DA models use neurons to establish implicit mappings from input background fields and observations to target analysis or reanalysis fields automatically. This accelerates the DA process significantly27,28,29. For instance, Pawar et al.27 proposed a long short-term memory (LSTM) model to estimate the nudging term and tested on the Lorenz96 physical model, which not only guides the state trajectories toward the observations but also acts as a stabilizer. Additionally, Wu et al.28 introduced a fast DA (FDA) method that replaces the DA process by training a neural network, which selects 4DVar results as target outputs. When tested on the Lorenz63 physical model, FDA demonstrated superior computational performance compared to 4DVar, while maintaining similar quality. However, these methods lack the incorporation of underlying physics27,28,29, and fail to learn the propagation of observational information across variables adequately. Therefore, when applied to complex multivariate three-dimensional (3D) weather systems, these methods are theoretically unable to efficiently correct other variables using observed variables.

On the other hand, an approach worth mentioning involves the use of DL-based surrogate models to replace the forward or ADM traditionally used in DA11,23,30,31,32,33,34,35. One prominent illustration of this methodology is the 4DVarNet model23, which constitutes a neural and differentiable implementation of the 4DVar cost function as well as a gradient-based neural solver of the cost function. The trained model has demonstrated the capacity to expedite the 4DVar method with fewer gradient-based iterations and has exhibited effective performance in reconstructing the dynamics in the Lorenz96 system, as well as in the sea surface height (SSH)23,34,35. However, the utilization of a neural network extracts exclusively the gradient information presents challenges in integrating physical constraints with background field information. Additionally, the ConvLSTM model used in 4DVarNet encounters difficulties in learning long-range dependencies in data owing to non-parallelizability and vanishing gradients issues25. Additionally, the autoregressive strategy leads to high GPU memory costs during training and inference. The inherent limitations hamper the effectiveness of recent methods in real-time estimation of multivariate 3D weather states and their applicability to real-world scenarios. This is due to the excessive computing resource requirements and the limited representation of intrinsic relationships within the multivariate 3D weather system. Therefore, substantial advancements in multivariate 3D weather state estimation are necessary to enable the use of DL-based DA methods in real-world initial field estimation. These advancements should aim to comprehensively capture correlations within weather systems while reducing the demand for computational resources.

In response to the above requirements, we present 4DVarFormer, a variational-constrained neural network designed to real-timely estimate accurate multivariate 3D initial fields for the pre-trained medium-range forecasting model FourCastNet2. FourCastNet is the first DL-based weather forecasting model that can achieve comparable performance to the high-resolution operational NWP model. If 4DVarFormer proves effective for ForeCastNet, it suggests the possibility of generalization to other DL-based weather forecasting models. Specifically, 4DVarFormer serves as a non-iterative solver of the 4DVar cost function by combining an attention-based neural network with the 4DVar physical constraint. This architecture enables capturing interactions between the kinetic information within the assimilation window and the background field, thereby enhancing its capacity to utilize observed variables to correct other variables. Notably, 4DVarFormer eliminates the necessity for background error covariance statistics and the development of complex adjoint models. Our observing system simulation experiments (OSSEs)1,35,36,37 demonstrate the superior performance of 4DVarFormer over 4DVarNet and ViT in terms of initial field estimation. It showcases the capability to estimate multivariate 3D initial fields in the East China region within 0.37 s, effectively driving FourCastNet to generate medium-range forecasts comparable to those using ERA5 as initial fields. Furthermore, our model skillfully propagates information from 10 m wind field observations to upper-air geopotential fields and upper-air temperature observations to upper-air relative humidity, indicating its capacity to accurately depict the intrinsic correlation within the multivariable 3D weather system. Finally, the sensitivity experiments additionally illustrate 4DVarFormer’s effectiveness in adapting to variations in background fields, thereby highlighting its potential for real-world weather DA.

Overall, our contribution to this work can be summarized as follows:

  • We propose a 4DVar-constrained model, 4DVarFormer, for accurate real-time estimation of multivariate 3D weather states.

  • The medium-range forecast initialized by 4DVarFormer demonstrates comparable performance to the forecast driven by ERA5 while outperforming those initialized by 4DVarNet and ViT.

  • We assess the model’s ability to capture correlations between variables in the atmospheric system through a series of sensitivity experiments, thus enhancing the model’s interpretability.

Results

The performance evaluation of 4DVarFormer involved utilizing data from the East China region spanning the years 2021 and 2022. Our experiments focused on two aspects of 4DVarFormer’s performance. The first is the accuracy and stability of the 30-day DA cycle, and the second is the accuracy of the 7-day medium-term forecasts initialized using 4DVarFormer. These experiments were conducted at 5-day intervals, yielding a total of 139 sets. For details of the experimental methodology, please see Fig. S4. To determine how 4DVarFormer’s performance compares to other DL-based DA methods, we focus on 4DVarNet23,34,35 and ViT38. 4DVarNet is a strong DL-based baseline that has achieved superior performance when reconstructing the SSH. Further, ViT and its variants2,3,6,39 have demonstrated remarkable performance in various computer vision tasks and complex dynamic modeling. Comparison with these two baseline models can effectively prove the effectiveness of 4DVarFormer.

Assessments of assimilation metrics for the 4DVarFormer model

We conducted a series of 30-day DA cycle experiments, where we utilized 3-day forecast background fields initialized by ERA5. These background fields were combined with noisy observations of surface wind and upper-air temperatures at 50, 500, 850, and 1000 hPa pressure levels as models’ inputs. In recent studies, ViT-based models such as FourCastNet2, PanGu3, and FuXi6 have demonstrated remarkable effectiveness in capturing the intricate dynamics of the high-resolution weather system. This indicates the potential of ViT-based models for modeling the multivariate correlations within the weather system. Consequently, we conducted a comparison between our 4DVarFormer, the original ViT model38, and the SOTA DL-based DA method called 4DVarNet. Both 4DVarFormer and 4DVarNet assimilate 6-hourly observations at the 24-h DA window, while ViT assimilates using observations at the initial moment.

Figure 1 illustrates the time series of the averaged latitude-weighted root mean square error (RMSE) and anomaly correlation coefficient (ACC) for 4DVarFormer, 4DVarNet, and ViT models, considering 4 surface variables, and 20 upper-air variables (the details of the RMSE and ACC metrics can be seen Supplementary Text). Notably, 4DVarFormer consistently outperforms both 4DVarNet and ViT for all tested variables. In terms of RMSE (lower values indicate better performance), 4DVarFormer exhibits an overall 34.6% improvement compared to 4DVarNet and a 43.2% improvement compared to ViT. On the other hand, in terms of ACC (higher values indicate better performance), 4DVarFormer exhibits an overall 10.9% improvement compared to 4DVarNet and a 20.2% improvement compared to ViT. The advantage of 4DVarFormer is even more pronounced for variables like Z500 and R500.

Fig. 1: Comparison of the average latitude RMSE (first and second rows) and ACC (third and fourth rows) of the 139 sets of 30-day assimilation cycle using 4DVarNet (blue lines), ViT (yellow lines), and 4DVarFormer (red lines) for four surface variables, such as MSLP, T2M, U10, and V10, and six upper-air variables, including Z500, T850, T500, R500, U500, V500, using testing data from 2021 to 2022.
figure 1

All models are evaluated against the ERA5 reanalysis dataset.

To gain a deeper understanding of the specific reason of 4DVarFormer’s advancements that contribute to its enhanced assimilation performance, we present the assimilation outcomes of all neural networks in Fig. 2. The analysis focused on two upper-air variables, Z500 and T850, as well as two surface variables, T2M, and 10-m wind speed, comparing the outcomes with the ERA5 ground truth. The depicted instances are all set at 00:00 UTC on 19 July 2021, when there is a tropical cyclone in the East China region. The results obtained from 4DVarFormer closely approximate the ground truth and demonstrate the best reconstruction of the tropical cyclone’s position and shape. It was observed that all DL-based models exhibit varying degrees of mesh artifacts, indicating that both ConvLSTM- and attention-based models tend to predict average values within the kernel or patch. This tendency is a common trait of DL-based vision algorithms, converging on average values3. In contrast, ERA5 appears less smooth due to its calculation of an ensemble average estimated value at each grid cell by solving a 4DVar system composed of a complex partial differential equation. Notably, the process of solving the numerical 4DVar cost function of a single member on a high-performance computer typically takes about 25 min40. In contrast, 4DVarFormer expedites the 4DVar process by approximately 4000 times, requiring only 0.37 s, in contrast to the traditional 4DVar.

Fig. 2: Visualization of assimilation results.
figure 2

The estimated states of two upper-air variables (Z500 and T850) and two surface variables (2 m-temperature and 10-m wind speed). For each case, the background field (first column), 4DVarNet (second column), ViT (third column), 4DVarFormer (fourth column), and the ERA5 ground truth (fifth column) are shown. For all illustrated cases, the time is 00:00 UTC on 19 July 2021, for there is a tropical cyclone at this time in the East China region.

In Figs. S5 and S6, we provide further evaluations of 4DVarFormer with the error profiles and the power spectral densities.

Assessments of 7-day forecast metrics driven by the 4DVarFormer model

Our research focuses on leveraging neural networks to acquire accurate multivariate 3D initial fields, thereby enabling accurate medium-range weather forecasts. To evaluate this ability, the analysis fields estimated by each model on every fifth day at 00:00 UTC are utilized as the initial fields for the pre-trained FourCastNet to generate forecasts. Following this, we conduct a comparative analysis of the 6-hourly forecasts spanning over 7 days, comparing these forecast results and contrasting them against forecasts generated using ERA5 as the initial field.

The overall forecast results for 2021 and 2022 are depicted in Fig. 3. Across all tested variables, encompassing upper-air and surface variables, 4DVarFormer demonstrates consistently superior accuracy compared to both 4DVarNet and ViT. Specifically, in terms of RMSE, 4DVarFormer exhibits overall 10.7% lower values than 4DVarNet and 13.6% lower values than ViT. When turned to ACC, 4DVarFormer exhibits overall 8.5% higher values than 4DVarNet and 11.8% higher values than ViT. This advantage is general across all lead times, ranging from 6 to 168 h (i.e., 7 days), with some variables such as T2M and R500 showing particularly significant advantages. Importantly, the accuracy of the forecasts initialized by 4DVarFormer is comparable to those driven by ERA5 for all variables. This heightened accuracy and alignment can be attributed to the capability of 4DVarFormer to utilize the physical constraints inherent in 4DVar.

Fig. 3: Comparison of the average latitude RMSE (first and second rows) and ACC (third and fourth rows) of the 139 sets of 7 day’s medium forecast using ERA5 (black lines) and outputs of 4DVarNet (blue lines), ViT (yellow lines), and 4DVarFormer (red lines) as initial fields.
figure 3

Four surface variables, such as MSLP, T2M, U10, and V10, and six upper-air variables are illustrated, including Z500, T850, T500, R500, U500, V500, using testing data from 2021 to 2022. All models are evaluated against the ERA5 reanalysis dataset.

Sensitivity experiments performed for interpretability

The skillful initial field estimation achieved by the 4DVarFormer model underscores the pivotal role played by multivariate coupling dynamics embodied within the model’s architecture. One of the advantages of the 4DVarFormer model is its ability to capture the relationship between temperature and relative humidity. This capability is facilitated by two key factors: the gradient of the 4DVar cost function, constructed by the pre-trained FourCastNet, and the robust representation of long-range correlations through the attention mechanism. In this subsection, we conduct sensitivity experiments to elucidate the fundamental physical constraints in 4DVarFormer. The control experiment discussed in the “Assessments of assimilation metrics for the 4DVarFormer model” and “Assessments of 7-day forecast metrics driven by the 4DVarFormer model” sections assimilated observations of 10-m wind and upper-air temperature. Now, we undertake several sensitivity experiments using the testing set to assess the model’s proficiency in assimilating different observational variables and their impacts. Specifically, one test experiment involves assimilating only the 10-m wind observations, while another focuses solely on assimilating upper-air temperature observations. The wind assimilation sensitivity experiment aims to explore its effects on the upper-air geopotential fields, whereas the temperature sensitivity experiment seeks to examine the interplay between temperature and relative humidity. All sensitivity analyses for the assimilation experiments are predicated on the aforementioned trained model without fine-tuning. Additionally, we also encapsulate the influence of different forecast timing background fields on the assimilation outcomes. The findings demonstrate the stability of our model when utilizing the background fields of 3–5 days’ forecasts as inputs, thereby ensuring consistent forecasting proficiency. To further investigate the impact of the 4DVar physical constraints on the assimilation performance of the 4DVarFormer model and the accuracy of the initialized forecasts, we perform a set of experiments by varying the length of the DAW. Furthermore, we compare the tropical cyclone track forecasts of two super typhoons in 2021 using 4DVarFormer and ERA5 initialized FourCastNet, respectively, in order to explore the characteristics of the flow dependence implicit in the 4DVarFormer model.

Assimilating the surface wind field effects on the upper-air geopotential

Surface wind significantly influences the atmospheric system, particularly through its relationship with pressure, which is a well-established feature of the atmosphere41,42,43. To investigate the influence of surface wind observations on the 4DVarFormer’s estimation of upper-air geopotential states, we carried out an experiment where only 10m wind observations were assimilated. This was done to discern the specific impact of surface wind observations on the upper-air pressure, considering the high correlation between pressure and geopotential. Figure 4a presents the average RMSEs and ACCs of Z500, Z850, U10, and V10 for 7-day forecasts initiated from the estimated initial fields. While the forecast skills are somewhat diminished compared to the control experiment (section “Assessments of 7-day forecast metrics driven by the 4DVarFormer model”) when only 10 m wind observations are assimilated, the overall performance remains notably superior to that of 4DVarNet and ViT.

Fig. 4: Visualization of examples when only assimilating 10-m wind observations.
figure 4

a Results of the 7-day medium forecast using ERA5 and state estimations only assimilating 10-m wind observations as initial fields, separately. The illustrated results are the forecasting RMSEs and ACCs of two upper-air variables (Z500 for the first column and Z850 for the third column) and two surface variables (U10 for the third column and V10 for the fourth column). b A typical example of background error (xb − xt). Two levels of geopotential variables (Z500 for the first column and Z850 for the second column) and two components of 10-m wind (U10 for the third column and V10 for the fourth column) are illustrated. c The visualization of the analysis increment (xa − xb) of the compared models is represented. For each case, 4DVarNet (first row), ViT (second row), and 4DVarFormer (third row) are shown. For all illustrated cases, the time is 00:00 UTC on 19 July 2021, for there is a tropical cyclone at this time in the East China region.

Figure 4b displays the background error (xb − xt) in the 10-m wind field and the upper-air geopotential fields 00:00 UTC on 19 July 2021, where there is a tropical cyclone at this time in the East China region. The FourCastNet forecasts reveal a negative error in the north and a positive error in the south for both Z500 and Z850, exhibiting a symmetrical structure on the north and south sides of the tropical cyclone. Furthermore, U10 and V10 exhibit negative errors in the tropical cyclone eye while positive errors around it. This suggests that FourCastNet is underestimating the strength of the tropical cyclone. Figure 4c depicts the analysis increments (xa − xb) resulting from the assimilation of 10-m wind observations. The spatial correspondence and accurate correction amount within the 10-m wind field increments from all three models are evident. Conversely, upon analyzing the increments of Z500 and Z850, it is evident that the corrections made by 4DVarNet and ViT are limited. Specifically, the analysis increments from 4DVarNet lack symmetry in the central structure of tropical cyclones in both the northern and southern regions, and hold relatively small corrections. This limitation hinders the complete correction of errors in the background field. On the other hand, ViT exhibits a significant positive bias throughout the entire East China region, with increments that poorly correspond to the background error. This result suggests that ViT fails to accurately propagate information from the 10-m wind field to the upper-air geopotential fields and does not adequately capture the wind-pressure relationship. In contrast, the analysis increments obtained by 4DVarFormer exhibit a strong correlation with the 10m wind field’s increments and closely correspond to the background errors, indicating that 4DVarFormer is capable of reasonably correcting the upper-air geopotential fields by assimilating the 10 m wind observations. This ability showcases the model’s capacity to capture the essential wind-pressure relationships within the atmospheric system.

In addition to the results already reported above, we also visualize the results for the initial fields assimilating only the 10-m wind observation in Fig. 5. It can be seen that both 4DVarFormer and ViT can correctly capture the location of the tropical cyclone, while 4DVarNet exhibits two tropical cyclone centers. This may be caused by the excessive weighting of the background field in the 4DVar cost function optimized by 4DVarNet. For the upper-air geopotential fields and mean sea level pressure, only 4DVarFormer can describe the spatial structures well. This indicates that 4DVarFormer can effectively extract the information from the observed data of the 10-m wind field to correct the geopotential fields and pressure fields. This result corresponds to the analysis increments in Fig. 4c and again shows that 4DVarFormer learned the wind-pressure relationship.

Fig. 5: Visualization of assimilation results.
figure 5

The state estimation of two upper-air geopotential fields (Z500 and Z850) and two surface variables (MSLP and 10-m wind speed). For each case, the background field (first column), states estimated by 4DVarNet (second column), ViT (third column), 4DVarFormer (fourth column), and the ERA5 ground truth (fifth column) are shown. The results are generated by only assimilating 10 m wind observations. For all illustrated cases, the time is 00:00 UTC on 19 July 2021, for there is a tropical cyclone at this time in the East China region.

In Fig. S7, we provide further evaluations of 4DVarFormer with the error profiles when only the 10-m wind observations are assimilated.

Assimilating the upper-air temperature effects on the relative humidity

The relative humidity is known to be dependent on air temperature44,45. Therefore, the assimilation of upper-air temperature observations is expected to correct relative humidity errors. Here, we further examine the effects of assimilating upper-air temperature observations on the estimation and forecast of upper-air relative humidity. Figure 6a presents the RMSEs and ACCs of R500, R850, T500, and T850. While the forecast skill of the temperature and relative humidity fields is somewhat diminished in the forecasts driven solely by the assimilation of upper-air temperature observations in 4DVarFormer compared to the control experiment (section “Assessments of 7-day forecast metrics driven by the 4DVarFormer model”), the overall performance remains notably superior to that of 4DVarNet and ViT.

Fig. 6: Visualization of examples when only assimilating upper-air temperature observations.
figure 6

a Results of the 7-day medium forecast using ERA5 and state estimations only assimilating upper-air temperature observations as initial fields, separately. The RMSEs and ACCs of the selected variables (R500 for the first column, R850 for the second column, T500 for the third column, and T850 for the fourth column) are illustrated. b A typical example of background error (xb − xt) is represented. Two levels of relative humidity variables (R500 for the first column and R850 for the third column) and two levels of geopotential (T500 for the second column and T850 for the fourth column) are illustrated. c The visualization of analysis increment (xa − xb) of the compared models. For each case, 4DVarNet (first row), ViT (second row), and 4DVarFormer (third row) are shown. For all illustrated cases, the time is 00:00 UTC on 19 July 2021, for there is a tropical cyclone at this time in the East China region.

Figure 6b depicts the background error (xb − xt) in the upper-air temperature and relative humidity fields at a specific moment in time, while Fig. 6c illustrates the analysis increments (xa − xb) resulting from assimilating the upper-air temperature observations by each neural network. The temperature and relative humidity increments from the three models exhibit distinct differentiation. Particularly, the increments of the ViT model exhibit significant anomalies in the northwest direction over land, suggesting that the model does not effectively learn the information present in temperature observations during its training phase. Thus, there is an underestimation of the influence of temperature features and a lack of separation of information from the wind and temperature observations. While 4DVarNet shows some improvement, it still exhibits explicit underestimated increments compared to 4DVarFormer. In contrast, the analysis increments obtained by 4DVarFormer present a more reasonable distribution, with the relative humidity increments showing a somewhat negative correlation with the temperature increments. This alignment is more consistent with the relationship between the two error distributions in the background field, suggesting that 4DVarFormer can effectively capture the fundamental temperature-humidity relationship in the atmospheric system and correct the corresponding relative humidity through the assimilation of upper-air temperature observations.

In addition to the findings already presented above, we have visualized the results of assimilating solely the upper-air temperature observations in Fig. 7. It is evident from the figure that only 4DVarFormer accurately captures the spatial structure of the relative humidity. On the other hand, both 4DVarNet and ViT do not significantly correct the background relative humidity. This result suggests that only 4DVarFormer possesses the ability to transfer information from temperature observations to humidity, while neither 4DVarNet nor ViT exhibits this capability. These results align once again to highlight that 4DVarFormer effectively learns the temperature-humidity relationship.

Fig. 7: Visualization of assimilation results.
figure 7

The state estimation of two upper-air relative humidity variables (R500 and R850) and two upper-air temperature variables (T500 and T850). For each case, the background field (first column), states estimated by 4DVarNet (second column), ViT (third column), 4DVarFormer (fourth column), and the ERA5 ground truth (fifth column) are shown. The results are generated by only assimilating upper-air temperature observations. For all illustrated cases, the time is 00:00 UTC on 19 July 2021, for there is a tropical cyclone at this time in the East China region.

In Fig. S8, we provide further evaluations of 4DVarFormer with the error profiles when only the upper-air temperature observations are assimilated.

Sensitivity experiments using different forecast timing background fields

We additionally tested the performance of 4DVarFormer in the face of different lead time forecasting background fields. Figure 8 shows the performance of the 7-day forecasts starting from assimilation results using the background field with 3-day (red line), 4-day (blue line), and 5-day (yellow line) lead times, respectively. It can be seen that 4DVarFormer shows similar performance on the three different forecast timing background fields, with both RMSEs and ACCs very close to those started from ERA5. There are some fluctuations in the ACC only on the T2M, but the overall ACC remains around the ACC from ERA5, which also shows that the model has a very good performance.

Fig. 8: Comparison of the average latitude-weighted RMSE (first and second rows) and ACC (third and fourth rows) of the 139 sets of 7-day’s medium forecast using ERA5 and outputs of 4DVarFormer (red lines for using 3-day forecasting background filed as 4DVarFormer’s input, blue lines for 4-day forecasting background field as the input, and yellow lines for 5-day forecasting background field as the input).
figure 8

Four surface variables, such as MSLP, T2M, U10, and V10, and six upper-air variables are illustrated, including Z500, T850, T500, R500, U500, V500, using testing data from 2021 to 2022. All models are evaluated against the ERA5 reanalysis dataset.

In Fig. S9, we provide further evaluations of 4DVarFormer with the error profiles when background fields with different lead times are used. Interestingly, as illustrated in Figs. 8 and S9, the forecast bias resulting from the assimilation of the 5-day background field is the largest, yet its metrics demonstrate the least oscillation amplitudes throughout the forecast horizon. This phenomenon may be attributed to the discontinuities in the analysis increments produced by 4DVarFormer between patches, whereas forecasts generated by FourCastNet tend to exhibit smoother transitions over extended forecast durations. Consequently, utilizing a background field with a 5-day forecast mitigates the discontinuities in the analytical field to some extent, thereby diminishing the impact of instability on subsequent forecasts.

Our experiments aimed to assess the performance of the 4DVarFormer model when assimilating once using background fields of varying forecast timing quality, to evaluate the model’s stability. As shown in Fig. S1, we observed an increasing trend in forecast errors as the lead time extended. Thus, this leads to larger biases when using background fields of different forecast timing for assimilation (see Fig. S9). This indicates that our model not only considers the impact of observations on assimilation but also accounts for the influence of background field quality. Hence, the error level of the background field in the training set influences the performance of our model. In future research, we will address the challenge of the deep dependence of neural networks on the background field distribution of the training set, thus leading to the development of more robust assimilation models.

Sensitivity experiments using different DAWs

In this section, we aim to further explore the role of 4DVar physical constraints by examining the impact of the analysis fields generated by 4DVarFormer, using different DAWs, on forecast accuracy. In our experiments, the assimilation window is left-closed and right-open, i.e., the 6-h DAW indicates that only observations at the initial moment are assimilated. The 12-h DAW indicates that only observations at the initial moment and 6 h after the initial moment are assimilated. We conducted experiments using DAWs of 6 h (without the 4DVar constraint), 12 h, 18 h, and 24 h to assess the effect of incorporating physical constraints of varying time lengths through the gradient of the 4DVar cost function. To begin with, we conducted a set of 30-day DA cycle experiments, and the results are presented in Fig. 9. As the length of the DAW increases, the analysis error decreases, indicating that including more temporal observations has a positive impact on the performance of the 4DVarFormer. Furthermore, we carried out a 7-day medium-range forecast experiment, and the results are shown in Fig. 10. The forecasts generated from analysis fields using longer DAWs exhibit lower RMSEs and higher ACCs. This demonstrates that incorporating information through the 4DVar constraints contributes positively to forecast accuracy.

Fig. 9: Comparison of the average latitude RMSE (first and second rows) and ACC (third and fourth rows) of the 139 sets of 30-day assimilation cycle using DAW of 6 h (blue lines), 12 h (yellow lines), 18 h (green lines), and 24 h (red lines) for four surface variables, such as MSLP, T2M, U10, and V10, and six upper-air variables, including Z500, T850, T500, R500, U500, V500, using testing data from 2021 to 2022.
figure 9

All models are evaluated against the ERA5 reanalysis dataset.

Fig. 10: Comparison of the average latitude-weighted RMSE (first and second rows) and ACC (third and fourth rows) of the 139 sets of 7-day’s medium forecast using ERA5 and outputs of 4DVarFormer (red lines for using 6-day DAW, blue lines for 12-h DAW, green lines for 18-h DAW, and red lines for 24-h DAW).
figure 10

Four surface variables, such as MSLP, T2M, U10, and V10, and six upper-air variables are illustrated, including Z500, T850, T500, R500, U500, V500, using testing data from 2021 to 2022. All models are evaluated against the ERA5 reanalysis dataset.

Tropical cyclone forecasting experiments

This section aims to demonstrate the effectiveness of tropical cyclone track forecasting using initial fields generated by 4DVarFormer. The purpose of this demonstration is to evaluate the adaptability of 4DVarFormer to extreme weather conditions and its ability to capture flow-dependent information. We localized the cyclone eye by finding the minimum location of the MSLP. As depicted in Fig. 11, the FourCastNet model employed in this study has limited performance, resulting in challenges in accurately predicting the path of the tropical cyclones even when the ERA5 is utilized as input. However, in the case of strong tropical cyclone tracking tasks, the initial fields generated by 4DVarFormer have successfully guided forecasts that exhibited closer proximity to the actual outcomes. Regarding the prediction of Super Typhoon Surigae, forecasts initialized by ViT and 4DVarNet accurately predicted the tropical cyclone’s tracks within a 4-day period. However, only the forecast initialized by 4DVarFormer successfully captures the tropical cyclone’s change in direction beyond 4 days. This distinctive predictive capability was further confirmed in the case of Super Typhoon Chanthu. These successes can be attributed, at least in part, to the 4DVarFormer model’s implicit flow-dependent characteristics, thereby enhancing its ability to initialize forecasts of extreme weather events.

Fig. 11: Comparison of the tropical cyclone forecasting initialized by 4DVarFormer and ERA5.
figure 11

Tracking results for two strong tropical cyclones in 2021, that is, (a) Super Typhoon Surigae (2021-2) and (b) Super Typhoon Chanthu (2021-14). The initial time point is shown below each panel. The time gap between neighboring dots is 6 h. We utilized FourCastNet, which was trained using ERA5, to forecast for 5 days starting from the initial moment. The black line represents the ground truth extracted from the iBtrack dataset, while the blue line represents the forecast result obtained using ERA5 as the initial field. The blue-gray, yellow, and red lines separately represent the forecast results obtained using initial fields generated by 4DVarNet, ViT, and 4DVarFormer.

Visualization of the attention weights

For the Transformer model, the diversity of features plays a crucial role in its ability to capture attention effectively46. In our study, we aimed to investigate whether the model learns a wide range of diverse features. To accomplish this, we visualized the attention map’s rank. As an example, we extracted the attention weights from the last block of the 4DVarFormer model and visualized the resulting attention map in Figs. 12 and 13. The visualization revealed that both the cross-attention and self-attention of 4DVarFormer exhibited attention maps with nearly full rank. This indicates that the model successfully learns a diverse set of features. Additionally, the attention maps obtained from different heads in the model’s multi-head attention were all distinct. This suggests that each head captures different features. For example, when examining the cross-attention, we observed that the fifth head’s attention map displayed localized features. This observation implies that the fifth head focuses more on the relationship between each embedding of the background field and the embedding of gradients at spatially adjacent.

Fig. 12: Attention map of the cross attention module (400 × 400) from the last block of 4DVarFormer.
figure 12

The cross-attention in each head exhibits a nearly full-rank attention map, indicating that 4DVarFormer can learn a wide range of features and capture diverse relationships from background fields and gradients.

Fig. 13: Attention map of the self-attention module (400 × 400) from the last block of 4DVarFormer.
figure 13

The self-attention in each head exhibits a nearly full-rank attention map, indicating that 4DVarFormer can learn a wide range of features.

Discussion

DL-based weather forecasting models have made substantial progress in medium-range weather forecasting. Nonetheless, these models still depend on the outputs of the conventional 4DVar method for initializing. However, the NWP-based 4DVar method entails significant computational costs, impeding the real-time operationalization of DL-based weather forecasting. Recent advancements in DL-based DA models present a promising approach for nonlinear modeling and rapid assimilation. Nevertheless, the current DL-based DA models encounter challenges in accurately estimating multivariate 3D initial fields due to the absence of physical constraints and the inability to capture multivariate kinetic relationships.

Recently, a new direction has emerged to enable real-time 4DVar by utilizing DL-based forecasting models instead of the forward and ADM. Under this situation, 4DVarNet has been successfully applied to reconstruct Lorenz63/96 systems and SSH. However, the issues of non-parallelizability and vanishing gradients of 4DVarNet as well as the expensive autoregressive strategy make it hard to extend to real-world initial field estimation directly. This study introduces the 4DVarFormer model, which is built upon the constraint of 4DVar and incorporates the advancements of the ViT model in extracting global correlations. Unlike traditional 4DVar, 4DVarFormer does not need to count the complex background error covariance matrix or develop the complex ADM, which greatly reduces the development difficulty of assimilation systems. The 4DVar-based physical constraint in 4DVarFormer improves its capability to capture multivariate relationships by establishing connections between observed and unobserved variables using the auto-differentiation of the forecast model and the attention mechanism. The effectiveness and superiority of the 4DVarFormer model in estimating multivariate 3D states are demonstrated in this work, highlighting its ability to represent the interconnected physical relationships within the atmospheric system.

Certainly, in addition to integrating the DL-based model with 4DVar, current research includes several studies that combine the DL-based forecasting model with Ensemble Kalman Filter (EnKF) for ensemble assimilation47,48,49. These studies leverage the rapid inference capability of the DL-based model to generate a large number of ensemble members. However, few related studies have applied these methods to the assimilation task of 3D multivariate fields. This may be because the research on ensemble forecasting techniques for DL-based forecasting is still in its early stages. Furthermore, generative adversarial networks (GANs) have also been employed for the development of accurate DA models50,51. Nevertheless, owing to the instability of generative adversarial network training, a profound exploration is required to train a stable and reliable assimilation model for multivariate 3D complex weather systems. The 4DVarFormer proposed in this paper possesses a simple structure and training process. However, it is also promising to explore the combination of ensemble methods as well as generative models with 4DVarFormer, which would leverage the advantages of these methods and further enhance the practical utility of DL-based assimilation.

Furthermore, the 4DVarFormer model demonstrates the capability to estimate multivariate 3D initial fields in the East China region, effectively driving pre-trained FourCastNet to generate medium-range forecasts comparable to those using ERA5 as initial fields. In addition, the black-box nature of DL-based models presents a notable challenge to their interpretability in real-time DA. While prior studies have reported most DL-based DA results, there has been limited analysis applied to the mechanistic understanding of multivariate 3D weather state estimation within a DL context. To address this gap, we have designed and executed sensitivity experiments to effectively demonstrate the ability of 4DVarFormer to capture the wind-pressure relationship and temperature-humidity relationship. Hence, these meticulously planned and demonstrated sensitivity experiments are critical for advancing the understanding of the kinetic soundness in a DL-based DA model. In addition, we have conducted a sensitive experiment to evaluate the impact of using different DAWs on the performance of 4DVarFormer in terms of assimilating and forecasting results. The results demonstrate the positive influence of the 4DVar physical constraints. Furthermore, we have assessed the performance of 4DVarFormer in the tropical cyclone track task. The results of two super typhoon tracking tests show that the tropical cyclone tracks initialized with 4DVarFormer exhibited improvements compared to ERA5. This suggests that 4DVarFormer is capable of learning flow-dependent characteristics. Additionally, we have visualized the attention map in 4DVarFormer and observed that it captured a wide range of diverse features. It is crucial to note that 4DVarFormer is a model framework that primarily utilizes the Transformer model to extract gradient information from the background field and the 4DVar cost function. Importantly, this framework remains theoretically independent of the architecture of the forecasting model.

Further enhancements to the configuration of 4DVarFormer, as presented in this study, are warranted. Firstly, the initial fields produced by the DL-based model demonstrate some smoothing and mesh checkerboard issues, likely stemming from the model’s limited identifiable resolution compared to the data resolution. To mitigate this concern, exploring the use of GANs52,53 or diffusion models54,55 for post-processing the output results to enhance the model’s identifiable resolution is advisable. For example, the introduction of a diffusion model in the FuXi-Extreme model39 has exhibited potential in enhancing forecast accuracy for extreme weather events, potentially attributed to the improved model identifiable resolution achieved through diffusion models.

Furthermore, the training of the 4DVarFormer model requires the use of a pre-trained forecast model to calculate the gradient as an input, leading to a substantial consumption of GPU memory and hindering efficient model training and scalability. A potential solution to this challenge involves developing lightweight forecast models, similar to the use of a low-resolution TLM in the inner loop of traditional 4DVar56,57. This approach aims to optimize GPU memory utilization while ensuring the accuracy of the lightweight forecast model, thereby enabling the extension of the 4DVarFormer model to encompass additional variables, larger regions, or even global state estimation.

The effective implementation and robust performance of 3D weather state estimation using 4DVarFormer indicate its potential for practical real-world assimilation. Consequently, our promising findings contribute to future advancements in integrated end-to-end DL weather forecasting systems. Furthermore, the associated variational-constrained networks can be readily extended to other geoscience domains, including oceanography, demonstrating its promise as a valuable complement to the 4DVar technique.

Materials and methods

Datasets

ERA5, the fifth generation of the ECMWF reanalysis dataset, provides invaluable hourly data on surface and upper-air variables with a horizontal resolution of approximately 31 km. Covering the period from January 1940 to the present day, this extensive dataset is a comprehensive and enduring source of weather information58. Produced by an ensemble 4DVar assimilation system that integrates high-quality global observations with Integrated Forecast System results, it is widely recognized as the most comprehensive and accurate reanalysis archive available. Due to its extensive coverage and accuracy, we have chosen to utilize the ERA5 reanalysis dataset as the ground truth for our work.

We have extracted a subset of the ERA5 dataset, specifically focusing on the East China region over a 13-year period. The spatial coverage spans from 100°E to 140°E in longitude and 10°N to 50°N in latitude, with a spatial resolution of 0.25° (160 × 160 latitude-longitude grid points) and a temporal resolution of 6 h. Our efforts concentrate on assimilating observations of upper-air temperature at four levels (50, 500, 850, 1000 hPa) and the 10-m wind to accurately estimate five upper-air atmospheric variables at four levels and four surface variables. The upper-air atmospheric variables include geopotential (Z), temperature (T), u component of wind (U), v component of wind (V), and relative humidity (R), while the surface variables encompass 2 m-temperature (T2M), 10 m u wind component (U10), 10 m v wind component (V10), and mean sea level pressure (MSLP). Consequently, our work involves the estimation, forecasting, and evaluation of a total of 24 variables.

We partitioned the data into training, validation, and testing sets. The training set comprises 14,600 samples, with 10 years’ worth of data from 2010 to 2019 (14,600 = 365 × 4 × 10). Similarly, the validation set consists of 1460 samples representing the year 2020 (1460 = 365 × 4 × 1), and the out-of-sample testing involves 2920 samples covering the years 2021 to 2022 (2920 = 365 × 4 × 2).

The experiments involved utilizing the OSSE setting1,35,36,37 and training the FourCastNet forecast model with the experimental data (for details, please see Supplementary Text, Table S1 and Fig. S1). To process the simulated observed wind field data, we initially converted it to wind speed and direction using the U10 and V10 variables from the ERA5 reanalysis. Gaussian stochastic perturbations were then applied to the wind speed, with a standard deviation of 2 m/s for wind speeds below 20 m/s, and a standard deviation of 10% multiplied by the wind speed for wind speeds above 20 m/s59. Subsequently, the perturbed wind field was transformed back to observations of U10 and V10. As for the simulated observed temperature data, we utilized the temperature in the ERA5 reanalysis and introduced Gaussian random perturbations to simulate relative humidity observations, with error standard deviations of 2K, 0.5K, 1.4K, and 2.2K for the 50, 500, 850, and 1000 hPa levels, respectively. These errors are assumed based on the retrieval error distribution described in the paper by Wang et al.60, rounded to 1 decimal place.

4DVarFormer: the multivariate 3D weather state estimation model

Inspired by the dynamical modeling applications using the variant ViT models2,3,6 and considering the physical consistency in the weather system, a computational efficiency 4DVar-constrained attention-based neural network model38, known as 4DVarFormer, has been established for multivariate 3D weather initial field estimation. It takes the multivariate 3D background xb(t0) at time t0 and observations of 10-m wind and 4 pressure level temperatures (\({\bf{y}}({t}_{0}),\ldots ,{\bf{y}}({t}_{{N}_{f}})\), Nf = 3 represents the number of time points of observations in the assimilation window of 24 h) as inputs to generate accurate initial fields xa(t0). The theoretical foundation of the 4DVarFormer procedure aligns with the cost function of the 4DVar method, as elucidated by:

$${{\mathcal{J}}}_{4DV\,ar}=\frac{1}{2}\parallel {{\bf{x}}}^{b}({t}_{0})-{\bf{x}}({t}_{0}){\parallel }_{{{\bf{B}}}^{-1}}^{2}+\frac{1}{2}\mathop{\sum }\limits_{k=0}^{{N}_{f}}\parallel {\bf{y}}({t}_{k})-{{\mathcal{H}}}_{k}({{\mathcal{M}}}_{{t}_{0}\to {t}_{k}}({\bf{x}}({t}_{0}))){\parallel }\,_{{{\bf{R}}}^{-1}}^{2},$$
(1)

where \({{\mathcal{M}}}_{{t}_{0}\to {t}_{k}}\) is the pre-trained FourCastNet model that predicts the x(tk) at time tk by taking state x(t0) at time t0 as the input (for details to train FourCastNet please see Supplementary Text and Table S1). \({{\mathcal{H}}}_{k}\) represents the observation operator at time tk. B and R separately represent the background and observation error covariance matrix which are assumed to be constant.

In the case of propagating the initial perturbation using a TLM, the expression for the analysis increment can be derived as follows:

$${{\Delta }}{\bf{x}}({t}_{0})=-2{\left({\bf{I}}+{\bf{B}}\mathop{\sum }\limits_{k = 0}^{K}{{\bf{M}}}_{{t}_{0}\to {t}_{k}}^{T}{{\bf{H}}}_{k}^{T}{{\bf{R}}}^{-1}{{\bf{H}}}_{k}{{\bf{M}}}_{{t}_{0}\to {t}_{k}}\right)}^{-1}{\bf{B}}{\left({\nabla }_{{{\bf{x}}}^{b}({t}_{0})}{{\mathcal{J}}}_{4DV\,ar}^{o}({{\bf{x}}}^{b}({t}_{0}))\right)}^{T},$$
(2)

where \({{\bf{M}}}_{{t}_{0}\to {t}_{k}}\) is the tangent linear version of \({{\mathcal{M}}}_{0\to k}\), Hk is the tangent linear version of \({{\mathcal{H}}}_{k}\). \({{\mathcal{J}}}_{4DVar}^{O}=\frac{1}{2}\parallel {\bf{y}}({t}_{k})-{{\mathcal{H}}}_{k}({{\mathcal{M}}}_{{t}_{0}\to {t}_{k}}({\bf{x}}({t}_{0}))){\parallel }_{{{\bf{R}}}^{-1}}^{2}\) denotes the observation term of the 4DVar cost function (for detail derivations please see Supplementary Text).

4DVarFormer is constructed based on this theory. Unlike traditional 4DVar systems, 4DVarFormer acts as a non-iterative 4DVar solver. It learns the operation \(-2{\left({\bf{I}}+{\bf{B}}\mathop{\sum }\nolimits_{k = 0}^{K}{{\bf{M}}}_{0\to k}^{T}{{\bf{H}}}_{k}^{T}{{\bf{R}}}^{-1}{{\bf{H}}}_{k}{{\bf{M}}}_{0\to k}\right)}^{-1}{\bf{B}}\) by a neural network. The operation comprises the pertinent features of the background field and the relevant features of the ADM. Since the features of the ADM can be derived from the gradient, the 4DVarFormer model takes the background field and the gradient of \({{\mathcal{J}}}_{4DVar}^{O}\) as inputs. This process yields initial fields that align with the trajectory information of weather states. (see Fig. 14a). Notably, the 4DVarFormer does not need to estimate the covariance matrix B, while the covariance matrix R is set as a diagonal matrix:

$${\bf{R}}=\left[\begin{array}{llllll}{\sigma }_{T50}^{2}{{\bf{I}}}_{T50}&&&&&\\ &{\sigma }_{T500}^{2}{{\bf{I}}}_{T500}&&&&\\ &&{\sigma }_{T850}^{2}{{\bf{I}}}_{T850}&&&\\ &&&{\sigma }_{T1000}^{2}{{\bf{I}}}_{T1000}&&\\ &&&&{\sigma }_{U10}^{2}{{\bf{I}}}_{U10}&\\ &&&&&{\sigma }_{V10}^{2}{{\bf{I}}}_{V10}\\ \end{array}\right]\,,$$
(3)

where σvariable (variable  [T50, T500, T850, T1000, U10, V10]) represents the standard deviation of the variable’s observation error and Ivariable denotes the number of observation points (in this study it is 25,600).

Fig. 14: 4DVarFormer for 3D-multivariate weather state estimation.
figure 14

a Architecture of 4DVarFormer, a 4DVar-constrained Transformer model. The model takes the background field (xb(t0)) and observations (y(t0), y(t1),  , y(t3)) as input and outputs the analysis increment (Δx(t0)). Then, the increment is added to the background field to get the final analysis field. b 4DVar Gradient Block, a neural implementation of the gradient of the 4DVar cost function based on a pre-trained FourCastNet. c Assimilation Network, a neural network that directly provides the analysis increment Δx(t0).

The model architecture of the 4DVarFormer consists of two main components: the 4DVar gradient block and the assimilation network. Firstly, in the 4DVar gradient block, the background field xb(t0) is fed into the pre-trained FourCastNet to predict the next Nf states of the weather on the grids (\({{\bf{x}}}^{f}({t}_{1}),\cdots \,,{{\bf{x}}}^{f}({t}_{{N}_{f}})\), in our study Nf = 3). Then, as shown in Fig. 14b, the observation term \({{\mathcal{J}}}_{4DVar}^{O}({{\bf{x}}}^{b}({t}_{0}))\) of the 4DVar cost function is computed by using the prior states (\({{\bf{x}}}^{b}({t}_{0}),{{\bf{x}}}^{f}({t}_{1}),\cdots \,,{{\bf{x}}}^{f}({t}_{{N}_{f}})\)) and the observations (\({\bf{y}}({t}_{0}),\cdots \,,{\bf{y}}({t}_{{N}_{f}})\)). With the help of PyTorch’s automatic differentiation function, we can directly obtain the gradient \({\nabla }_{{{\bf{x}}}^{b}({t}_{0})}{{\mathcal{J}}}_{4DVar}^{o}({{\bf{x}}}^{b}({t}_{0}))\) of this cost function.

Subsequently, the gradient \({\nabla }_{{{\bf{x}}}^{b}({t}_{0})}{{\mathcal{J}}}_{4DVar}^{o}({{\bf{x}}}^{b}({t}_{0}))\) and the background field xb(t0) are used as the inputs of the following assimilation network (Fig. 14c) to generate the analysis increment Δx(t0). Recently, the ViT38 and its variants have demonstrated remarkable performance in various computer vision and complex dynamical systems modeling tasks, which learn the correlation of the input data by the attention mechanism. The attention mechanism is calculated as follows:

$$\,\text{Attention}\,({\bf{Q}},{\bf{K}},{\bf{V}})=\frac{{\bf{Q}}{{\bf{K}}}^{T}}{\sqrt{D}}{\bf{V}},$$
(4)

where Q, K, V represent the query, key, and value features respectively. D denotes the length of the features. To reduce the spatial dimensions of inputs and accelerate the training process, the patch embeddings divide the gradient and background field into 8 × 8 size patches with each patch projected as a feature vector, separately. Subsequently, the attention-based assimilation network processes the embedded data, and prediction follows a simple linear projection layer. The assimilation network comprises L stack blocks (in our work, L = 4), formed sequentially by a multi-head cross-attention (MHCA), a feedforward block, a multi-head self-attention (MHSA), and another feedforward block. Each MHCA takes in the linear projection of the background feature as the query feature, and the linear projections of the output from the previous block serve as the key and value features. The final analysis field is computed as the sum of the background field and the analysis increment.

Model training strategy

This section outlines the training process for DL-based DA models (4DVarFormer, 4DVarNet, and ViT). The training step involves supervised training and optimizing the DL-based DA models to estimate the initial fields using the training dataset. The loss function used is \({{\mathcal{L}}}_{1}\) loss, which is defined as follows:

$${{\mathcal{L}}}_{1}=\frac{1}{C\times H\times W}\mathop{\sum }\limits_{c=1}^{C}\mathop{\sum }\limits_{i=1}^{H}\mathop{\sum }\limits_{j=1}^{W}\vert {\hat{{\bf{X}}}}_{c,i,j}-{{\bf{X}}}_{c,i,j}\vert ,$$
(5)

where C, H, and W are the number of channels and the number of grid points in latitude and longitude direction, respectively. c, i, and j are the indices for the channel, the latitude, and the longitude coordinates, respectively. \({\hat{{\bf{X}}}}_{c,i,j}\) and Xc,i,j are the models’ estimated and ground truth for the same variable and locations (latitude and longitude coordinates) at the initial time. The \({{\mathcal{L}}}_{1}\) loss is averaged over all the grid points and variables.

All the DL-based DA models in the study were developed using the PyTorch framework. Specifically, the 4DVarFormer and ViT models were trained using a single NVIDIA V100 GPU. The training process involved 100 epochs and utilized the AdamW optimizer with parameters β1 = 0.9 and β2 = 0.95, as well as an initial learning rate of 5e−4. On the other hand, the 4DVarNet model was trained on two NVIDIA V100 GPUs and employed the Adam optimizer, as described in the original paper35. Additionally, the autoregressive iteration step of 4DVarNet is set to be 10. It is worth noting that all of these models were trained using a batch size of 64 and a cosine learning rate decay. After each training epoch, we evaluate the \({{\mathcal{L}}}_{1}\) loss of the model estimation based on the validation set and save the model parameters with the minimum loss (for more details of the training hyperparameters, the evaluation metrics, the configurations of all neural networks as well as the flow chart of the evaluation framework, please see Supplementary Text, Tables S2S4 and Figs. S2S4.). All models were trained using only the background field of 3-day forecast duration as well as 10m wind field and high altitude temperature observations, and all relevant tests were performed on the trained models.