Introduction

Climate profoundly impacts social and economic activities, making high-resolution climate prediction a crucial research area for many years. Radar reflectivity data serves as fundamental information for analyzing and forecasting catastrophic weather events. Higher radar resolution leads to more detailed image structures, enabling early detection and warning of severe weather events. The concept of downscaling is to make estimations at a finer spatial scale than that of the original datasets, aiming to enhance information and details. Within climate sciences, climate downscaling methods have been developed and generally fall into two categories: dynamic climate downscaling models and statistical downscaling methods. Recent studies have addressed the Super-Resolution (SR) techniques to solve the climate downscaling problem. SR1,2,3 has long been a prominent topic in computer vision, aiming to generate high-resolution images from low-resolution inputs. Various SR methods have been introduced for reconstructing high-resolution images, with applications spanning diverse fields such as medical diagnostics4, transportation5 and addressing the challenges of enhancing precipitation data in the field of climatology. SR models are mainly constructed by Convolutional Neural Networks (CNNs) to generate High-Resolution (HR) images with minimal differences compared to Low-Resolution (LR) images, which is a significant improvement over traditional interpolation methods. Among them, CNNs equipped with a large number of learnable weights can effectively model and capture the complex spatial patterns embedded in LR images. However, these algorithms primarily optimize by minimizing the mean square error, leading to a lack of high-frequency content in the final generated images. Moreover, the convolution operations adopt a local mechanism, which hinders the establishment of global dependencies and limits the overall performance of the model. Very recently, the transformer-based model has been proposed for SR problem. The central component of the Transformer is the Self-Attention (SA) mechanism6, which facilitates the establishment of global dependencies. This property alleviates the limitations of CNN-based algorithms. Some studies have shown that Transformer mitigates the high complexity of global self-attention.

Moreover, utilizing multiple types of information in deep learning, also known as multimodal learning, is drawing growing interest7. With the rapid growth of the types of indicators available in the meteorological field, deep learning-based climate downscaling tasks should also consider the problem of multimodal learning. Radar data poses more complexity compared to standard images due to its multiscale, multidimensional nature and susceptibility to factors like illumination, cloud cover and sensor noise. Dual polarization radar provides rich polarization information, including the horizontal reflectance factor \(\text {Z}_\text {H}\) indicating precipitation intensity, the \(\text {Z}_\text {DR}\) representing the horizontal and vertical echo intensity difference, and \(\text {K}_\text {DP}\) measuring liquid water content in the atmosphere8,9. The patterns of \(\text {Z}_\text {DR}\), \(\text {K}_\text {DP}\) and \(\text {Z}_\text {H}\) reveal distinct characteristics of size and distribution of raindrops, which may change dramatically during different evolution stages of storms and thus provide evolution information about storms. Inspired by this pattern of polarization radar, we propose to integrate dual polarization radar data into a unified climate image to enhance model training. Therefore, in this paper, we propose a new model Climate Downscaling Dual Aggregation Transformer (CDDAT) to facilitate the use of multiple input information, to achieve more precise climate downscaling. Specifically, we firstly construct a multimodal fusion structure that uses CNNs to merge multiple radar variables (\(\text {Z}_\text {DR}\), \(\text {K}_\text {DP}\) and \(\text {Z}_\text {H}\)) into climate images. Then, shallow feature extraction is implemented by a convolutional layer that maps the input low-resolution data to potential vectors. Furthermore, we design a “CNN+Transformer” pattern as the deep feature extraction part, which can be divided into two parts, Lightweight CNN Backbone(LCB) and DATB. For one thing, LCB reduces the shape of the intermediate layer feature mapping more, dynamically adjusts the feature mapping size to extract high-frequency features, and maintains the deep network depth. The basic feature extraction unit of LCB is the Adaptive Residual Feature Block (ARFB)10, which can adaptively adjust the residual weights. Hence LCB has the advantage of dynamically adjusting the size of feature mapping to extract high-frequency features. LCB can achieve significantly reduces computational resource consumption and the number of parameters while enabling efficient visual feature extraction.For another, we design DATB to aggregate spatial and channel dimensional features by applying spatial self-attention and channel self-attention alternately in successive Transformer blocks. Since DATB’s self-attention employ shift window operations, it not only efficiently learns the relationship between similar local blocks, but also captures more spatial information, which can lead to more references in the super-resolution region. It is worth noting that DATB’s self-attention is operated using a shift window, so more spatial and channel information can be captured, resulting in more reference information in the super-resolution region. Finally, we use a pixel transformation (or pixel rearrangement) layer for image scaling to achieve image reconstruction. Figure 1 gives the flowchart of our method. Firstly, our multi-radar data is extracted from the source dataset with outlier processing and selected according to the experimental standards. The multi-radar data is then preprocessed including normalization and fused using CNN and then used as input into our proposed model. The LCB and DATB for our model transform the input data from shallow feature into deep features. Based on the pixel transformation layer, the HR image is reconstructed. The main contributions are as follows:

  • We propose a CDDAT model based on dual aggregation Transformer to generate high-resolution climate predictions. To our knowledge, it is the first attempt to adopt the Transformer model in the SR domain to realize climate downscaling.

  • We propose the multi-modal fusion method based on CNN modules, which aims to fuse multiple radar metrics data. The dual polarization radar variables (\(\text {K}_\text {DP}\) and \(\text {Z}_\text {DR}\)) are multimodal fused as input model data, which can mine dynamic structural information of convective precipitation over the single variable.

  • We conduct extensive comparative experiments on the NJU-CPOL dataset to verify the effectiveness and superiority of CDDAT.

Fig. 1
figure 1

The flowchart of the proposed method.

Related work

Statistical Climate Downscaling (SD) is a classic tool for obtaining small-scale climate information11. The method relies on historical observations of statistical correlations between General Circulation Models (GCMs) and regional historical observational data and assumes that the relationship is constant, defining the prediction as a function of the predictor. This approach can consist of two main components. First, it is necessary to discover and establish empirical relationships between large-scale climate factors (predictors) and local climate factors (predictors)12,13. Second, this empirical relationship is applied to outputs from global models or regional models, including precipitation, temperature, or barometric pressure. The methods for implementing SD tasks are gradually enriched. Regression models have been widely used to fit probability distributions14,15. The non-homogeneous Hidden Markov Model (NHMM)16 was introduced to solve the rainfall problem17,18. However, SD methods are based on the statistical analysis of a long series of historical data and can only obtain finer spatial information. These are often accompanied by ignoring spatial and temporal inconsistencies. In addition, the application of such methods is also limited in areas where there is a lack of measured data. In recent years, more and more climate models have adopted super-resolution techniques for further analysis of climate datasets. Deep learning algorithms for precipitation forecasting in the meteorological field mainly utilize CNNs and Generative Adversarial Networks (GANs). For example, DeepSD19 was first used in a spatial downscaling task, integrating complex precipitation data into a single image using computer vision methods and leveraging the latent information of precipitation variables. It also uses topographic data as an input channel, taking into account geographic influences. Then, Kumar et al.20 obtained the optimal spatial distribution of rainfall magnitude over India based on DeepSD downscaling results. Models in the super-resolution domain utilize a series of residual blocks and attention mechanisms for continuous improvement of the local perceptual limitations of CNN. The climate downscaling approach is similar to the advances in the SR field. Cheng et al.21 introduced a residual dense block to the LapSRN network. They can collect different scales of high-resolution results at the corresponding level. They also conducted a detailed study of checkerboard artifact elimination in parameter studies of deconvolutional layers. Sharma et al.22 introduced ResDeepD with a series of skip connections across residual blocks, presenting faster output and better results. Chiang et al.23 proposed networks of jump connections, attention blocks, and auxiliary data cascades for heterogeneous precipitation simulation data with bias correction.

Recent climate downscaling methods are SRGAN-based models, such as DeepDT24, PhIRE GAN25 and ProGAN26, where perceptual loss is designed to improve the quality of subjective reconstruction of images, which has been shown to outperform other types of models, such as Augmented Convolutional Long Short-Term Memory (ConvLSTM) and U-Net27. Li et al28 proposed an unsupervised model-guided coarse-to-fine fusion model for hyperspectral image super-resolution task. By fusing deep image prior and degradation model information, Li et al.29 presented a model-informed unsupervised method to deal with hyperspectral image super-resolution problem. Yao et al.30 built a multi-graph neural network for hyperspectral image classification. Ding et al.31 constructed a multi-scale receptive field GAT to classify hyperspectral images. Chen et al.32 designed a neural network to detect hyperspectral image changes. Recently, SCNet33 has emerged as a remarkable model, adept at meticulously extracting intricate structural information from data, which significantly enhances the performance in tasks such as feature extraction and pattern recognition. Concurrently, OSEDiff34, a cutting-edge generative model, showcases its prowess by proficiently integrating multiple types of information to generate highly realistic and high-quality outputs, demonstrating great potential in addressing complex data generation challenges.

However, the above models all select individual rainfall data or add terrain, temperature, and other climate factors to assist in downscaling. Although relevant data is added during downscaling training, the model lacks the ability to reassign the importance of multiple variables. Radar indicators are subject to the inherent defects of radar systems, and the data input into the model sometimes contains noise or unexpected deviations. The presence of low-quality variables will reduce the accuracy of the prediction. In contrast, our model can adaptively reassign the importance of multiple variables, in other words, by decreasing the weights of features that negatively affect the prediction and increasing the weights of features that positively contribute to the prediction.

Materials

Data

This study uses data from a C-band dual polarization weather radar operated by Nanjing University. The data set (termed NJU-CPOL) was collected during 2014–2019, covering 268 precipitation events. The original radar base data (\(\text {Z}_\text {H}\), \(\text {Z}_\text {DR}\) and \(\text {K}_\text {DP}\)) were obtained from NJU-CPOL for training, verification and testing. During model training, we use vertical flipping operation as a data augmentation method for radar meteorological data. As shown in Fig. 2, Bilinear Interpolation is used to downsample the HR image to obtain the corresponding LR image. For the 2× super-resolution (2\(\times\)) experiment, the height and width of the downsampled LR image are 1/2 of the original HR image; while for the 4× super-resolution (4×) experiment, the height and width of the downsampled LR image are 1/4 of the original HR image. In addition, to increase the diversity of the dataset and verify the effectiveness of the algorithm during rainfall weather conditions, since part of the radar reflectivity information is invalid , such data will be discarded when constructing the training set. Note that the pixels on each image record the climate level at a certain time.

Fig. 2
figure 2

Schematic diagram of downsampling process of rainfall image.

Data preprocessing

In this paper, the randomization strategy is adopted to ensure that the data order is different after each shuffling, so as to ensure that the combination of training samples has sufficient randomness. We choose the 0-1 normalization operation, which maps the result between [0,1] by performing a linear transformation on the raw data. The calculation process is shown in the formula :

$$\begin{aligned} {{\textbf {X}}}^{*} = \frac{{\textbf {X}}-{{\textbf {X}}}_{min}}{{\textbf {X}}_{max}-{\textbf {X}}_{min}} \end{aligned}$$
(1)

where, \(\textbf{X}_{min}\) is the minimum value of the input data, \({\textbf {X}}_{max}\) is the maximum value of the input data. Through normalization processing, we can ensure that the input data have a consistent distribution, which improves the training efficiency and performance of the model. At the same time, some of the original radar data are treated as outliers. Some of the radar reflectivity information is invalid, which does not contribute much to the radar image downscaling task, but aggravates the imbalance phenomenon. For example, if the radar echoes of a single sample are all below 5 dBZ, these clutter data will be discarded when forming the training set. In practice, especially in the field of meteorology, the amount of meteorological data collected by radar observation stations is limited. Data enhancement technology exists to transform images or other types of data algorithmically, thereby increasing the diversity and quantity of data. In this paper, the operation of vertical flip is adopted to increase the number of valid samples, which can not only avoid the situation of overfitting of the model but also enhance its prediction ability for and adaptation ability to unknown data.

Model implementation

We implemented our model in the Pytorch framework. There are a total of 7310 precipitation data samples which are randomly separated into 80% for training, 10% for validation, and the other 10% for testing. The model training is optimized by Adam Optimizer35 with default settings and a learning rate of \(10^{-4}\). The training process was performed on NVIDIA GeForce RTX 2080Ti. The CDDAT can be optimized with the commonly used loss functions for SR, such as \(L_2\)1,36,37, \(L_1\)38,39,40 and perceptual losses41,42. For simplicity, given several N ground-truth HR images \({{\textbf {X}}_{t, i}}_{i=1}^{N}\), we optimize the parameters of CDDAT by minimizing the pixel-wise \(L_1\) loss:

$$\begin{aligned} L_1 =\frac{1}{N} \sum _{i=1}^{n} \Vert {\textbf {X}}_{h,i}-{\textbf {X}}_{t,i} \Vert _{1} \end{aligned}$$
(2)

Methods

In this section, the proposed CDDAT consists of two main parts. The first part introduces the process of image preprocessing. The second part is to introduce our downsampling method. Subsequently, we elaborate on the core components of CDDAT: the Lightweight CNN Backbone (LCB) and the Dual Aggregation Transformer Backbone (DATB). These parts are introduced in the following subsections.

Fig. 3
figure 3

The fusion process of the multi-mode input image.

CNN for image preprocessing

As mentioned above, there are several measurement dimensions for weather forecasting using dual-polarization radar. In this paper, reflectance factor (\(\text {Z}_\text {H}\)), differential reflectance factor (\(\text {Z}_\text {DR}\)) and differential phase factor (\(\text {K}_\text {DP}\)) are used. In addition, the general color image includes three channels, such as red, blue, and green. Each channel represents a feature of the picture. Inspired by this structure, we populate each image channel with radar data to obtain a multi-channel climate image for model input. The following steps describe the generation process of model input images. First, we obtain the relevant reflectance factor (\(\text {Z}_\text {H}\)), differential reflectance factor (\(\text {Z}_\text {DR}\)), and differential phase factor (\(\text {K}_\text {DP}\)) data from the NJU-CPOL dataset. Then, we treat each radar indicator as a channel to get the climate image Ic (three-channel). As shown in Fig. 3, by transforming the radar measurements into multiple channels, we can reduce the complexity of the radar image input to the model and capture a richer set of non-linear features in the climate image. After processing the raw radar data , a standard rainfall dataset is generated from the NJU-CPOL data. Precipitation in each dataset is recorded in the region. The \(\text {Z}_\text {H}\), \(\text {Z}_\text {DR}\) and \(\text {K}_\text {DP}\) are transformed from the radar measurements, and then they are input into a 1\(\times\)1 convolution layer to obtain a three- channel input data of our proposed model. This processing can reduce the complexity of the radar image input to the model and capture a richer set of non-linear features in the climate image.

Fig. 4
figure 4

The network architecture of our method.

Network structure

As shown in Fig. 4, the main purpose of this work is to obtain a high-resolution result \({{\textbf {I}}}_{SR}\) from a given low-resolution input \({{\textbf {I}}}_{LR}\). It can be simplified to \({{\textbf {I}}}_{SR}=Model({{\textbf {I}}}_{LR};\varvec{\theta } )\) , where \(\varvec{\theta }\) is the hyperparameter of the model. The CDDAT proposed in this paper includes three modules: shallow feature extraction, deep feature extraction, and image reconstruction. The shallow feature extraction is obtained by a convolution layer. The deep feature extraction is accomplished based on LCB and DATB where LCB consists of a group of HPBs and DATB is mainly composed of Dual Spatial Transformer Backbone (DSTB) and Dual Channel Transformer Backbone (DCTB) with global residual learning. The image reconstruction is achieved using PixelShuffle layer. Initially, we employ a convolution layer to process it and generate the shallow feature \({{\textbf {F}}}_{0}\).

$$\begin{aligned} {{\textbf {F}}}_{0} = {f}_{s}({{\textbf {I}}}_{LR}) \end{aligned}$$
(3)

where \({f}_{s}\) denotes the shallow feature extraction layer. \({{\textbf {F}}}_{0}\) is the extracted shallow feature, which is then used as the input of LCB with several High Preserving Blocks (HPBs).

$$\begin{aligned} {{\textbf {F}}}_{n} = {f}^{n}_{HPB} ({f}^{n-1}_{HPB}(...({f}^{1}_{HPB}({{\textbf {F}}}_{0})))) \end{aligned}$$
(4)

where \({f}^{n}_{HPB}\) denotes the mapping of \(n-th\) HPB and \({{\textbf {F}}}_{n}\) represents the output of \(n-th\) HPB.

$$\begin{aligned} {{\textbf {F}}}_{D} = {f}^{n}_{D} ({f}^{n-1}_{D}(...({f}^{1}_{D}[{{\textbf {F}}}_{1},{{\textbf {F}}}_{2},...,{{\textbf {F}}}_{n}]))) \end{aligned}$$
(5)

where \({{\textbf {F}}}_{D}\) is the output of DATB and \({f}_{D}\) stands for the operation of DATB. Finally, \({{\textbf {F}}}_{D}\) and \({{\textbf {F}}}_{0}\) are simultaneously fed into the reconstruction module to get the SR image \({{\textbf {I}}}_{SR}\).

$$\begin{aligned} {{\textbf {I}}}_{SR} = f({f}_{P} (f({{\textbf {F}}}_{d}+({\textbf {F}}_{0})) \end{aligned}$$
(6)

where f and \({f}_{P}\) stand for the convolutional layer and PixelShuffle layer, respectively.

Skip connections

To avoid gradient vanishing for training a model, skip connections have been widely used for both residual learning and decreasing training parameters. As shown in Fig. 5, we combine non-linear mapping \(f({{\varvec{x}}})\) with input \({{\varvec{x}}}\) to form residual learning, that is, by rewriting the original mapping into \(f({{\varvec{x}}})+{{\varvec{x}}}\). It is said to be easier to optimize the residuals than the original mapping. To the extreme that the mapping is identical, learning a zero residual is easier than fitting the original one by stacked non-linear layers43. We also apply several skip connections in subsequent Transformer modules, for example, High Preservation Blocks (HPB) and its core form Adaptive Residual Feature Blocks(ARFB). HPB is used to maximize the utilization of feature information under a low computational budget, thereby avoiding a sharp performance drop caused by reductions in network depth . The adaptive mechanism of ARFB enables the model to adjust to different input contents. In addition, inspired by the VDSR1, this paper uses skip joins for global residual learning. Different from1,43, our skip connections is combined with Transformer modules to facilitate global modeling in Transformer architectures.

Fig. 5
figure 5

Illustration of skip connections.

Lightweight CNN backbone (LCB)

The radar image dataset created is multidimensional data with spatial correlation. Dual-polarization radar images exhibit high local spatial correlation and rich texture features. In addition, to avoid occupying a large amount of GPU memory, a lightweight CNN backbone network is mainly used to extract shallow features from the input image \({\textbf {I}}_{LR}\), enabling the model to have initial super-resolution capability. According to Fig. 4, we can observe that LCB is mainly composed of a series of HPB, mainly using ARFB. As displayed in Fig. 6b, ARFB consists of two residual units (RU) and two convolutional layers. Assuming that \({{\textbf {x}}}_{ru}\) is the input of RU, the process of RU can be formulated as:

$$\begin{aligned} {\textbf{y}}_{ru} = \lambda _{res} \cdot {f}_{ex}({f}_{re}({{\textbf {x}}}_{ru})) + \lambda _{x}\cdot \textbf{x} \end{aligned}$$
(7)

HPB, which is mainly composed of ARFB, is used to obtain high-frequency features of rainfall maps. As shown in Fig. 6a, input feature mapping (\({\textbf {Q}}_ {s}\)) after average pooling and upsampling gets \({\textbf {Q}}_{u}\) elements, to calculate the high-frequency information (\(\textbf{P}_ {h}\)). \({\textbf {Q}}_{u}\) is the average information of \({\textbf{Q}}_{s}\). The purpose of this operation is to preserve the details and edges of the feature map until the high-frequency feature is downsampled. The formula for this process is:

$$\begin{aligned} {\textbf{P}}_{h} ={\textbf{Q}}_{s} - {\textbf{Q}}_{u} \end{aligned}$$
(8)

The downsampled feature maps are denoted as \({{\textbf {F}}}^{'}_{n-1}\). Several ARFBs are utilized to explore the potential information for completing the SR image. It is worth noting that these ARFBs share weights to reduce parameters. Meanwhile, a single ARFB is used to process the \({{\textbf{P}}}_{h}\) to align the feature space with \({{\textbf{F}}}^{'}_{n-1}\). After feature extraction, \({{\textbf{F}}}^{'}_{n-1}\) is upsampled to the original size by bilinear interpolation. After that, we fuse \({{\textbf{F}}}^{'}_{n-1}\) with \({{\textbf{P}}}^{'}_{h}\) for preserving the initial details and obtain the feature \({{\textbf{F}}}^{''}_{n-1}\). This operation can be expressed as

$$\begin{aligned} {\textbf{F}}^{''}_{n-1} =[f_{a}({{\textbf{P}}}_{h}), u(f^{4}_{a}(d({{\textbf{F}}}^{'}{n-1}))] \end{aligned}$$
(9)

where u and d denote the upsampling and downsampling operations, respectively, \(f_{a}\) denotes the operation of ARFB. Then, a 1\(\times\)1 convolution layer is used to reduce the channel number. And a channel attention module is employed to highlight channels with high activation values. Finally, an ARFB is used to extract the final features and the global residual connection is proposed to add the original features \({{\textbf{F}}}^{'}_{n-1}\) to \({{\textbf{F}}}_{n}\).

Fig. 6
figure 6

(a) The architecture of High Preserving Block(HPB). (b) Adaptive Residual Feature Blocks (ARFBs).

Dual aggregation transformer block based on BSConv

Blueprint Separable Convolution(BSConv) decomposes standard convolution into pointwise convolution and depthwise convolution. It is the inverse operation of depthwise separable convolution (DSConv). A study44 shows that BSConv performs better in effectively separating standard convolution. Compared to DSConv, BSConv adds pointwise convolution to achieve interaction between channels.

The Dual-aggregate Transformer Block (DATB) based on blueprint convolution is equipped with Adaptive Self-Attention (ASA) and Blueprint Spatial Gate Feedforward Network (BSGFN)45. BSGFN can enhance the modeling of spatial relationships and feature interactions within data. As shown in Fig. 7, if a given input feature is fed into the n-th DATB block, the specific formula is defined as follows:

$$\begin{aligned} {\textbf{X}}^{'}_{n} = \text {ASA}(\text {LN}({\textbf{X}}_{n-1})) + {\textbf{X}}_{n-1} \end{aligned}$$
(10)
$$\begin{aligned} {\textbf{X}}_{n} = \text {BSGFN}(\text {LN}({\textbf{X}}^{'}_{n})) + {\textbf{X}}^{'}_{n} \end{aligned}$$
(11)

where \({\textbf{X}}_{n}\) is the output feature, LN(\(\cdot\)) is the LayerNorm layer. ASA denotes adaptive self-attention (see Fig. 8), which includes Adaptive spatial self-attention (AS-SA) and Adaptive channel self-attention (AC-SA). Spatial Window Self-Attention(SW-SA): As shown in Fig. 9a, query, key, and value matrices (denoted as \(\textbf{Q}\), \(\textbf{K}\), and \(\textbf{V}\), respectively) are generated using linear projection, where all matrices are in \(\textbf{R}^{H \times W \times C}\) space. The specific formula of SW-SA is defined as follows:

$$\begin{aligned} \text {SW}-\text {SA}(\textbf{X})={\textbf{Y}}_{s}{\textbf{W}}_{p} \end{aligned}$$
(12)

where \({\textbf{W}}_{p} \in {{\textbf{R}}^{C \times C}}\) is the linear projection to fuse all features. The feature \({\textbf{Y}}_{s} \in {{\textbf{R}}^{H \times W \times C}}\) is obtained by reshaping and concatenating all \({\textbf{Y}}_{s}^{i}\), where \({\textbf{Y}}_{s}^{i}\) is the spatital self-attention of i-th head.

Channel-Wise Self-Attention(CW-SA): As shown in Fig. 9b, the self-attention mechanism in the channel-wise self-attention (CW-SA) is performed along the channel dimension. By the principle of SW-SA, the channel is subsequently segmented into heads, and attention is individually applied to each head. Finally, attention features \({\textbf{Y}}_{c} \in {{\textbf{R}}^{H \times W \times C}}\) are obtained by concatenating and reshaping then all \({\textbf{Y}}_{c}^{i}\).

As depicted in Fig. 8a, b, self-attention mechanisms AS-SA and AC-SA are designed based on SW-SA and CW-SA, respectively. AS-SA is to enhance the model’s ability to capture contextually relevant spatial dependencies while maintaining computational efficiency. AC-SA is designed for enhancing model performance by enabling context-aware feature refinement across channels.The process is defined as

$$\begin{aligned} \text {AS} - \text {SA}(\textbf{X}) = \left( \text {C} - \text {I}({\textbf{Y}}_{s},{\textbf{Y}}_{w}) + \text {S} - \text {I}({\textbf{Y}}_{w},{\textbf{Y}}_{s}) \right) {\textbf{W}}_{p} \end{aligned}$$
(13)
$$\begin{aligned} \text {AC} - \text {SA}(\textbf{X}) = \left( \text {S} - \text {I}({\textbf{Y}}_{c},{\textbf{Y}}_{w}) + \text {C} - \text {I}({\textbf{Y}}_{w},{\textbf{Y}}_{c}) \right) {\textbf{W}}_{p} \end{aligned}$$
(14)

where \({\textbf{Y}}_{s}\), \({\textbf{Y}}_{c}\) and \({\textbf{Y}}_{w}\) are the outputs of SW-SA, CW-SA, and BS-Conv defined above. \({\textbf{W}}_{p}\) is the projection matrix the same as Eq. (12), S-I and C-I are shown in Fig. 10. Compared with the serial and parallel designs of attention mechanisms, AS-SA and AC-SA have their unique advantages. Firstly, they effectively combine local convolutional information with global self-attention information to improve the quality of feature fusion. Moreover, a simple addition combination can lead to a misalignment of features. The adaptive interaction makes the outputs of the two branches adapt to each other, thereby optimizing the feature fusion effect. Secondly, AS-SA uses complementary cues through channel interaction to improve channel modeling capabilities, while AC-SA enhances representation capabilities through spatial interaction. Importantly, by adding adaptive interactions, convolutional branches can capture global information in the same way as self-attention, thus improving the quality of the convolutional output. Finally, we use the pixel shuffling method from efficient sub-pixel convolutional neural network (ESPCN) and add another convolution layer to optimize the block result, as shown in Fig. 11. Thus, the detailed spatial layout of rainfall in radar images can be further restored.

Fig. 7
figure 7

DATB based on BSConv. (a) DSTB based on BSConv. (b) Dual Channel Transform Backone(DCTB) based on BSConv.

Fig. 8
figure 8

Diagram of attention blocks. (a) Adaptive spatial self-attention(AS-SA). (b) Adaptive channel self-attention(AC-SA).

Fig. 9
figure 9

The complete architecture of Self-Attention. (a) Spatial Window Self-Attention(SW-CA). (b) Channel Wise Self-Attention(CA-SA).

Fig. 10
figure 10

Illustration of adaptive self-attention. (a) S-I. (b) C-I. (c) BSGFN.

Fig. 11
figure 11

Pixel shuffle from ESPCN. This one-step upscaling layer rearranges the pixels along the channel axis into a larger image.

Data availability

The core code of our method is available on GitHub (https://github.com/MM-miao11/CDDAT) and archived via Zenodo (https://doi.org/10.5281/zenodo.16784015).

Results

The performance of the generated climate images

The Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) were selected to verify the reproduction performance of weather radar and the similarity of the original image resolution. The higher the PSNR value, the better the reconstruction performance of the algorithm. The range of SSIM is between 0 and 1, and the higher the SSIM value, the closer the reconstructed climate image is in structure.

Radar precipitation data reconstruction supervisor effect comparison

In order to verify the performance of CDDAT in radar images, the effectiveness of the proposed method can be evaluated by directly observing the final output images. In the overall comparison image of subjective effects at a magnification factor of 2, as shown in Fig. 12, significant performance differences between different super-resolution models can be observed. The images reconstructed by the Bicubic model are the most blurred, with unclear edges, overall dim brightness, and the highest amount of accompanying image noise. In comparison, the reconstruction images produced by the SRCNN algorithm show improvements in clarity and brightness recovery, with some enhancement in contour details. However, the high-resolution precipitation distribution map generated by the CDDAT model exhibits similar effects to DAT, SRGAN, SCNet and OSEDiff. As shown in Fig. 13 below, the image reconstructed by the Bicubic model still has the worst effect when the magnification factor is 4. Different from the effect of ×2, the effect of the SRGAN model is fuzzy with serious artifacts and unclear contour details when the magnification factor is 4. The effect of the SCNet model is slightly fuzzy. The reconstruction effect of DAT, OSEDiff and CDDAT models is close to that of high-resolution rainfall map HR. It shows that these two models have the best rainfall prediction effect. The following section will compare the objective indicators of CDDAT and DAT models. In order to evaluate the performance of these three models more clearly, this paper focuses on selected rainfall images generated during severe storm weather. In these images, the color depth directly reflects the rainfall, and most models tend to produce overly smooth edge boundaries and excessive dark patterns when reconstructing dark areas, that is, parts of the image where red and orange areas merge together. Next, we will conduct a detailed comparative analysis of this problem to comprehensively evaluate the performance of each model in rainfall map reconstruction. Faced with complex precipitation distributions, such as local heavy rainfall meteorological conditions, there is an issue of neglecting some small echo information, highlighted in Fig. 14 with a black box. Within the black box, during the downsizing process implemented by different models, it is found that SRCNN, SRGAN, SCNet and OSEDiff lose small amounts of rainfall information. Surprisingly, the Bicubic algorithm produces blurry images but does not lose radar information. At this point, the DAT model and the CDDAT model perform well in heavy rainfall scenes, preserving the information of small rainfall, with the brightness and clarity of the rainfall images generated by CDDAT being the highest overall, effectively suppressing the unclear texture noise in the generated images. During the downscaling process, there are more dark areas in the generated rainfall images, indicating regions with high rainfall amounts. It indicates that while noise reduction occurs during the generation of rainfall images, noise points are still present in the dark areas. Choosing a higher magnification factor can better highlight the differences between models. Therefore, we selected strong precipitation images downscaled by a factor of 4, which are images with rich dark areas of rainfall, to compare the rainfall generated by the SRGAN, DAT, SRCNN, SCNet, OSEDiff and CDDAT models. As shown in Fig. 15, we found that SRGAN, SCNet and OSEDiff produce ghosting, affecting the representation of dark areas. The proposed CDDAT model, with its high-frequency extraction module compared to the basic DAT model, can generate more detailed high-frequency information, which is advantageous for the representation of dark areas, showing the best results.

Fig. 12
figure 12

Comparison of subjective visual effects of rainfall data super-resolution reconstruction results by different methods. The downscaling scale is ×2.

Fig. 13
figure 13

Visual comparison of precipitation prediction for rainfall data super-resolution reconstruction results by different methods. The downscaling scale is ×4.

Fig. 14
figure 14

Comparison of subjective visual effects of rainfall data super-resolution reconstruction results by different methods. The downscaling scale is ×2.

Fig. 15
figure 15

Comparison of subjective visual local effects of different methods on superresolution reconstruction of rainfall data. The downscaling scale is ×4.

Table 1 Average peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) results of the reconstructed echo of the NJU-CPOL radar.
Fig. 16
figure 16

Comparison of super-resolution results obtained for dew point.

Comparison of objective indexes of radar precipitation data reconstruction

As shown in Table 1 , the results indicate that models based on Transformers outperform traditional super-resolution methods in all scale modes, with the model built on Transformers performing the best overall. Specifically, the model CDDAT, which incorporates a high-frequency extraction module and double-aggregated Transformer blocks proposed in this paper, demonstrates the best performance. This suggests that compared to CNN-based models, transformer-based models are also effective and show further performance improvement compared to models based on adversarial neural networks. For a scaling factor of 2, the proposed CDDAT network achieved the highest PSNR value. The PSNR value of DAT is 0.16dB higher than that of SRGAN, while CDDAT is 0.27dB higher than SRGAN (an increase of 0.82%). Compared to the SRCNN model, the PSNR value improved significantly from 35.52 to 38.02 when using the CDDAT model. In comparison to the baseline model DAT46, CDDAT achieved a 0.11 increase in PSNR (0.29% improvement). Additionally, there was a minor increase in SSIM, further reinforcing the performance enhancement of the CDDAT model. For a scaling factor of 4, CDDAT’s PSNR value is 0.15dB higher than SRGAN and 0.09dB higher than DAT. In conclusion, the CDDAT super-resolution network demonstrates unique advantages in the field of meteorological rainfall, performing with high performance on rainfall datasets. This confirms the effectiveness of the algorithm in the task of rainfall downscaling.

Ablation experiments

Next, we study the effectiveness of the downscaling scheme proposed in this paper.

  • To test the effectiveness of the multi-radar indicator fusion scheme proposed in this paper for downscaling methods, different fusion schemes are selected for comparison for the three indicators \(Z_{H}\), \(Z_{DR}\), and \(K_{DP}\).

  • To test the effectiveness of the high-frequency extraction module, the DAT and MC-DAT models are selected for the comparative experiments. Since the subjective demonstrations are too similar, the comparison mainly focuses on the performance of objective indicators.

The resuIts of the ablation experiments are shown in Table 2. In the DAT and MC-DAT models, the scheme of fusing the three indicators demonstrates the best comprehensive performance, with the highest PSNR and SSIM values. However, when down-sampling by a factor of 4, training the MC-DAT model after fusing \(Z_{H}\) and \(Z_{DR}\) yields the highest PSNR value, which reaches 32.93. Through comparison, it is found that adding \(K_{DP}\) to the input data has little impact on the overall distribution of the model. For both the DAT and MC-DAT models, the PSNR values of the (\(Z_{H}\), \(Z_{DR}\)) combination are higher than those of the (\(Z_{H}\), \(K_{DP}\)) combination when training the models. For the DAT model, the PSNR results of fusing (\(Z_{H}\), \(Z_{DR}\)) and (\(Z_{H}\), \(Z_{DR}\), \(K_{DP}\)) are identical, with a value of 32.83.

However, we find that \(K_{DP}\) has a certain ability to enhance structural details. When down-sampling by a factor of 4, the SSIM value obtained by training the MC-DAT model with the fused dataset (\(Z_{H}\), \(Z_{DR}\), \(K_{DP}\)) is the highest, which is 0.8728. We can observe that, for both the DAT and MC-DAT models, when down-sampling by factors of 2 and 4, the SSIM results of the models trained with the dataset fused by the three indicators are higher than those of the models trained with a single \(Z_{H}\) indicator, and those of the fused datasets (\(Z_{H}\), \(Z_{DR}\)) and (\(Z_{H}\), \(K_{DP}\)). The results show that the \(K_{DP}\) indicator has little improvement effect on the prediction of the overall rainfall distribution, but it helps to increase the detailed information of rainfall images. Comprehensively considering, the \(K_{DP}\) indicator can be retained as one of the input indicators of the model to enhance the detailed edge information of refined rainfall.

In conclusion, on the one hand, the above results demonstrate the effectiveness of the multi-radar-indicator fusion scheme proposed in this paper. The inclusion of the \(Z_{DR}\) and \(K_{DP}\) indicators increases the micro-physical characteristics of raindrop particles, which is beneficial for generating the detailed edges of rainfall images. On the other hand, the above results also illustrate that the MC-DAT has improved its performance due to the added high-frequency extraction module, enhancing the model’s ability to capture and extract high-frequency rainfall information.

Table 2 Comparison results of different indicators fusion.

Comparative study on additional dataset

To further verify the generalization of our model, we carried out additional experimental evaluations using a different dataset47, which contains urban climate information at 2.5 km (LR) and 250 m (HR) resolutions obtained by physical modelling paradigms.The Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) are utilized to evaluate the reconstruction of dew point data for different models:

$$\begin{aligned} MAE = \frac{1}{N}\sum _{n = 1}^{N} \left| G_{n}^{HR} - y_{n}^{HR} \right| \end{aligned}$$
(15)
$$\begin{aligned} RMSE = \sqrt{\frac{1}{N} \sum _{n = 1}^{N} \left( G_{n}^{HR} - y_{n}^{HR} \right) ^2} \end{aligned}$$
(16)

where \(G_{n}^{HR}\) denotes the nth ground truth, \(y_{n}^{HR}\) denotes the nth SR model outputs and N is the number of samples.

From Fig. 16, it is clear that, the SR results from CDDAT contains more refined small-scale features than other models. Table 3 shows that CDDAT outperforms other methods with respect to the MAE and RMSE.The reason why the proposed CDDAT has better performance is that the multimodal fusion architecture of CDDAT facilitates more physically consistent information.

Table 3 MAE and RMSE for the reconstruction performance of dew point.
Table 4 Comparison of model complexity.

Complexity analysis

From Tables 1, 2, 3 and 4, it is clear that the proposed model (33.9M parameters, 100.76G FLOPS) that can perform better than DAT (33.7M parameters,100.12G FLOPS ) with a slightly larger in parameters and FLOPs, which shows the high efficiency of our proposed model. Compared with SRCNN (0.068M parameters, 0.055G FLOPS) and SRGAN (1.55M parameters, 53.67G FLOPS), our method shows a good trade-off between performance and model complexity.

Conclusions

In this paper, we propose a novel Climate Downscaling Dual Aggregation Transformer(CDDAT) for polarization radar convective storm weather precipitation data. We first adopt a lightweight CNN Backbone (LCB) that extracts high-frequency features from rainfall images and then utilize a Dual Aggregation Transformer Backbone (DATB) that models spatial and channel dimensions in rainfall images by alternately applying self-attention for spatial and channel feature aggregation. Finally, we chose NJU-CPOL dual-polarization weather radar data to evaluate our proposed method based on metrics of PSNR and SSIM. Extensive experimental results show that our model can recover the echo edges and details of severe convective weather more accurately than other existing methods, which is of great significance for the early warning of severe convective weather. Our proposed model can enhance disaster preparedness by providing detailed precipitation. The potential limitation of our model is that our model requires high-resolution training data, which may be sparse in developing regions.