Introduction

Land Surface Temperature (LST) is an indispensable parameter in the study of the exchange of matter and energy between the atmosphere and the surface and in the study of aspects of climate change. It is one of the important indicators of global climate change and the study of the water–heat balance of the Earth system1,2. Ground-based observations can obtain continuous LST data in time, but they are relatively discrete in spatial distribution. The thermal infrared band of remote sensing satellites can invert LST3, but its resolution often restricts each other in time and space, and the imaging moment and cloud amount also have a certain impact on data validity. ERA5-Land reanalysis data can provide continuous global LST information every hour4, but their spatial resolution is only 0.1°. To meet the demand for high spatio-temporal resolution LST and to adapt to the diversified requirements of various fine earth science applications, the downscaling of LST data is required.

In recent years, scholars have proposed many downscaling algorithms, including image fusion5, statistical regression models6, and algorithms based on machine learning7. Among them, image fusion includes the Spatio-Temporal Adaptive Reflectance Fusion Model (STARFM)8, Enhanced STARFM (ESTARFM)9, the Spatio-Temporal Integrated Temperature Fusion Model10, and the Spatio-Temporal Adaptive Data Fusion Algorithm for Temperature Mapping5. These models obtain continuous LST information by fusing the LST data of different time frequencies, while maintaining the good spatial texture of the downscaled LST data11. However, they have problems such as unsatisfactory inversion accuracy of heterogeneous surfaces, high model training costs, the smoothing of details, and unclear physical mechanisms.

Compared with the above methods, the statistical regression downscaling model based on the “scale invariance” assumption fully considers the physical mechanism of the energy balance of surface thermal radiation information before and after LST downscaling and retains the detail information of LST after downscaling12. This method is simple and efficient, and it currently dominates in the field of LST downscaling research. For example, the DisTrad algorithm13 realizes LST downscaling by establishing a linear regression relationship between LST and the Normalized Difference Vegetation Index (NDVI), and the TsHARP algorithm14 realizes LST downscaling by using vegetation coverage and LST to establish a linear regression model. In addition, the Geographically Weighted Regression (GWR) method15 is also widely used in LST downscaling. However, due to the complex features of high-dimensional data that cannot be fully characterized with traditional statistical regression models, downscaling methods based on machine learning have emerged16; due to their superior nonlinear fitting capabilities, they are widely used in downscaling work and have improved the performance of downscaling. For example, Random Forest (RF), Artificial Neural Networks (ANN), Support Vector Machines (SVM), etc., use NDVI, surface reflectance, geographical/terrain factors, and other remote sensing indicators as downscaling factors to downscale LST17,18,19. However, machine learning methods have certain limitations when used to extract higher-dimensional information, and they cannot deeply characterize the intrinsic relationship between multiple environmental factors and target quantities.

In order to explore the potential relationships among multiple environmental factors and to capture the complex multilevel information in high-dimensional data, deep learning has emerged as a new research breakthrough following machine learning. Convolutional Neural Networks (CNNs) can deeply mine multiscale features of high-dimensional images20,21,22,23; CNNs include the Super-Resolution Convolutional Neural Network (SRCNN), which directly learns the end-to-end mapping between fine-scale and coarse-scale images and has achieved significant results in the field of image super-resolution reconstruction. However, LST data at different scales are not merely different in terms of resolution. Traditional super-resolution methods lack the capacity to capture necessary surface details for LST downscaling, and a large number of auxiliary data need to be introduced to supplement the details of LST. Therefore, the attention mechanism becomes an entry point. It helps the model find the features most relevant to LST among a large amount of auxiliary data and integrates these features into the prediction of LST, thereby achieving more accurate LST downscaling. However, research in this area is not the main focus at present.

The establishment of a downscaling model based on deep learning requires the support of a large amount of data. At the same time, in order to adapt to the model input and standardize the geographic references, a huge amount of data preprocessing work needs to be conducted. Traditional remote sensing methods cannot meet the needs of the model construction due to their high computational complexity and high processing costs. With the emergence of cloud computing platforms such as Google Earth Engine (GEE), this problem has been solved. GEE combines Google’s cloud computing capabilities and the Earth observation data of major institutions to solve important global social problems. It provides a large amount of geospatial data, including satellite remote sensing images and basic geographic information24,25,26,27. The ERA5-Land LST data and the seven auxiliary factors needed for the downscaling model research in this study are all retrieved through GEE and preprocessed through cloud computing.

In order to achieve more accurate LST downscaling and to explore the application of the attention mechanisms of deep learning methods to downscaling, this study proposes an Attention Mechanism U-Net (AMUN) method to downscale the hourly and monthly average LST reanalysis data of ERA5-Land across China. This method transfers the U-Net network used for image semantic segmentation28,29,30 to the field of LST downscaling and calibrates the feature maps through the Global Multi-Factor Cross-Attention (GMFCA) module. In addition, the Feature Fusion Residual Dense Block (FFRDB) connection module is introduced to deeply extract the features of coarse-scale LST data. Finally, the calibrated feature maps, deeply extracted feature maps, and shallowly extracted feature maps are fused as the input of U-Net to obtain fine LST data. At the same time, the Bayesian global optimization algorithm31,32 is used to search for the optimal combination of the hyperparameters of the model to enable the best performance of the model and to improve the model’s prediction accuracy.

The main innovations and contributions of this study are as follows:

  1. 1.

    This study proposes a novel downscaling network that combines image super-resolution reconstruction with the concept of scale-invariant effects to achieve downscaling.

  2. 2.

    The study introduces an attention mechanism, which is integrated with U-Net, to deeply extract the complex relationships between auxiliary factors and LST, significantly improving downscaling accuracy.

  3. 3.

    The proposed method enhances the spatial resolution of ERA5-Land LST data, providing a new reference for acquiring high spatiotemporal LST. This is of great importance for refined surface temperature applications, such as agricultural planning, urban heat management, and climate change research.

The rest of this paper is arranged as follows. Section "Materials" introduces the acquisition and preprocessing of the ERA5-Land LST data and auxiliary factor data based on GEE. Section "Methods" introduces the AMUN downscaling method proposed in this study. Section "Results and discussion" evaluates and discusses the effectiveness of the AMUN method. Finally, Section "Conclusion" summarizes the contributions of this study.

Materials

Data acquisition based on GEE

In this study, the territory of China is taken as the research area, as it has rich terrain features and significant spatial variability in LST (Fig. 1). The LST data come from the ERA5-Land hourly monthly average reanalysis dataset with a spatial resolution of 0.1°; this dataset provides LST data at all times of the day on a monthly basis. It is obtained from the skin_temperature band of the ECMWF/ERA5_LAND/MONTHLY_BY_HOUR dataset in GEE. The auxiliary remote sensing dataset directly obtains the NDVI, land cover type, and surface reflectance using the Moderate Resolution Imaging Spectroradiometer (MODIS) MOD13Q1, MCD12Q1, and MOD09Q1 products. The Normalized Difference Building Index (NDBI) and the Modified Normalized Difference Water Index (MNDWI) are calculated using the MOD09A1 product. The slope and aspect are calculated using the Digital Elevation Model (DEM) from the Shuttle Radar Topography Mission (SRTM) (Table 1). Finally, the accuracy is evaluated using the temperature data measured at meteorological stations.

Fig. 1
Fig. 1
Full size image

Topographic map of China (using DEM data from SRTM [https://lpdaac.usgs.gov/products/srtmgl1v003/] and produced with ArcGIS 10.8 software).

Table 1 Basic information of the data used in this study.

Data processing based on GEE

In order to adapt to the model input and to unify data from different sources, data preprocessing is performed. The preprocessing involved in this study is completed within the GEE platform, which includes mean synthesis, ROI clipping, resampling, and reprojection. All the auxiliary data are resampled to 0.01° and 0.1° under the WGS84 coordinate system using bilinear interpolation, and the LST data are resampled to 1° based on 0.1°33. Additional processing is required for individual data items, such as the extraction of the slope and aspect from the DEM data obtained from the USGS/SRTMGL1_003 dataset in GEE, which is achieved through the Slope and Aspect functions, respectively. The sur_refl_b02, sur_refl_b04, and sur_refl_b06 bands obtained from the MODIS/061/MOD09A1 dataset are used to calculate the normalized difference to obtain NDBI and MNDWI; this is implemented in GEE through the NormalizedDifference functions. It can be expressed as:

$$\begin{array}{*{20}c} {NDBI = Normalized\;Difference\left( {sur\_refl\_b06,sur\_refl\_b02} \right) = \frac{sur\_refl\_b06 - sur\_refl\_b02}{{sur\_refl\_b06 + sur\_refl\_b02}}} \\ \end{array}$$
(1)
$$\begin{array}{*{20}c} {MNDWI = Normalized\;Difference\left( {sur\_refl\_b04,sur\_refl\_b06} \right) = \frac{sur\_refl\_b04 - sur\_refl\_b06}{{sur\_refl\_b04 + sur\_refl\_b06}}} \\ \end{array}$$
(2)

where \(sur\_refl\_b02\), \(sur\_refl\_b04\), and \(sur\_refl\_b06\) represent the near-infrared, green light, and shortwave infrared bands of MODIS surface reflectance, respectively. It should be noted that the Mean functions in GEE can synthesize data into a monthly time scale. In addition, for data that do not change on a monthly basis, the Mean method also has the implicit function of converting the ImageCollection type to the Image type, so as to crop the image data (GEE only supports the Clip functions for the Image type). The corresponding datasets and processing flow of this study are shown in Fig. 2. Among them, the 1° LST and 0.1° auxiliary data are used for model training and simulation data experiments, while the original 0.1° LST and 0.01° auxiliary data are used for real data experiments to achieve downscaling.

Fig. 2
Fig. 2
Full size image

Dataset and processing procedure.

Methods

LST downscaling framework

In this study, a total of seven environmental and geographical factors are used as auxiliary factors for downscaling LST. The environmental factors are NDVI, land cover type, surface reflectance, MNDWI, and NDBI34, which are used to provide vegetation cover conditions, surface physical characteristics, inland water conditions, and building cover conditions, respectively. The geographical factors are slope and aspect. The framework for the LST downscaling in this study can be expressed as:

$$\begin{array}{*{20}c} {LST_{H} = f\left( {LST_{L} ,\;NDVI,\;Land\;Cover\;Type,\;Surface\;Reflectance,\;Slope,\;Aspect,\;MNDWI,\;NDBI} \right)} \\ \end{array}$$
(3)

where \(LST_{H}\) and \(LST_{L}\) represent high-resolution and low-resolution LST data, respectively, and \(f\) represents the downscaling model proposed in this study.

AMUN Method

In order to delve into the intricate characteristics of LST and thoroughly examine the association between LST and auxiliary factors, this study introduces the AMUN method (Fig. 3). In the network, the low-resolution LST data and 7 high-resolution auxiliary data are used as input. To utilize the precise magnitude data of the coarse LST and the abundant attributes of diverse fine auxiliary factors, this study proposes the GMFCA module, which mutually adjusts the feature maps. Furthermore, the FFRDB connection module is employed to derive multi-tiered feature data from the provided LST map. Finally, through global residual learning, the outputs of GMFCA, FFRDB, and shallow feature extraction are fused as the input of U-Net, to extract texture information from the feature fusion map, and retain its high-frequency details when they are mapped to LST.

Fig. 3
Fig. 3
Full size image

Overall structure of AMUN method. (a) Residual Dense Block (RDB), (b) Channel Weight Extraction (CWE), (c) Encoder-Bridge-Decoder (E-B-D). (To highlight the characteristics of U-Net, the number of convolution kernels in each layer of the E-B-D module is marked.)

Detailed information about each module is provided in the following subsections.

GMFCA Module

In order to consider multi-feature input and its complementary information, this study proposes the GMFCA module based on the global cross-attention mechanism and multi-factor cross-attention mechanism33. This approach, while maintaining the effective performance of both aspects, decreases the model’s training parameters and enhances the efficiency of the model’s training process.

The GMFCA module contains eight inputs, which are extracted and weighted using the CWE block to obtain one LST channel weight and seven auxiliary data spatial weights. For the channel weights, the convolutional layer extracts the feature map containing its amplitude information from the sampled LST data. It employs the sigmoid layer for feature map normalization to derive the channel weight, amplifies the auxiliary data feature map through multiplication, adjusts the feature map extracted from the auxiliary data via an element-wise weighting operation, and subsequently merges the seven adjusted auxiliary data feature maps using element-wise addition. For the spatial weights, each of the seven auxiliary data’s spatial weights is multiplied with the LST feature map to amplify the detailed features of the LST data. Ultimately, the channel-weighted auxiliary data feature map and the spatially weighted LST feature map, which have been combined through addition, are integrated via the cascade function. The high-dimensional feature space is then condensed using a 1 × 1 convolutional layer to project onto the low-dimensional feature space, yielding a recalibrated feature map abundant in details and magnitude. Taking one of the auxiliary data channels as an example, its mechanism of action with the LST channel can be expressed as:

$$\begin{array}{*{20}c} {F_{W} = \left( {W_{1} \otimes F_{A} } \right) \times \sigma \left( {W_{2} \otimes F_{T} } \right)} \\ \end{array}$$
(4)

where \(F_{W}\) represents the feature map after weighted calibration; \(F_{A}\) and \(F_{T}\) represent the feature maps extracted from the input auxiliary data and LST data, respectively; \(W\) represents the convolution kernel in the CWE block; \(\times\) and \(\otimes\) represent the element-wise multiplication operation and convolution operation, respectively; and \(\sigma\) represents the sigmoid function. The effectiveness of this module is discussed in Section "Reliability experiment of three network modules".

FFRDB connection module

The objective of the FFRDB connection module is to derive layered data from the LST input attributes that have been subjected to preliminary feature extraction. It first concatenates three RDBs35 through global feature fusion; then, it goes through a 1 × 1 convolutional layer to adaptively fuse local features at different levels; finally, it extracts features through convolution. Each RDB consists of three layers of dense blocks, local feature fusion, and local residual learning, forming a continuous memory mechanism. Specifically, the cascade function deeply connects the output of the previous RDB with the output of each layer of the current RDB. Through element-wise addition, the output of the previous RDB is directly fused into the output of the 1 × 1 convolution, further enhancing the feature information flow. After such an operation, an LST feature map rich in local features is obtained.

U-Net module

The primary function of U-Net in AMUN is to conduct advanced feature extraction and amalgamation on the combined outcomes of GMFCA, FFRDB, and shallow feature extraction, so that the network can fully utilize the features of each level at the input end and enhance the network’s expressive power. Its structure consists of a set of convolutional layers for extracting the output features of global residual learning, a transposed convolutional layer for upsampling to the original input size, and an E-B-D module. The E-D-B module has a total of 5 layers, 4 symmetrical encoder and decoder layers, and 1 bridge layer. The corresponding encoders and decoders are jump-connected to realize the fusion of shallow information and deep information.

Bayesian optimization of hyperparameters

Bayesian optimization is a global optimization method that builds a probabilistic model of the objective function and uses this model to explore the optimal hyperparameter combination that has not yet been evaluated, balancing the relationship between the two. The Bayesian optimization rule32 can be expressed as:

$$\begin{array}{*{20}c} {p\left( {w{|}D} \right) = \frac{{P\left( {D{|}w} \right)p\left( w \right)}}{p\left( D \right)}} \\ \end{array}$$
(5)

where \(p\left( {w{|}D} \right)\) represents the posterior probability, \(P\left( {D{|}w} \right)\) represents the likelihood, \(p\left( w \right)\) and \(p\left( D \right)\) represent the prior probabilities of the corresponding events, \(w\) represents the value of the parameter to be optimized, and \(D\) represents the data that have been observed.

This study uses this method to optimize the selection of five parameters: solver, learning rate, number of iterations, batch size, and regularization strength. In addition, the learning rate decay factor is set to 0.5, and the learning rate decay period is set to half of the number of iterations. The detailed procedure of the experiment and its outcomes are presented in section "Bayesian optimization experiment".

Model training and data preparation

  1. 1.

    Training and validation data: The network is trained using 24 data sets from a 12-month period in 2020–2021 (Table 2). The initial LST data serve as the label dataset. The slice sizes of the input and labels are 32 × 32 × 8 and 32 × 32 × 1, respectively. After processing 24 sets of data, a total of 36,387 patches (all worthless areas are excluded) were generated. Among these, 85% serve as the training set for network training, while the remaining 15% constitute the validation set used to assess the model’s performance during the training process.

  2. 2.

    Simulated and real data test sets: The efficacy of the AMUN method is assessed via experiments with both simulated and real data. In the simulation experiment, image data from August 2019 and February 2022 are used to test the downscaling model and to evaluate the model’s performance in the summer and winter of different years. In the real data experiment, image data from four distinct seasons, namely February 2019, May 2020, August 2021, and November 2022, are employed. This is done to put the downscaling model to the test, with the objective of assessing the model’s performance across different seasons, both within and beyond the temporal scope of the training set. The image resolution is 10 times that of the simulated test set (Table 2).

  3. 3.

    Parameter settings: After Bayesian hyperparameter optimization, the Rmsprop solver is finally used as the gradient descent optimization algorithm, with a total of 99 rounds of training, a batch size of 34, a regularization strength of 0.0001, an initial learning rate of 0.0001, and halving every 50 rounds.

Table 2 Detailed information about the experimental data. T and A represent LST and auxiliary data, respectively.

Results and discussion

LST downscaling evaluation

To ascertain the efficacy of the suggested AMUN approach in the downscaling estimation of LST, this study takes the territory of China as the research area and carries out downscaling experiments comprising two aspects: simulated data and real data, which are described in this section. In both experiments, the proposed method is compared longitudinally with RF. This study uses three quantitative indicators, Mean Absolute Error (MAE), Coefficient of Determination (R2), and Root Mean Square Error (RMSE), to evaluate the accuracy of the LST downscaling results.

Simulated data experiment

In the simulated data experiment, the original ERA5-Land data (0.1°) are downsampled to a spatial resolution of 1° as the simulated LST data to be downscaled. Subsequently, with this simulated LST and seven auxiliary datasets of 0.1° spatial resolution, the trained model is used to downscale the 1° simulated LST data to 0.1°. Consequently, the original ERA5-Land data can serve as a benchmark to contrast with the LST data derived from downscaling, allowing the model’s performance to be assessed in terms of visual impact and quantitative metrics.

The downscaling results and quantitative indicator data graphs of the AMUN and RF methods in August 2019 and February 2022 are shown in Figs. 4 and 5, respectively. It can be observed that due to the higher terrain in the southwest, the LST of the Qinghai-Tibet Plateau is relatively lower compared to other regions. Both AMUN and RF’s downscaling outcomes exhibit strong spatial coherence when juxtaposed with the reference data. However, the fitting line of the scatter plot associated with AMUN is noticeably nearer to the reference line, indicating that AMUN’s downscaling result has superior spatial correlation compared to RF (Figs. 4b2, c2 and 5b2, c2). The bias of the AMUN downscaling results in August 2019 is generally low, with only a very small number of areas showing values greater than 2, and the distribution is relatively random, which is visually significantly better than the RF downscaling results (Figs. 4b3, c3 and 5b3, c3). Although the AMUN downscaling results in February 2022 show a larger number of areas with bias values greater than 2, they are still visually superior to the RF downscaling results overall.

Fig. 4
Fig. 4
Full size image

LST visual effects and quantitative indicator data for August 2019. (a1,a2) are the original (reference) image and input LST data, respectively. (b1b3) represent the AMUN downscaling results, scatter plot, and bias level stereogram, respectively. (c1c3) represent the RF downscaling results, scatter plot, and bias level stereogram, respectively. The elliptical red circles in the figures indicate areas of significant contrast between the reference image and the prediction results from the two methods, highlighting that the AMUN method’s predictions are closer to the reference image in terms of spatial details.

Fig. 5
Fig. 5
Full size image

LST visual effects and quantitative indicator data for February 2022. (a1,a2) are the original (reference) image and input LST data, respectively. (b1b3) represent the AMUN downscaling results, scatter plot, and bias level stereogram, respectively. (c1c3) represent the RF downscaling results, scatter plot, and bias level stereogram, respectively. The elliptical red circles in the figures indicate areas of significant contrast between the reference image and the prediction results from the two methods, highlighting that the AMUN method’s predictions are closer to the reference image in terms of spatial details.

For a more robust validation of the AMUN method’s effectiveness, the LST downscaling outcomes were assessed quantitatively (Table 3). As inferred from the coefficient of determination, both the AMUN and RF downscaling outcomes demonstrate commendable performance, with R2 values of 0.986 and 0.936, respectively. However, when considering the MAE and RMSE outcomes, the AMUN method’s downscaling outcomes significantly outperform those of RF, with MAE values of 0.731 K and 1.637 K, respectively, and RMSE values of 1.112 K and 2.292 K. This suggests that the AMUN method exhibits superior performance in terms of expressing spatial details.

Table 3 Quantitative evaluation of the simulation experiment.

To sum up, in the simulation experiment, both the AMUN and RF methods are employed to downscale the 1° LST data to 0.1°. When juxtaposed with the reference data, the spatial coherence and details of the AMUN downscaling outcomes outshine those of the RF downscaling outcomes.

Real data experiment

In the real data experiment, the original ERA5-Land data (0.1°) are downscaled to a spatial resolution of 0.01°. Subsequently, the downscaling results are validated in two ways: one is based on the area scale, where the downscaling results are downsampled to match the original LST data for validation; the other is based on the point scale, where the downscaling results are validated with actual meteorological station data.

In the area scale validation, the performances of the RF and AMUN methods are compared using data from four time scales in February 2019, May 2020, August 2021, and November 2022, with the original LST data as the benchmark (Fig. 6). Visually, the downscaling results of the AMUN method accurately capture the spatial distribution of LST. Regarding the areas with small distribution areas and abrupt temperature changes, the AMUN method can capture them well using the auxiliary data, but with the RF method, these areas are difficult to handle. This reflects the fact that the AMUN method can better restore spatial details.

Fig. 6
Fig. 6
Full size image

Visual comparison of the downscaling results of the AMUN and RF methods with the original LST data in February 2019, May 2020, August 2021, and November 2022. (a1a4) represent the original (reference) images. (b1b4) represent the downscaling results of the AMUN method. (c1c4) represent the downscaling results of the RF method. (d) is the schematic diagram of the area for February 2019 and May 2020. (e) is the schematic diagram of the area for August 2021 and November 2022. The elliptical red circles in the figures indicate areas of significant contrast between the reference image and the prediction results from the two methods, highlighting that the AMUN method’s predictions are closer to the reference image in terms of spatial details.

The spatial correlation and bias of the AMUN method are superior to those of the RF method (Figs. 7 and 8). The downscaling results of both methods perform well in flat areas, with errors mainly concentrated in mountainous areas. The AMUN method demonstrates that it can significantly resist these errors. Therefore, the AMUN method proposed in this study is effective for the downscaling of the original LST data.

Fig. 7
Fig. 7
Full size image

Scatter plots of the original LST data compared separately with the AMUN and RF methods in February 2019, May 2020, August 2021, and November 2022. (a1a4) represent the scatter plots corresponding to the AMUN method. (b1b4) represent the scatter plots corresponding to the RF method.

Fig. 8
Fig. 8
Full size image

Bias level diagrams of the original LST data compared separately with the AMUN and RF methods in February 2019, May 2020, August 2021, and November 2022. (a1a4) represent the bias level diagrams corresponding to the AMUN method. (b1b4) represent the bias level diagrams corresponding to the RF method.

In the point scale validation, the same four time scales of data in February 2019, May 2020, August 2021, and November 2022 are used for accuracy validation. The actual station temperature data corresponding to the two regions of the four time scales are compared with the original data, the AMUN downscaling results, and the RF downscaling results (Fig. 9). In order to reduce the difference between air temperature and LST, the average monthly maximum temperature data of the station are used in area (a), and the average monthly temperature data of the station are used in area (b) for a comparison of all three. The results show that all three have good spatial correlation with the station data, and the trend of the line changes is highly consistent with the station data. The AMUN method is even better than the original data, indicating that the AMUN method both retains the accuracy of the original data and supplements the spatial details effectively.

Fig. 9
Fig. 9
Full size image

Schematic diagram of the station validation area and line chart. (a,a1,a2) correspond to the validation area and line chart for February 2019 and May 2020. (b,b1,b2) correspond to the validation area and line chart for August 2021 and November 2022.

Apart from the visual assessment of the downscaling outcomes, this experiment also carried out a quantitative evaluation (Table 4). Both the RF and AMUN methods exhibit a strong correlation with the original image, with R2 values exceeding 0.94. In terms of MAE and RMSE values, the AMUN method outperforms the RF method, with MAE values of 0.613 K and 0.956 K, and RMSE values of 0.989 K and 1.444 K, respectively. From just the numerical values of the quantitative evaluation, the AMUN method does not have a clear advantage over the RF method; however, from the visual effect in Fig. 6, it can be seen that the reason for this phenomenon is that the AMUN method introduces more spatial details, which make the downsampled results slightly rough compared to the smoothness of the original data. Moreover, when using bilinear interpolation for downsampling, the sampling method itself also has a certain systematic error; so, the gap between the two methods is reduced to a certain extent.

Table 4 Quantitative evaluation of the real experiment. Compared with the LST, there is an inherent error of 2–3 K in the air temperature at the station.

In the evaluation based on stations, the temperature data of the meteorological stations are compared with the AMUN downscaling results, RF downscaling results, and the original data. Due to factors such as the different speeds at which the air and ground absorb and release solar heat, there is a systematic error of 2-3 K between the air temperature and LST, but this does not affect the evaluation of the spatial correlation of the downscaling results. As shown in Table 4, the R2 values of AMUN, RF, and the original data are 0.894, 0.861, and 0.878, respectively; the MAE values are 2.848 K, 2.927 K, and 2.776 K, respectively; and the RMSE values are 4.307 K, 4.933 K, and 4.625 K, respectively (Table 4). The three indicators of the AMUN method results are superior to those of the RF method, and they are also superior to the evaluation results of the original data in terms of R2 and RMSE. This once again proves that the downscaling results of the AMUN method not only maintain a high consistency with the original data, but also enrich the real LST details to a certain extent.

Discussion

Reliability experiment of three network modules

In order to clarify the effectiveness of the GMFCA module and the three-module combination AMUN proposed in this method, three modules are combined, namely (1) the GMFAC module alone; (2) the combination of the GMFCA module and the FFRDB connection module; (3) the combination of the GMFCA module and the U-Net module; and (4) AMUN as the combination of the three modules. The performance of the four combinations is evaluated using the RMSE of the validation set (the input data were normalized) during the training process, and the same hyperparameters are used during the training of the four combinations. At the end of the iteration, the RMSEs of the validation set of the four combinations are all below 0.5. When the training of the GMFAC module alone ends, the RMSE of the validation set is 0.44. After introducing the FFRDB connection module, the RMSE of the validation set slightly increases. After introducing the U-Net module, the RMSE of the validation set significantly decreases. When the FFRDB connection module and the U-Net module are introduced at the same time to form AMUN, the final RMSE of the validation set is 0.28, which is the best of the four combinations, and the RMSE is significantly lower than the other three combinations (Fig. 10). To explain the observed results: The GMFCA and FFRDB modules are parallel in structure. When the FFRDB module is introduced, the model captures more feature information, but it cannot fully utilize this information. By introducing the U-Net module, the model can effectively extract and integrate the combined features, enabling more efficient utilization of the extracted information. Overall, the GMFCA module and the AMUN method proposed in this study are significantly effective. After training, the size of the final AMUN model is 84.1 MB.

Fig. 10
Fig. 10
Full size image

RMSEs of the validation set during the training process of the four combinations. Due to the normalization of the data, RMSE has no units.

Bayesian optimization experiment

In order to better ensure the performance of the downscaling model and to enhance the robustness of the AMUN method, this experiment uses the Bayesian global optimization method to optimize the selection of five parameters: solver, learning rate, number of iterations, batch size, and regularization strength, and it explores the best hyperparameter combination of the downscaling model, with a total of 30 experiments. The combination of hyperparameters with good effects is more inclined towards lower learning rates, regularization strengths, and batch sizes and more inclined towards higher numbers of iterations. For the solver, the Rmsprop solver has the best training performance for the AMUN method; the Adam solver is second; and the Sgdm solver experienced gradient explosion twice, leading to the failure of the model training (Fig. 11). Finally, after optimization, the Rmsprop solver was used as the gradient descent optimization algorithm, with a total of 99 rounds of training, a batch size of 34, a regularization strength of 0.0001, and an initial learning rate of 0.0001. Overall, using Bayesian optimization can select the best hyperparameters for the model, reducing the RMSE of the validation set from 5.6863 K to 1.4892 K. Compared to grid search and random search methods, Bayesian optimization can find the optimal hyperparameter combination more quickly, reducing computational costs31.

Fig. 11
Fig. 11
Full size image

Bayesian hyperparameter optimization process. Units of RMSE: K.

Adaptability of AMUN to different terrains

To further evaluate the effectiveness of the AMUN method, experiments were conducted under different terrain conditions (plateau, plain, and mountainous regions) (Fig. 12). A quantitative analysis of AMUN’s performance in these terrains was conducted (Table 5). Based on the quantitative results, the AMUN method achieved the highest spatial correlation with the original map in the mountainous region, with an R2 of 0.976, followed by the plateau region with an R2 of 0.958. However, in terms of the error between predicted and actual values, the plain region performed best, with an MAE of 0.279 K and an RMSE of 0.357 K. The plateau region followed, with an MAE of 0.486 K and an RMSE of 0.620 K. In summary, the AMUN method performs well across all three terrain types. Although the R2 performance in the plain region is less pronounced due to weaker spatial variability of LST, the method still shows strong results in terms of MAE and RMSE.

Fig. 12
Fig. 12
Full size image

Comparison between the experimental results of AMUN and the original maps in different terrains (a1a6 represent the original maps, b1b6 represent the AMUN results).

Table 5 Quantitative analysis of AMUN method accuracy across different terrains.

Adaptability of AMUN to different seasons

To investigate the performance of the AMUN method across different seasons, this study evaluated the model’s downscaling results for four seasons (Fig. 13). The seasonal divisions are as follows: spring (March–May), summer (June–August), autumn (September–November), and winter (December-February). The experimental results show that the model maintains strong spatial correlations with the original values across all four seasons, with R2 values consistently around 0.98 and no significant differences between seasons. In terms of error, the model performs best in summer, with an MAE around 0.5 K and an RMSE around 0.8 K. The seasons with the largest errors are spring and winter, where the MAE is approximately 0.35 K higher and the RMSE is about 0.65 higher compared to summer. In summary, the AMUN method performs best in summer, followed by autumn, with spring and winter showing comparable performance, though slightly behind summer and autumn.

Fig. 13
Fig. 13
Full size image

Performance of the AMUN method across three evaluation metrics in different seasons.

Conclusion

In this study, deep learning methods are innovatively used to explore LST downscaling. This study proposes a U-Net method based on an attention mechanism, which consists of GMFCA, FFRDB, and U-Net modules. The GMFCA module is an attention mechanism that can fully extract the complementary information between the LST data and the seven auxiliary data. Then, the FFRDB module is introduced to capture rich features from the input LST data. Then, using the U-Net module, multilevel feature information can be further extracted from the input, and more potential connections between the data can be mined. The effectiveness of the AMUN method is demonstrated by comparing the combination of the three modules. The GEE platform is used to obtain and preprocess the ERA5-Land hourly and monthly average reanalysis LST dataset and the seven auxiliary data, improving the efficiency of the downscaling work. In addition, using the Bayesian optimization algorithm, the training hyperparameters of the AMUN method are optimized and selected, improving the accuracy and robustness of the downscaling model. In the simulated data experiment and the real data experiment, the AMUN method and RF method are compared. The AMUN method has better spatial correlation and accuracy and is better matched with the original LST data and station-measured data. The RF method cannot fully describe the spatial variability of LST and cannot accurately capture the area of abrupt temperature change. In summary, the methods and the series of optimization measures proposed in this study can effectively cope with LST downscaling work.

In addition to ERA5-Land reanalysis LST data, with a reasonable selection of auxiliary factors and by adjusting the number of input heads, the AMUN method can accommodate similar downscaling tasks based on the scale-invariant effect. However, the method still has some limitations. For example, the model’s prediction stability under different terrain and seasonal conditions requires further improvement. Additionally, due to the slice-based input model training method, stitching artifacts appear in a few areas. In our future research, we plan to further improve the model structure and training mechanism to enhance prediction stability while also exploring the application of downscaling methods inspired by image super-resolution reconstruction to existing machine learning models (such as SVM and ANN) to investigate their predictive performance.