Abstract
Evapotranspiration (ET) is a critical component of the water and energy cycles in desert grassland ecosystems. However, modeling ET in arid grasslands faces significant challenges due to data scarcity, high spatiotemporal heterogeneity, and complex interactions among climatic drivers. To address these challenges, this study developed a Random Forest Regression (RF-R) model integrated with high-resolution PML-V2 ET data and CRU meteorological datasets (2001–2020) to simulate ET in China’s desert grasslands. The RF-R model achieved superior performance, with R² values of 0.953 (training) and 0.931 (testing), RMSE of 3.421 and 4.182 mm/month, and an average prediction bias of 11.815%. The comparative analysis between BPNN and SVR models confirms the robustness of RF-R estimates. Key climate factors were identified through multi-scale importance assessments: precipitation and wet-day frequency were the primary drivers, followed by cloud cover and diurnal temperature range. This study provides a reliable framework for ET simulation in data-scarce arid regions and supports targeted water management strategies for desert grassland restoration.
Similar content being viewed by others
Introduction
Grassland ecosystems, as one of the major ecosystems on the Earth’s surface, account for about one-third of the total land area and play an essential role in maintaining the functions and processes of terrestrial ecosystems1,2. Desert steppes are located in the transition region between steppes and deserts in Eurasia, an essential ecological barrier in northwest China3. In recent years, the degradation of desert grasslands has been increasing, and the stability of ecosystems is gradually decreasing4,5. As a critical environmental driver in desert grassland degradation processes, grassland evapotranspiration serves as a key hydrological regulator that governs fundamental ecological mechanisms including soil water redistribution patterns, vegetation community dynamics, and ecological succession trajectories6,7. Therefore, carrying out desert grassland evapotranspiration modeling and exploring the climate factors is necessary. Currently, numerous researchers have successfully estimated evapotranspiration by applying the Penman-Monteith model8,9,10the energy balance model11,12and remotely sensed product data13,14,15obtaining more satisfactory research results. While physically based models require multiple parameters that are often region-specific and challenging to acquire in arid environments, some remote sensing ET products have simplified parameterization by leveraging satellite data and empirical relationships. The acquisition of these parameters often involves complex procedures, with measurement accuracy critically determining model reliability and applicability. Data collection difficulties and accuracy issues are the main obstacles to using process-based models over large areas. Recent computational advancements have significantly enhanced the adoption of machine learning approaches to address these constraints. Nagappan, et al.16 used a machine learning method to predict reference evapotranspiration and proved it effective in carrying out it. Hascoet, et al.17 used a machine learning approach to calibrate a global ET dataset from a water cycle perspective. They showed that the corrected dataset could be generalized to unseen catchments. Shaloo, et al.18 produced four ET prediction machine-learning approaches based on climate data to estimate ET in arid regions with scarce drought data. The Random Forest algorithm (RF), proposed by Breiman (2001), integrates multiple decision trees to handle nonlinear relationships and reduce overfitting risks. Unlike physics-based models requiring complex parameters, RF minimizes simulation uncertainties and demonstrates strong regional adaptability19,20. Its simplicity and efficiency have driven widespread adoption in hydrological studies. Xu, et al.21evaluated the accuracy of five machine learning algorithms for estimating evapotranspiration at 36 flux tower sites in the Black River Basin, with the Random Forest algorithm showing superior performance. Azzam, et al.22 estimated transpiration evapotranspiration of wheat using four machine learning algorithms in the study watershed. The results showed that the Random Forest algorithm performed optimally. Liu, et al.23constructed an evapotranspiration estimation model based on meteorological and evapotranspiration data from Qinghai Lake Basin using Random Forest (RF), Extreme Gradient Boosting (XGB), Support Vector Machines (SVR), and Artificial Neural Networks algorithm (ANN). The results showed that the Random Forest model simulated the best results. The above studies have shown that compared with remote sensing models that require many parameters, the machine learning approach to estimating evapotranspiration can reduce the effect of uncertainty in the simulation process to a certain extent and has strong regional applicability. Meanwhile, the random forest algorithm model is more interpretable than other complex machine learning models.
This study addresses the complex parameterization issues in physical remote sensing models by innovatively employing machine learning approaches to develop an evapotranspiration (ET) simulation model for desert steppes, thereby revealing the key driving mechanisms of ET in these ecosystems. In the arid grassland region, especially under the unique climate and ecological constraints of northwest China, there is still a lack of in-depth exploration. Moreover, existing models often ignore the interactions among multi-scale factors during the ET process in arid grasslands.The proposed technical framework integrating machine learning with remote sensing products not only establishes a methodology for investigating water cycle evolution in arid regions, but also provides scientific support for constructing ecological security patterns and implementing restoration practices in degraded desert steppe ecosystems.
The paper is structured as follows: Sect. Study area and data presents the study area and data sources. Section Study area describes machine learning method model and the importance evaluation model. Section Data source simulates the model and analyses the evaluation results. Section Research methods is the discussion and conclusions, which focus on the model accuracy and the correlation analysis of the main climate factors of evapotranspiration, as well as a summary of the conclusions obtained.
Study area and data
Study area
This study was conducted in the desert grassland region of China (35°43′−47°52′N, 73°51′−113°58′E), The altitude ranges from-162 to 7249 m, which has unique geomorphological features in the northwest China desert grassland region, including the western region of Inner Mongolia, the northern and southern regions of Xinjiang, the Hexi Corridor region of Gansu, and the Qinghai Qaidam Basin and Qilian Mountains region. The Northwest Territories are undulating, with vast deserts interspersed with vast grasslands. The region has a typical continental climate, with hot, dry summers and cold, arid winters and significant temperature differences between day and night. Groundwater recharge in the desert grassland area mainly comes from precipitation infiltration, and groundwater discharge is dominated by evaporation. The annual precipitation in this region is generally low, and the spatial distribution is significantly uneven; the precipitation is mainly concentrated in summer, but the median characteristics of the boxplot of annual precipitation are closer to that of spring: although it is significantly higher than the median of winter precipitation, it is still significantly lower than that of summer precipitation. As shown in Fig. 2. This rainfall pattern limits agricultural development and ecological improvement in the region. The study area was divided into Xinjiang Desert Grassland Zone (I), Gansu Desert Grassland Zone (II), Qinghai Desert Grassland Zone (III), and Inner Mongolia Desert Grassland Zone (IV) based on geographic location. The drawing is made using GIS tools. map no.: GS (2023) 2767. As shown in Fig. 1, 2.
Data sources
The meteorological and hydrological impact factor data used in this study were obtained from the Global Land Surface Monthly Mean Temperature dataset published by the Climatic Research Unit(CRU) at the University of East Anglia, UK, with a spatial resolution of 0.5° × 0.5°, and a time series from January 2001 to December 2020 was selected.(https://doi.org/10.5285/58a8802721c94c66ae45c3baa4d814d0) Influence factors include percent cloud cover (cold), diurnal temperature range (dtr), frost day frequency (frs), precipitation (pre), monthly average daily mean temperature (tmp), vapor pressure (vap), and wet day frequency (wet).Although the evapotranspiration process is affected by multiple factors such as wind speed and solar radiation, the core driving factors are selected in this study mainly based on the dual consideration of data reliability and regional dominant mechanism.CRU data have been widely used in multiple studies worldwide, and their accuracy and applicability have been validated in several studies24,25. The PML-V2 (China) dataset was developed by Zhang, et al.26. The PML-V2 dataset, provided by the Institute of Geology and Geophysics, Chinese Academy of Sciences, comprises five key components: Gross Primary Productivity (GPP), Vegetation Transpiration (Ec), Soil Evaporation (Es), Canopy Interception Evaporation (Ei), and Evaporation from Water Bodies and Snow/Ice (ET_water). In this study, we utilize the sum of Ec, Es, Ei, and ET_water from the PML-V2 dataset to construct the actual evapotranspiration (ETa) as our target value. Meanwhile, He, et al.27showed that PML-V2 (China) outperforms its global version and other typical products (MOD16A2, SEBAL, GLEAM, MOD17A2H, VPM, and EC-LUE) in terms of estimating ET and GPP.
The DEM data are from the Geospatial Data Cloud (http://www.gscloud.cn/) with a spatial resolution of 90 m.
Research methods
Random forest regression model
To accurately simulate the complex natural phenomenon of desert grassland ET, this study takes advantage of the powerful data processing capability and high flexibility of random forests in ET research to integrate multi-source environmental data, automatically identify the most critical environmental variables. By constructing multiple decision trees and using a voting mechanism to synthesize the prediction results of each tree, Random Forest effectively reduces the risk of model overfitting and provides a reliable machine-learning framework for accurately estimating evapotranspiration by dealing with nonlinear relationships and interaction effects.
The Random Forest model (Fig. 3) combines several independent decision trees. In the process of generating the decision trees, two “random” strategies are adopted: random selection of samples and random selection of features. These two random selections can effectively reduce the correlation between individual decision trees in the model, further improving the model’s accuracy. In the modeling process, 80% of the data were selected as training data and 20% as test data, which were used to test the data generalization ability of the random forest model.
The specific steps are as follows:
Step 1
The ET in the PML-V2 dataset was used as the output variable (target value) of the model, and the input variables (attribute values) of the corresponding seven meteorological factor models, the output variable (target value) and the input variable matrix (attribute values) formed the training dataset S.
Step 2
Generate training subsets and regression trees. Bootstrap sampling creates multiple training subsets, each paired with a randomly selected feature subspace to train unpruned regression trees. Iterations explore diverse feature combinations to optimize inputs, and the process repeats across all trees, with final outputs aggregated for analysis.
Step 3
Generate a random forest model. Each regression tree grows recursively, branching from top to bottom. The regression tree stops increasing after the split termination condition is met, and eventually, all the single regression trees are combined to form a random forest.
Step 4
For the corresponding regression results above, the arithmetic mean is obtained using the ‘simple average method’ to obtain the final ET value.
Factor importance evaluation model based on Out-of-Bag (OOB) data error rate calculation
The conventional variable importance measures (VIM) calculations for random forest models are divided into two methods, they are obtained based on the Gini index and the out-of-bag (OOB) data error rate. The variable importance score VIM(OOB) is calculated based on the OOB error rate for a broader range of applications. This is because VIM(OOB) not only evaluates the categorization contribution of each variable individually but also evaluates the effect of interactions between variables on categorization performance. Suppose a variable does not significantly change its OOB error rate after randomly replacing observations. In that case, the expected value of the VIM for this variable will be zero, indicating that the variable does not have significant categorical power. In this study, for the constructed RF - R, the OOB data error rate - based factor importance evaluation model was used to assess the importance of seven factor categories. Each random forest tree was built with a randomly selected training bootstrap sample, and the OOB prediction error rate was calculated.The tree is constructed again after randomly replacing the observations of variable \(\:{X}_{j}\) and the prediction error rate of OOB is calculated. Finally, the difference of between the two OOB error rates is calculated after normalization to the average of all the trees, the replacement importance of variable \(\:{X}_{j}\left(VI{M}_{j}\right)\). The sum of the importance values of all features involved in the modeling is 1. The importance values of individual features range between (0 and 1), and the closer a feature’s importance value is to 1, the more important that feature is in the modeling process. The specific formula is as follows:
where \(\:{n}_{o}^{i}\) is the number of observed cases of OOB data for the \(\:i\)-th tree, \(\:I\left(g\right)\) is an indicator function, meaning that it takes 1 when the two values are equal and 0 when they are not; \(\:{Y}_{P}\in\:\left\{\text{0,1}\right\}\) is the true result for the \(\:p\)-th observation, \(\:{Y}_{p}^{i}\in\:\left\{\text{0,1}\right\}\) is the \(\:p\)-th observation of the OOB data for the\(\:\:i\)-th tree before random permutation. \(\:{Y}_{p,{\pi\:}_{j}}^{i}\in\:\left\{\text{0,1}\right\}\) is the prediction result of the i-th tree for the p-th observation of the OOB data after random replacement; \(\:n\:\)the number of regression trees in the random forest.
Backpropagation double hidden layer neuron network model
Backpropagation Double Hidden Layer Neuron Network Model (BPNN) is an artificial neural network model inspired by the human nervous system for processing and learning complex patterns and relationships. It is divided into an input layer, an output layer and several hidden layers, each of which is connected to the neurons of the previous layer, and it updates these parameters by calculating an error function that corresponds to the weights of each unit of the network, thus achieving a gradual approximation of the final result. A typical BPNN consists of input, hidden and output layers. As shown in Fig. 4.
Support vector regression model
SVM (Support Vector Machines) is a binary classification model, the core idea of which is to map the samples in the same plane with lines to better classify the whole sample into several categories and classify new unknown samples well if they are available later. The basic process of the Support Vector Machine algorithm is as follows: After selecting the input parameters of the model as the dependent variable the model, then select the training function by adjusting the different functions and parameters until the output result reaches the desired effect; this is a suitable theoretical basis through a specific training function to achieve a better simulation effect.
Indicators for evaluation of simulation results
To evaluate the performance of the random forest regression model in predicting evapotranspiration, this study employs a comprehensive assessment framework consisting of three key metrics: Root Mean Square Error (RMSE), correlation determination (R2), and Mean Bias Error (MBE). RMSE measures the deviation between actual and predicted values, reflecting the accuracy of predictions. R2 indicates the strength of the association between simulated and observed values, while MBE assesses the average bias of the predicted values relative to the estimates. Together, these metrics provide a robust evaluation of the model’s effectiveness in simulating evapotranspiration processes. The formulas for the above three indicators are shown below:
\(\:{E}_{sim,t}\) and \(\:{E}_{obs,t}\) are simulated and measured ET, respectively, and \(\:\stackrel{-}{E}\) is the average of ET.
Results and discussion
Simulation of average evapotranspiration in the study area
As shown in Fig. 5, the MBE of the RF-R model in the training and test sets is −0.005 and − 0.048 mm/month, and the RMSE of the RF-R model in the training and test sets is 3.421 and 4.182 mm/month. When comparing data on monthly averages of ET from 2001 to 2020, the simulation results overlap with the product dataset, regardless of the time training set or test set. The R2 between the RF-R simulation results and the PML-V2 (China) dataset are above 0.90, 0.953 and 0.931, respectively, indicating the model has reliability and high accuracy.
Table 1 shows seasonal variations in ET significantly affect the prediction error indicators MBE and RMSE in the desert grassland region. Specifically, the actual values of ET are lower in the fall and winter months as ET processes weaken, which usually results in more petite corresponding RMSE values. In contrast, during the spring and summer months, when ET values increase as evapotranspiration processes intensify, MBE and RMSE values tend to be larger. Despite the seasonal variations in ET values leading to subsequent changes in the error indicators, the RF-R model performed well in the simulations for all months, with R² above 0.95, indicating a strong correlation between the simulated values and the actual observed values. This shows that although the magnitude of the prediction error is affected by seasonal variations in ET, the RF-R model effectively captures the trend of ET and provides reliable simulation results in all months of the year. Therefore, considering the model’s high performance and ET’s seasonal characteristics, the RF-R model is adequate and accurate for predicting ET in desert grassland regions.
As can be seen from Fig. 6, the simulation results of the RF-R model are consistent with the trend of the validation data in each month and show significant seasonal change characteristics, with relatively large evapotranspiration values in spring and summer seasons and relatively small evapotranspiration values in fall and winter seasons. Relative to the validation product, the RF-R model simulates the ET although there is underestimation or overestimation, and the deviation of the model prediction ranges from 0 to 50%, where the maximum deviation is 49.453%, the minimum deviation is 0.197%, and the average deviation is 11.815%. Using the August 2020 data as an example for spatial comparison, both datasets exhibit a low in the center, high around’ ET spatial pattern (Fig. 7), indicating that the RF-R model has successfully captured the dominant distribution characteristics of regional ET. The RF-R model slightly overestimates the ET across the entire region, with an average deviation of 1 mm. The overestimation is primarily observed in eastern Inner Mongolia, where the maximum overestimation reaches 6 mm. The overall simulation is good.
As shown in Fig. 8, In terms of seasonal distribution, the spatial distribution of ET in spring showed a distribution trend of increasing from inside to inside out, and the northern region of Xinjiang and the Qilian Mountains were the regions with the largest ET in spring. In summer, the ET was the smallest in the plain and basin areas and the largest in the surrounding mountainous areas. The low ET values in autumn were mainly in Xinjiang and high ET values mainly in Qinghai. The overall ET value in winter is small, and only the ET value in the Kunlun Mountains in of Xinjiang is relatively large.
Comparison of ET simulations with other machine learning models
To validate the advantages of the RF-R evapotranspiration prediction model in desert grasslands, this study compared RF-R with BPNN and SVR using identical training/test datasets. Predictions from all three models were evaluated against the PML-V2 (China) benchmark, with RF-R demonstrating superior simulation accuracy and generalization (Fig. 9).
As can be seen in Fig. 9, the RF-R model outperformed BPNN (R2 = 0.906, 0.890) and SVR (R2 = 0.900, 0.889) with higher R2 values (0.953, 0.931), which indicates that the simulated ET values outputted from the RF-R model are more strongly correlated with the ET values in the dataset; The MBE of the BPNN model and SVR model in the training and test sets were − 0.046, −0.352 mm/month and 0.02, −0.333 mm/month, respectively, which were smaller compared to the MBE of the RF-R model in the training and test sets of −0.005 and − 0.048 mm/month, respectively. It indicates that the simulated ET values based on the RF-R model deviate less from the ET values in the dataset; The RMSE of the BPNN and SVR models in the training and test sets are 4.768, 5.155 and 4.912, 5.259, respectively, and the RSMEs of both are higher than those of the RF-R model, which are 3.421 and 4.182. This paper uses the independent sample t-test of the model to compare the mean of two independent samples. The mean difference between two independent algorithms (RF-R vsBPNN and RF-RvsSVR) is compared. The results show that the p value is < 0.05, and the difference is significant. It shows that the RF-R model simulates regional evapotranspiration with higher accuracy. The RF-R model simulates better than the BPNN and the SVR model in both training and test sets. This is consistent with previous studies21,22,23which have illustrated the higher modelling ability of random forests, although the study areas are different.
Importance analysis of ET impact factors in the study area
This study evaluated ET data and its corresponding seven categories of environmental impact factors from 2001 to 2020 using changes in OOB error rates to determine the significance of individual impact factors on ET changes in the desert steppe region. Cloud cover constrains surface available energy by directly modulating solar radiation intensity, while the diurnal temperature range (DTR) comprehensively reflects strong radiation under clear skies and low wind speed conditions during stable atmospheric periods. Their synergy enables indirect characterization of non-input wind speed and radiation factors. At the same time, through the differentiation of spatial and temporal scales, the contribution of each influencing factor to the changes of ET under different spatial and temporal conditions is further analyzed to reveal the influence of various spatial and temporal scales on the dynamics of ET in desert grasslands.
As can be seen from Fig. 10, there are differences in the importance of the main factors affecting evapotranspiration in different seasons and different regions. In the spring, pre and wet have a relatively more significant influence on evapotranspiration in all regions except for region I. In region I, the effect of cld on ET is the largest at 0.195, and the effect of vap and tmp is the smallest at 0.105 and 0.097, respectively. In region II, vap and cld have the same importance of 0.111. The top three importance rankings in regions III and IV are pre > wet > cld. In summer, pre and wet have a more significant influence on evapotranspiration. In region I, dtr ranked second in importance at 0.185, and tmp had the most minor influence at 0.088. In region II, vap, tmp, and cld have relatively little effect on ET. In region III, dtr and cld are second only to pre and wet in importance, with 0.154 and 0.146, respectively, and vap has the slightest effect on ET, with 0.094. In region IV, cld has the most significant effect on ET at 0.163, and frs has the smallest effect at 0.023. In the fall, the four regions’ most significant impacts on ET were cld, pre, wet, and pre, with significance of 0.188, 0.225, 0.194, and 0205, respectively. In regions I and II, vap and tmp have relatively little effect on ET. In Region III, PRE ranked second in importance at 0.173, with the rest of the impact factors ranging from 0.11 to 0.14. In region IV, wet has the second most significant effect on ET after pre at 0.204, and tmp has the smallest effect on ET at 0.094. In winter, the most significant impacts on ET in the four regions were cld, wet, dtr, and cld, with 0.259, 0.245, 0.175, and 0.214, respectively. In region I, vap and pre have similar importance, 0.163 and 0.162, respectively, and wet and frs have relatively small effects on ET, 0.09 and 0.098, respectively. In region II, the significance of vap and frs is relatively small, 0.075 and 0.074, respectively. In regions III and IV, the importance of cld is the smallest, and the most significant factor, 0.099 and 0.214, respectively, and the significance of other climate factors ranges from 0.11 to 0.18.
As can be seen in Fig. 11, pre and dtr have a large effect on ET at the monthly scale, with pre having the most significant importance in region I, followed by wet, and dtr the least, at 0.231, 0.145, and 0111, respectively. The effect of dtr on evapotranspiration was greatest in regions II and III, with 0.173 and 0.216, respectively. In region IV, the top three importance rankings are: pre > wet > vap. at the annual scale, pre has the most significant effect on ET, with an importance of 0.26–0.35. it is followed by wet, with an importance of 0.18–0.20.
In desertification grassland areas, the change of evapotranspiration (ET) is caused by various factors that show different degrees of influence at different spatial and temporal scales. By analyzing the seasonal variation, monthly and annual scale data, this study revealed the factors mainly affecting ET changes and their regional differences. In terms of seasonal variation, precipitation (pre), wet day frequency (wet), cloud cover (cld) and diurnal temperature range (dtr) are the main factors in the influence of ET change. Specifically, in area I, summer precipitation had the greatest effect on ET, while spring, autumn and winter cloud cover was most significant. This suggests that in area I, summer precipitation events significantly increase soil moisture and promote plant growth and transpiration, while in other seasons, cloud cover reduces solar radiation and thus reduces evaporation rates. In some regions, wet day frequency is a crucial factor influencing ET in spring, summer, and winter, and summer precipitation is also an important driver. This suggests that in regions, maintaining adequate wet day frequency throughout the year is crucial to maintaining ET levels, especially in seasons with less precipitation. Regional spring is mainly affected by precipitation, summer and autumn by wet day frequency, while winter is mainly driven by the temperature difference between day and night. This pattern suggests that in regions, precipitation initiates the plant growing season in spring, wet day frequency maintains high ET levels during warm months, and diurnal temperature differences mediate ET changes during cold months. In regions, ET in spring, summer and autumn is driven by precipitation and controlled by clouds in winter. This indicates that in regions, precipitation continuously provides necessary water resources during warm seasons to support plant activity and transpiration, while in winter, the presence of clouds reduces solar radiation and affects ET. On the monthly scale, areas I and IV mainly influence the precipitation. From an annual perspective, the dominant factor in all four regions was precipitation. This underscores the importance of precipitation as a significant driver of long-term ET trends, highlighting the informational support of integrated precipitation data and predictive models for water resources planning and management strategies.
Correlation analysis of climate drivers impacts on ET
Changes in ET are influenced by a variety of factors, and different studies have revealed the main drivers of changes in ET in different regions. The present study shows that changes in ET in the desert grassland region result from the combined effects of many climate factors, and these factors show different influences at different spatial and temporal scales. In the importance ranking of the seven categories of factors calculated based on the error rate of out-of-bag data, combining different spatial and temporal scales, the study showed that the important factors for ET changes in the study area were pre, wet, cld, and dtr. The influence of precipitation is prominent among them. This is similar to the studies of Pang, et al.28 and Fu, et al.29. In order to further understand the relationship between these four types of factors and ET, this paper analyzes the correlation of single climate factors under seasonal variation based on image element scale ET with spatially interpolated climate factors, respectively.
As can be seen from Fig. 12, the correlation coefficients of ET and pre ranged from − 0.60 to 0.95, and the correlation between ET and pre in spring and summer was significantly stronger than that in fall and winter and positively correlated. In spring and summer, the temperature is high and the solar radiation is strong, which is conducive to the evapotranspiration process. At the same time, the precipitation in these two seasons is usually large, providing sufficient water sources for evapotranspiration. In autumn and winter, with the gradual decrease in temperature, surface evaporation and plant transpiration weaken, leading to a decrease in ET. At this time, if the precipitation increases, there may be a negative correlation between ET and precipitation due to the low temperature and low evaporation. Meanwhile, the air temperature, precipitation and vegetation cover may vary under different topographic conditions, and these differences may cause the relationship between ET and precipitation to show different correlations in different regions.The correlation coefficients between ET and wet ranged from − 0.60 to 0.91, and the spatial distribution of spring-summer ET-wet correlations was similar to that of spring-summer ET and pre, with the vast majority of the study area showing positive correlations. The correlation coefficients between ET and cld ranged from − 0.60 to 0.80, and the spatially averaged correlation coefficient between ET and cld in the fall was 0.19, with an overall positive correlation. The spatial correlation coefficients between quarterly ET and dtr were in the order of spring (−0.40) > summer (−0.27) > fall (−0.18) > winter (−0.17), with an overall negative correlation. Overall, the correlation between seasonal ET and pre and wet is stronger than the correlation between seasonal ET and cld and dtr. ET is positively correlated with pre, wet and cld and negatively correlated with dtr. Due to the desert steppe’s unique climatic conditions and vegetation types, precipitation is mainly exchanged between the atmosphere and the subsurface in the form of evapotranspiration of water vapor30,31.Cloud cover exhibits a positive correlation with ET in our study (Fig. 10), which may initially appear counterintuitive given that clouds reduce solar radiation—a primary driver of evaporation. However, in desert grasslands, cloud cover is often associated with precipitation events, which increase soil moisture and vegetation activity. While reduced solar radiation under cloud cover may suppress evaporation temporarily, the dominant effect in arid regions is the enhanced moisture availability from rainfall32,33. This moisture sustains plant transpiration and soil evaporation over subsequent days, leading to a net increase in ET. A negative correlation was observed between ET and dtr, which was more pronounced especially during the spring and summer seasons. This may be due to the fact that smaller daily temperature differences in spring and summer are usually associated with higher cloudiness or higher humidity, conditions that limit heat uptake and water evaporation from the surface, thus reducing the efficiency of evapotranspiration. In summary, there are obvious variability between the evapotranspiration of desert grasslands and its important climate factors at different spatial and temporal scales. These differences are important for understanding the water cycle process in desert grasslands, which can help to formulate more effective water resource management and ecological protection strategies33,34,35.
Limitations and perspectives of the study
There are still some deficiencies in the current research, firstly the scarcity of evapotranspiration measurement data limits the training and validation of the models. Although the PML-V2 (China) dataset passes the multiscale tests performed by the Eddy Observing System and the water balance, its accuracy is better than the global product PML-V2 as well as other typical models. However, there are still some problems and uncertainties that may exist in the model calibration, such as the mismatch of Eddy Observing System observation ranges, parameter uncertainties due to the limited number of stations, and the accuracy of input data. The advantages of the Random Forest algorithm in handling multidimensional data capture the complex relationship between multiple influences and evapotranspiration, while the Random Forest regression model performs well in simulation by virtue of its high accuracy and fast training speed. However, they still require adequate physical explanations to support their predictions. This study failed to cover all factors that may affect ET. Therefore, future models need to incorporate more variables related to desertification and improve the sensitivity of responses to changes in these variables.
Conclusion
This study developed a Random Forest Regression (RF-R) model to simulate ET in China’s desert grasslands, addressing the challenges of parameter complexity in traditional physical models and data scarcity in arid regions. The main conclusions are as follows:
-
(1)
The RF-R model demonstrated superior accuracy (training R² = 0.953, test R²=0.931) compared to BPNN and SVR models, with lower RMSE (3.421–4.182 mm/month) and minimal bias (MBE < 0.048 mm/month). Unlike physically based models requiring extensive parameterization, the RF-R model leverages machine learning to bypass intermediate variables (e.g., aerodynamic resistance), making it a practical tool for regions lacking high-resolution observational data. This approach bridges the gap between data availability and hydrological modeling needs, particularly in understudied arid ecosystems.
-
(2)
ET in desert grasslands exhibits pronounced seasonal and spatial heterogeneity, with peak values in summer and minimal activity in winter. Spatially, mountainous regions showed higher ET due to localized precipitation and vegetation, while basins and plains had lower values. These patterns align with the ecological vulnerability of desert grasslands, where water availability governs ecosystem stability. The model’s ability to capture these trends supports its use in monitoring degradation hotspots and guiding targeted restoration efforts.
-
(3)
Precipitation (pre) and wet-day frequency (wet) emerged as the primary drivers of ET, followed by cloud cover (cld) and diurnal temperature range (dtr). The positive correlation between cloud cover and ET highlights the indirect role of clouds in the desert steppe. These findings highlight the importance of water supply rather than radiation control, which is critical for climate adaptation strategies.
-
(4)
Building on the current findings and limitations, future studies will integrate high-resolution satellite data and additional variables to resolve fine-scale ET variability. Direct validation against eddy covariance observations in hyper-arid regions will address uncertainties in existing datasets, while coupling machine learning with process-based models could bridge the gap between predictive accuracy and mechanistic interpretability. Extending the temporal scope to include climate change projections will further strengthen the model’s utility for long-term water resource planning, drought resilience strategies, and ecological restoration efforts in these fragile ecosystems.
Data availability
All authors made sure that all data and materials support our published claims and comply with field standards. Data citations: (http://dx.doi.org/10.5285/58a8802721c94c66ae45c3baa4d814d0.)Zhang Y, He S (2022) PML-V2(China): evapotranspiration and gross primary production dataset(2000.02.26-2020.12.31). National Tibetan Plateau/Third Pole Environment Data Center.https://doi.org/10.11888/Terre.tpdc.272389.Data sources are provided in the manuscript, and for anyone to request data for this study, contact nishi_chu@163.com.
References
Li, X-T. et al. Extreme drought does not alter the stability of aboveground net primary productivity but decreases the stability of belowground net primary productivity in a desert steppe of Northern China. Environ. Sci. Pollut Res. Int. 30 (9), 24319–24328. https://doi.org/10.1007/s11356-022-23938-1 (2022).
Yu, D-S. et al. Spatiotemporal variation of net primary productivity and its response to drought in inner Mongolian desert steppe. Global Ecol. Conserv. 33, e01991. https://doi.org/10.1016/j.gecco.2021.e01991 (2021).
Wu, Q. et al. Long-Term warming and nitrogen addition have contrasting effects on ecosystem carbon exchange in a desert steppe. Environ. Sci. Technol. 55 (11). https://doi.org/10.1021/acs.est.0c06526 (2021).
Fang, J-Z. et al. Research advancement in grassland ecosystem vulnerability and ecological resilience and its inspiration for improving grassland ecosystem services in the karst desertification control. Plants 11 (10), 1290–1290. https://doi.org/10.3390/plants11101290 (2022).
Wu, N. et al. Spatial-Temporal characteristics and driving forces of aboveground biomass in desert steppes of inner mongolia, China in the past 20 years. Remote Sens. 15 (12). https://doi.org/10.3390/rs15123097 (2023).
Han, X. et al. The effect of time scales on the distribution of evapotranspiration and driving factors in desert grasslands. Agric. Water Manage. 284 https://doi.org/10.1016/j.agwat.2023.108348 (2023).
Tang, R., Li, Z-L. & Sun, X. Temporal upscaling of instantaneous evapotranspiration: an intercomparison of four methods using eddy covariance measurements and MODIS data. Remote Sens. Environ. 138, 102–118. https://doi.org/10.1016/j.rse.2013.07.001 (2013).
Bao, Y. et al. Comparison of an improved Penman-Monteith model and SWH model for estimating evapotranspiration in a meadow wetland in a semiarid region. Sci. Total Environ. 795, 148736–148736. https://doi.org/10.1016/j.scitotenv.2021.148736 (2021).
Sur, C., Nam, W-H., Zhang, X., Tadesse, T. & D.Wardlow, B. Assessment of an evapotranspiration algorithm accounting for land cover types and photosynthetic perspectives using remote sensing images. GIScience Remote Sens. 60 (1). https://doi.org/10.1080/15481603.2023.2279802 (2023).
Zhang, C. et al. Calibration and validation of three evapotranspiration models in a tea field in the humid region of south-east China. Irrig. Sci. 71 (5), 1254–1267. https://doi.org/10.1002/IRD.2728 (2022).
Yin, J-Q. et al. Estimation of evapotranspiration through an improved daily global solar radiation in SEBAL model: a case study of the middle Heihe river basin. Theoret. Appl. Climatol. 155 (4). https://doi.org/10.1007/S00704-023-04796-4 (2023).
Zahid, M. N. et al. Evapotranspiration Estimation using a satellite-based surface energy balance: a case study of upper Bari doab, Pakistan. Environ. Earth Sci. 82 (24). https://doi.org/10.1007/S12665-023-11284-5 (2023).
Brecht, M. et al. Verhoest N E C. GLEAM v3: satellite-based land evaporation and root-zone soil moisture. Geosci. Model Dev. 10 (5), 1–36. https://doi.org/10.5194/gmd-10-1903-2017 (2017).
Cheng, L-Z., Yang, M-X., Wang, X-J. & Wan, G-N. Spatial and Temporal variations of terrestrial evapotranspiration in the upper Taohe river basin from 2001 to 2018 based on MOD16 ET data. Advances in meteorology. (2020). https://doi.org/10.1155/2020/3721414
Li, X-L., Xu, X-F., Tian, W., Tian, J. & He, C-S. Contribution of climate change and vegetation restoration to interannual variability of evapotranspiration in the agro-pastoral ecotone in Northern China. Ecol. Ind. 154 https://doi.org/10.1016/j.ecolind.2023.110485 (2023).
Nagappan, M., Gopalakrishnan, V. & Alagappan, M. Prediction of reference evapotranspiration for irrigation scheduling using machine learning. Hydrol. Sci. J. 65 (16), 2669–2677. https://doi.org/10.1080/02626667.2020.1830996 (2020).
Hascoet, T., Pellet, V., Aires, F. & Takiguchi, T. Learning global evapotranspiration dataset corrections from a water cycle closure supervision. Remote Sens. 16 (1). https://doi.org/10.3390/rs16010170 (2023).
Shaloo, Kumar, B., Bisht, H., Rajput, J., Mishra, A. K. & TM K K, Brahmanand, P. S. Reference evapotranspiration prediction using machine learning models: an empirical study from minimal climate data. Agron. J. 116 (3), 956–972. https://doi.org/10.1002/AGJ2.21504 (2023).
Breiman, L., Random & Forests Mach. Learn. 45(1): 5–32. https://doi.org/10.1023/A:1010933404324 (2001).
Wang, Z-D., Hu, M-H. & Zhai, G-T. Application of Deep Learning Architectures for Accurate and Rapid Detection of Internal Mechanical Damage of Blueberry Using Hyperspectral Transmittance Data. Sens. (Basel Switzerland) 18(4): 1126–1126. https://doi.org/10.3390/s18041126 (2018).
Xu, T. et al. Evaluating different machine learning methods for upscaling evapotranspiration from flux towers to the regional scale. J. Geophys. Research: Atmos. 123 (16), 8674–8690. https://doi.org/10.1029/2018JD028447 (2018).
Azzam, A., Zhang, W-C., Akhtar, F., Shaheen, Z. & Elbeltagi, A. Estimation of green and blue water evapotranspiration using machine learning algorithms with limited meteorological data: A case study in Amu Darya river basin, central Asia. Comput. Electron. Agric. 202 https://doi.org/10.1016/J.COMPAG.2022.107403 (2022).
Liu, X-C., Gao, L-M., Zhang, L-L. & Chen, K-L. Simulation of evapotranspiration of marsh meadow in Qinghai lake basin based on machine learning model. Int. Conf. Remote Sens. Mapp. Geographic Syst. (RSMG 2023). 12815 https://doi.org/10.1117/12.3010219 (2023).
Harris, I., Jones, P. D., Osborn, T. J. & Lister, D. H. Updated high-resolution grids of monthly Climatic observations – the CRU TS3.10 dataset. Int. J. Climatol. 34 (3), 623–642. https://doi.org/10.1002/joc.3711 (2014).
Zhang, D., Song, X-F., Zhang, Y-H., Yang, L-H. & Yang, S-T. Variation characteristics of precipitation over Weihe river basin based on CRU grid dataset during last 100 years. J. Arid Land. Resour. Environ. 32, 142–148. https://doi.org/10.13448/j.cnki.jalre.2018.061 (2018). https://link.cnki.net/doi/
Zhang, Y. & He, S. PML-V2(China): evapotranspiration and gross primary production dataset(2000.02.26-2020.12.31). Natl. Tibetan Plateau/Third Pole Environ. Data Cent. https://doi.org/10.11888/Terre.tpdc.272389 (2022).
He, S-Y. et al. A daily and 500 m coupled evapotranspiration and gross primary production product across China during 2000–2020. ESSD 14 (12). https://doi.org/10.5194/essd-14-5463-2022 (2022).
Pang, X. et al. Long term variation of evapotranspiration and water balance based on upscaling eddy covariance observations over the temperate semi-arid grassland of China. Agric. For. Meteorol. 308–309. https://doi.org/10.1016/J.AGRFORMET.2021.108566 (2021).
Fu, J. et al. Spatial-temporal variations of terrestrial evapotranspiration across China from 2000 to 2019. Sci. Total Environ. 825: 153951–153951. https://doi.org/10.1016/J.SCITOTENV.2022.153951 (2022).
Li, H., Wang, J., Liu, H., Miao, H. & Liu, J. Responses of vegetation yield to precipitation and reference evapotranspiration in a desert steppe in inner mongolia, China. J. Arid Land. 15 (04), 477–490. https://doi.org/10.1007/s40333-023-0051-2 (2023).
Wang, J., Zhang, R-Q., Li, H-P., Lu, H-Y. & Xue-Song-Cao, Liu, R-C. Relationship between subsurface water depletion and meteorological vegetation factors at different time scales. Desert Grasslands Arid Zone Agricultural Res. 38 (4), 8. https://doi.org/10.7606/j.issn.1000-7601.2020.04.19 (2020).
Wang, Z-P. et al. Temporal and Spatial variation of water use efficiency of vegetation in mu Us sandy land and its influencing factors. J. Appl. Ecol. 12 https://doi.org/10.13287/j.1001-9332.202506.007 (2025).
Xun, T-Y. et al. Temporal and Spatial changes of evapotranspiration in the Beiluo river basin and its response to climate and vegetation evolution. Acta Ecol. Sin. (12), 1–12. https://doi.org/10.20103/j.stxb.202407241749 (2025).
Wang, Z-C., Wang, Y-P., Xu, Z-Y. & Xun, W-C. Changes characteristics and dominant factors of potential evapotranspiration in different dry and wet zones of inner Mongolia. Arid Land. Geogr. 48 (04), 612–622. https://doi.org/10.12118/j.issn.1000-6060.2024.480 (2025).
Lu, H., Zeng, Y-N. & Wang, P-C. Spatiotemporal variation of actual evapotranspiration and its influencing factors in the Northeast Qinghai-Xizang plateau. Arid Land. Geogr. 48 (5), 753–764. https://doi.org/10.12118/j.issn.1000-6060.2024.395 (2025).
Funding
Seminar 3 of the National Key R&D Program “Prevention and Control of Major Natural Disasters and Public Safety”. (No: 2023YFC3006503).
Author information
Authors and Affiliations
Contributions
All the authors participated in the experiment and reviewed the manuscript. Chen H.T. revised the manuscript and conducted model design. Kang A.Q. was responsible for the initial concept. Chu N.S. processes data runs models, and writes initial drafts of manuscripts.Wang W.C. put forward suggestions to improve the quality of learning. He J. shared many comments. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Ethics approval
All authors kept the ‘Ethical Responsibilities of Authors’.
Consent to participate
All authors gave explicit consent to participate in this work.
Consent to publish
All authors gave explicit consent to publish this manuscript.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Chen, H., Chu, N., Kang, A. et al. Simulation and analysis of evapotranspiration from desert grasslands based on a random forest regression model. Sci Rep 15, 25760 (2025). https://doi.org/10.1038/s41598-025-11056-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-11056-0