Background & Summary

Multi-layer soil moisture serves as a vital water source for root water uptake and crop transpiration, playing a crucial role in crop growth and yield formation1,2,3,4,5. High-resolution and high-accuracy multi-layer soil moisture data provide essential support for predicting agricultural droughts and waterlogging, as well as for guiding farmland management6,7,8. Northeast China is a major agricultural region for the cultivation of soybeans, maize, and rice9. Notably, the black soil region, one of the world’s four major black soil areas10, is a key commercial grain production base. However, in recent years, climate change has markedly altered the spatial and temporal distribution of precipitation across the region, with a notable rise in the frequency of extreme weather events. These changes have amplified the temporal variability and spatial heterogeneity of soil moisture, posing significant challenges to crop stability and agricultural productivity. Concurrently, intensified agricultural activities have heightened the sector’s reliance on water resources, thereby increasing the region’s vulnerability to both drought and waterlogging events11. To effectively confront these challenges, there is an urgent need to harness high-resolution soil moisture data to strengthen agricultural risk early warning systems and resource management. Multi-layered soil moisture datasets have broad applicability in real-world contexts, including drought monitoring, water stress detection, irrigation scheduling, and crop planning. Accurately characterizing the spatiotemporal dynamics of soil moisture can support timely adjustments in irrigation and sowing practices by producers, while also providing a robust scientific foundation for regional food security assessments and the formulation of agricultural policy. Understanding soil moisture dynamics in this region is therefore crucial for ensuring stable food production and optimizing agricultural resource management.

In recent years, the development of soil moisture datasets has garnered significant attention. Theoretical methods for soil moisture retrieval using visible, infrared, and microwave bands have become well established12,13,14. Common approaches include the moisture deficit index, soil thermal inertia, and the temperature-vegetation drought index15,16,17,18. Microwave-based retrieval methods typically derive soil moisture by establishing relationships between backscatter coefficients and soil moisture content13,19,20. However, while visible and infrared-based retrieval methods offer high spatial resolution, they are severely limited by cloud cover, which hinders soil moisture estimation under cloudy conditions. Conversely, microwave-based retrieval can penetrate cloud cover but suffers from low spatial resolution, typically in the range of tens of kilometers21,22,23. To overcome these limitations, researchers have increasingly adopted land surface models and data assimilation techniques to generate soil moisture products at both regional and global scales24,25,26. Notable examples include the ERA5-Land reanalysis dataset and the Global Land Data Assimilation System (GLDAS). Additionally, several widely used global soil moisture datasets, such as SMOS, SMAP, and the latest ESA CCI global soil moisture dataset, provide valuable resources for large-scale soil moisture monitoring7,27,28,29. Despite advancements in integrating multi-source remote sensing data for soil moisture estimation, existing datasets still face considerable limitations. Most struggle to simultaneously achieve long-term continuity, high spatial and temporal resolution, and multi-layer soil moisture information27,30,31. Furthermore, due to data constraints, our understanding of multi-layer soil moisture dynamics in certain regions remains insufficient, introducing uncertainties and challenges in soil moisture prediction. Therefore, further research is essential to develop large-scale soil moisture monitoring technologies that can provide more comprehensive, continuous, and high-resolution multi-layer soil moisture data.

The Community Land Model (CLM) allows direct integration of remote sensing observations of land surface conditions to generate accurate and spatiotemporally consistent land surface state fields. By employing ensemble-based methods, CLM effectively reduces uncertainties in land surface simulations, making it possible to incorporate high-resolution satellite-derived soil moisture data to update soil moisture estimates within the model26. In this study, we enhanced the CLM3.5 (Community Land Model version 3.5) framework by replacing its default global datasets with high-resolution regional datasets tailored for Northeast China. Specifically, the original FAO-based soil texture dataset was replaced with data from the Second National Soil Survey (SNSS), and the MODIS-derived land cover dataset was substituted with the China Regional Land Cover Dataset (CLCV). These modifications provide a more accurate representation of regional soil and land cover characteristics.

To further optimize soil moisture simulation, we adopted the scheme proposed by Song et al. which employs atmospheric forcing data from the China Meteorological Administration Land Data Assimilation System (CMA-LDAS) to drive the improved CLM3.5 model. This scheme was applied to simulate multi-layer soil moisture in the Inner Mongolia Autonomous Region of China, yielding varying degrees of improvement in soil moisture simulation. Specifically, the simulated soil moisture under this scheme best captures the spatiotemporal variation characteristics of observed soil moisture in the Inner Mongolia region, leading to a significant enhancement in simulation performance. Building upon this scheme, we developed a high-resolution, multi-layer daily soil moisture dataset spanning 2008–2023, with a spatial resolution of 2 km and covering four depth layers (0–10 cm, 10–20 cm, 20–50 cm, and 50–100 cm). This dataset not only significantly improves the accuracy of soil moisture estimation but also enhances the spatiotemporal resolution, thereby more precisely capturing the dynamic changes in multi-layer soil moisture.

The black soil region in northeast China is one of the major black soil belts in the world (Fig. 1), distinguished by a variety of soil types, including black soil, meadow soil, and chestnut soil. These soils are rich in organic matter and essential nutrients, making the region highly suitable for agricultural production. The primary crops cultivated in this area include rice, maize, soybeans, and sorghum. This region encompasses Heilongjiang province, Jilin province, Liaoning province, and parts of eastern Inner Mongolia, covering a total area of approximately 1.24 million square kilometers. Of this, the area of typical black soil cultivated land spans 18.53 million hectares. The Northeast Black Soil Region serves as China’s largest grain production base, contributing around one-quarter of the nation’s total grain output. Between 2020 and 2022, the implementation of conservation tillage in the black soil region expanded from 4.6 million hectares to 8.3 million hectares. Despite these efforts, soil degradation remains a persistent issue, with varying degrees of severity across the region.

Fig. 1
figure 1

Distribution map of the Northeast Black Soil Region in China and the locations of soil moisture In-situ stations. The stations represented in the figure correspond to the following sites:1, Suolun; 2, Dashizhai; 3, Guiliuhe; 4, Barigasitai; 5, Alideer; 6, Eti; 7, Julihen.

Methods

CLM3.5 introduce

The Community Land Model (CLM) is one of the most advanced and well-developed land surface models globally. It serves as the land surface component of the Community Climate System Model (CCSM) and has been developed through collaboration among various research institutions, building on the foundations of the Common Land Model (CoLM) and the NCAR Land Surface Model (NCAR LSM), among others. CLM incorporates insights from the Biosphere-Atmosphere Transfer Scheme (BATS)32, the IAP94land surface model developed at the Institute of Atmospheric Physics, Chinese Academy of Sciences33, and the NCAR LSM, evolving into a third-generation land surface process model. Significant improvements have been made to the model’s land surface parameters and hydrological processes, and updates have been implemented to integrate MODIS-based surface datasets as well as enhance the canopy interception scheme34. These advancements culminated in the development of CLM3.5 available at http://www.cgd.ucar.edu/tss/clm/distribution/clm3.5/index.html)35. Numerous offline simulations have demonstrated that CLM3.5 significantly improves the distribution of global evapotranspiration compared to observed runoff, soil moisture, and total water storage36. The model simulates wetter soil moisture conditions, lower vegetation water stress, enhanced transpiration, and increased photosynthesis. Additionally, improvements were observed in the interannual variability of total land water storage, as well as the phase and amplitude of runoff interannual variations36. In this study, the CLM3.5 was employed to simulate multi-layer soil water content, using updated land use and soil texture data. The simulations cover four distinct soil depth layers: 0–10 cm, 10–20 cm, 20–50 cm, and 50–100 cm, as Fig. 2.

Fig. 2
figure 2

Technical flowchart.

The driving data for CLM3.5 encompass fundamental datasets such as soil texture, DEM37, land cover (https://doi.org/10.5067/MODIS/MCD12Q1.061), and leaf area index (LAI) (https://doi.org/10.5067/MODIS/MOD15A2H.061) derived from MODIS product data, alongside meteorological variables including precipitation and temperature. Daily precipitation and temperature data are derived from the ERA5-Land reanalysis dataset provided by the European Centre for Medium-Range Weather Forecasts (ECMWF; available at https://cds.climate.copernicus.eu/). These ERA5-derived data have been corrected using observational datasets from the National Qinghai-Tibet Plateau Science Data Center, which cover China with a 1 km spatial resolution and monthly temporal resolution spanning 1961–2014 (accessible at https://data.tpdc.ac.cn/home). In the data processing workflow, factors such as longitude, latitude, and altitude are explicitly incorporated. Compared to analogous products, this dataset exhibits higher resolution and lower uncertainty38.

Delta downscaling method

The Delta downscaling method is a commonly used statistical downscaling technique that has been widely applied in the downscaling process of General Circulation Models (GCMs)39. In this study, the Delta method was applied to temperature and precipitation data to enhance the spatial accuracy of climate driving data. Specifically, the original low-resolution data were first interpolated onto the target high-resolution grid. Based on monthly climatological means from a historical reference period, systematic biases between model observations and history were then quantified at each grid point, and spatial correction factors were computed. These factors were then applied to bias-correct the downscaled climate data. This downscaling approach was used to preprocess the driving data for the CLM3.5 land surface model, providing high-resolution input data that supports accurate simulation and validation of soil moisture. While the Delta downscaling method can generate input data with higher resolution, two key considerations led us to finalize a 2 km resolution for the soil moisture dataset: the CLM3.5 model requires extensive inputs, including meteorological variables and underlying surface parameters, yet some of these datasets are unavailable at resolutions finer than 2 km, making it challenging to uniformly upscale all input data to a resolution higher than this threshold, and the primary objective of this dataset is to support future analyses of drought and flood impacts on agriculture, for which a 2 km resolution is sufficient to meet the requirements; taking these factors into account, we opted to produce the soil moisture dataset at a 2 km resolution.

$$\text{Delta}(\text{P})=\frac{{\text{P}}_{\text{obs}}}{{\text{P}}_{\text{his}}}$$
(1)
$$\text{Delta}(\text{T})=\frac{{\text{T}}_{\text{obs}}}{{\text{T}}_{\text{his}}}$$
(2)
$${\text{P}}_{{\rm{rcp}}}^{{\prime} }=\text{Delta}(\text{P})\times {\text{P}}_{\text{rcp}}$$
(3)
$${\text{T}}_{{\rm{rcp}}}^{{\prime} }=\text{Delta}(\text{T})\times {\text{T}}_{\text{rcp}}$$
(4)

Where Pobs represents precipitation data generated from observations, Phis represents historical precipitation, Delta(P) represents precipitation difference coefficient, Prcp represents uncorrected precipitation, Prcp is the bias-corrected precipitation, Tobs represents air temperature data generated from observations, This represents historical air temperature, Delta(T) represents temperature difference, Trcp represents uncorrected temperature, and Trcp is bias-corrected temperature.

Error analysis method

The DMSM (Data Management and Simulation Model) aims to quantify system and random errors by comparing simulated values with field observation data. To comprehensively assess the performance of DMSM, four metrics were used, including the correlation coefficient (R), bias, mean absolute error (MAE), and root mean square error (RMSE)7.

$$R=\frac{\sum ({S}_{i}-\overline{S})({M}_{i}-\overline{M})}{\sqrt{\sum {({S}_{i}-\overline{S})}^{2}{({M}_{i}-\overline{M})}^{2}}}$$
(5)
$$MAE=\frac{1}{m}\mathop{\sum }\limits_{i=1}^{m}|{S}_{i}-{M}_{i}|$$
(6)
$$RMSE=\sqrt{\frac{{\sum }_{i=1}^{m}{({S}_{i}-{M}_{i})}^{2}}{m}}$$
(7)
$$bias=\frac{1}{m}\mathop{\sum }\limits_{i=1}^{m}({S}_{i}-{M}_{i})$$
(8)

Data Records

This dataset includes soil moisture data across four vertical layers within the 0–100 cm depth range. Each layer is archived as a compressed file in .7Z format, which requires decompression software for extraction, and the compressed file for each layer is named “DMSM 1.7z”. The data files are freely available at figshare40. The dataset is publicly accessible via the https://doi.org/10.6084/m9.figshare.2852308740. All files are stored in GeoTiff format, projected under the World Geodetic System 1984 (WGS84) coordinate system, with a spatial resolution of 0.019. The file naming convention follows the example “NELDAS_V1_Land_2km._YYYYMMDD_10.tif”, where “YYYYMMDD” denotes the date of data acquisition. The suffixes in the filenames correspond to specific depth layers: “_10” for the first layer (0–10 cm), “_20” for the second layer (0–20 cm), “_50” for the third layer (20–50 cm), and “_100” for the fourth layer (50–100 cm). Files are organized into and stored within respective subfolders based on their corresponding layers.

Technical Validation

We conducted accuracy validation of the newly developed DMSM dataset at different spatial and temporal scales, which consisted of four steps. First, accuracy analysis was performed using the R, RMSE, MAE, and Bias metrics, comparing the DMSM dataset with data from 7 field observation stations. Second, a systematic comparison was made between ERA5 from the European Centre For Medium-Range Weather Forecasts, ECWMF (https://cds.climate.cop-ernicus.eu), and GL-DAS from the Land Data Assimilation System (https://ldas.gsfc.nasa.gov/gldas/GLDASpublicati-ons.php) soil moisture data and DMSM, as well as precipitation data, to verify the applicability and accuracy of CLM3.5 under temporal variation trends. Third, at the regional scale, we analyzed the correlation between DMSM data and both ERA5 and GLDAS on a gridded scale to assess the spatial accuracy of soil moisture representation between DMS-M and existing soil moisture data. Fourth, we analyzed the spatial variation trends of D-MSM in the northeastern black soil region to understand its spatiotemporal variability at this regional scale.

Validation using field observation station data

In this study, soil moisture data from 8 in-situ measurement sites within the Northeast Black Soil Region were obtained for the period 2008–2023 (Fig. 1). The data were collected at 10-day intervals from May to September throughout this period. A comprehensive verification analysis of the DMSM was performed (Fig. 3). Overall, the coefficient of determination (R²) between DMSM and the in-situ data was 0.65, indicating that DMSM generally exhibited good consistency with the measured soil moisture data. Additionally, we collected data from seven field observation sites, encompassing soil layers of 0–10 cm, 10–20 cm, 20–50 cm, and 50–100 cm. For each of these seven sites, precision verification and comparative analysis of soil moisture across different depths were conducted.

Fig. 3
figure 3

Comparison of In-situ soil moisture and soil moisture from CLM 3.5.

To assess the accuracy of the CLM3.5 generated soil moisture data at depths of 0–10 cm, 10–20 cm, 20–50 cm, and 50–100 cm, the observed soil moisture data from 7 field observation sites were compared with the corresponding depth simulated soil moisture data from CLM3.5. Figures 4 and 5 present the R², Bias, MAE, and RMSE. Among the seven observation sites, three sites (Barigasitai, Dashizhai, Julihen) provided data for all four layers (0–10 cm, 10–20 cm, 20–50 cm, and 50–100 cm), while four sites (Alideer, Eti, Guiliuhe, Suolun) only provided data for three layers (0–10 cm, 10–20 cm, and 20–50 cm). All observation stations are located in relatively homogeneous areas within their respective 2 km×2 km grids. Specifically, the surrounding environment of each station is characterized by flat terrain with no significant topographic relief, dominant land cover types cropland consistent with the main surface type of the grid (Fig. 1). Accuracy validation was performed separately for the four-layer and three-layer data from different site locations. The results showed that, in general, DMSM exhibited the lowest R value of 0.7, RMSE values between 0.035 and 0.07, MAE values between 0.03 and 0.06, and BIAS values between −0.02 and 0.02(Fig. 6). Furthermore, we analyzed the correlation between GLDAS-SM and ERA5-SM with the observed soil moisture data, and found that the R values were all lower than the correlation coefficient R of DMSM and with the observed soil moisture data. Based on the four metrics, the soil moisture simulations for 0–10 cm, 10–20 cm, 20–50 cm, and 50–100 cm in the northeastern black soil region generated by CLM3.5 are highly reliable in terms of accuracy.

Fig. 4
figure 4

Comparison of In-situ multi-layer soil moisture and multi-layer soil moisture from CLM 3.5.

Fig. 5
figure 5

Evaluation of GLDAS-SM and ERA5-SM against ground-based Soil moisture measurements.

Fig. 6
figure 6

Evaluation of multi-layer soil moisture from CLM 3.5 against In-situ soil moisture during the vegetation growing season of years 2016–2022.

Temporal variation of the DMSM

To further analyze whether DMSM can effectively reflect the temporal variation of soil moisture, the temporal performance of DMSM was evaluated. We selected the growing season (May to September) from 2021 to 2022 and compared DMSM with ERA5 and GLDAS for the seven observation sites to analyze their temporal consistency. Since the 0–100 cm soil moisture data from ERA-SM and GLDAS-SM are divided into three layers—ERA-SM: Layer 1 (0–7 cm), Layer 2 (7–28 cm), and Layer 3 (28–100 cm); GLDAS-SM: Layer 1 (0–10 cm), Layer 2 (10–40 cm), and Layer 3 (40–100 cm)—whereas DMSM consists of four layers—Layer 1 (0–10 cm), Layer 2 (10–20 cm), Layer 3 (20–50 cm), and Layer 4 (50–100 cm), a comparative analysis was conducted to align soil moisture values across different depths. To facilitate this comparison, the average soil moisture from the 20–50 cm and 50–100 cm layers of DMSM was calculated to match the three ERA-SM layers, while the average soil moisture from the 10–20 cm and 20–50 cm layers of DMSM was computed to align with the GLDAS-SM depths. Additionally, a comparison was made with the corresponding precipitation data. Figure 7 shows that DMSM, ERA-SM, and GLDAS-SM exhibit good consistency at the daily time scale across all sites. DMSM effectively captures both the daily and seasonal variations of soil moisture and is able to accurately reflect the temporal changes of soil moisture and precipitation events. Furthermore, the relatively higher DMSM values are mainly attributed to major precipitation events.

Fig. 7
figure 7

Comparison of Temporal Variations in Soil Moisture from DMSM, ERA5, and GLDAS with Precipitation during the 2021–2022 Growing Season at Seven Soil Moisture Observation Stations.

To further explore the ability of the CLM3.5 simulated soil moisture data (DMSM) to represent spatial soil moisture variation trends, we created spatial distribution maps of the correlation coefficients between DMSM, ERA5 soil moisture data (ERA-SM), and GLDAS soil moisture data (GLDAS-SM) at the pixel scale, as Fig. 8. Since DMSM is divided into 0–10 cm, 10–20 cm, 20–50 cm, and 50–100 cm layers, ERA-SM uses 0–7 cm, 7–28 cm, and 28–100 cm layers, while GLDAS uses 0–10 cm, 10–40 cm, and 40–100 cm layers. We further analyzed DMSM with ERA-SM and GLDAS-SM for different depths and obtained corresponding three-layer soil moisture data for DMSM with ERA-SM and DMSM with GLDAS-SM. We then conducted a correlation analysis for the three layers of soil moisture data to demonstrate the consistency and differences between different data sources at the spatial scale. Previous studies have demonstrated that bilinear interpolation determines the new value of a pixel by calculating the weighted average of the four nearest input pixel centers, with weights assigned based on their respective distances. This method yields higher image quality after scaling, avoids discontinuities, and is thus well-suited for processing continuous datasets without distinct boundaries41. Therefore, in the context of grid-scale correlation analysis, we applied bilinear interpolation to the higher-resolution DMSM data to align its spatial resolution with that of the ERA-SM and GLDAS-SM datasets. It can be seen that in most areas of the northeastern black soil region, the correlation coefficient R is greater than 0, and in most areas, R exceeds 0.7, indicating that DMSM is highly consistent with ERA-SM and GLDAS-SM in representing the spatial variation of soil moisture (Fig. 7).

Fig. 8
figure 8

Spatial distribution of correlation coefficient between DMSM, ERA-SM, and GLDAS-SM. a, b, and c refer to coefficients between DMSM and ERA-SM at Layer 1, Layer 2, Layer 3; d, e, and f refer to coefficients between DMSM and GLDAS-SM at Layer 1, Layer 2, Layer 3.

The trend of soil moisture

The trend of soil moisture in the northeastern black soil region from 2008 to 2023 was analyzed using the least squares method, as shown in Fig. 9. The pixel values in the figure range from −0.01 to 0.01. Pixels with values greater than 0 indicate an increasing trend, whereas those with values less than 0 represent a decreasing trend. A larger absolute value corresponds to a more pronounced changing trend. The trend of DMSM at the 0–10 cm depth indicates that most areas in the northeastern black soil region show an increasing trend in soil moisture (Fig. 8), with the most noticeable increase occurring in the central and eastern parts of the study area, suggesting a trend towards wetter conditions. In comparison, the increasing trend is slightly weaker for the other three soil moisture layers (10–20 cm, 20–50 cm, and 50–100 cm). The areas with a decreasing trend in soil moisture are mainly located in the southwestern part of the eastern black soil region, which is mainly the sandy soil area.

Fig. 9
figure 9

Trend of Soil Moisture Changes in the Northeastern Black Soil Region from 2008 to 2023.