Background & Summary

Food security is significantly influenced by the escalating population growth, ongoing social and economic development, and climate change1,2. The Loess Plateau in western China stands out for its abundant arable land resources and rich light and heat resources3, showcasing vast potential for agricultural production. It holds a prominent position as one of China’s key grain production hubs. Being a typical dryland agricultural region, the Loess Plateau primarily relies on precipitation for agricultural water supply, but limited rainfall and high evaporation rates pose challenges to agricultural progress. Optimizing the cropping structure, enhancing water resource efficiency and boosting grain production capacity in dryland regions are essential for safeguarding national food security4. Therefore, accurately mapping the planting pattern of the Loess Plateau is vital in promoting agricultural sustainability and ensuring food security5.

In recent years, the accessible and freely available remote sensing data with global coverage and satisfactory spatial resolution have significantly bolstered the data support for identifying crop types on national and global scales6,7,8. The main data types currently used include moderate resolution imaging spectroradiometer (MODIS) with a resolution of 250 m/8 days, Landsat data with a resolution of 30 m/15 days, Sentinel-2 image with a resolution of 10 m/5 days. China’s adoption of a household contract responsibility system has resulted in highly dispersed arable land, enabling farmers to selectively cultivate crops, leading to a marked diversity in crop types. The spatial resolution of MODIS data is too coarse to identify crop types in highly fragmented areas9. Furthermore, due to the brief crop growing season and the 16-day revisit interval frequency of Landsat data, there is an incapacity to provide consistent and stable crop phenology characteristics10. Sentinel-2 provides higher spatial and temporal resolution as well as more spectral bands, making it the optimal free satellite data source for large-scale crop mapping. For example, You et al.8 utilized Sentinel-2 data to produce 10-m crop type maps in Northeast China over 2017–2019, while Li et al.11 used Sentinel-2 data to map annual 10-m maize cropland changes in China during 2017–2021.

Remote sensing-based crop mapping methods can be categorized as phenology-based and machine learning (ML)-based approaches. Phenology-based methods focus on capturing the distinct phenology of various crop species through time-series growth curve characteristics12. This approach necessitates representative crop sample data and adequate satellite observation data to characterize crop phenology. However, satellite data with high temporal spatial resolution is susceptible to weather conditions, intraclass variability, and interclass similarity of spectral and temporal features across large spatial regions and multiple years, impacting crop classification accuracy. Machine learning methods rely on machine learning models to classify crops using remote sensing data. Popular machine learning classifiers such as Support Vector Machine (SVM)13, Random Forest (RF)8, and Neural Network (NN)14 are commonly employed. The advantage of machine learning lies in its ability to consider various recognition features like spectral reflectance, vegetation indices, temporal, and textural features comprehensively, enabling quick and autonomous exploration of crop-specific characteristics. However, machine learning methods are also subject to certain limitations in crop mapping in large areas. Firstly, there is a lack of reliable sample data in large areas, and the quantity and quality of sample data determine the identification accuracy of crop mapping. Secondly, the high dimensionality of feature variables employed in classification increases computational complexity and reduces classifier efficiency. Currently, several studies have integrated phenology with machine learning to address these challenges. Belgiu et al.15 utilized phenological information-based time-series similarity Dynamic Time Warping to create crop sample data for machine learning algorithms. Zhao et al.16 and Yin et al.17 selected optimal features based on phenological information to establish an optimal feature subset, maintaining classification accuracy while reducing computational costs. Therefore, it is promising to map large areas crop type using a combination of phenology and machine learning methods.

Accurate and updated data on planting intensity and crop rotation are essential for advancing sustainable agricultural intensification, mitigating negative agricultural impacts, and ensuring food security. However, there is a lack of reliable updated maps with detailed descriptions on cropping intensity and crops rotations in Loess Plateau. Therefore, the objective of this study is to produce a 10 m resolution crop planting pattern dataset from 2018–2022 on the Loess Plateau. Our contributions encompass four main aspects: (1) Enhancement of the sample dataset based on phenological indices and DTW algorithm; (2) Discrimination of crop planting intensity based on crop phenological curves; (3) Development of independent random forest classifier by considering agricultural climate zones; (4) Construction of optimal feature subset for crop classification. The results of this study can provide important information for the management of agricultural ecosystem in the Loess Plateau.

Methods

Study area

The Loess Plateau is situated in the middle and upper reaches of the Yellow River Basin (https://www.geodata.cn/), covering an area of 6.22 × 105 km2 and spanning latitude 33°69′ to 41°28′N, and longitude 100°86′ to 114°56′E (as shown in Fig. 1). It is characterized by a transitional climate from semi-humid to semi-arid, representing a typical continental monsoon climate with uneven spatial and temporal distribution of precipitation. Rainfall in the southeastern region exceeds 600 mm/day, while the northwestern arid region receives only 150–250 mm/year18. The region experiences high interannual variability in rainfall, with dry in winter and spring, and wet in autumn. Water scarcity and soil erosion act as significant constraints to the agricultural development and ecological restoration of the area. The Loess Plateau has historically been among China’s vital agricultural zones19, with cultivated land covering more than 20% of the total area, predominantly reliant on rain-fed agriculture20. The major crops are wheat, maize, rapeseed, potatoes, and soybean.

Fig. 1
figure 1

The location of the Loess Plateau (a), municipal boundary (b), county boundary (c).

The Loess Plateau encompasses 7 provinces, 44 cities and 332 counties (https://www.ngcc.cn/). Due to the influence of rainfall, terrain, and temperature, there are spatial differences in planting phenology, so the agricultural region is used for mapping crop planting patterns21,22. According to the “2019 National Cultivated Land Quality Grading Report” (https://www.gov.cn/), the study area spans six agricultural regions (as shown in Fig. 1). It includes the Jindong-Yuxi Hilly Mountain Agriculture, Forestry and Pastoral Zone (JAFP), Fenwei Valley Agricultural Zone (FVA), Jin-Shaan-Gan Loess Hills Gullies Pastoral and Forest Zone (LPF), Longzhong-Qingdong Hilly Agriculture and Pastoral Zone (LAP), Agriculture and Pastoral Zone along the Great Wall (GAP), and Meng-Ning-Gan Agriculture and Pastoral Zone (MAP). Generally speaking, the crop planting patterns within each agricultural region exhibit similarities. The phenological characteristics of 8 main crops of the Loess Plateau was obtained from fieldwork and referred to Phenology calendar data3, as shown in Fig. 2.

Fig. 2
figure 2

Crop calendar of the 8 major crops in the Loess Plateau.

Overview of the crop classification method

Figure 3 illustrates the workflow of this paper. Firstly, we completed the pre-processing of Sentinel-2 data from 2018 to 2022, and extracted cropland data based on FROM_GLC10 product. Secondly, we collected sample points in field surveys and visual interpretation, and the DTW algorithm is used to expand and enhance the data of crop sample points. Thirdly, we derived the cropping intensity based on crop phenological curve method. Fourthly, we used optimal crop features as inputs to train the crop classifier based on the RF algorithm, and then mapped crop pattern map with 10 m resolution from 2018 to 2022 on the Loess Plateau.

Fig. 3
figure 3

Integrated workflow for crop classification on the Loess Plateau (2018–2022), encompassing: (a) Image pre-processing; (b) Sample collecting; (c) Cropping intensity; (d) Feature construction; (e) Random Forest classification on GEE.

Sentinel-2 images and pre-processing

Sentinel-2 data were obtained from the European Space Agency’s Copernicus Open Access Hub (https://scihub.copernicus.eu/). The Sentinel-2 Earth observation mission currently comprises two satellites, Sentinel-2A and Sentinel-2B, which were launched in 2015 and 2017, respectively. Together, the two satellites provide global coverage of the Earth’s surface every five days. Each satellite is equipped with a multispectral instrument capable of acquiring imagery across 13 spectral bands, spanning the visible, near-infrared, and shortwave infrared regions of the electromagnetic spectrum, with spatial resolutions of up to 10 meters.

Due to the unavailability of Level-2A surface reflectance (SR) products on the Google Earth Engine (GEE) platform for the Loess Plateau region prior to 20198, Level-1C top-of-atmosphere (TOA) reflectance data were used for the year 2018. From 2019 to 2022, we used the more reliable Level-2A SR data as they became accessible on the platform.

In this study, five key spectral bands were selected, including Red-edge 1 (RE1), Red-edge 2 (RE2), Red-edge 3 (RE3), Short-Wave Infrared 1 (SWIR1), and Short-Wave Infrared 2 (SWIR2). The red-edge bands are important indicators of plant pigment concentration and vegetation health status, while SWIR bands are sensitive to water content and other biochemical components in crop leaves. Among these, RE2, SWIR1, and SWIR2 demonstrated significant discriminatory power for distinguishing between maize and soybean (You et al., 2021).

In addition, four complementary vegetation indices were derived from different combinations of these spectral bands: the Normalized Difference Vegetation Index (NDVI)23,24, Enhanced Vegetation Index (EVI)25, Normalized Difference Senescent Vegetation Index (NDSVI)26, and Green Chlorophyll Vegetation Index (GCVI)27. These indices were selected to better capture the phenological characteristics of diverse crops under complex cropping structures. NDVI and EVI have been widely used to extract crop phenological indicators, with numerous studies identifying NDVI as a dominant vegetation index for vegetation monitoring28. EVI is more effective in minimizing background soil noise, making it particularly suitable for arid and semi-arid regions where vegetation cover is sparse and bare soil is prevalent. It also helps mitigate NDVI’s saturation issues in areas with high biomass. NDSVI is useful for monitoring crop growth and identifying senescence stages such as leaf yellowing and wilting, which supports improved analysis of crop phenological cycles29. GCVI, on the other hand, reflects crop photosynthetic capacity and nutrient status, and is particularly effective in the identification of soybean cultivation30. The functions of NDVI, EVI, NDSVI and GCVI are provided in Eqs. (1)–(4) as follows:

$${\rm{NDVI}}=\frac{{p}_{{nir}}-{p}_{{red}}}{{p}_{{nir}}+{p}_{{red}}}$$
(1)
$${\rm{EVI}}=2.5\frac{{p}_{{nir}}-{p}_{{red}}}{{p}_{{nir}}+6{p}_{{red}}-7.5{p}_{{blue}}+1}$$
(2)
$${\rm{NDSVI}}=\frac{{p}_{{swir}1}-{p}_{{red}}}{{p}_{{swir}1}+{p}_{{red}}}$$
(3)
$${\rm{GCVI}}=\frac{{p}_{{nir}}}{{p}_{{green}}}-1$$
(4)

where, \({p}_{{nir}}\), \({p}_{{red}}\), \({p}_{{blue}}\), \({p}_{{swir}1}\), and \({p}_{{green}}\) are the near-infrared, red, blue, shortwave infrared bands, and green in the Sentinel-2A images, respectively.

Cropland mask

Wang et al.31 conducted a comparative analysis of six global high-resolution land use/land cover products (WorldCover10, FROM_GLC10, ESRI GLC10, FROM_GLC30, GLC_FCS30, and GlobeLand30), and found that FROM_GLC10 exhibited relatively high accuracy over China, with an overall accuracy of 65.57%, particularly in the classification of croplands. Similarly, Bie et al.32 evaluated three major global land cover products (FROM_GLC10, ESA World Cover, and ESRI Land Cover) over northwestern China and reported that FROM_GLC10 achieved the highest overall accuracy of 77.83%.

Considering that the Loess Plateau is one of China’s primary agricultural production regions, and that cropland areas did not undergo significant large-scale expansion or reduction between 2018 and 2022, this study focused on the classification of specific crop types—including maize, wheat, soybean, rapeseed, and potatoes—rather than distinguishing between cropland and non-cropland. To exclude non-cropland areas, we employed the FROM_GLC10 dataset, which has been shown to perform well in northwestern China.

The FROM_GLC10 product is a global scale land use dataset with 10 m resolution in 2017, which was obtained from the Department of Earth System Science, Tsinghua University, China (https://data-starcloud.pcl.ac.cn/zh).

Sample data collection and augmentation

The sample data used in this study were derived from both field surveys and augmentation based on the Dynamic Time Warping (DTW) algorithm.

  1. (i)

    Samples collected from field surveys. From April to October 2021, extensive field surveys were conducted across the main agricultural areas of the Loess Plateau, as shown in Fig. 1. A UniStrong G138BD handheld GNSS receiver with a horizontal accuracy of 2–5 meters was used to record the geographic coordinates of various crop types. After the field investigation was completed, all samples were visually inspected using high-resolution images from Google Earth to ensure quality and accuracy. The obtained sample points include: winter wheat, spring wheat, summer maize, spring maize, soybean, potatoes, winter rapeseed, spring rapeseed, and other cropland.

  2. (ii)

    Sample preparation based on phenology method. Based on the field-validated sample points, standard NDVI phenological curves were extracted for eight major crops: winter wheat, spring wheat, summer maize, spring maize, winter rapeseed, spring rapeseed, soybean, and potatoes. On the Google Earth Engine (GEE) platform, pixel-wise NDVI time series were compared with these reference phenological curves using the Dynamic Time Warping (DTW) algorithm to calculate temporal similarity. The resulting DTW distance rasters were then used to generate spatially distributed sample points through stratified random sampling within agricultural zones, thereby augmenting the training dataset with high-quality, geographically representative samples.

Sample point processing and Dynamic time warping

Before conducting sample point amplification using the DTW algorithm, it is necessary to distinguish the collected sample points as wheat in the study area is divided into winter wheat and spring wheat, maize includes summer maize and spring maize, and rapeseed includes winter rape and spring rape, and the planting and harvest times of these crops are different. Using the GEE platform and based on sentinel-2 remote sensing images, the NDVI of the collected sample points of wheat, maize and rapeseed during their growth periods was calculated, and a 10-day NDVI time series was synthesized. The NDVI range values of different types of crops in the study area could be obtained, and the standard NDVI curves of different types of crops could be drawn. As shown in Fig. 4, based on the NDVI curve, the NDVI characteristics of the samples at different times are utilized to distinguish crops such as winter wheat and spring wheat, forming sample points of winter wheat, spring wheat, summer maize, spring maize, winter rape, and spring rape.

Fig. 4
figure 4

Standard phenological curves of major crops on the Loess Plateau.

Dynamic Time Warping (DTW) algorithm measures the similarity between two non-linear time series curves by calculating the distance value. Assuming that the time series X = {x1, x2, …, xn} is the curve of an unknown pixel, and the time series Y = {y1, y2, …, ym} is the standard curve of a known maize pixel, and the lengths of the two curves are n and m respectively. The DTW algorithm measures the similarity between the two given time series using the Euclidean distance and can flexibly warp and stretch the time series X to align with the time series Y. Use dbase(i, j) to represent the distance matrix obtained by calculating the Euclidean distance between any two points in sequence X and sequence Y. The calculation is as follows:

$${\rm{DTW}}({\rm{X}},{\rm{Y}})={\rm{Min}}\sqrt{\mathop{\sum }\limits_{{\rm{i}}=1}^{{\rm{n}}}{{\rm{D}}}_{{\rm{i}}}}$$
(4)

Where Di is the i-th element in the regular path, and n is the total number of elements in the regular path. The expanded crop sample distribution map based on the DTW algorithm is shown in Fig. 5.

Fig. 5
figure 5

Crop sample distribution map based on DTW algorithm.

Cropping intensity mapping method

This study employed the framework proposed by Liu et al.33 to generate a 10 m resolution crop planting intensity dataset for the Loess Plateau, consisting of two main steps: (1) phenological phase identification and (2) cropping cycle detection.

In the first step, the Savitzky–Golay (SG) filter was applied to smooth NDVI time series, from which the annual NDVI minima and maxima were derived. Transition points were then identified when NDVI values exceeded 50% of the seasonal amplitude34, marking mid-greenup (increasing NDVI) and mid-greendown (decreasing NDVI) stages. The period between mid-greenup and mid-greendown was defined as a growing phase, while the reverse marked a non-growing phase.

For cropping cycle detection, the number of potential cropping cycles (\({N}_{{pc}}\)) was defined as the minimum between the number of mid-greenup (\({N}_{{up}}\)) and mid-greendown (\({N}_{{down}}\)) points:

$${N}_{{pc}}=\min \left\{{N}_{{up}}\right.,\left.{N}_{{down}}\right\}$$

To reduce false detections caused by NDVI anomalies, we followed Sakti and Takeuchi35 in setting a minimum crop growth duration of 48 days. Growth periods shorter than this threshold were classified as false cycles (\({N}_{{fc}}\)), and the actual cropping intensity (\({CI}\)) was calculated as:

$${CI}={N}_{{pc}}-{N}_{{fc}}$$

In this study, cropping intensity data for each year (2018–2022) were produced based on this method.

Feature selection

According to the growth period of crops, different crops are distinguished by using spectral index, vegetation index and texture features. First, during the crop growth period, we combined remote sensing images into a 10-day time series. We selected five spectral reflection bands: RE1(B5), RE2(B6), RE3 (B7), SWIR1(B11) and SWIR2(B12). Secondly, the maximum, minimum and mean values of the vegetation indices (NDVI, EVI, NDSVI, GCVI) in each time series were calculated as another classification basis. Finally, we used the gray-level co-occurrence matrix function to add the texture features of the image, including the Angular second moment (ASM), Contrast, Correlation, and Entropy indicators (ENT). We take these three indices together as the characteristic values for crop classification (as shown in Table 1).

Table 1 The characteristic value of crop planting structure identification.

Random Forest algorithm

Random forest (RF) algorithm is an assembly method that builds multiple randomly uncorrelated decision trees (DT) to achieve an integrated classifier with better predictive performance36. A decision tree is a data structure that each path from the root (the attribute that contributes the most to the final classification result) to the leaf node (the final classification result) represents a rule for the decision37. RF is robust to noise and outlier data, and it solves the overfitting problem that plagues DT. The GEE platform has provided RF algorithm, which has been used to classify the crop types8, detect wetlands changes38, and detect forest fires39.

In this study, the RF algorithm was adopted to training crop classifiers, two parameters were set. (1) The parameter “numberOfTrees” (the number of trees determines the number of binary CART trees used to construct the RF model) was set to 100. (2) minLeafPopulation (the minimum sample size required for leaf nodes) is set to 10, and the remaining parameters are default in GEE.

Data Records

This dataset provides crop distribution maps at 10 meters on the Loess Plateau from 2018 to 2022 (Fig. 6), available in the figshare repository in Geotiff format40. The dataset is provided in the ESPG: 4326 (WGS_1984) spatial reference system. The range of the dataset is from 33°69 ‘to 41°28’ N latitude and 100°86 ‘to 114°56’ E longitude. The values of these crops types are: (1: winter wheat; 2: spring wheat; 3: winter rape; 4: spring rape; 5: spring maize; 6: summer maize; 7: soybean; 8: potatoes; 9: single others,; 10: winter wheat-summer maize; 11: winter rape-summer maize;12: winter wheat-others; 13: winter rape-others; 14: others-summer maize; 15: other double-cropping patterns). These data can be visualized and analyzed in ArcGIS, QGIS or other similar software.

Fig. 6
figure 6

Map of 10 m crops in the Loess Plateau from 2018 to 2022.

Technical Validation

First of all, we compared the on-site data with the real data on the ground. Based on the evaluation of the sample points, we took 30% of the sample points as verification samples to generate a confusion matrix and evaluate the classification accuracy of eight crops. The results show that the overall accuracy from 81.08% to 84.54%, and the Kappa coefficient ranged from 78.35% to 82.26%. Secondly, we compared our crop data with the data from the municipal agricultural census. As can be seen from the Fig. 7, the coefficient of determination (R2) of the main crops of the Loess Plateau, maize and wheat, with the data in the statistical yearbook are relatively high, greater than 0.83. The R2 of soybeans, potatoes and are rapeseed greater than 0.61. These findings show that our crop products are consistent with the records in the statistical yearbook. However, the farmland mask data we used (Sources of error of FROM_GLC10 product) has an overall accuracy of 72.76%. Although the overall and Kappa coefficients are higher than other products, there are still errors, which will reduce the accuracy of crop map recognition. In the future, we plan to use all crop sample points to create a set of China’s latest detailed farmland maps based on current farmland data.

Fig. 7
figure 7

Verification of the results of the municipal statistical yearbook.

Usage Notes

The Loess Plateau is one of China’s important grain production areas. The crop planting area information of the Loess Plateau is of great significance to regional food security and agricultural development. In our study, we used field survey samples combined with DTW algorithm expansion, proposed a large-scale multi-category crop recognition model based on random forests, and provided a 10 m resolution crop distribution map of the Loess Plateau region from 2018 to 2022. This high-precision crop map can be used to predict agricultural productivity, evaluate the water suitability of dryland farmland, etc. Timely and high-resolution crop data is of great significance to the study of regional food balance and food security.