Background & Summary

As one of the largest food producers in the world, China occupies only 9% of global cropland but produces 25% of the world’s food to feed approximately 20% of the global population1. However, China’s crop cultivation has experienced a significant transformation since the initiation of rural reforms2. The cropland area has expanded from 169 million hectares to 172 million hectares from 1986 to 20213. In particular, change in the annual crop mix has also occurred. The maize cultivation area has increased by 105% over the past thirty years, while the area of rice and wheat has decreased by 11% and 19%, respectively4. Moreover, the increase in cropland and fertilizer application has resulted in severe environmental challenges in China, such as soil degradation, biodiversity loss, and non-point source pollution5,6,7. Therefore, revealing the spatially explicit evolution of crop patterns in China is critical for supporting food security strategies and assessing the impact of agricultural practices on the environment.

The geographical distribution of cropland in China has changed substantially over the past three decades, leading to significant transformations in crop patterns. The area of cropland in Southeastern China has declined by 12% since 1990, particularly in the Yangtze River Delta and Pearl River Delta regions, while cropland reclamation in Northeastern China has increased by 15%8. Meanwhile, the proportion of grain production in Northern China (56%) has surpassed that in Southern China, indicating a shift in the center of grain production from south to north9,10. In particular, grain production in Southeastern coastal provinces, such as Guangdong, Jiangsu, and Zhejiang, has declined dramatically11. In contrast, the grain output in the Northeastern provinces, including Heilongjiang and Jilin, has increased from 12% to 25% of national grain production12. However, these changes are primarily analyzed using administrative-level statistics, which overlook spatial heterogeneity within statistical units13,14. Spatially explicit maps with high resolution that illustrate the evolution of crop patterns in China over an extended period are yet to be developed.

Several studies have attempted to depict crop patterns in China using high-resolution Earth observation imagery. The distribution of the three main crops (maize, rice, and wheat) of China in the past two decades has been mapped at the resolutions of 1 km, 500 m, and 30 m in existing works based on satellite images from the MODIS (Moderate Resolution Imaging Spectroradiometer) and Landsat project15,16,17. While crop maps with higher spatial resolutions (10 m) are now available, the high-resolution imagery from the Sentinel-2 mission only offers coverage for periods starting from 201618,19,20. Although there have been ongoing advancements in the spatial and temporal resolutions of remote sensing imagery, the acquisition of free or low-cost remote sensing data remains challenging21,22. Furthermore, since crop recognition and classification rely on substantial training samples23,24, most studies have focused on the spatial distribution of individual grain crops, while harvest area maps for a comprehensive set of crops currently remain absent.

To map the distribution of multiple crops, data fusion techniques have been applied to integrate statistical data with remote sensing observations. For instance, the M3 and MIRCA (Monthly IRrigated and rainfed Crop Area) datasets depict the distribution of over 100 crop types by fusing national crop statistics with the global cropland extent in 2000, which were updated to the CROPGRIDS dataset for 202025,26,27. The GAEZ (Global Agro-Ecological Zones) programs illustrate the patterns of crop natural growth suitability for 2000 and 2010, and have been extended to include data for 2015 in the GAEZ 2015 + dataset28. Detailed information about these datasets is provided in Table S1 in the Supplement. However, the spatial resolution of global crop datasets is generally coarse (5 arcminutes, ~9 km), and their temporal coverage is limited to discrete years after 2000, as illustrated in Fig. 1. The SPAM (Spatial Production Allocation Model) datasets provide crop grids in China from 2000 to 2020 at 5-year intervals, nevertheless, the input data, crop types and methodologies are different for each period, which limits its application in time-series analysis applications29,30. Therefore, it remains a key task to develop high spatial resolution maps for multiple crops in China that are consistent with long-term statistical data.

Fig. 1
figure 1

Crop spatial distribution datasets. The dashed border rectangles indicate datasets from data fusion and the solid border rectangles indicate datasets from remote sensing observations. The x-axis represents the annual periods, the left y-axis represents the crop types included in the dataset, and the right y-axis represents the spatial resolution of the dataset.

The objective of this study is to map spatiotemporal changes in multi-crop harvested areas across China at a 1-km resolution from 1990 to 2020 at five-year intervals. We apply the maximum-score optimization approach to downscale prefecture-level statistics to grid cells, which is an extension of the minimum cross-entropy approach originally used in the SPAM datasets31,32. The cross-entropy framework has demonstrated high accuracy in mapping crop distributions at 5 arcminutes resolution globally, as well as for Sub-Saharan Africa and China33,34,35,36,37. Nevertheless, since this framework was originally developed for 5 arcminute grid allocation, it exhibits difficulty in converging and reduced allocation accuracy when applied to finer spatial resolutions38,39. To address these limitations, the maximum-score optimization approach has proven capable for generating 30 arcseconds maps and avoids the fragmented allocations by the cross-entropy method40.

The development of the dataset involves the following steps: firstly, this study involves the calculation and compilation of statistical data for multiple annual crops at the prefecture level. Subsequently, the spatial distribution of these crops is created by allocating the statistical records to corresponding grids. To enhance the accuracy and rationality of crop distribution over time, multi-source land use data, such as cropland and urban boundaries, temporally aligned with the statistical records, are integrated. Finally, the reliability of the dataset is validated through comparisons with county-level statistical data and existing gridded datasets, utilizing both quantitative metrics and spatial pattern analyses. The built dataset could provide critical input to biogeochemical models and dynamic agricultural models to support land use management and the sustainable transformation of agricultural production.

Methods

Data collection

Statistical data

This study collected agricultural statistical data at the national, provincial, prefectural, and county levels in China, covering the period from 1988 to 2022. The data sources include the statistical yearbooks from the National Bureau of Statistics of China (https://www.stats.gov.cn) and the Big Data Research Platform for China’s Economy and Society (https://data.cnki.net). Moreover, historical crop price records were obtained from the FAO (Food and Agriculture Organization of the United Nations) (https://www.fao.org/faostat)41. To evaluate irrigation and agricultural input levels, crop fertilizer application ratios were derived from the inventories of “Compilation of the National Agricultural Costs and Returns”42. Recommended fertilizer application ratios for different crops were obtained from He et al.43. The proportion of irrigated land was obtained from the third simulation round of the Inter-Sectoral Impact Model Intercomparison Project (ISIMIP3a) dataset44, which provides data on rainfed and irrigated cropland for crops from 1901 to 2021.

Spatial data

Several spatial datasets relevant to crop distribution, including land use, crop natural growth suitability, urban boundaries, and population density, have been used to facilitate the mapping of crop area. The land use data were employed to delineate the extent of cropland and irrigated cropland, including the China Land Cover Dataset (CLCD)45, the China Land Use and Land Cover Remote Sensing Monitoring Dataset (CNLUCC)46, the Global Land Cover Fine-Grained Classification Product (GLC_FCSD)47, the China Arable Land Dataset (CACD)3, and the China Irrigated Cropland Dataset48. The crop natural growth suitability and potential yield distribution were derived from the GAEZ v4 dataset49. The population density data originated from the China Population Spatial Distribution Dataset50. The urban extent data were obtained from the China Urban Boundary Dataset (GAIA)51. Additionally, the transport accessibility data were obtained from Weiss et al.52. All spatial datasets were rasterized and resampled to a 1-km resolution grid. The details of these datasets are provided in Table 1.

Table 1 Input data utilized in this study.

Modeling framework

The research framework is illustrated in Fig. 2, which consists of four key steps: pre-processing statistical and spatial data, allocating statistical harvest area data, validating the accuracy of the results, and analyzing spatiotemporal changes of crop patterns. First, national, provincial, and prefectural statistical data were harmonized and downscaled to grids, while county-level data were used to validate the accuracy of the mapping results. Then, crop cultivation was categorized into three different production systems, i.e., irrigated, high-input rainfed, and low-input rainfed based on crop irrigation and fertilizer application ratios. Multiple land use datasets were then integrated to develop a synthetic map of cropland and irrigated cropland. Meanwhile, pre-processing was performed on spatial datasets such as city boundaries, population distribution, and transportation accessibility (Figs. S1S3). Then, a spatial allocation framework was used to downscale the prefecture-level crop area statistics into 1-km grids. Moreover, the accuracy of the dataset was evaluated in terms of both validation metrics and spatial patterns. Finally, changes in the spatiotemporal heterogeneity of crop distribution were analyzed.

Fig. 2
figure 2

Research framework.

Harmonization of statistical data

This study collected harvest area statistics for 16 different crop types across four different political administration levels. The detailed list of crop names is provided in Table 2. The time coverage of this built dataset spans from 1990 to 2020. We endeavored to collect crop statistics from publicly accessible sources, generally ensuring data completeness for major production regions (Fig. S4). However, due to the extended temporal coverage and reliance solely on public data, many regions inevitably have incomplete time series. To ensure data availability and methodological consistency, we followed previous studies by representing each time point with a 5-year average, using data from the two years before and after each time point26,29,53,54. For instance, statistical values from 1988 to 1992 were utilized to map the crop patterns in 1990, while those from 2018 to 2022 were employed to map the crop patterns in 2020. Besides, the linear interpolation was used to fill the missing values. Due to the lack of prefecture-level statistical data for Xinjiang, Qinghai, and Xizang, spatial allocations for these provinces were based on provincial-level statistical data.

Table 2 Crop names of this study in different datasets.

In order to ensure the consistency of statistical data across different administrative levels, this study used national-level statistical data as a benchmark and successively corrected for each province and prefecture. First, the crop harvest area for each province \(({{HA}}_{j,t,p}^{{stat}})\) was adjusted according to Eq. (1), based on the national total harvest area \(({{HA}}_{j,t,{national}}^{{stat}})\), resulting in the revised provincial harvest area denoted as \({{HA}}_{j,t,p}^{{revise}}\):

$${{HA}}_{j,t,p}^{{revise}}={{HA}}_{j,t,p}^{{stat}}\times \frac{{{HA}}_{j,t,{national}}^{{stat}}}{\sum _{p}{{HA}}_{j,t,p}^{{stat}}},$$
(1)

where the j denotes the crop type, t represents the time node, and p indicates the provincial-level administrations. The sum of the revised area at the provincial level is highly in agreement with the national statistics, as shown in Fig. S5.

Based on the revised provincial-level data, a similar approach is applied to adjust the harvest area for the prefecture-level city \(({{HA}}_{j,c}^{{stat}})\) as Eq. (2):

$${{HA}}_{j,t,c}^{{revise}}={{HA}}_{j,t,c}^{{stat}}\times \frac{{{HA}}_{j,t,p}^{{revise}}}{\sum _{c}{{HA}}_{j,t,c}^{{stat}}},$$
(2)

where the \(c\) represents the prefecture-level city with province \(p\). The sum of the harmonized data at the prefecture level is equal to the revised provincial values, as shown in Fig. S6. Therefore, the total harmonized harvest area of all provinces and prefectures equals the national statistics.

Classification of crop production systems

Based on the ratios of irrigation and fertilizer application, each crop cultivation was categorized into three production systems: irrigated, high-input rainfed, and low-input rainfed. The proportion of irrigated crop for each 5-year node was extracted from the ISIMIP3a dataset44. The proportions of high-input and low-input systems are determined based on actual crop fertilizer application statistics compared with the recommended amounts. The shares of irrigated system for crops \(({H{A}_{{Ratio}}}_{j,t,p}^{{irrigat}{ion}})\) are determined by averaging the gridded values at the prefecture level. Owing to the lack of crop fertilizer application data at the prefecture level, this study assumed that all prefectures within the same province share the same average fertilizer application ratios. The details of the crop production system classification are as follows.

First, the crop fertilizer application ratios of each province \(({{Fert}}_{j,t,p}^{{unit}})\) were obtained from the “Compilation of the National Agricultural Costs and Returns” as presented in Table S2. Then, the sum of fertilizer application for each crop \(({{Fert}}_{j,t,p}^{{Sum}})\) was calculated using Eq. (3). The average crop fertilizer applications for all provinces are shown in Table S3.

$${{Fert}}_{j,t,p}^{{Sum}}={{HA}}_{j,t,p}^{{revise}}\times {{Fert}}_{j,t,p}^{{unit}}$$
(3)

Then, the high-input production area of the crop \(({{HA}}_{j,t,p}^{{high}})\) is obtained from dividing the total fertilizer application by the recommended fertilizer application ratios for the crop \(({{Fert}}_{j,p}^{{recommend}})\), as shown in Eq. (4):

$${{HA}}_{j,t,p}^{{high}}=\frac{{{Fert}}_{j,t,p}^{{Sum}}}{{{Fert}}_{j,p}^{{recommend}}}.$$
(4)

The ratios of high-input system in the total area \(({H{A}_{{Ratio}}}_{j,t,p}^{{high}})\) are calculated following Eq. (5):

$${{HA}{\rm{\_}}{Ratio}}_{j,t,p}^{{high}}=\frac{{{HA}}_{j,t,p}^{{high}}}{{{HA}}_{j,t,p}^{{revise}}}.$$
(5)

Finally, the ratios of the low-input system \(({H{A}_{{Ratio}}}_{j,t,p}^{{low}})\) are calculated by subtracting the proportion of the irrigated and high-input area following Eq. (6). Tables S4S6 show the detailed proportions for each production system.

$${{HA}{\rm{\_}}{Ratio}}_{j,t,p}^{{low}}=1-{{HA}{\rm{\_}}{Ratio}}_{j,t,p}^{{irrigat}{ion}}-{{HA}{\rm{\_}}{Ratio}}_{j,t,p}^{{high}}$$
(6)

Synthesis of cropland

The cropland distribution from different sources was synthesized using the multi-source data integration approach proposed by Lu et al.55. The cropland extents were obtained from CACD, CLCD, GLC_FCSD, and CNLUCC datasets. The spatial resolutions of these datasets were resampled to 1 km, with temporal periods aligned to the time nodes of the statistics.

First, the average and maximum cropland areas for each grid were derived by spatially overlaying multiple datasets. Then, a ranking scoring table was developed to assess the credibility of the cropland distribution. As shown in Table S7, the agreement level reflects the degree of consensus among the datasets in classifying a grid cell as cropland. The rank value indicates the order of reliability across different data combinations. For example, an agreement level of 4 indicates that all datasets classify the grid cell as cropland; conversely, a value of 1 indicates that only one dataset identifies the cell as cropland. The ranking of the datasets was determined through existing data quality comparison studies and expert judgment. Given that the CACD was specifically developed for cropland mapping, it is considered to have the highest level of credibility56. Additionally, comparative studies on data accuracy indicate that the CLCD demonstrates higher overall accuracy when compared to both the GLC_FCSD and the CNLUCC datasets57,58. Finally, the synthesized average cropland area maps are illustrated in Fig. S7.

The approach for irrigated cropland synthesis is similar to cropland. The irrigated cropland maps were obtained from the China Irrigated Cropland Dataset, GLC_FCSD, and CNLUCC datasets. A scoring table for these irrigated cropland datasets is provided in Table S8. The synthesized maps of average irrigated cropland area for the period from 1990 to 2020 are shown in Fig. S8.

Spatial allocation of crop statistics

Prefecture-level statistics were allocated to 1-km grids using the maximum-score optimization approach developed by van Dijk et al.40. This method builds upon the minimum cross-entropy framework originally proposed by You and Wood (2005, 2006)31,32. The cross-entropy framework calculates the prior probability of crop distribution based on biophysical and socioeconomic factors, and then determines the actual probability under statistical constraints by minimizing cross-information entropy33. To enable multi-temporal, high-resolution mapping, the maximum plant suitability score was used in place of the minimum cross-entropy metric, resulting in improved spatial resolution and faster convergence. The main steps of the model are described below.

First, the crop harvest area of the prefecture \(({{HA}}_{j,t,c}^{{revise}})\) is converted into the physical area \(({{PA}}_{j,t,c})\) based on crop planting intensity \(({{CropIntensity}}_{j,t,c})\) as expressed in Eq. (7). The assumption is that there is no difference in cropping intensity between prefectures within the same province. The average crop planting intensities for different crops are shown in Table S9.

$${{PA}}_{j,t,c}=\frac{{{HA}}_{j,t,c}^{{revise}}}{{{CropIntensity}}_{j}}$$
(7)

Then, the crop plant suitability scores for different crop production systems \(l\) within the grid \(i\) are calculated. The irrigated and rainfed high-input systems are mainly located in large farms, and their score is influenced by both biophysical characteristics and socio-economic factors, including transport accessibility and crop prices. Therefore, their crop plant suitability score is presented as Eq. (8):

$${{score}}_{i,j,l}=\sqrt{{{access}}_{i}\times {{rev}}_{i,j,l}},l\in \left\{{irrigation},{rained\; high\; input}\right\},$$
(8)

where the \({{access}}_{i}\) represents the accessibility from the grid to cities, and where \({{rev}}_{{ijl}}\) represents the economic benefit of the crop, which is calculated via Eq. (9):

$${{rev}}_{i,j,l}={{pot}{\rm{\_}}{yield}}_{i,j,l}\times {{price}}_{j},$$
(9)

where the \({{pot\_yield}}_{i,j,l}\) is the potential yield of crop \(j\) in grid \(i\) under \(l\) system, which is obtained from the GAEZ v4 dataset. The changes in crop prices \(({{price}}_{j})\) are shown in Table S10.

For rained low-input system, its plant suitability score is constrained mainly by the natural environment, which is calculated following Eq. (10):

$${{score}}_{i,j,l}={s{uit}{\rm{\_}}{index}}_{i,j,l},l\in \left\{{rained\; low\; input}\right\}.$$
(10)

The \({s{uit}{\_index}}_{{ijl}}\) denotes the plant suitability of crop \(j\) in grid \(i\) under \(l\) system, which is derived from the GAEZ v4 dataset as well.

Finally, based on the plant suitability scores of crops, the objective function designed to maximize the plant suitability score is formulated as Eq. (11):

$$\max {\sum }_{i}{\sum }_{j}{\sum }_{l}{s}_{i,j,l}\times {{score}}_{i,j,l}.$$
(11)

The \({s}_{i,j,l}\) represents the ratio of the area allocated to crop \(j\) under system \(l\) to the total area of the grid, which is the target solution value of the model. Meanwhile, the subsequent constraints should be abided by during the model-solving process as indicated in Eqs. (1216).

  1. 1.

    The area allocated to crops within the grid must not exceed the total area of the grid.

    $$0\le {s}_{i,j,l}\le 1,\forall \,i\forall \,j\forall \,l$$
    (12)
  2. 2.

    The cumulative ratios of different production systems within the grid must sum to 1.

    $$\sum _{l}{s}_{i,j,l}=1,\forall \,i\forall \,j$$
    (13)
  3. 3.

    The crop area is equal to the statistical data.

    $$\sum _{l}{{PA}}_{j,c}\times {s}_{i,j,l}={{PA}}_{j,c},\forall \,i$$
    (14)
  4. 4.

    The crop area does not exceed the total cropland area within the prefecture (\({{cropland}{\rm{\_}}{area}}_{c}\)).

    $$\sum _{l}{{PA}}_{j,l,c}\times {s}_{i,j,l}\le {{cropland}{\rm{\_}}{area}}_{c},\forall \,i\in c$$
    (15)
  5. 5.

    The irrigated crop area does not exceed the total irrigated cropland area within the prefecture (\({{irrigation\_area}}_{c}\)).

$$\sum _{l={irrigation}}{{PA}}_{j,l,c}\times {s}_{i,j,l}\le {{irrigation}{\rm{\_}}{area}}_{c},\forall \,i$$
(16)

To address data inconsistencies between different types of constraints, the slack variables are incorporated into the objective function. Higher weights (106) are assigned to constraints relating to cropland availability and irrigated area to limit their slack, while lower weights (105) are assigned to constraints relating to subnational statistics. This configuration ensures solution feasibility and quantitatively clarifies trade-offs in constraint prioritization40,59. Furthermore, all weights, parameters, and implementation details have been made publicly available to enable rigorous validation and to ensure the robustness and reproducibility of the results.

The optimization model was implemented in the GAMS (General Algebraic Modeling System) using a linear programming framework, with the CPLEX (C-Simplex algorithm) solver employed to efficiently address large-scale spatial allocation problems. Data exchange between model components was managed by GDX (GAMS Data eXchange) format files. All data preprocessing and visualization procedures were performed in R60. To improve allocation accuracy for China-specific crop patterns, we enhanced the original mapspamc package38, thereby maintaining methodological consistency with global applications.

Validation metrics

The county-level statistics are used to evaluate the accuracy of the allocation results at the prefecture level. The scatter plot illustrates the relationship between the statistically derived harvest area (\({{HA}}^{{stat}}\)) and the allocated harvest area (\({{HA}}^{{alloc}}\)). Meanwhile, the Pearson correlation coefficient \((C{or})\), the coefficient of determination \(\left({R}^{2}\right)\), and the root mean square error \(({RMSE})\) is used to quantitatively evaluate the accuracy of the results. The calculation methods for the different accuracy verification metrics are presented in Eqs. (1719):

$${Cor}({{HA}}^{{alloc}},{{HA}}^{{stat}})=\frac{\mathop{\sum }\limits_{k=1}^{n}({{HA}}_{k}^{{stat}}-\bar{{{HA}}^{{stat}}})({{HA}}_{k}^{{alloc}}-\bar{{{HA}}^{{alloc}}})}{\sqrt{\mathop{\sum }\limits_{k=1}^{n}{({{HA}}_{k}^{{alloc}}-\bar{{{HA}}^{{alloc}}})}^{2}}\sqrt{\mathop{\sum }\limits_{k=1}^{n}{({{HA}}_{k}^{{stat}}-\bar{{{HA}}^{{stat}}})}^{2}}},$$
(17)
$${R}^{2}=1-\frac{\mathop{\sum }\limits_{k=1}^{n}{\left({{HA}}_{k}^{{alloc}}-{{HA}}_{k}^{{stat}}\right)}^{2}}{\mathop{\sum }\limits_{k=1}^{n}{\left({{HA}}_{k}^{{alloc}}-\bar{{{HA}}^{{alloc}}}\right)}^{2}},$$
(18)
$${RMSE}=\sqrt{\frac{1}{N}\mathop{\sum }\limits_{k=1}^{n}{\left({{HA}}_{k}^{{stat}}-{{HA}}_{k}^{{alloc}}\right)}^{2}}.$$
(19)

Analysis of crop distribution change

The Standard Deviation Ellipse (SDE) model was used to analyse the spatiotemporal evolution of crop area at 1-km resolution. The SDE effectively captures the directional trends of crop distribution and includes key elements such as the ellipse center, rotation angle, and semi-axes61. First, the center coordinates of the ellipse \(\left(\bar{x},\bar{y}\right)\) are calculated, weighted by the crop area in grid \(i\) across different years, as expressed in Eq. (20):

$$\left(\bar{x},\bar{y}\right)=\left[\left(\frac{\mathop{\sum }\limits_{i=1}^{n}{\omega }_{i}{x}_{i}}{\mathop{\sum }\limits_{i=1}^{n}{\omega }_{i}}\right),\left(\frac{\mathop{\sum }\limits_{i=1}^{n}{\omega }_{i}{y}_{i}}{\mathop{\sum }\limits_{i=1}^{n}{\omega }_{i}}\right)\right],$$
(20)

where the coordinates \(({x}_{i},{y}_{i})\) represent the location of grid \(i\) and the \({\omega }_{i}\) denotes the crop harvest area within that grid. Then, the rotation angle depicts the trend of crop distribution and is calculated following Eq. (21):

$$\tan \theta =\frac{\left(\mathop{\sum }\limits_{i=1}^{n}{\omega }_{i}^{2}{\widetilde{x}}_{i}^{2}-\mathop{\sum }\limits_{i=1}^{n}{\omega }_{i}^{2}{\widetilde{y}}_{i}^{2}\right)+\sqrt{{\left(\mathop{\sum }\limits_{i=1}^{n}{\omega }_{i}^{2}{\widetilde{x}}_{i}^{2}-\mathop{\sum }\limits_{i=1}^{n}{\omega }_{i}^{2}{\widetilde{y}}_{i}^{2}\right)}^{2}+4\,\mathop{\sum }\limits_{i=1}^{n}{\omega }_{i}^{2}{\widetilde{x}}_{i}^{2}{\widetilde{y}}_{i}^{2}}}{2\,\mathop{\sum }\limits_{i=1}^{n}{\omega }_{i}^{2}\widetilde{x}\widetilde{y}},$$
(21)

where \(\widetilde{x}\) and \(\widetilde{y}\) are the differences between the grid and center, indicating the displacements of the grid relative to the overall distribution. Finally, the standard deviations along the x-axis and y-axis are defined as the long semi-axis \(({\sigma }_{x})\) and short semi-axis \(({\sigma }_{y})\) as shown in Eqs. (22, 23):

$${\sigma }_{x}=\sqrt{\frac{\mathop{\sum }\limits_{i=1}^{n}{\left({\omega }_{i}{\widetilde{x}}_{i}\cos \theta -{\omega }_{i}{\widetilde{y}}_{i}\sin \theta \right)}^{2}}{\mathop{\sum }\limits_{i=1}^{n}{\omega }_{i}^{2}}},$$
(22)
$${\sigma }_{y}=\sqrt{\frac{\mathop{\sum }\limits_{i=1}^{n}{\left({\omega }_{i}{\widetilde{x}}_{i}\sin \theta -{\omega }_{i}{\widetilde{y}}_{i}\cos \theta \right)}^{2}}{\mathop{\sum }\limits_{i=1}^{n}{\omega }_{i}^{2}}}.$$
(23)

Spatiotemporal changes of crop harvest area in China

China’s crop structure from 1990 to 2020 is shown in Fig. 3. It is clear that maize, rice, and wheat are the three dominant crops, together accounting for more than 65% of the total harvest area of crops. Their area shares decreased from 70.6% in 1990 to 67.3% in 2005 and then gradually increased to 72.1%. Meanwhile, the area of maize surpassed wheat and rice in 2005 and 2010 to become the largest crop in China. In contrast, the shares of rice and wheat decreased by 5.0% and 7.7%, respectively. Furthermore, the shares of non-staple crops such as soybeans, rapeseed, and groundnuts remained relatively stable during this period. In contrast, the harvest area of millet and sorghum decreased significantly. Among economic crops, the cotton area decreased from 4.7% in 1990 to 2.2% in 2020, while the tea area increased steadily from 0.7% to 2.1%. In addition, local cultivation crops, including tobacco, sugarbeet, sesame, sunflower, and sugarcane, have remained relatively marginal and stable over time.

Fig. 3
figure 3

The temporal variation trends of crop harvest area in China. The height of the nodes represents the harvest area of crops, while the labelled text denotes the shares relative to the total harvest area for a certain year.

Figure 4 shows the spatiotemporal changes in harvest area for four crops (maize, rice, wheat, and soybean) from 1990 to 2020. The center of crop harvest is gradually shifting northward, with the Northeastern China emerging as a critical region for agricultural production. The maize cultivation is concentrated in the Northern and Northeastern China (Fig. 4a). There has been a steady increase in the area under maize cultivation since 1900, with a particularly accelerated pace after 2005. In particular, the most significant maize expansion occurred in the Northeastern China, resulting in a shift of the center by approximately 214 kilometers to the northeast. Rice cultivation is concentrated mainly in the middle and lower reaches of the Yangtze River, followed by the Northeastern China (Fig. 4b). Between 1990 and 2000, a significant decline was observed in the Zhejiang and Jiangsu Provinces. In contrast, there was a notable expansion in the Northeastern China as well, resulting in a shift of about 344 kilometers to the northeast. The wheat cultivation is clustered in northern China, including Henan, Hebei and Shandong Provinces (Fig. 4c). As wheat production has become increasingly concentrated in the Huang-Huai-Hai Plain, the center of wheat production has shifted only 44 kilometers. Soybean cultivation is mainly concentrated in the Northeastern and Northern China (Fig. 4d). Heilongjiang and Inner Mongolia Provinces have been constantly extending their soybean area, resulting in a shift of the center by about 354 km to the northeast. The spatial distribution of non-staple crops exhibits significant heterogeneity, yet their primary cultivation centers are spatiotemporally stable (Fig. S9). For instance, rapeseed is predominantly concentrated in Central China (e.g., Hubei and Hunan provinces), whereas peanuts, millet, sunflower seeds, sorghum, and sesame are mainly cultivated in northern regions (e.g., Henan, Hebei, and Inner Mongolia). Furthermore, the primary harvest of potatoes, tea, sugarcane, and tobacco occurs in the southwestern provinces, including Yunnan, Guizhou, and Guangxi. Moreover, the primary production regions for cotton and sugar beet have shifted from North China to Xinjiang province.

Fig. 4
figure 4

Spatiotemporal change of the crop harvest area in China. The distinct colors represent different years. The ellipses denote the standard deviation ellipses of crop harvest area distribution, while the dots indicate the centers of these ellipses. Profile graphs illustrating harvest area along both longitudinal and latitudinal directions are provided at the top and right side of the map. The background map is derived from the free open-source map dataset provided by Natural Earth (https://www.naturalearthdata.com).

Data Records

The constructed dataset offers comprehensive crop harvest area maps for 16 crop types in China at a 1-km spatial resolution from 1990 to 2020 with 5-year intervals, and it is accessible at Science Databank62. Unlike crop maps with pure pixels, the pixel value represents the crop harvest area (unit: hectare) at the sub-pixel level, which effectively mitigates the problem of mixed pixels and showcases the gradient variations in urban-rural transition zones. Additionally, the crop proportions of different agricultural production systems were generated based on the ratios of crop irrigation and fertilization obtained from statistical data.

The crop harvest area maps are stored in GeoTIFF format at a 1-km spatial resolution with the EPSG:4326 spatial reference. The various folders within the dataset contain raster data for different agricultural production systems, such as the Irrigated system (I), the Rainfed High-input system (H), the Rainfed Low-input system (L), and the total amount of the All system (A = I + H + L). The file name adheres to the format “ha_{*}_{**}_30 sec_{***}_CHN.tif”. Here, “ha” indicates the harvest area, “30sec” represents a spatial resolution of 30 arcseconds (approximately 1 km), “*” stands for the crop name, “**” denotes the production system, and “***” represents the time of the data.

Technical Validation

Accuracy assessment with county-level statistics

Figure 5 presents the validation of the spatial allocation results with county-level statistics. The three main food crops—maize, rice, and wheat—exhibit high accuracy, with correlation coefficients \((r)\) of 0.89, 0.87, and 0.90, and coefficient of determination \(({R}^{2})\) values of 0.80, 0.75, and 0.81, respectively. Among the industrial crops, the sugarcane, cotton, and sunflower show relatively considerable allocation accuracy with \(r\) values of 0.91, 0.78, and 0.76 and \({R}^{2}\) values are 0.84, 0.61, and 0.57, respectively. The allocation maps for oilseed crops such as soybeans, rapeseed, and groundnuts also show accuracy with \(r\) values exceeding 0.69 and \({R}^{2}\) values above 0.48. The \(r\) values of non-staple crops, such as sorghum, millet, potato, sesame, and sugarbeet, are approximately within the range of 0.41 to 0.60. Moreover, the \(r\) values for tobacco and tea are less than 0.40. Additionally, the results are highly consistent with the prefecture-level statistics, proving the reliability of the downscaling model, as shown in Fig. S10. Therefore, the constructed crop maps adequately depict the spatial pattern of crops, especially for primary grains and oilseed crops.

Fig. 5
figure 5

Validation at the county level. The subgraphs are arranged in descending order based on crop harvest area. The X-axis represents the county-level aggregation of allocated results, while the Y-axis represents the actual values from county-level statistical data. To enhance the clarity of the validation, a logarithmic transformation is performed for both axes. The validation metrics are displayed in the upper left corner of each subgraph, including the sample size \(\left(N\right)\), correlation coefficient \((r)\), coefficient of determination (\({R}^{2}\)), and root mean square error \(({RMSE})\). The count of legend represents the number of validation samples in one square block in the figure. The unit of legend (Count) represents the number of validation samples within each square block in the figure.

Comparison of spatial patterns with existing datasets

The constructed dataset was compared with four representative datasets for three main grain crops (maize, rice, and wheat) for the year 2010. These datasets include the Spatial Production Allocation Model dataset (SPAM, ~10 km), the China Crop Area dataset (CCA, 1 km), the China Cropping Patterns dataset (CCP, 500 m), and the China Crop Dataset (CCD, 30 m)16,18,55,63. The CCP and CCD data pixels were aggregated into a 1-km grid for comparison.

As shown in Fig. 6, the 1-km dataset from this study significantly upgrades the spatial detail of crop distribution compared to the 10 km resolution SPAM data. The CCA and CCP datasets provide homogeneous crop pixels, resulting in an overestimation of crop area in mixed pixels that contain both impervious surfaces and cropland (Fig. 6c1–c3,d1–d3). In contrast, the maps from this study comprise sub-pixel crop area attributes, which effectively reflect the gradient changes from urban to rural zones64. Therefore, the dataset built provides much more explicit details in estimating the area of crops in mixed pixels situated at urban fringes (Fig. 6a1–a3). Furthermore, the crop patterns of the constructed maps are similar to the CCD data (Fig. 6a3,e3), demonstrating that the integration of urban boundaries is necessary to improve the rationality of crop maps65.

Fig. 6
figure 6

Comparison of datasets at the grid level. The prefectures under consideration are located in Shijiazhuang for maize (38.08°N,114.82°E), Changsha for rice (28.28°N,112.51°E), Zhumadian for wheat (33.04°N,113.98°E), and the reference year is 2010. The first column is the high-resolution imaginary layers from the GEOVIS Earth (https://geovisearth.com/en). The second column shows the crop maps developed in this study. The third column presents the crop grids derived from the SPAM (Spatial Production Allocation Model) dataset. The fourth column illustrates the crop area pixels obtained from the CCA (China Crop Area) dataset. The fifth column displays the crop pixels from the CCP (China Cropping Patterns) dataset. The final column represents the crop distribution data sourced from the CCD (China Crop Dataset).

Comparison with SPAM datasets

The county-level validation metrics of the built dataset and the 10-km SPAM were compared as shown in Fig. 7. The spatial allocation accuracy of the three primary crops (maize, rice, and wheat) in this study was closely in line with the SPAM dataset, as manifested by the \(r\) and \({R}^{2}\) values. For tea, cotton, sugarcane, millet, and sorghum, the \(r\) and \({R}^{2}\) values of this study’s dataset surpass those of SPAM. However, the accuracies for soybean, rapeseed, groundnut, potato, tobacco, sunflower, sesame, and sugarbeet are lower than those recorded in SPAM. Owing to the existence of more fragmented pixels in the 1-km maps, the \({RMSE}\) values in this study are higher than those of SPAM. Therefore, the dataset generated in this study exhibits high accuracy in mapping the distribution of major grain crops. Nevertheless, it is also noted that the increase in resolution brings about a certain degree of uncertainty regarding the spatial distribution of non-major food crops66,67.

Fig. 7
figure 7

Accuracy comparison with SPAM datasets at the county level. The comparison samples include four temporal nodes: 2000, 2005, 2010, and 2020, given that SPAM is available in these years.

Limitations and prospects

The dataset developed in this study shows high accuracy in spatial allocation for domain food crops. However, its accuracy is comparatively lower for certain industrial and economic crops. This inconsistency is primarily due to the coarse resolution (10 km) of the GAEZ crop natural growth suitability data, which is not consistent with the high-resolution (30 m) input land use datasets. The resolution mismatch causes significant uncertainty in the spatial allocation of crops40. As the GAEZ dataset is designed for global agricultural assessments49, it may underestimate crop plant suitability in some regions, potentially causing uncertainty in crop mapping. Furthermore, the accuracy is also limited by the lack of complete prefecture-level statistical data. For instance, linear interpolation was used to estimate missing values in prefecture-level statistics, which oversimplified the variation trends in the harvest area22,68. Moreover, there is inherent uncertainty in the county-level statistical validation data27.

Therefore, the spatial distribution of crop plant suitability is the foundation for accurate crop spatial downscaling. Future research should focus on improving methods for assessing the plant suitability and potential yield of crops under changing climatic and anthropogenic conditions69,70. For instance, integrating the crop plant suitability assessment framework AEZ (Agro-Ecological Zoning) into the crop spatial allocation model could help improve the simulation accuracy71,72. Moreover, machine learning technologies provide a promising approach for assessing crop plant suitability. For example, the maximum entropy model has proven effective in evaluating the plant suitability of crops such as rice, wheat, and soybean39,69,73. Furthermore, incorporating deep learning-based crop recognition through satellite imagery into the data fusion framework will enhance the accuracy of the allocated maps and provide more detailed and reliable data support for agricultural monitoring and analysis74. Moreover, the advent of large language models introduces robust tools capable of extracting valuable agricultural insights from vast collections of historical and contemporary inventories75. Leveraging these tools will significantly advance the construction of multi-scale and long-term agricultural statistical databases (covering crop yields, fertilizer use, field management, etc.)76. This progress will facilitate data downscaling at finer statistical units, expand the variables of the crop maps, and promote their application in climate change adaptation and sustainable agricultural development.

Usage Notes

This study integrated multiscale agricultural statistical data and multisource spatial datasets to map the harvest area of 16 crops in China at a 1-km resolution from 1990 to 2020 with 5-year intervals. The prefecture-level crop statistics were downscaled to grids based on the integration of natural and economic factors affecting agricultural practices. The built dataset addresses the data gap in spatially explicit maps for multiple crops over a long time span in China. It can offer valuable insights for analyzing changes in China’s crop cultivation patterns and their impacts on environmental pollution and ecosystem services. Additionally, the dataset could be used as crucial inputs for land-surface ecosystem models (e.g., Lund-Potsdam-Jena managed Land model) and integrated assessment models (e.g., Global Change Analysis Model). Since the dataset aligns well with national, provincial, and prefectural statistics, it may help support further analysis in socioeconomic research.

In future research, a key challenge is to assess crop plant suitability scores at a high spatial resolution to improve the accuracy of crop mapping. Moreover, it is also crucial to integrate crop-specific maps derived from remote sensing products into the crop statistical data allocation framework. To further improve the usability and accuracy of these datasets, it is essential to collect statistical data over longer periods with continuous, complete time series and at finer spatial resolutions. With these potential improvements in the future, this dataset will become a dynamic and open-source program that can be continually updated to reflect changing crop patterns. This ongoing development will provide a solid foundation for future applications in agricultural modelling and land use management.