Background & Summary

Accurate estimation of crop yields is crucial for maintaining agricultural stability and ensuring national food security, especially for staple crops such as maize, rice and soybean1. These crops play a critical role in meeting the basic needs of billions of people worldwide. However, global climate change has brought unprecedented challenges to food security, including extreme weather events, water scarcity, and land degradation. Concurrently, external factors like geopolitical tensions, the COVID-19 pandemic, and economic volatility further complicate the global agricultural and food security landscape. Given these dynamic challenges, monitoring crop growth and estimating crop yields are vital for making informed import and export decisions, efficiently allocating agricultural resources, and formulating robust national food security strategies2. Crop yield estimates not only play a pivotal role in accurately forecasting food production and assessing essential supply need, but also foster economic stability and drive agricultural progress3.

In recent years, significant progress has been made in crop yield estimation, especially through remote sensing technology4. This has enabled the development of crop yield datasets at various spatial resolutions (Table 1), such as 550 km5, 125 km6, 43 km7, 10 km8,9, 4 km10,11, 1 km12, 500 m13 and 30 m14, which together constitute an important research foundation in this field. However, although these existing global and regional scale crop yield datasets provide valuable scientific basis for effectively monitoring and analyzing the dynamic changes of agricultural productivity at the provincial, river basin and even county levels, these datasets are mainly generated by spatial allocation or downscaling of statistical yield data of national or provincial administrative units14. This downscaling approach, which relies on top-level aggregated statistics of crop production or yield, may fail to accurately capture the true fluctuations and driving factors of field-level crop yields in regions with high spatial heterogeneity. This challenge is particularly acute in China, where a decentralized farming model, fragmented landholdings, diverse cropping systems, and varying management practices are the norm10,11,12,13,14. The main reason is that the analyses with coarse resolution data contain a mixed pixel effect, blurring and interfering with the unique yield signal characteristics of small-scale plots.

Table 1 Summary of studies on crop yield datasets.

Methodologies for crop yield estimation are broadly categorized into four main approaches: traditional methods, process-based models, spatial statistical and artificial intelligence (AI) models, and satellite-based crop photosynthesis-yield models. Traditional methods, based on resource-intensive field surveys, provide high accuracy at the cost of significant labor and financial investment, limiting their scalability15. Process-based models, such as Decision Support System for Agrotechnology Transfer (DSSAT), use climate and soil data as inputs to simulate crop growth and estimate grain yields16. By modeling crop development in relation to environmental factors, they provide a comprehensive approach to yield prediction17. However, these models often require detailed, localized input data, which can be difficult to obtain for large-scale applications18. In contrast, spatial statistical and AI models, which include those utilizing remote sensing data (such as vegetation indices) and crop yield data for training, focus on identifying spatial patterns to predict yields19. Machine learning models have been widely used to enhance prediction accuracy20. Nevertheless, they depend heavily on large training datasets and high computational demands, which may limit their application in real-time field settings20. Additionally, these models face challenges related to generalizability and environmental heterogeneity, especially when downscaling estimates to smaller areas21. Satellite-based crop photosynthesis-yield models, merging the strengths of the aforementioned approaches, enhance prediction reliability by simulating dry matter dynamics22,23.

Currently, the satellite-based crop photosynthesis-yield models such as the Carnegie Ames Stanford Approach (CASA)24, Photosynthesis (PSN)25, and Vegetation Photosynthesis Model (VPM)26 have played pivotal roles in estimating terrestrial ecosystem Gross Primary Productivity (GPP) and Net Primary Productivity (NPP) through remote sensing methods27. Among these, the VPM is widely used worldwide for its integration of the Enhanced Vegetation Index (EVI) and Land Surface Water Index (LSWI), developed using satellite remote sensing and flux observation data28, especially in estimating C3 and C4 cropland yields29,30,31. However, to estimate crop yield, the yield estimated by NPP from VPM combined with other constant parameters requires field measurements of parameters like maximum light energy utilization, harvest index, and plant carbon content. The acquisition of this data not only requires calculations from multiple sample sources but is also susceptible to various factors, including theories related to crop growth conditions32, which hinder low-cost and rapid estimation of crop yields33.

To address this challenge, we developed a method (named yield_\({{\rm{A}}{\rm{P}}{\rm{A}}{\rm{R}}}_{{{\rm{\varepsilon }}}_{g}}\,\& \,a\)) of yield estimated by the dynamic observation index according to the model principle of VPM (\({{\rm{APAR}}}_{{\varepsilon }_{g}}\)) combined with transition coefficient (a) to estimate the spatial distribution of crop yields with 10 m spatial resolution in Northeast China from 2016 to 2021. As a major grain-producing region, Northeast China contributes more than 20% of China’s total grain production annually. Maize, rice, and soybean are the primary crops in this region34. Specifically, combining meteorological information and remotely sensed images generated by the Sentinel-2 satellite, we improve the previous method (combining NPP from VPM with some empirical parameters, yield_NPP&EP) by calculating and linearly regressing them against the yield statistics, developing a yield estimation method (yield_\({{\rm{A}}{\rm{P}}{\rm{A}}{\rm{R}}}_{{{\varepsilon }}_{g}}\,\& \,a\)) that does not require direct field measurements, while ensuring high accuracy and cost-effectiveness. Finally, the accuracy of the staple crop yield dataset of Northeast China was assessed by using both field observation data and official crop yield statistics. Our dataset provides a detailed description of yield patterns for maize, rice, and soybean in Northeast China, thereby supporting strategic agricultural planning and precise management. A key advantage is the ability to reveal subtle field-management differences that regional analyses miss, providing vital support for precision services such as agricultural insurance underwriting and yield-efficiency evaluations for smallholders.

Methods

Study area

The study area, located in Northeast China, encompasses Heilongjiang (HLJ), Jilin (JL), and Liaoning (LN) provinces (38°71′-53°44′N; 118°52′-134°17′E) (Fig. 1). This region features a cold-temperate continental monsoon climate, characterized by an average annual temperature ranging from −3.8 °C to 11.3 °C and annual precipitation between 298 mm and 880 mm. Northeast China experiences four distinct seasons, with dry, cold winters and rainy, warm summers. Benefits from well-balanced light, temperature, and water availability, as well as fertile black soil, this area has become a major grain-producing region. Maize, rice, and soybean are the primary crops, and the planting period is from May to September. In 2022, the production of maize, rice and soybean in Northeast China accounted for 33.39%, 18.34%, and 51.79% of their national production respectively, and the planting areas of these crops accounted for 30.64%, 16.81%, and 52.29% of their national areas, respectively35.

Fig. 1
Fig. 1
Full size image

The geographical location of the study area and the spatial distribution of key parameters. (a) the location of the Northeast China; (b) the average temperature from 2016 to 2021; (c) the average Photosynthetically Active Radiation (PAR) from 2016 to 2021; the study area and distribution of maize (d), rice (e) and soybean (f), respectively.

Data collection, preprocessing and parameter calculation

The data used for the model in this study are listed in Table 2. Full technical details of the model input and output, data preprocessing, quality control, and mechanism for handling missing data are provided in Supplementary information.

Table 2 Details on the datasets used in this study.

Sentinel-2 Multi-Spectral Instrument (MSI) Level-2A imagery, provided by the European Space Agency (ESA) at a 10 m spatial resolution, was acquired for the growing seasons of maize, rice, and soybean in the study area from 2016 to 2021. The data acquisition and all subsequent preprocessing steps were performed on the Google Earth Engine (GEE) platform. To ensure data quality, a series of preprocessing procedures were implemented. Initially, cloud and cloud-shadow masking were applied using the QA60 quality assessment band. The resulting clear-sky observations were then subjected to spatio-temporal filtering to reduce noise. Our preprocessing workflow also included the standardization of auxiliary meteorological data and the calculation of a custom moisture index, culminating in a multi-source fused dataset. Finally, using the processed dataset, we computed two remote sensing indices including Enhanced Vegetation Index (EVI) and Land Surface Water Index (LSWI)36.

The Photosynthetically Active Radiation (PAR) data are derived from the Global Land Surface Characteristic Parameter Data Product dataset GLASS (Global Land Surface Satellite), with a spatial resolution of 5 km and a temporal resolution of 1 day37. Since the PAR data in this dataset are only available up to 2020, the average value of the PAR from 2016 to 2020 was adopted for the data in 2021 in this study.

The monthly mean temperature, maximum and minimum temperature data from May to September were obtained from the National Tibetan Plateau Data Centre (https://data.tpdc.ac.cn/)38, with a spatial resolution of 1 km.

The spatial distribution dataset of maize, sourced from the National Ecological Science Data Centre (http://www.nesdc.org.cn/), has a spatial resolution of 30 m and a temporal resolution of one year39. The spatial distribution dataset of rice is from the National Ecological Science Data Centre (http://www.nesdc.org.cn/), with corresponding a spatial resolution of 20 m and a temporal resolution of one year40. In addition, the spatial distribution data of rice for 2017 were used for 2016. The spatial distribution data of soybean were obtained from the National Earth System Science Data Centre (http://www.geodata.cn), with a spatial resolution of 30 m and a temporal resolution of one year41. Similarly, the distribution for soybean for 2017 were used in place of the data for 2016.

Site-based yield data of different crops in the study area were obtained from the National Field Scientific Observation Research Stations in Hailun City and Shenyang City, covering the period from 2016 to 2021 (https://www.cnern.ac.cn/). These data were used to verify the accuracy of the estimated output. The statistical data at the city and county scales are derived from statistical yearbooks for the same period (http://www.stats.gov.cn)35. It is worth noting that due to the reform of the agricultural reclamation system in Heilongjiang Province in 2019, the statistical yearbooks for the period of 2016–2018 lack production measurements for the state farms in reclamation districts. Consequently, the yield data in Heilongjiang Province used in this study were derived by combining data from the relevant reclamation areas with county-level statistics42 (Table S1).

VPM for crop yield estimation

The schematic of data collection and preprocessing, model construction and validation are shown in Fig. 2.

Fig. 2
Fig. 2
Full size image

Schematic of data preprocessing, model construction and dataset generation for estimating yields of jade, rice and soybean using yield_\({{\rm{APAR}}}_{{\varepsilon }_{g}}\)& \(a\).

The formula for crop yield converting from NPP is as follows43,44:

$$Y=\frac{{NPP}\ast \left(1-b\right)\ast {HI}}{c\ast \left(1-\omega \right)}$$
(1)

where Y is the crop yield; b is the proportion of biomass in the underground part of the crop compared to the whole plant; HI is the harvest index; c is the plant carbon content; ω is the water content coefficient of the crop in the post-harvest storage period. According to the previous study, b, HI, c, and ω are taken as 0.1, 0.49, 0.45, and 14%, respectively44.

NPP can be estimated using VPM. The absorption of PAR by the vegetation canopy can be classified into chlorophyll absorbing fraction (\({{\rm{FPAR}}}_{{\rm{c}}{\rm{h}}{\rm{l}}}\)) and non-photosynthetic vegetation absorbing fraction (\({{\rm{FPAR}}}_{{\rm{NPV}}}\)) in VPM. Among them, the chlorophyll absorption part (\({{\rm{FPAR}}}_{{\rm{c}}{\rm{h}}{\rm{l}}}\)) is involved in photosynthesis, while the non-photosynthetic vegetation absorption part (\({{\rm{FPAR}}}_{{\rm{NPV}}}\)) is not involved in photosynthesis25,45. NPP can be estimated according to the following equations:

$${\rm{NPP}}={\rm{GPP}}\ast {\rm{CUE}}$$
(2)
$${\rm{GPP}}={\rm{LUE}}\ast {{\rm{APAR}}}_{{\rm{chl}}}$$
(3)
$${{\rm{APAR}}}_{{\rm{chl}}}={{\rm{FPAR}}}_{{\rm{chl}}}\ast {\rm{PAR}}$$
(4)
$${\rm{LUE}}={{\rm{LUE}}}_{0}\ast {{\rm{T}}}_{{\rm{scalar}}}\ast {{\rm{P}}}_{{\rm{scalar}}}\ast {{\rm{W}}}_{{\rm{scalar}}}$$
(5)

where NPP stands for net primary productivity; GPP stands for gross primary productivity; CUE stands for carbon use efficiency, which is the conversion coefficient of GPP to NPP46,47,48; LUE stands for light use efficiency (in gC/MJ); PAR stands for Photosynthetically Active Radiation (in MJ/m²); \({{APAR}}_{chl}\) stands for PAR absorbed by chlorophyll in the canopy; \({{FPAR}}_{chl}\) stands for the proportion of PAR absorbed by the vegetation canopy accounted for by the absorbing fraction of chlorophyll; \({{\rm{LUE}}}_{0}\) represents the maximum light energy use efficiency (in gC/MJ)13,49; \({{\rm{T}}}_{{\rm{scalar}}},{{\rm{P}}}_{{\rm{scalar}}}\) and \({{\rm{W}}}_{{\rm{scalar}}}\) represent the stress coefficients of temperature, phenology and water on the maximum light energy utilization at the canopy scale50.

\({T}_{{scalar}}\) characterizes the effect of temperature on photosynthesis and can be calculated using the following equation51:

$${T}_{{scalar}}=\frac{\left(T-{T}_{\min }\right)\left(T-{T}_{\max }\right)}{\left(T-{T}_{\min }\right)\left(T-{T}_{\max }\right)-{\left(T-{T}_{{opt}}\right)}^{2}}$$
(6)

where \({T}_{\min },{T}_{\max },{T}_{{opt}}\) are the minimum, maximum and optimum temperatures required for vegetation photosynthesis, respectively. The look-up table method was used to determine the triple base temperatures of different vegetation types51. We distinguished between \({T}_{\min },{T}_{\max }\) and \({T}_{{opt}}\) for C3 and C4 crops, which can be obtained from the relevant literature52. \({T}_{{scalar}}\) is set to 0 when the air temperature is less than the minimum photosynthetic temperature.

\({P}_{{scalar}}\) characterizes the effect of changes in leaf phenology on photosynthesis at the crop canopy scale and is calculated depending on the type of crop. Crops like rice, soybean, and maize undergo distinct phases-from leaf emergence to full spreading. During the emergence to spreading phase, the calculation is defined by formula (7). Once leaves reach full spreading (denoted by \({{LSWI}}_{\max }\)), \({P}_{{scalar}}\) reaches its peak value of 1 for the remainder of the growing season.

$${P}_{{scalar}}=\frac{1+{LSWI}}{2}$$
(7)

\({W}_{{scalar}}\) represents the vegetation moisture factor, which quantifies the influence of moisture on photosynthesis, typically determined using LSWI53. The calculation of the vegetation moisture factor is expressed as:

$${W}_{{scalar}}=\frac{1+{LSWI}}{1+{{LSWI}}_{\max }}$$
(8)

The \({{\rm{FPAR}}}_{{\rm{chl}}}\) is approximated by a linear function of EVI:

$${{\rm{FPAR}}}_{{\rm{chl}}}=\alpha \ast {EVI}$$
(9)

where \(\alpha \) is an empirical coefficient, which takes the value of 153.

Yield_\({{\bf{APAR}}}_{{{\boldsymbol{\varepsilon }}}_{{\bf{g}}}}\)& \({\boldsymbol{a}}\) method for crop yield estimation

In the VPM, relatively fixed regionalized parameters such as \({{\rm{LUE}}}_{0}\), HI, CUE, c, b and ω are mainly affected by factors such as crop growth, environment, climatic conditions, crop varieties, and agricultural management conditions in the study area. In this study, we proposed a dynamic observation index \({{\rm{APAR}}}_{{\varepsilon }_{g}}\) and a composite yield-conversion coefficient \(a\) to replace multiple dynamic observation variables and multiple regionalized fixed parameters to reduce and integrate the key parameters of the original equation. According to equations ((1)–(9)), the crop yield can be estimated as follows:

$$Y=\frac{{T}_{{scalar}}\ast {P}_{{scalar}}\ast {W}_{{scalar}}\ast {{\rm{FPAR}}}_{{\rm{chl}}}\ast {PAR}\ast {CUE}\ast \left(1-b\right)\ast {HI}\ast {\varepsilon }_{0}}{c\ast \left(1-\omega \right)}$$
(10)

We propose a dynamic observational index termed \({{\rm{APAR}}}_{{\varepsilon }_{g}}\). From a physical and biological perspective, \({{\rm{APAR}}}_{{\varepsilon }_{g}}\) represents the Photosynthetically Active Radiation that is jointly constrained by temperature (\({T}_{{scalar}}\)), phenology (\({P}_{{scalar}}\)), water availability (\({W}_{{scalar}}\)), and canopy light absorption capacity (\({{\rm{FPAR}}}_{{\rm{chl}}}\)), quantifying the effective photosynthetic energy input under real-world environmental constraints (in MJ/m²). Its mathematical expression is:

$${{\rm{APAR}}}_{{\varepsilon }_{g}}={T}_{{scalar}}\ast {P}_{{scalar}}\ast {W}_{{scalar}}\ast \left({{\rm{FPAR}}}_{{\rm{chl}}}\ast {PAR}\right)$$
(11)

Correspondingly, we propose a yield conversion coefficient \(\,a\) that consolidates all the empirical parameters \({{\rm{LUE}}}_{0}\), HI, CUE, c, b and \(\omega \) in the comprehensive yield equation. They collectively define the overall efficiency of converting the captured effective energy (\({{\rm{APAR}}}_{{\varepsilon }_{g}}\)) into final yield. Therefore, we consolidate these empirical factors into a single yield conversion coefficient \(a\):

$$a=\frac{{CUE}\ast \left(1-b\right)\ast {HI}\ast {{LUE}}_{0}}{c\ast \left(1-\omega \right)}$$
(12)

The \({{\rm{APAR}}}_{{\varepsilon }_{g}}\) can be calculated by the parameters we have mentioned in the section of “Data collection, preprocessing and parameter calculation”. The coefficient \(a\) is affected by a variety of factors such as crop type, environmental conditions of the study area, management practices, and crop varieties. Rather than prescribing fixed values for each component of \(a\), we calibrate it empirically. Specifically, \(a\) for each city was derived using a zero-intercept linear regression model that relates the average \({{\rm{APAR}}}_{{\varepsilon }_{g}}\) in the specific crop distribution area to the statistical yields of the counties within the city for the period 2016–2021 (Fig. S1). Finally, the crop yield in each grid can be calculated as follows:

$$Y=a\ast \mathop{\sum }\limits_{{t}_{0}}^{{t}_{1}}\left({{\rm{APAR}}}_{{\varepsilon }_{g}{\rm{t}}}\ast {D}_{t}\right)$$
(13)

where \({{\rm{t}}}_{0}\) and \({{\rm{t}}}_{1}\) are the starting and ending months of the crop growing season, \({{\rm{APAR}}}_{{\varepsilon }_{g}t}\) is the monthly average \({{\rm{APAR}}}_{{\varepsilon }_{g}}\) for month \(t\) and \({D}_{t}\) is the number of days in month \(t\).

Validation

The validation of our simulation results was evaluated through three types of data including statistical data, field data, and existing data products. For statistical data, we implemented a 10-fold cross-validation procedure to prevent overfitting and obtain an unbiased performance estimate because our model relies on statistical datasets for calibrating key parameters (specifically, parameter \(a\)). Specifically, based on the county-level statistical yield data and \({{\rm{APAR}}}_{{\varepsilon }_{g}}\) values for each crop across Northeast China from 2016 to 2021, we divided the dataset into 10 subsets. During each iteration, 90% of the data is used to train the linear regression model, and the remaining 10% is retained as an independent validation set to calculate the prediction error. For field data and existing data products, we conducted ground-truth validation by comparing the field data from Hailun and Shenyang stations during 2016–2021 with our results, the 10 km Global Gridded Crop Production (GGCP) dataset8, and the Spatial Production Allocation Model (SPAM) yield dataset54. The above verification processes were all evaluated in this study using correlation coefficient (r), mean relative error (MRE), and relative root mean square error (rRMSE)2 for evaluation.

Data Records

The dataset is available on ref. 55. We provide the maize, rice and soybean yield distribution maps and datasets for Northeast China from 2016 to 2021. The crop yield distribution maps and datasets for Northeast China from 2016 to 2021 are named according to the respective years and are provided in Geo TIFF format with a spatial resolution of 10 m. Pixel values in these maps and datasets represent crop yield values, expressed as crop yield per unit area, in tonne per hectare (t/ha). Specific datasets detailing the yield for each crop are available in the provided link (https://doi.org/10.6084/m9.figshare.27717624.v3)55.

Technical Validation

Comparing the performance of different methods using statistical data

The result shows that the yield estimation accuracy of the yield_\({{\rm{APAR}}}_{{\varepsilon }_{g}}\)& \(a\) method is superior to that the yield_NPP&EP. Notably, the yield_\({{\rm{APAR}}}_{{\varepsilon }_{g}}\)& \(a\) method has achieved significant enhancement in r, MRE and rRMSE indexes, especially for rice, which r has improved by 0.41, MRE by 9.15% and rRMSE by 10.04%. Overall, crop yield estimation in the improved model shows better consistence with statical data at county level (Fig. 3). The advantage of the yield_\({{\rm{APAR}}}_{{\varepsilon }_{g}}\)& \(a\) method lies in that it uses of \({{\rm{APAR}}}_{{\varepsilon }_{g}}\) and a to replace numerous regionally fixed parameters and dynamic observation variables in the original model, enabling the capture of key parameters without the need for direct measurement. This approach effectively captures interaction effects among diverse geographic factors, thereby optimizing and adjusting many of the discrepancies inherent in the yield_NPP&EP.

Fig. 3
Fig. 3
Full size image

The prediction performance of maize yield from 2016 to 2021 simulated by the yield_NPP&EP method (a), the yield_\({{\rm{APAR}}}_{{\varepsilon }_{g}}\)& \(a\) method (b) and the yield_\({{\rm{APAR}}}_{{\varepsilon }_{g}}\)& \(a\) method (c) without outliers; the prediction performance of rice yield from 2016 to 2021 simulated by the yield_NPP&EP method (d), the yield_\({{\rm{APAR}}}_{{\varepsilon }_{g}}\)& \(a\) method (e) and the yield_\({{\rm{APAR}}}_{{\varepsilon }_{g}}\)& \(a\) method (f) without outliers; the prediction performance of soybean yield from 2016 to 2021 simulated by the yield_NPP&EP method (g), the yield_\({{\rm{APAR}}}_{{\varepsilon }_{g}}\)& \(a\) method (h) and the yield_\({{\rm{APAR}}}_{{\varepsilon }_{g}}\)& \(a\) method (i) without outliers.

Comparing the performance of our model for estimating yields across different yield levels using statistical data

By comparing the estimated yields of maize, rice, and soybean in Northeast China from 2016 to 2021 with statistical data, we observed large deviations from estimated norms in several counties, especially in areas with low statistical anomalies. To improve the accuracy of yield predictions, we screened the average statistical yields of different crops separately by crop type: for maize, we excluded yields below 4 t/ha, 5 t/ha, and 6 t/ha; for rice, yields below 5 t/ha, 6 t/ha, and 7 t/ha; and for soybean, yields lower than 1.5 t/ha or greater than 3.5 t/ha. Some counties exhibited significant deviations from the expected low-value statistical anomalies (Fig. 3a,d,g). Meanwhile, maize, rice and soybean performed well under the yield_\({{\rm{APAR}}}_{{\varepsilon }_{g}}\)& \(a\) method algorithm in districts with average yields of more than 6 t/ha, 7 t/ha and 3.5 t/ha, resulting in MRE of 11.78%, 11.46% and 18.75% and rRMSE of 16.73%, 15.39% and 23.33%, respectively. However, for maize, rice and soybean using the dataset within the optimal accuracy range mentioned above led to significant disruptions in temporal continuity across some counties, resulting in poor performance at the time-series scale. To tackle this issue, we chose datasets that closely aligned with the optimal accuracy criteria, focusing on districts with average statistical yields surpassing 5 t/ha for maize, 6 t/ha for rice and between 1.5 t/ha and 3.5 t/ha for soybean. This approach resulted in MRE of 12.14%, 11.93% and 14.06% and rRMSE of 16.97%, 15.97% and 17.26% for maize, rice and soybean, respectively. Such meticulous selection not only ensures the accuracy of the model predictions and maintains temporal consistency across counties, but also illustrates the particular applicability of the yield_\({{\rm{APAR}}}_{{\varepsilon }_{g}}\)& \(a\) method to yield estimation in areas with high crop yields. Post-screening, prediction accuracy was assessed separately for each crop (Fig. S2).

In areas where maize yields averaged over 5 t/ha, LN exhibited poor cross-validation results with MRE and rRMSE values of 15.76% and 21.41%, respectively. Conversely, other regions demonstrated high consistency in measured and validated data, with JL achieving the best cross-validation results (MRE = 9.67%, rRMSE = 12.99%) followed by HLJ (MRE = 11.74%, rRMSE = 15.89%). Similarly, for rice, LN’s cross-validation results were subpar, with MRE and rRMSE values of 13.30% and 17.09%, respectively. JL led with superior performance (MRE = 10.93%, rRMSE = 14.61%), followed closely by HLJ (MRE = 11.96%, rRMSE = 16.12%). Regarding soybean, LN exhibited poorer cross-validation results (MRE = 16.29%, rRMSE = 20.47%), whereas HLJ demonstrated the best performance (MRE = 13.97%, rRMSE = 18.32%), and JL also showed strong results (MRE = 14.06%, rRMSE = 17.26%).

Validation using field data and comparison with other datasets in Northeast China

To verify the accuracy and advancement of the dataset in this study, we compared the results of this study which aggregated to the 10 km spatial resolution (Fig. 4a), the GGCP dataset (Fig. 4b), this study at 10 m resolution (Fig. 4c) and the SPAM dataset (Fig. 4d) with the site-based yield data from Hailun and Shenyang for the 2016–2021. Specifically, the 10 m yield estimates from this study first aggregated to the 10 km spatial resolution prior to quantitative comparison. Compared with the GGCP dataset, our results show lower MRE and rRMSE (Fig. 4a,b). Furthermore, a comparison with maize and soybean from the 2020 SPAM dataset (with no matching sample points for rice) (Fig. 4c,d) confirmed the superior accuracy of our dataset in yield estimation. Specifically, our dataset exhibited a higher r value (0.09 higher), lower MRE (6.6% lower), and lower rRMSE (7.0% lower) compared to the SPAM dataset, with tighter clustering around the 1:1 line, further indicating improved accuracy in both r value and error metrics.

Fig. 4
Fig. 4
Full size image

Comparison of simulated yields from this study which aggregated to the 10 km spatial resolution (a), the GGCP dataset (b), this study at 10 m resolution (c) and the SPAM dataset (d) with the measured yields.

Patterns of crop yields in Northeast China from 2016 to 2021

We evaluated the annual yield maps by comparing their spatiotemporal dynamics against government-reported county-scale yields for maize, rice, and soybean. Yield distributions aligned closely across sources: maize predominantly ranged between 5–10 t/ha, rice between 5–10 t/ha, and soybean between 1–3 t/ha (Figs. S35). Spatial patterns from 2016 to 2021 consistently reflected statistical yield trends at this scale (Fig. 5): maize yields decreased east to west, rice declined west to east, and soybean diminished south-to-north (Figs. S68). However, these spatial relationships exhibited inconsistencies in select years, likely attributable to anomalous events such as localized disasters.

Fig. 5
Fig. 5
Full size image

The yield distribution of maize (af), rice (gl) and soybean (mr) with 10 m spatial resolution in the Northeast China from 2016 to 2021.

Uncertainties and limitations

The input-dependency analysis of the maize yield model in Jilin Province indicates that uncertainties associated with spatial representation constitute the dominant and explicitly quantified factor influencing dataset reliability. A comparison between simulations driven by 30 m and 1 km resolution inputs (Fig. S9) demonstrates improved performance when higher-resolution spatial data are used, highlighting the critical role of spatial scale in resolving yield variability. The spatially heterogeneous model performance observed across regions can therefore be partly attributed to these scale-related effects13. Even with limited observation points, this method can effectively assess spatial heterogeneity. It remains capable of supporting a range of applications, including rapid and reliable regional yield assessment, analysis of yield spatial patterns, comparison of inter-county differences, and study of long-term yield trends. At the county level, the dataset reliably captures overall yield levels and inter-county differences, and cross-validation among counties supports its regional applicability. However, within individual counties, yield variability is strongly influenced by fine-scale heterogeneity in management practices, environmental conditions, and stress factors that are not fully represented in the model inputs. In high-yield areas, relatively homogeneous cropping systems and favorable growing conditions reduce the sensitivity of yield estimates to input uncertainties, resulting in more stable model performance. In contrast, low-yield or anomalous regions often exhibit greater variability in crop management, pest and disease impacts, and episodic stress events, which can amplify spatial and temporal uncertainties and lead to reduced stability of yield estimates.

Beyond spatial representation, several sources of uncertainty further constrain model performance13. Temporal gaps and cloud contamination in optical remote sensing data primarily introduce noise into vegetation signal retrieval during periods critical for yield formation, affecting interannual variability rather than systematic yield levels28. Uncertainties in phenology detection may propagate into seasonal productivity estimates, particularly in regions characterized by heterogeneous cropping practices. In addition, the reliance on monthly temperature data limits the representation of short-term thermal stress and daytime photosynthetic conditions, which may influence yield estimates in anomalous years. Finally, the absence of long-term, plot-level yield observations does not directly contribute to model error, but constrains independent validation at sub-county or field scales, thereby limiting the direct applicability of the dataset for field-scale or management-level applications56.

Future improvements to the model could address these limitations. These limitations reflect common constraints in large-scale yield datasets derived from remote sensing and reanalysis products. Future extensions of the dataset may reduce some of these uncertainties through multi-sensor data integration to mitigate cloud effects, the incorporation of pixel-level uncertainty characterization57, and the use of refined temperature metrics to better approximate daytime crop growth conditions.

Usage Notes

The 10 m annual crop yield maps produced in this study represent a crucial regional-scale dataset for agricultural research. This dataset is most suitable for crop yield monitoring and trend analysis in medium or high yield, large-scale cultivation areas in Northeast China and similar agro-ecological zones, and is particularly well suited for regional-scale analyses, spatial pattern assessment, inter-county comparisons, and long-term yield variability studies. The yield_\({{\rm{APAR}}}_{{\varepsilon }_{g}}\)& \(a\) method is suitable for using in the estimation of relatively high-yield areas for maize, rice and soybean. This dataset not only holds direct significance for improving agricultural yield efficiency and ensuring food security but also has far-reaching impacts on environmental protection, disaster mitigation, and decision-making.