Introduction

The growing demand for monitoring land cover changes and terrestrial ecosystems has driven the development of spatiotemporal data fusion techniques to address critical challenges in phenology18, natural disaster response1, anthropogenic disturbances7,11, and agricultural management12. Traditional satellite systems face inherent limitations, including frequent cloud contamination6 and lengthy revisit cycles17, which restrict their effectiveness in capturing high-resolution spatiotemporal dynamics23. To overcome these constraints, fusion methods have emerged to integrate complementary data sources—combining fine-spatial-resolution imagery (e.g., Landsat) with coarse-spatial-but-high-temporal-resolution data (e.g., MODIS)—to generate synthetic datasets with dual high resolution19.

The pioneering STARFM framework6 established the foundation for temporal interpolation-based approaches, leveraging neighboring pixels to predict intermediate states. Subsequent variants like ESTARFM22 and STARRFM10 enhanced temporal sensitivity but remained constrained by mixed-pixel challenges in heterogeneous landscapes23. Parallel developments in unmixing-based methods, initiated by Zhukov et al.25, addressed spectral heterogeneity through endmember proportion analysis, assuming temporal consistency of land cover components26. While effective in stable environments, these approaches struggle with dynamic systems where spectral signatures vary nonlinearly across bands and seasons (Fig. 1).

Spectral variability is a significant factor affecting the performance of image fusion techniques. Recent studies have highlighted the challenges posed by spectral variations, particularly in heterogeneous landscapes, where the reflectance values across different spectral bands can change drastically over time due to seasonal dynamics3. For example, in regions with diverse land cover types, such as agricultural or forested areas, the spectral reflectance can vary significantly between seasons, making it difficult for traditional fusion methods to accurately model these changes. A recent study demonstrated that spectral variability often leads to suboptimal fusion results, as the temporal changes in spectral characteristics are not always linearly predictable across different bands9.

Recent spatiotemporal data fusion methods aim to synthesize high-resolution spatial and temporal data by integrating both sparse fine-resolution images and frequent coarse-resolution images from different satellite sensors. These approaches are designed based on various principles, each with distinct strengths and limitations, making it challenging for users to select the most suitable method for their specific applications2,24. Recent studies have introduced novel machine learning techniques: Extreme Learning Machine (ELM), which directly learns mapping functions from difference images, rather than relying on complex feature representations13. In addition, deep learning methods have gained increasing attention for their ability to handle complex spatiotemporal fusion tasks, especially in the presence of cloud contamination and large land cover changes. For example, the Multi-scene Spatiotemporal Fusion Network (MUSTFN) based on Convolutional Neural Networks (CNN) integrates multi-level features from different resolutions and sensors to improve fusion accuracy, even in areas with large registration errors and rapid land cover changes16. Similarly, Conditional Generative Adversarial Networks have been applied to fuse optical and microwave data, effectively filling gaps in cloud-contaminated imagery and improving data availability14. Other deep learning approaches, like the Dual-Branch Subpixel-Guided Network (DSNet), have been proposed for hyperspectral image classification, combining subpixel information with spectral features to enhance classification accuracy, demonstrating the growing potential of deep learning models in handling mixed pixels and improving decision boundaries8. These advancements highlight the increasing shift toward data-driven fusion techniques, offering more accurate, efficient, and adaptable solutions for spatiotemporal data fusion in remote sensing applications.

Despite these advancements, two critical gaps remain in the current spatiotemporal fusion methods. First, while deep learning approaches have shown promising results, they often require large amounts of training data, which can limit their applicability in regions where historical imagery is sparse or unavailable. This issue is particularly challenging in remote or under-monitored areas, where obtaining sufficient labeled data for training models is not feasible. Second, traditional unmixing frameworks, often struggle to accurately capture seasonal spectral variations, especially in heterogeneous landscapes with dynamic land cover changes. These methods, while effective in stable environments, struggle to account for the complex, nonlinear changes in reflectance values across different seasons and land cover types, leading to suboptimal fusion results in regions with high variability in land use or climate conditions (Fig. 1).

Fig. 1
figure 1

The spectrum of land-cover types during different seasons. The reflectance value of each band was extracted using Landsat 8 OLI.

In temperate zones, the phenology of various landscapes is distinctly reflected in the changing reflectance values across different spectral bands, as influenced seasonal variations. This phenomenon is noticeable in a variety of land covers such as farmlands, grasslands, wetlands, and riversides. Specifically, Fig. 2a depicts the spectral data from built areas, where the change in band values between two seasons is almost negligible. In contrast, for vegetation-covered areas like farmland (Fig. 2c) and forests (Fig. 2b), there is a significant shift in Band 5 (NIR band). This band is particularly sensitive to growth, thus showing the greatest variation. Figure 2d highlights the transformation at the riverside from water to bare ground depending on the water flow resulting in a completely altered spectrum between the seasons. The degree of change in band values varies significantly across different types of land cover, which introduces uncertainties these land-cover types coexist within a single coarse pixel.

To address these limitations, we propose the Residual-Distribution-based Spatiotemporal Data-Fusion Method (RDSFM). This method aims to distribute residuals more accurately across the sub-pixels of each band. is specifically designed to effectively reflect the degree of change in heterogeneous areas and in regions where land cover types have altered. RDSFM requires only one high resolution image and two high temporal resolution images to function optimally. This study lies in its potential to improve the precision of remote sensing data analysis by refining the spatial and temporal details captured in heterogeneous landscapes. This particularly valuable in ecological monitoring, urban planning, and agricultural management. Additionally, the RDSFM method could substantially contribute to the advancement of remote sensing technologies and methodologies, more sustainable management of natural resources.

Methods

To more accurately distribute residuals across each subpixel, this study introduces a weighting system based on multivariate relationships between images captured at times t1 and t2. This weighting is determined through the Iteratively Regularized Multivariate Alteration Detection (IR-MAD) method, which is designed detect changes in multivariate data collected at two temporal points from the same geographical area15. The custom code used in this study for the RDSFM is available at on personal webpage (https://homepage.ybu.edu.cn/jinyihua).

Estimate the residuals

The real changed values of coarse images between t1 and t2 can be written as follows:

$$\:\varDelta\:C\left({x}_{i},\:{y}_{i},\:b\right)={C}_{2}\left({x}_{i},\:{y}_{i},\:b\right)-{C}_{1}\left({x}_{i},\:{y}_{i},\:b\right)$$
(1)

where \(\:{C}_{1}({x}_{i},\:{y}_{i},\:b)\) is band b value of coarse pixel at location \(\:({x}_{i},\:{y}_{i})\) at t1, and \(\:{C}_{2}({x}_{i},\:{y}_{i},\:b)\) is band b value at t2.

In the context of a homogeneous landscape, the errors stemming from temporal changes can be characterized by the discrepancies between the actual values and \(\:{F}_{2}^{TP}\), the temporally predicted fine-resolution image at t2. Therefore, the predicted changes between \(\:{F}_{2}^{TP}\) and the actual fine-resolution image at t1 are as follows:

$$\:\varDelta\:F\left({x}_{i},\:{y}_{i},\:b\right)=\frac{1}{m}\sum\:_{j=1}^{m}{F}_{2}^{TP}({x}_{ij},\:{y}_{ij},\:b)-\frac{1}{m}\sum\:_{j=1}^{m}{F}_{1}({x}_{ij},\:{y}_{ij},\:b)$$
(2)

\(\:m\) is the number of fine pixels within one coarse pixel, and \(\:{F}_{2}^{TP}\)is the temporal predicted value of fine pixels at t2 using unmixing-based method.

Between the real changed values and the temporally predicted changed values, there are residuals. These residuals mainly are caused by spatial and temporal change, such as a land-cover type change and within-class variability across the image, and the pixel value changed by season23. Distributing residuals \(\:\text{R}\left({\text{x}}_{\text{i}},\:{\text{y}}_{\text{i}},\:\text{b}\right)\) to fine pixels within a coarse pixel is important to improve the accuracy of the predict fine pixel value at t2. The residual R of each band between the true changed values and the predicted changed values can be derived as follows:

$$\:R\left({x}_{i},\:{y}_{i},\:b\right)=\varDelta\:C\left({x}_{i},\:{y}_{i},\:b\right)-\varDelta\:F\left({x}_{i},\:{y}_{i},\:b\right)$$
(3)

Because different landscape types have similar reflectance at a particular time, it will be misclassified and generate errors in some cases. The important terms of residual distribution is understanding the variation of each bands. For more precisely distribute the residuals to subpixels, we introduce MAD-based weights, estimated by IR-MAD.

Estimate weights for residual distribution using IR-MAD

The IR-MAD method is adopted for multi-variate calculation, owing to its effectiveness, speed and fully automatic processing. For distributing the residuals to each subpixel more properly, this study introduced a weight based on multi-variates between the images at t1 and t2. It is estimated by iteratively regularized multi-variate alteration detection (IR-MAD), which detect changes in multi-variate data acquired at two points in time covering the same geographical region15.

The main idea of IR-MAD is a simple iterative scheme to place high weights on observations that exhibit little change over time. The IR-MAD algorithm is an established change detection technique based on canonical correlation analysis. Mathematically, the IR-MAD tries to identify linear combinations of two variables, \(\:{a}^{T}X\) and \(\:{b}^{T}Y\), to maximize the objective function \(\:{max}_{a,b}\:var({a}^{T}X-{b}^{T}Y)\) with \(\:V\left\{{a}^{T}X\right\}=V\left\{{b}^{T}Y\right\}=1\). The dispersion matrix of the MAD variates is as follows:

$$\:D=V\left\{{a}^{T}X-{b}^{T}Y\right\}=V\left\{{a}^{T}X\right\}+V\left\{{b}^{T}Y\right\}-2cov\left\{{a}^{T}X,\:{b}^{T}Y\right\}=2\left(1-corr\left\{{a}^{T}X,\:{b}^{T}Y\right\}\right)$$
(4)

A MAD variate is the difference between the highest order canonical variates and it can be expressed as follows:

$$\:\left[\begin{array}{c}X\\\:Y\end{array}\right]\to\:\left[\begin{array}{c}{a}_{p}^{T}X-{b}_{p}^{T}Y\\\: \vdots \\\:{a}_{1}^{T}X-{b}_{1}^{T}Y\end{array}\right]$$
(5)

where \(\:{a}_{i}\) and \(\:{b}_{i}\) are the defining coefficients from a standard canonical correlation analysis.

Using a brief derivation, the objective function can be reformulated to minimize the canonical correlation of two variables, that is \(\:min\lambda\:=corr({a}^{T}X,\:{b}^{T}Y)\), where corr represents a correlation function. If we let the variance-covariance matrix of X and Y be \(\:{{\Sigma\:}}_{XX}\) and \(\:{{\Sigma\:}}_{YY}\), respectively, and their covariance be \(\:{{\Sigma\:}}_{XY}\), the correlation can be formulated to Rayleigh quotients as follows:

$$\:\text{min}{\lambda\:}^{2}=\frac{{a}^{T}{{\Sigma\:}}_{XY}{{\Sigma\:}}_{YY}^{-1}{{\Sigma\:}}_{YX}a}{{a}^{T}{{\Sigma\:}}_{XX}a}=\frac{{b}^{T}{{\Sigma\:}}_{XY}{{\Sigma\:}}_{XX}^{-1}{{\Sigma\:}}_{YX}b}{{b}^{T}{{\Sigma\:}}_{YY}b}$$
(6)

Equation (8) is just an eigenvalue problem, that is \(\:{{\Sigma\:}}_{XY}{{\Sigma\:}}_{YY}^{-1}{{\Sigma\:}}_{YX}a={\lambda\:}^{2}{{\Sigma\:}}_{XX}a\), and hence the solutions are the eigenvectors of \(\:{a}_{1},\:\dots\:,\:{a}_{n}\) corresponding to the eigenvalues \(\:{\lambda\:}_{1}^{2}\ge\:\dots\:\ge\:{\lambda\:}_{n}^{2}\ge\:0\) of \(\:{{\Sigma\:}}_{XY}{{{\Sigma\:}}_{YY}^{-1}{\Sigma\:}}_{YX}\) with respect to \(\:{{\Sigma\:}}_{XX}\). Now assuming that the MODIS image is preprocessed to be zero mean, we denote \(\:MAD={a}^{T}X-{b}^{T}Y\) as the MAD components of the combined bi-temporal image. In this process, the ENVI extension for IR-MAD generated by Canty4 was used for running the IR-MAD and obtaining the MAD variates.

MAD variates are determined by the correlation between two images; a larger value means much change occurred in the spectrum, indicating changed areas, and lower values means there are no changes in the spectrum and indicate the areas are the same as the previous image4,15. Here, the input data for multi-variate calculation are a fine-resolution image at t1 and coarse-resolution image at t2, which resampled to a fine resolution through a bilinear interpolation method.

Distribute the residuals to the fine pixel

Errors in temporal prediction are mainly caused by land-cover type change and within-class variability across the image. Therefore, this study proposed a new weighted function to distribute residuals to subpixels, considering the variates in each band and the heterogeneous degree. This study introduced a multi-variate-based weight (\(\:{W}_{MAD}\)) as follows:

$$\:{W}_{MAD}({x}_{ij},\:{y}_{ij},\:b)=D({x}_{ij},\:{y}_{ij},\:b)/\sum\:_{i=1}^{m}D({x}_{ij},\:{y}_{ij},\:b)$$
(7)

.For the heterogeneity pixels, we addressed the issue of heterogeneous coarse pixels with higher residual weights by utilizing the Homogeneous Index (HI).

$$\:HI\left({x}_{ij},\:{y}_{ij}\right)=\left(\sum\:_{k=1}^{m}{I}_{k}\right)/m$$
(8)

where \(\:{\text{I}}_{\text{k}}\) approaches 1, means the kth fine pixels within a moving window with the same land cover type as the central fine pixel \(\:({x}_{ij},\:{y}_{ij})\) being considered, otherwise \(\:{\text{I}}_{\text{k}}\) approaches 0. HI ranges from 0 to 1, and larger values indicate more homogeneous landscape, and smaller values indicate more heterogeneous. The weight for combining the two cases is:

$$\:W\left({x}_{ij},\:{y}_{ij},\:b\right)=R\left({x}_{ij},\:{y}_{ij},\:b\right)*\left(1-HI\left({x}_{ij},\:{y}_{ij}\right)\right)+R({x}_{ij},\:{y}_{ij},b)*{W}_{MAD}({x}_{ij},\:{y}_{ij},\:b)$$
(9)

The weight is then normalized as follows:

$$\:{W}_{Normalized}\left({x}_{ij},\:{y}_{ij},\:b\right)=W({x}_{ij},\:{y}_{ij},\:b)/\sum\:_{j=1}^{m}W({x}_{ij},\:{y}_{ij},\:b)$$
(10)

Then, the residual distributed to jth fine pixel is as follows:

$$\:r({x}_{ij},\:{y}_{ij},\:b)={W}_{Normalized}\left({x}_{ij},\:{y}_{ij},\:b\right)*R({x}_{i},\:{y}_{i},b)$$
(11)

Summing the distributed residual and the temporal change, we can obtain the prediction of the total change of a fine pixel between t1 and t2 as follows:

$$\:\varDelta\:F\left({x}_{ij},\:{y}_{ij},\:b\right)=r\left({x}_{ij},\:{y}_{ij},\:b\right)+[{F}_{2}^{TP}\left({x}_{ij},\:{y}_{ij},b\right)-{F}_{1}\left({x}_{ij},\:{y}_{ij},b\right)].$$
(12)

Testing experiment

For the experiments, this study utilized Landsat images to cover two distinct study areas, each exhibiting unique spatial and dynamics. The first area features a complex and heterogeneous landscape, while the second undergoes significant land-cover-type changes. These sites have been pivotal in various spatiotemporal data-fusion methods, such as STARFM, ESTARFM, and FSDAF, serving as benchmarks for comparing testing these techniques. The satellite images for both sites are cloud-free Landsat 7 ETM + images (https://earthexplorer.usgs.gov), which were atmospherically corrected using the FLAASH (Fast Line-of-sight Atmospheric Analysis of Spectral Hypercubes) algorithm.

At the first site (Fig. 2), located in the southern region of New South Wales, Australia, characterized by complex and heterogeneous terrain, with coordinates at 145.0675ºE, 34.0034ºS. These images, corresponding to Path/Row93/84, were taken on November 25, 2001, and January 12, 2002, respectively. The predominant land types within this locale include irrigated rice cropland, dryland agriculture, and woodlands. Notably, rice croplands are typically irrigated during October and November5,23. The phenological changes in this area, particularly in the rice cropland, are defined by the seasonal irrigation practices, where rice fields are typically flooded in October and November, marking a distinct shift in vegetation dynamics. This seasonal change is crucial in understanding the land-cover transitions in this region.

Fig. 2
figure 2

Test data in a complex and heterogeneous landscape: Landsat images acquired on (a) November 25, 2001 and (b) January 12, 2002, (c) and (d) are the corresponding MODIS images with a 500-m spatial resolution to (a) and (b). All images use red-green-blue as RGB. The Landsat 7 ETM + images were obtained from the U.S. Geological Survey (USGS) (https://www.usgs.gov/), while the MODIS MOD09GA were provided by NASA’s Earth Observing System Data and Information System (EOSDIS) (https://earthdata.nasa.gov/).

The second site (Fig. 3), experiencing land cover type changes, is located in northern New South Wales, Australia (149.2815°E, 29.0855°S). This area is relatively homogeneous, featuring extensive croplands and natural vegetation. Two Landsat images were captured, on November 26, 2004, and the other on December 12, 2004 (Path/Row 91/80). A flood event occurred in December 2004, which is evident from the 12 Landsat image displaying a vast inundated region. This flood event caused a distinct and rapid phenological change in the land cover, particularly transforming some of the croplands and vegetation areas into water-covered pixels. The phenological shift from terrestrial vegetation to water in these regions illustrates the dynamic land-cover type change during extreme weather events. Such changes are crucial for monitoring and understanding the temporal variations in land cover in this region.

Fig. 3
figure 3

Test data in area with land-cover type change: Landsat images acquired on (a) November 26, 2004 and (b) December 12, 2004, (c) and (d) are the corresponding MODIS images with a 500-m spatial resolution to (a) and (b). All images use red-green-blue as RGB. The Landsat 7 ETM + images were obtained from the U.S. Geological Survey (USGS) (https://www.usgs.gov/), while the MODIS MOD09GA were provided by NASA’s Earth Observing System Data and Information System (EOSDIS) (https://earthdata.nasa.gov/).

MODIS images were also incorporated for data fusion in this study. The MODIS data utilized comes from the Moderate Resolution Imaging Spectroradiometer (MODIS) onboard the Terra and Aqua satellites, with a spatial resolution of 500 m. The specific product used in this study is the MOD09GA (Surface Reflectance, 8-day L3 Global 500 m), which provides corrected surface reflectance values for bands 1 to 7. To ensure proper alignment between the MODIS and Landsat data, a coregistration process was performed. This process involved resampling the MODIS data to match the spatial resolution of Landsat imagery (30 m) and aligning the georeferencing to ensure pixel-wise accuracy. Furthermore, atmospheric correction was applied to both the MODIS and Landsat datasets to mitigate atmospheric scattering and absorption effects. The MODIS data were corrected using the MOD09GA product’s built-in atmospheric correction algorithm, while Landsat data underwent standard atmospheric correction through the FLAASH (Fast Line-of-sight Atmospheric Analysis of Spectral Hypercubes) method. These preprocessing steps ensured that the data fusion process was based on accurately coregistered and atmospherically corrected datasets, allowing for reliable change detection and fusion results.

The performance of the RDSFM was evaluated in comparison with the unmixing-based data-fusion algorithm (UBDF) developed by Zhukov et al. in 1999, and the FSDAF algorithm introduced by Zhu et al. in 2016. This comparative analysis was pertinent since RDSFM is similarly based on an unmixing methodology. Both the predicted fine-resolution images derived from these algorithms were assessed against the actual images through qualitative and quantitative means. To represent the different facets of accuracy, several indices were employed. The root mean square error (RMSE) measured the deviation between the predicted and actual reflectance values. Additionally, the correlation coefficient (r) was utilized to quantify the linear relationship between the predicted reflectance and the true reflectance. The mathematical definitions for RMSE and r are as follows:

$$\:\text{R}\text{M}\text{S}\text{E}=\sqrt{\frac{\sum\:_{i=1}^{n}{\left({P}_{i}-{O}_{i}\right)}^{2}}{n}}$$
(13)

where n is the number of samples, \(\:{P}_{i}\) is the predicted value of pixel i, and \(\:{O}_{i}\) is the observed value in pixel i:

$$\:\text{r}=\frac{\sum\:_{i=1}^{n}\left({x}_{i}-\stackrel{-}{x}\right)\left({y}_{i}-\stackrel{-}{y}\right)}{\sqrt{\sum\:_{i=1}^{n}{\left({x}_{i}-\stackrel{-}{x}\right)}^{2}}\sqrt{\sum\:_{i=1}^{n}{\left({y}_{i}-\stackrel{-}{y}\right)}^{2}}}$$
(14)

where \(\:{x}_{i}\), \(\:{y}_{i}\) are the single sample indexed with I, and \(\:\stackrel{-}{x}=\frac{1}{n}\sum\:_{i=1}^{n}xi\) (the sample mean), and analogously \(\:\stackrel{-}{y}\).

The average difference (AD) between the predicted and true images was employed to quantify the overall bias in the predictions. A positive AD suggests that the fused image tends to overestimate the actual values, whereas a negative AD indicates an underestimation.

In addition to the previously mentioned quantitative evaluations, a visual assessment index, the Structural Similarity Index (SSIM)20,23, was also employed to measure the similarity in overall structure between the true and predicted images as follows:

$$\:\text{S}\text{S}\text{I}\text{M}=\frac{(2{\mu\:}_{X}{\mu\:}_{Y}+{C}_{1})(2{\sigma\:}_{XY}+{C}_{2})}{({\mu\:}_{X}^{2}+{\mu\:}_{Y}^{2}+{C}_{1})({\sigma\:}_{X}+{\sigma\:}_{Y}+{C}_{2})}$$
(15)

where \(\:{\mu\:}_{X}\) and \(\:{\mu\:}_{Y}\) are means; \(\:{\sigma\:}_{X}\) and \(\:{\sigma\:}_{Y}\) are the variance of the true and predicted images, respectively; \(\:{\sigma\:}_{XY}\) is the covariance of the two images; and \(\:{C}_{1}\) and \(\:{C}_{2}\) are two small constants to avoid unstable results when the denominator of Eq. 15 is very close to zero. A SSIM value closer to 1 indicates more similarity between the two images.

Results

Test in heterogeneous landscape

Visually comparing the predictive results generated by the two methods, as depicted in Fig. 4, it is evident that both methods successfully preserve spatial details. Figures 4b and c showcase a Landsat-like image dated January 11, 2002, predicted using the FSDAF and RDSFM methods respectively. Figures 5 and 6 depict the predicted red band and NIR band, which are crucial for vegetation analysis. The visual comparison indicates that the images predicted by both methods bear a general resemblance to the original Landsat image shown in Fig. 4. Given that pixels exhibiting significant changes are assigned relatively large change values, the predictions made using the RDSFM method appear to more closely replicate the actual image in terms of detail and color. This observation suggests that while both methods are capable of capturing the general phenological changes in complex and heterogeneous landscapes, the results from the RDSFM method are more consistent with the actual image.

Fig. 4
figure 4

Zoom-in scenes of a complex and heterogeneous site: Original Landsat image of January 11, 2002 (a) and its predicted images suing FSDAF (b) and RDSFM (c).

Fig. 5
figure 5

Zoom-in scenes of a complex and heterogeneous site: Original Landsat red band of January 11, 2002 (a) and its predicted red band using FSDAF (b) and RDSFM (c).

Fig. 6
figure 6

Zoom in scenes of a complex and heterogeneous site: Original Landsat NIR band of January 11, 2002 (a) and its predicted red band using FSDAF (b) and RDSFM (c).

In an analysis of the quantitative indices derived from the original Landsat image dated January 11, 2002, it is evident that three methodologies have effectively incorporated temporal change information into the Landsat imagery to predict outcomes for the same date. Among the three bands, the fusion results from RDSFM exhibit a lower Root Mean Square Error (RMSE) and a higher correlation coefficient (R) compared to those obtained through UBDF and FSDAF (Table 1). This indicates a superior accuracy in the predictions made by RDSFM over the alternative methods. Notably, the Near-Infrared (NIR) band shows the most pronounced disparity in accuracy between RDSFM and the other methods. Given that the Landsat images were captured during the early growing season, both the NIR and red bands displayed significant changes in reflectance compared to other bands. RDSFM’s substantial enhancement in predicting outcomes for the NIR and red bands suggests a more robust capability in capturing the significant temporal changes between the dates of input and prediction.

Table 1 Accuracy assessment of three data fusion methods applied to a complex and heterogeneous study area (Fig. 4). The units are reflectance (band1: blue band; band2: green band; band3: red band; band4: NIR band; band5: SWIR1 band; band6: SWIR2 band; R: correlation coefficient; RMSE: root mean square error; AD: average difference from the true reflectance; SSIM: structural similarity).

Test in land cover type change

A zoom-in section was utilized to underscore the distinctions between the predicted and actual images (Fig. 7). The image predicted by RDSFM aligns more closely with the original in terms of spatial details compared to those predicted by FSDAF, as evident in the enhanced views of the red and NIR bands shown in Figs. 8 and 9. Notably, when comparing the zoomed-in sections of the two original Landsat images, a small area of a lake is observed transitioning from non-water to water. Despite FSDAF accurately predicting the pixel values for this small area, RDSFM demonstrates superior capability in preserving such minor changes.

Fig. 7
figure 7

Zoom-in scenes of land-cover change: original Landsat image of November 26, 2004 (a), and predicted images suing FSDAF (b) and RDSFM (c).

Fig. 8
figure 8

Zoom-in scenes of land-cover change: original Landsat red band of November 26, 2004 (a), and predicted images suing FSDAF (b) and RDSFM (c).

Fig. 9
figure 9

Zoom-in scenes of land-cover change: original Landsat NIR band of November 26, 2004 (a), and predicted images using FSDAF (b) and RDSFM (c).

The quantitative indices derived from the fused outcomes compared with the original Landsat image at the forecast interval show that all data-fusion techniques successfully captured essential temporal change information between the input and prediction images. Among the three bands, RDSFM delivered the most precise forecasts, evidenced by the lowest RMSE and highest values in correlation coefficient (R) and Structural Similarity Index Measure (SSIM). Regarding the overall predictive bias, the minimal Absolute Difference (AD) values indicate that all three methods achieved nearly unbiased results (Table 2).

Table 2 Accuracy assessment of three data-fusion methods applied to a complex and heterogeneous study area (Fig. 5). The units are reflectance (band1: blue band; band2: green band; band3: red band; band4: NIR band; band5: SWIR1 band; band6: SWIR2 band; R: correlation coefficient; RMSE: root mean square error; AD: average difference from the true reflectance; SSIM: structural similarity).

Discussion

To track the rapid dynamics of land surfaces with temporal changes, spatiotemporal data-fusion techniques have been developed to integrate satellite imagery with varying spatial and temporal resolutions. Traditional methods, however, have struggled to accurately predict pixel values in areas of land-cover change during the intervals between the input and forecast dates. Addressing this challenge, this study introduces a novel spatiotemporal data-fusion approach, named Residual Distribution Spatiotemporal Fusion Method (RDSFM). This method blends temporally sparse, high-resolution images with temporally dense, low-resolution images. RDSFM incorporates principles from IR-MAD, which enables the detection of spectrally changed pixels without supervision, the unmixing-based method, and the Histogram Intersection (HI) concept, previously implemented in FSDAF with proven high accuracy in handling mixed coarse pixels. The findings confirm that RDSFM delivers superior accuracy and more effectively predicts areas with significant spectral changes.

The spectral variation of each pixel in RDSFM is more robust compared to other methodologies due to the application of Median Absolute Deviation (MAD) for weighting. This adaptation leverages MAD to detect significant changes across multiple bands and bi-temporal datasets through a statistical approach. Data utilized in the IR-MAD method encompass both a high-resolution image at time t1 and a lower-resolution image at time t2, facilitating the detection of temporal transformations. Similarly, the use of a high-resolution image from t1 and a lower-resolution image from t1 in the IR-MAD technique allows for the identification of changes caused by different sensors.

Weights are derived by combining the MAD values, as shown in Figs. 10b and 11b, which present the distribution of MAD variates for the red and NIR bands between the initial input time and the prediction time. The similarity in the change detection distribution observed in the original Landsat images at t1 and t2 (Figs. 10b and 11b) with the estimated MAD distribution between these times highlights the effectiveness of this method. The MAD-based weights applied in RDSFM help more accurately capture temporal variations in these plant-sensitive bands, leading to more reliable NDVI calculations. The analysis of the results across all bands indicates that RDSFM performs well not only in the red and NIR bands, which were initially emphasized, but also in the other spectral bands (i.e., blue, green, SWIR1, and SWIR2). Thus, MAD-based weights efficiently address individual band reflectance variations, applying uniform weights across bands despite heterogeneous reflectance changes. This MAD-based weighting recognizes the non-uniform reflectance changes over time, as evidenced in Figs. 10 and 11, band 5 (SWIR1) shows a notable decrease in reflectance, while other bands, particularly the NIR and SWIR2 bands, exhibit increases. By adopting a MAD-based weight system, this study offers a more realistic distribution of temporal changes, demonstrating a significant improvement over conventional method.

Fig. 10
figure 10

Change detection of a land-cover change site from original Landsat red band from November 26, 2004 to December 12, 2004(a), and the estimated MAD variate of the red band(b).

Fig. 11
figure 11

Change detection of land-cover-change site from the original Landsat NIR band form November 26, 2004 to December 12, 2004 (a), and the estimated MAD variate of the NIR band (b).

The results of RDSFM combine two changes in the distribution results distributed by the homogeneous index and the weight based on MAD. In heterogeneous landscapes, if one pixel has little variation between t1 and t2, the result has the possibility to be biased toward HI. On the contrary, if one pixel has a large variation between t1 and t2, the result has a possibility of depending on the weight of MAD. Therefore, in some cases, some bands do not change much through temporal change, such as Band 1 (blue) and Band 2 (green), then the blending result depends on HI, and the result will be similar to that of the FSDAF, which also uses an HI index for heterogeneous landscape prediction. In contrast, for bands like Band 4 (NIR), Band 5 (SWIR1), and Band 6 (SWIR2), the MAD-based weights are more effective in capturing temporal changes, resulting in higher accuracy than methods relying solely on unmixing or HI. The weights-based MAD is most effective in the NIR band, the band with a large difference in reflectance value depending on the season, and shows higher accuracy than unmixing based method. This is because the MAD can detect temporal differences in each band, and can more accurately predict the pixels that have a spectral change.

Recent studies on spatiotemporal fusion, such as the work of Qin et al.16, have introduced deep learning-based fusion techniques and hybrid statistical models to enhance prediction accuracy. While these methods show promise, they often require large-scale training datasets and significant computational resources, limiting their applicability in certain remote sensing scenarios. In contrast, RDSFM provides a computationally efficient alternative by leveraging statistical change detection and unmixing principles, achieving comparable or superior accuracy in heterogeneous landscapes without the need for extensive training datasets. Furthermore, RDSFM integrates MAD-based weighting to better handle temporal variations, particularly in plant-sensitive bands such as NIR and SWIR2. This enhancement underscores the significance of this study in advancing spatiotemporal fusion methodologies by balancing computational efficiency and predictive accuracy.

Conclusion

This study introduces the Residual Distribution based Spatio-temporal Data Fusion Method (RDSFM), a novel approach designed to enhance the accuracy and applicability of spatio-temporal data fusion in satellite imagery. Through the strategic utilization of the IR-MAD algorithm to estimate subpixel distribution weights, RDSFM effectively addresses the challenges posed by spatial and temporal variability in heterogeneous landscapes.

Our findings demonstrate several key advantages of RDSFM over existing methods. Firstly, RDSFM excels in predicting variable bands seasonally, particularly in the red and NIR bands, crucial for vegetation-related analyses. Secondly, it demonstrates robust performance in accommodating landscapes with dynamic land cover changes, surpassing traditional methods in accuracy and detail preservation. Thirdly, RDSFM requires minimal input data, making it practical for applications where comprehensive satellite imagery may be limited.

Although the RDSFM is capable of predicting both heterogeneous landscapes and spectral changes between input and prediction dates, it struggles to detect minute spectral variations due to land-cover changes. This limitation becomes apparent when only a few fine pixels undergo land-cover changes that are not visible in interpolated coarse-resolution images. Moreover, since RDSFM is designed based on the IR-MAD framework, it does not directly support other products like NDVI and LST. The RDSFM requires at least three bands for effective blending15,21. While machine learning approaches do not suffer from this limitation, RDSFM’s ability to focus on changes in individual reflectance bands through the MAD weighting gives it an edge in vegetation-focused analyses where NDVI and similar indices are crucial. RDSFM improves accuracy by adjusting the residual distribution, particularly in scenarios involving spatial and temporal pixel variations. RDSFM enhances prediction accuracy across different seasonal bands, and adapts well to scenarios with mixed pixels and variations in land cover types.

In conclusion, RDSFM represents a significant advancement in spatio-temporal data fusion methodologies, offering practical solutions for monitoring dynamic landscapes and supporting a wide range of environmental and agricultural applications.