Background & Summary

Over the past few decades, remote sensing has experienced significant growth, with an escalating demand for updated high-resolution (HR) imagery. This demand arises from various real-world applications, such as crop delineation1 and urban planning2, where precise spatial information is crucial for decision-making. However, due to technical and economic constraints, open-access Earth observation missions often face challenges in delivering imagery at the desired high spatial resolution. Consequently, there is a pressing need for sophisticated super-resolution (SR) models to enhance the ground sampling distance of publicly available low-resolution (LR) imagery.

Creating training datasets for super-resolution in remote sensing is particularly challenging due to the limited availability of HR multispectral imagery. Although platforms such as PlanetScope, SPOT-6/7, and WorldView-3 provide resolutions more than twice that of freely available data like Sentinel-2 (S2), their use is limited by proprietary licenses and restricting accessibility for open research and public applications. Additionally, it is crucial to select an LR image within an appropriate time frame to ensure that it accurately corresponds to the HR image representation. Under this situation, researchers primarily adopt two methodologies: the synthetic and cross-sensor approaches. The synthetic method generates LR images by applying a degradation model to HR images3. Conversely, the cross-sensor approach utilizes real LR satellite images from a different sensor3.

Each approach has its advantages and limitations. The synthetic approach excels in providing “controlled environments”, where the LR image is an exact downscaled version of the HR image, making it ideal for testing new SR networks. However, relying solely on synthetic datasets can introduce significant distribution shifts during inference when applied to real-world LR imagery4. Prominent examples of synthetic datasets include WHU-RS195, RSSCN76, and AID7. In contrast, cross-sensor datasets face challenges in the scalability of the number of samples, as they require the collection of collocated LR-HR image pairs within a short time frame to minimize surface and atmospheric discrepancies. Even with simultaneous image acquisition, reflectance values may differ due to variations in sensor calibration and viewing angles. A notable example of a cross-sensor dataset is SEN2VENµS3, which harmonizes Sentinel-2 (LR) and Venµs (HR) imagery before applying SR methods.

While current state-of-the-art SR networks8,9,10,11 excel in enhancing high-frequency details, they often struggle to maintain coherence, particularly in the spectral domain12 (Fig. 1). This misalignment arises from inadequate harmonization in cross-sensor approaches12. In remote sensing, pixel values represent more than mere color intensity; they provide radiance observations crucial for retrieving biophysical variables and ensuring accurate environmental interpretation13,14,15. Therefore, it is imperative to utilize transparent datasets that not only facilitate effective model training but also ensure accurate spectral and spatial consistency.

Fig. 1
figure 1

Comparison of an S2 image with its corresponding super-resolved counterpart highlights the spectral inconsistency of state-of-the-art SR models. The third column displays the pixel-level ratio between the S2 image and the super-resolved results downsampled to the original S2 resolution of 10 m. The models evaluated include Worlstrat8 (first row), AllenAISR9 (second row), Razzak10 (third row), and SR4RS11 (fourth row).

In this paper, we present an innovative hybrid dataset that combines the advantages of both synthetic and cross-sensor approaches, effectively addressing the issue highlighted in Fig. 1. First, a cross-sensor dataset comprising 2,851 image pairs of Sentinel-2 (\(S2\)) at 10 m and aerial images from the National Agriculture Imagery Program (\(NAIP\)) at 2.5 m is crafted to improve our understanding of the complex relationship between \(S2\)-\(NAIP\) image pairs (Fig. 2 cross-sensor dataset). With this dataset, we focus on learning an optimal point spread function (PSF), spectral alignment function, and the noise pattern between NAIP and S2. By accurately modeling these components, we can synthetically generate LR counterparts, removing the need to capture real HR and LR images simultaneously. This significantly enhances dataset scalability. Using the learned degradation model, a second dataset is created, consisting of 17,657 locations (Fig. 2 synthetic dataset), where two NAIP images (oldest and more recent) are selected per location, and their LR counterparts are synthetically generated (\(S{2}_{like}\)). NAIP was chosen over other HR products due to its extensive coverage, public availability, and public domain license, making it ideal for creating large-scale training datasets.

Fig. 2
figure 2

The locations of cross-sensor (A) and synthetic (B) regions of interest (ROIs) within the SEN2NAIP dataset. Unlike the cross-sensor subset, each ROI within the synthetic subset comprises two NAIP images.

Methods

This section introduces the data products and details all the preprocessing steps for generating the cross-sensor and synthetic datasets. Figure 3A presents the workflow for crafting the cross-sensor dataset, while Fig. 3B depicts the workflow utilized for generating the synthetic dataset.

Fig. 3
figure 3

A high-level summary of our workflow to generate SEN2NAIP datasets. The numbers represent the ROIs retained after each processing block. (A) The workflow for the cross-sensor SEN2NAIP: we assess one spatial quality metric (\(Q{A}_{1}\)) and one spectral quality metric (\(Q{A}_{2}\)) for each LR-HR pair. (B) The workflow for the synthetic SEN2NAIP: purple blocks denote the degradation model trained in the cross-sensor dataset.

Data products

Sentinel-2

The S2 mission comprises two nearly identical satellites, Sentinel-2A and Sentinel-2B, launched in June 2015 and March 2017, respectively. Their Level 2 (L2A) products offer estimates of surface reflectance values across 13 spectral channels, covering the entire globe every five days. The S2 imagery is freely distributed under an open data policy. In the SEN2NAIP dataset, we focus on S2 10-meter bands, which include red, green, blue, and near-infrared bands (RGBNIR).

National Agriculture Imagery Program (NAIP)

The National Agriculture Imagery Program (NAIP) captures aerial imagery of the contiguous United States and provides updates every three years (five years before 2009). The image acquisition mainly occurs from June to August, when the country experiences the peak growing seasons. While images are available from 2002, only those from 2011 onward are considered for SEN2NAIP, as they always have complete RGBNIR bands and a stable ground sampling distance less than or equal to 1.0 meters. In SEN2NAIP, we aim for 4 \(\times \) super-resolution, so we download NAIP images from GEE, setting the scale to 2.5 meters. GEE stores images using a pyramiding policy, which aggregates 2 \(\times \) 2 blocks of pixels. Therefore, instead of downloading images at their native resolution (1 m), we retrieve data from a higher level of the pyramid. This approach reduces the volume of data downloaded while preserving the necessary details for our 4 \(\times \) super-resolution tasks.

Cross-sensor dataset

The cross-sensor dataset begins by generating 18,482 random regions of interest (ROIs) across the contiguous United States using a hard-core point model16. This model ensures that each point is at least 5 km apart from any other point, effectively preventing spatial overlap between images. Subsequently, we remove \(S2\)-\(NAIP\) pairs captured with a difference higher than 1 day to maintain similar landscape conditions. In addition, we use the CloudSEN12 UNetMobV217 cloud detection algorithm to automatically exclude pairs that contain clouds in the \(S2\) image. The time and cloud filters reduced by 61% the original number of potential images. \(S2\) and \(NAIP\) images that pass these initial screenings are retrieved using the R client for Google Earth Engine18.

To harmonize the reflectance values between \(NAIP\) and \(S2\) images, we apply histogram matching to each \(NAIP\) band using \(S2\) as a reference. This results in \(NAI{P}_{h}\). It is well-known that histogram matching can be effective for adjusting the overall reflectance intensity distribution of images, but it struggles to account for local variations, failing to preserve finer details like local contrast and texture. However, in our dataset, we work with small patches, 484 \(\times \) 484 pixels for HR and 121 \(\times \) 121 pixels for LR images (see Record section). This patch-based approach helps mitigate the issue of local variance, as smaller regions are more homogeneous, allowing histogram matching to perform more accurately and retain local details.

After reflectance values correction, we employ the opensr-test framework12 to assess the spatial (\(Q{A}_{1}\)) and spectral (\(Q{A}_{2}\)) alignment of the \(S2\)-\(NAI{P}_{h}\) image pairs. The \(Q{A}_{1}\) spatial quality metric is estimated by calculating the mean absolute error (MAE) between the ground control points identified by the LightGlue19 and DISK20 algorithms. On the other hand, the \(Q{A}_{2}\) spectral quality metric is estimated by the average spectral angle distance of each band. We discard image pairs with \(Q{A}_{1}\,\mathrm{ > }\,1\) pixel or \(Q{A}_{2}\,\mathrm{ > }\,2\) deg to ensure minimal distortion due to external factors affecting super-resolution. Finally, to guarantee the dataset’s highest quality, we meticulously examine the remaining \(S2\)-\(NAIP\) pairs via a visual inspection review process. Any \(S2\)-\(NAIP\) pairs identified with saturated or defective pixels and any noticeable inconsistencies observed during the visual inspection were excluded.

A realistic degradation model

SR addresses an inverse problem, forcing the reconstruction of the original HR image from its degraded LR counterpart. This task is naturally ill-posed, as several potential HR images can correspond to a single LR image. The relationship between the LR (i.e., \(S2\)) and HR (i.e., \(NAIP\)) images conventionally is defined through the function:

$$S{2}_{like}=\varPhi (NAIP;\theta ){\downarrow }_{\mathrm{bi}}+n,$$
(1)

where \(\varPhi \) is the blurring function, \({\downarrow }_{\mathrm{bi}}\) is the bilinear downsampling operator, and \(n\) is the additive noise pattern. In a cross-sensor setting, the reflectance estimation can significantly vary between sensors. Therefore, the degradation model must consider the spectral alignment between the two sensors to obtain realistic LR counterparts. We propose a slightly variation of the Eq. (1) as:

$$NAI{P}_{\hat{h}}=\varUpsilon (NAIP)$$
(2)
$$S{2}_{like}=\varPhi (NAI{P}_{\hat{h}};\theta ){\downarrow }_{\mathrm{bi}}+n$$
(3)

where \(\varUpsilon \) is a harmonization model that transforms \(NAIP\) into \(NAI{P}_{\hat{h}}\) that closely match the reflectance values of \(S2\). Utilizing the cross-sensor dataset, we obtain the \(\varUpsilon \), \(\varPhi \), and \(n\) models that emulate \(S2\) imagery (i.e., \(S{2}_{like}\)) from \(NAIP\). We randomly split the cross-sensor dataset into train and test, with 80% of the data allocated for training and the remaining 20% for testing. To learn the noise pattern \(n\), we include cloud-free \(S2\) images from the CloudSEN1217 dataset. As noise is an inherent attribute of \(S2\) images, the \(NAIP\) counterpart is unnecessary.

Harmonization model

One of the key challenges in scaling synthetic HR-LR datasets for super-resolution is the need for an LR reference at the same acquisition time. To overcome this, we propose three methods (i.e., statistical, deterministic, and variational) for correcting \(NAIP\) reflectance values without requiring an \(S2\) reference. Converting \(NAIP\) into \(S2\) can be seen as an “image color transfer” task. While a perfect match is not required, the harmonization must be accurate enough to “trick” the SR models into interpreting NAIP-degraded images as real \(S2\) data. This preprocessing step helps create HR-LR pairs without reflectance shifts, avoiding the problems shown in Fig. 1.

In the statistical approach, we determine the best gamma correction for each RGBNIR band within the pair of \(NAIP\) and \(S2\) images, setting a four-dimensional gamma correction vector. Next, we use a multivariate normal distribution to model the distribution of gamma values (see Fig. 4). During inference, the \(NAI{P}_{\hat{h}}\) is obtained by applying the 50th-percentile gamma correction values to the original \(NAIP\) image bands to obtain \(NAI{P}_{\hat{h}}\).

Fig. 4
figure 4

Probability density function of the optimal gamma factor for each spectral band. The mean (Mean) and standard deviation (Std) are indicated for each band, along with the Kolmogorov-Smirnov (KS) statistic and p-value.

Figure 4 displays the density distributions of the optimal gamma factor for each spectral band. The Red, Green, and Blue panels show similar mean values, around 0.33 to 0.35, with standard deviations ranging from 0.07 to 0.08. The NIR band, however, has a notably higher mean value of 0.46 and a larger standard deviation of 0.12, indicating a higher degree of variability. The Kolmogorov-Smirnov (KS) statistic values are small, suggesting a good fit with the expected distribution.

For the deterministic model (Fig. 5A), we train a U-Net21 architecture with EfficientNet-B022 as its backbone. The input data consists of \(NAIP\) imagery degraded to a 10-meter (\(NAI{P}_{10m}\)) resolution using simple bilinear interpolation with an anti-aliasing filter (i.e., a triangular blurring function). The target data comprises \(S2\) imagery. During inference, \(NAI{P}_{\hat{h}}\) is obtained by a three-step process: (1) degrade NAIP to 10 m; (2) use the U-Net to estimate a harmonized version with reflectance closer to S2; and (3) apply histogram matching to correct the original \(NAIP\) reflectance values based on the U-Net output. Deep learning models for image-to-image tasks often introduce blurriness or lose fine details such as textures23, so we use the U-Net predictions to just fine-tune the original \(NAIP\) reflectance values (Fig. 5A).

Fig. 5
figure 5

Flowchart diagrams for the deterministic (A) and variational (B) harmonization approaches.

For the variational model (Fig. 5B), we disaggregate each band from \(S2\) and \(NAI{P}_{10m}\) into a 1D Tensor containing the number of values inside a histogram bin. Each histogram was structured into 120 bins from 0 to 1, transforming an image pair into two tensors with dimensions of (4, 120). Then, we use this transformed version of the dataset to train a variational autoencoder24 (VAE) that learns to transform the histogram of \(NAIP\) into the histogram of \(S2\). For inference, \(NAI{P}_{\hat{h}}\) is obtained in a four-step process. First, we obtain \(NAI{P}_{10m}\). Second, we obtain the histograms for each band. Third, we utilize the trained VAE to obtain the \(S{2}_{like}\) histogram. Fourth, we adjust the original \(NAIP\) reflectance values using the \(S{2}_{like}\) histogram.

Blurring model

For the blurring model \(\varPhi \), we fine-tune the width of the Gaussian degradation kernel by comparing \(S2\) imagery and \(NAI{P}_{h}\) using the MAE metric. We apply a specific Gaussian kernel to blur the \(NAI{P}_{h}\) image, then downsample it using bilinear interpolation without anti-aliasing filtering. In Fig. 6, we present the error curves (MAE) for the RGBNIR bands. The best sigma values are 3.0, 2.9, 2.9, and 3.4 for the RGBNIR bands, respectively.

Fig. 6
figure 6

Error curves (MAE) for the RGBNIR bands: associated error (y-axis) for each Gaussian kernel’s sigma value (x-axis). The optimal sigma value is highlighted by a dashed red vertical line.

Noise model

To accurately capture the noise complexity of \(S2\) images, we propose using the PD-Denoising model25 to predict the noise distribution (Supplementary Figure S1A). We applied the PD-Denoising model to all the RGB cloud-free Sentinel-2 images in the CloudSEN12 dataset17. This allowed us to create a reflectance-to-noise matrix (Supplementary Figure S1B). This matrix correlates the reflectance values to empirical noise distributions, providing insight into varying noise characteristics. During the inference phase, the noise pattern, denoted as \(n\), is generated by sampling from the reflectance-to-noise matrix for each reflectance value in the LR image. However, due to the computational cost of the previous method, we also proposed a simpler model. This alternative utilizes a Gaussian noise pattern with a zero mean and a standard deviation of 0.012, which represents the average standard deviation across various reflectance ranges (bins of 0.05) derived from the signal-to-noise matrix shown in Supplementary Figure S1B. The final noise is then scaled proportionally to the square root of the mean squared reflectance value for each pixel, ensuring that the noise level adapts to the intensity of the signal.

Synthetic dataset

The synthetic dataset does not require a pre-existing LR pair; instead, it is generated using the model trained in the previous section. As in the cross-sensor dataset (Fig. 3B), we use hard-core point modeling to produce 101,123 randomly generated ROIs across the contiguous United States. By removing the need for NAIP images to be paired with simultaneous images from S2, we are able to generate a much larger dataset. Each ROI contains an early (oldest) and late (most recent) NAIP image to track land cover changes.

While performing visual inspections on the cross-sensor dataset, we observed that many NAIP images had blank values. This issue occurred because the sampling was too close to the \(NAIP\) scene borders. To tackle this problem, we designed a straightforward yet effective blank identification system based on a \(19\times 19\) kernel to calculate image variance. Any early or late NAIP image with a zero variance was identified as containing blank values, and subsequently, the ROI was removed. This filtering resulted in a 22.7% reduction in our dataset.

The NAIP data program focuses on capturing images of agricultural areas during the growing seasons, potentially leading to a biased sampling distribution. To enhance the diversity of the dataset, we utilize a semantic clustering filter that integrates Contrastive Language-Image Pre-training26 (CLIP) and MiniBatchKMeans27. Recent studies have demonstrated the efficacy of CLIP in semantic understanding of super-resolved remote sensing imagery9. Our approach uses CLIP with MiniBatchKMeans, a KMeans algorithm variant optimized for large datasets, to cluster images based on semantic similarities. We establish an 18,000-cluster framework and randomly select one image per cluster, resulting in a 77.5% reduction in the dataset (Fig. 7). While the reduction percentage may seem excessive, the dataset is still well-spatially distributed (Fig. 2) and extensive, with 17,567 ROIs and 35,134 images (approximately 170 billion pixels).

Fig. 7
figure 7

Examples of the clusters formed. The highlighted image in red indicates the one chosen for the SEN2NAIP dataset.

Finally, for each NAIP early and late images, we generated three \(NAI{P}_{\hat{h}}\)-\(S{2}_{like}\) image pairs using the degradation model presented in Eq. (2).

Data Records

The dataset is available online via Science Data Bank28. It is divided into two main sets: cross-sensor and synthetic (Fig. 8). These sets contain a total of 20,508 regions of interest (ROIs), with the cross-sensor set having 2,851 ROIs and the synthetic set having 17,657 ROIs.

Fig. 8
figure 8

The SEN2NAIP dataset is organized into a hierarchical folder structure. The dataset is divided into cross-sensor and synthetic images at the top level (gray folders). The subsequent level (denoted by yellow folders) arranges the images based on their geographic location (ROI). Within the synthetic division, an additional split encompasses early and late images, each characterized by the time acquisition. Finally, each folder contains an LR-HR pair with a metadata.json file that details the specifics of the data.

Within the cross-sensor directory, every ROI subfolder contains two images: LR.tif and HR.tif. These images are stored in GeoTIFF format and contain the RGBNIR spectral bands. The LR image refers to the S2 imagery at 10 meters and an image shape of \(4\times 121\times 121\) pixels. The HR.tif is a downsampled version of NAIP imagery, with a resolution of 2.5 meters to set a 4 \(\times \) scaling factor w.r.t. Sentinel-2. The synthetic directory’s structure is slightly different. Each ROI folder contains two subfolders, early and late, each corresponding to different time acquisitions. The NAIP images maintain the RGBNIR spectral bands; however, unlike the cross-sensor dataset, their dimensions are set at \(4\times 1100\times 1100\). Only histograms with 120 bins per spectral band were stored for each degradation model (statistical, deterministic, and variational) to prevent the dataset from becoming too large. Finally, Table 1 provides a detailed overview of the metadata for the acquired images, which are stored in the JSON format as metadata.json.

Table 1 Metadata associated with each LR-HR image pair.

Technical Validation

Validation of synthesized Sentinel-2-like imagery

We evaluate the precision of degradation models (statistical, deterministic, and variational) in converting \(NAIP\) into \(S2\) images. We consider different percentile values (10, 25, 50, 75, 90) for the statistical approach to assess the model sensitivity. The efficacy of the transformation is quantitatively summarized in Table 2, which presents a comparative analysis between \(S{2}_{like}\) and \(S2\). This analysis is conducted on the test subset derived from the cross-sensor dataset. To evaluate the performance of the degradation models comprehensively, we employ five metrics: Pearson correlation, which measures the linear correlation; Spectral angle distance, which assesses the similarity in the spectral signatures by computing the angle between their vectors; Percentage Bias (PBIAS), which quantifies the average tendency of the degraded images to overestimate or underestimate; and mean absolute and root mean squared errors (MAE and RMSE), which measure the average absolute and squared differences, respectively. We estimated these metrics per image and then averaged the values to report a global value.

Table 2 Comparative Analysis between \(S2\) and \(S{2}_{like}\) generated by the three proposed methods: statistical, deterministic, and variational.

On average, the raw \(NAIP\) images exhibit reflectance values approximately three times greater than those of \(S2\) images, as evidenced by the PBIAS metric in Table 2. This difference is due to the different preprocessing steps each sensor uses. \(NAIP\) images undergo a Bidirectional Reflectance Distribution Function (BRDF) correction, which adjusts reflectance based on the statistical properties of the image before mosaicking. In contrast, \(S2\) images skip BRDF correction because automatic methods can be unreliable in certain locations and atmospheric conditions29.

Based on the evaluation of various metrics, the deterministic approach that combines U-Net adaptation with histogram matching proved to be the most accurate method. It accurately simulates reflectance intensity (PBIAS) and successfully preserves the spectral signatures of \(S2\) images (SAD). In contrast, the statistical method could only produce stable results at the 50th percentile. This method was able to represent reflectance intensity, but it failed to preserve the inter-band relationships, as shown by the SAD, RMSE, and MAE metrics. Meanwhile, the variational approach yields results almost as accurate as the deterministic method but slightly inferior. However, it has the potential to include aleatory variations, which could be useful in enhancing the generalization capacity of SR networks. Supplementary Figure S1B complements the analysis by illustrating the reflectance distribution for each band across the different proposed degradation methods.

Figures 9 and 10 depict two common scenarios in the SEN2NAIP dataset. The deterministic approach demonstrates good performance in both images. However, other methods may also serve as appropriate degradation models under specific scenarios. For instance, in environments with minimal spatial autocorrelation, the statistical method at the 50th percentile often produces results comparable to the deterministic approach, as shown in Fig. 9. On the other hand, the variational method typically excels in high-contrast scenes, as illustrated in Fig. 10. These findings suggest that combining all degradation methods simultaneously may be beneficial when training SR networks to increase the dataset diversity.

Fig. 9
figure 9

Comparative visualization of three image degradation methods applied to NAIP imagery against Sentinel-2 data, with the final column showing the histograms of pixel values.

Fig. 10
figure 10

Comparative visualization of three image degradation models applied to NAIP imagery against Sentinel-2 data (high-contrast example), with the final column showing the histograms of pixel values.

Real-world SR network training

In this section, we explore the impact of various degradation models on the super-resolution of \(S2\) imagery. To ensure a fair comparison, we maintained the same SR lightweight network and identical training setup, altering only the degradation model used. The SR network is a simplified version of the VGG model, where the max-pooling layers were removed. We utilize the Adam optimizer, setting \(\beta \)1 to 0.9 and \(\beta \)2 to optimize a composite MAE plus adversarial loss function. To differentiate between real and super-resolved images, we utilized the PatchGAN discriminator. This discriminator assesses \(16\times 16\) patches, aggregating their scores to determine whether the image is real or fake. We use the S2NAIP synthetic dataset for training, where the LR inputs and HR targets are represented by \(S{2}_{like}\) and \(NAI{P}_{h}\), respectively. We have established five distinct degradation models (dataloader configurations):

  • Raw: This configuration does not use a harmonization model to create \(NAI{P}_{h}\). The \(S{2}_{like}\) pair is generated by applying an anti-aliasing filter, which is applied by default in PyTorch, and then using bilinear downsampling. This approach provides a baseline of not applying a harmonization model.

  • Statistical: The \(NAI{P}_{h}\) is derived by applying the statistical model set at the 50th percentile. This configuration offers insight into the impact of adjusting solely the reflectance intensity.

  • Variational: This configuration utilizes the vanilla VAE model to generate \(NAI{P}_{h}\) (Fig. 5B).

  • Deterministic: Here, \(NAI{P}_{h}\) is produced by applying the trained U-Net model (Fig. 5A).

  • All: In this approach, all degradation models are employed together, with their application randomized within the dataloader.

In all dataloaders except for Raw, \(S{2}_{like}\) is produced using the pre-trained Gaussian blurring model and bilinear downsampling (with no anti-aliasing filter) on \(NAI{P}_{h}\). Additionally, simple Gaussian random noise, as described in the noise modeling section, is incorporated into the process. We repeat the training for each degradation model five times and report the best result. To conduct a thorough evaluation, we utilized two distinct cross-sensor datasets that were not utilized during the degradation model’s training. These independent datasets consist of \(S2\)-\(SPOT\) and \(S2\)-\(NAIP\) image pairs, as described in the opensr-test paper12.

To assess the impact of the degradation model in super-resolution, we focus on the omission metric within the opensr-test framework. The omission denotes the missing high-frequency pattern that the \(SR\) model failed to incorporate12. The higher the omission score, the more conservative the SR network is.

The findings are summarized in Fig. 11, revealing that although all the SR networks perform similarly well on synthetic test data (see the first panel in Fig. 11), their effectiveness varies considerably when tested on cross-sensor datasets. The SR network trained with the raw dataloader struggles to introduce high-frequency details in real \(S2\) imagery, resulting in outputs that closely correspond to a simple bilinear interpolation (Fig. 12). After visually testing the networks on several \(S2\) images, we found that the three proposed harmonization methods performed well in most cases, yielding similar results. However, the statistical method encountered some issues with certain images. For example, as shown in Fig. 12, the deterministic model introduced high-frequency details with greater intensity in \(S2\) images compared to the statistical approach. The variational method and “all” approaches produced results comparable to the deterministic model.

Fig. 11
figure 11

Super-Resolution performance on SEN2NAIP (synthetic), OpenSR-NAIP (cross-sensor), and OpenSR-SPOT (cross-sensor) datasets. The red boxplot is the baseline showing the score value if no harmonization model is applied.

Fig. 12
figure 12

Comparison of degradation models on SR networks: The first column shows the S2 image, followed by the results of the SR network trained with the raw NAIP images. The subsequent columns display results using the deterministic degradation model and the statistical degradation approach, respectively.

This experiment is a proof-of-concept, focusing not on the quality of high-frequency details added but on how the models respond to \(S2\) imagery. These findings emphasize the importance of aligning the degradation model with the characteristics of real-world sensors to ensure reliable SR outputs during inference. Readers should be aware that these outcomes are based on a simple SR network and may change with more advanced models. We plan to test these degradation models with larger SR networks and alternative architectures, such as transformers30 or recurrent linear units31. This study underscores the need for more research on developing more efficient degradation models that accurately mimic the characteristics of one sensor using another. Doing so could unlock the full potential of SR networks for real-world imagery beyond just NAIP data.

Usage Notes

This study introduces SEN2NAIP, a comprehensive dataset designed for realistic S2 super-resolution. It comprises images from NAIP and S2, covering the entire contiguous United States. The dataset’s total volume is 167.21 GB. To facilitate easy access, we offer a Python script that streamlines the process of batch downloading the dataset32. Users should be aware that the majority of NAIP data is collected from June to August and only over the United States, which may influence model performance on imagery from other seasons and regions, especially in areas with snow cover.