Introduction

Precipitation–when, where, and how much water falls from the sky to the earth's surface–governs freshwater availability, agricultural productivity, flood hazards, and ecosystem health across the globe1. Despite its significance, precipitation remains one of the most challenging climate variables to observe and predict accurately. This challenge stems from precipitation’s fundamental nature: unlike most climate variables that vary smoothly across space and time, precipitation manifests as discrete, intermittent pulses with striking discontinuities2,3. These complex spatiotemporal organization depends crucially on small-scale cloud microphysics processes4 that remain poorly understood or simulated. Besides, these processes are highly sensitive to environmental conditions: small perturbations in temperature, humidity, or aerosol concentrations can determine whether clouds produce no rain, light drizzle, or torrential downpours5,6. Furthermore, the triggering and organization of convection–the primary mechanism for intense precipitation–depends on complex interactions between boundary layer turbulence7, atmospheric stability8, and mesoscale circulations9,10 that remain computationally prohibitive to simulate explicitly. These complexities create fundamental observational and predictive challenges.

Currently, we rely upon three sources to derive precipitation information, namely in situ gauge observations, remote sensing, and numerical simulation that often assimilate in situ and remote-sensed data11. Each of these three sources comes with inherent limitations regarding their accuracy, coverage, and resolution. Ground-based observations from rain gauges provide the most direct and accurate measurements at point locations. However, gauge networks exhibit severe spatial limitations: even 2.5 × 2.5 grid cells contain less than two gauges on average12, let alone the oceanic and remote regions. Satellite remote sensing offers near-global coverage, but measures precipitation indirectly. Passive microwave sensors on polar-orbiting satellites detect emission and scattering signatures from hydrometeors, providing relatively direct estimates but with limited temporal sampling13. Infrared sensors on geostationary satellites offer frequent observations (every 10-30 minutes) but only measure cloud-top temperatures, requiring empirical relationships to infer surface precipitation–a particularly poor assumption for shallow, warm clouds that produce significant precipitation in tropical and maritime regions14. Numerical weather prediction and reanalysis products provide physically consistent, complete spatiotemporal coverage by assimilating available observations into dynamical models15. However, precipitation in these systems emerges as the end result of a complex chain of parameterized processes—radiation, convection, cloud microphysics, and boundary layer turbulence—each contributing its own errors16, with their errors compounding multiplicatively. The consequence of these observational and simulational limitations is profound: current precipitation datasets often disagree by as much as the signal itself11,16. In tropical regions, the spread among different products can exceed 300 mm/hr of the mean precipitation11, fundamentally limiting our ability to close the global water budget, validate climate models, or provide reliable information for water resource management.

A promising solution to these challenges lies in data fusion–leveraging the complementary strengths of multiple data sources to produce precipitation estimates that surpass any individual source in accuracy, resolution, and coverage17,18,19,20,21,22,23,24,25,26,27. Among data-fusion approaches, Bayesian methods offer a coherent and probabilistically grounded solution. The key insight is elegant: by deriving an informative prior from all available sources, we can encode existing knowledge in a statistically coherent form. Once established, this prior can be updated via Bayes’ theorem with any new observation–accounting for each source’s unique error characteristics and observational modalities through tailored likelihood functions.28,29,30,31. The framework naturally weights observations by their reliability and propagates uncertainties to yield full posterior distributions32, essential for risk assessment.

Recent advances in deep generative models, particularly probabilistic diffusion models33,34, offer a transformative opportunity for implementing the above Bayesian framework. Diffusion models approximate target distributions by learning to reverse a gradual noising process. The forward process progressively perturbs data with Gaussian noise, while the reverse process, parameterized by a neural network, learns to invert this corruption to recover samples. This iterative denoising procedure enables diffusion models to generate high-quality and diverse samples, spanning domains from natural images35 to protein structures36, while also serving as priors for Bayesian inference, making them particularly well-suited for capturing the intricate patterns of precipitation. Once trained, they function as “plug-and-play” priors37,38,39,40,41: the same learned distribution can be applied to diverse inference tasks–bias correction, downscaling, or gap-filling–by simply changing the likelihood function without retraining. Despite the promises, implementing this framework for precipitation faces three fundamental challenges. First, precipitation’s extreme spatiotemporal variability—from localized convective cells to continental-scale fronts—makes it extraordinarily difficult to be captured in a single prior distribution. Second, constructing an informative prior becomes paradoxical when no individual data source is trustworthy or comprehensive. Each source captures different aspects of precipitation across mismatched scales, creating a circular dependency where we need accurate data to build a prior, yet need a prior to evaluate data accuracy. Third, even with a reasonable prior, posterior sampling remains challenging because, from a machine learning perspective, a precipitation field is high-dimensional and the associated observation likelihood is complex. These barriers define the frontier for deploying generative AI in Earth-system science, demanding innovations that transcend conventional approaches.

To address these challenges, we introduce PRIMER (Precipitation Record Infinite MERging), a general framework that reconceptualizes how diffusion models can learn from imperfect, heterogeneous precipitation records. Our key insight is that probabilistic diffusion models need not be trained on perfect samples—instead, they can be viewed as spectral regression models that progressively learn from low-frequency structures to high-frequency details as we gradually corrupt the target distribution using Gaussian noise42. This property enables us to construct an informative prior by learning conditional distributions of precipitation patterns for each data source, where the conditioning explicitly captures each dataset’s characteristic biases.

As emphasized by ref. 43, integrating data with varying degrees of sparsity—from sparse grids to dense fields—poses a major machine learning challenge. We acknowledge this issue and propose an approach to better handle such heterogeneity (see SI Section 2.8 for a comparison with ref. 43). Conventionally, diffusion models work on samples residing on fixed-resolution grids44, forcing us to interpolate heterogeneous observations to common resolutions. This interpolation is particularly destructive for precipitation: it smooths sharp gradients at convective boundaries, introduces artificial correlations between sparse gauge points, and—most critically—destroys the very precision that makes gauges valuable. For sparse gauge networks covering less than 1% of the domain, interpolation essentially fabricates information that doesn’t exist. We therefore require an architecture that can learn priors directly from each source’s native sampling structure. This necessity drives our adoption of coordinate-based diffusion models, which represent precipitation as spatial fields \(x:{{\mathbb{R}}}^{2}\to {\mathbb{R}}\) rather than tensors. In this formulation, both dense grids and sparse gauge observations are simply different sampling patterns of the same underlying field. PRIMER directly learns from arbitrarily and sparsely distributed points—each defined by its location and precipitation intensity—without relying on spatial interpolation (see Fig. 1b)—gauge observations influence the function locally while gridded data constrain large-scale structure. Our two-stage training strategy is thus a natural choice: we first learn the baseline priors PERA5(x) and PIMERG(x), which represent the climatological distributions of precipitation fields x derived from climate reanalysis, i.e., fifth-generation ECMWF atmospheric reanalysis (ERA5), and satellite-based retrieval dataset, i.e., Integrated Multi-satellitE Retrievals for GPM (IMERG). We then fine-tune the model using gauge observational information at sparse grid locations (we refer hereafter to these densely observed grid cells as “gauge observations”; see Method 4.6 for data sources and detailed descriptions), so as to incorporate local accuracy, yielding the updated prior P(x) (Fig. 1b; star indicates that it is supposed to be a better prior). The coordinate-based representation ensures that gauge information enhances rather than corrupts the prior, as each source contributes at its natural scale. Once trained, PRIMER supports diverse applications through principled posterior sampling: given observations \({{\mathcal{O}}}\)—whether from biased satellites, sparse gauges, or coarse forecasts—we can sample from posterior \({P}_{\star }(x| {{\mathcal{O}}})\) to produce improved ensemble estimates (Fig. 1a). Empirical evaluations demonstrate the effectiveness of our approach: it achieves statistically significant error reductions for grids which are densely observed with gauges, supplements high-frequency details through downscaling, and further reduces errors by merging gauge observations with the background, in a way similar to optimal interpolation, underscoring its potential for operational use. It also generalizes to unseen operational forecasts without retraining and extends to downscaling future scenario precipitation fields in CMIP6. By transforming the challenge of heterogeneous, imperfect data from a limitation into a strength, PRIMER establishes a paradigm for precipitation data fusion that extends naturally to other Earth-system variables plagued by observational trade-offs.

Fig. 1: Overview of PRIMER.
Fig. 1: Overview of PRIMER.
Full size image

a Inference. PRIMER functions as a learned prior over the target precipitation field. Given a condition \({{\mathcal{O}}}\), PRIMER draws samples from the posterior \({P}_{\star }(x\,| \,{{\mathcal{O}}})\). By changing \({{\mathcal{O}}}\), PRIMER samples from \({P}_{\star }(x| {{\mathcal{O}}})\) under a shared prior, thereby unifying three applications in a single Bayesian generative framework. b Prior construction via principled data fusion. Because no single precipitation dataset is uniformly reliable across scales, PRIMER integrates heterogeneous records—reanalysis (e.g., ERA5), satellite retrievals (e.g., IMERG), and sparse gauge observations—to obtain a more accurate prior. In Stage 1, the model is pretrained on gridded products to learn baseline priors PERA5(x) and PIMERG(x). In Stage 2, it is fine-tuned with gauge observations under shared weights (section “Model training”) to produce a refined prior P(x) that retains large-scale structure from gridded data while incorporating localized constraints from gauges. In the following experiments, we will demonstrate that P(x) yields superior accuracy compared to baseline priors.

Results

Reproducing climatological distributions

The gist of the PRIMER is to learn a trustworthy prior of precipitation fields, thereafter applying it for a broad range of relevant probabilistic inference tasks. Before verifying the probabilistic inference results, we should ensure the accuracy of the learned prior distribution. As directly evaluating such high-dimensional priors is intractable, we instead assess their statistical properties as proxies45,46,47. We compare unconditionally generated samples from PIMERG(x), PERA5(x), and the updated prior P(x) against their respective reference datasets. In particular, we focus on the climatological mean and standard deviation of precipitation (Fig. 2). At the grid-point level, the agreement is clear. For mean precipitation (Fig. 2a–f), both PIMERG(x) and PERA5(x) exhibit strong spatial correspondence with IMERG and ERA5, achieving Pearson correlation coefficients (PCCs) of 0.85 and 0.97, respectively. The standard deviation fields (Fig. 2g–l) are likewise well reproduced (PCC = 0.75 and 0.86), highlighting PRIMER’s capacity to represent not just the average precipitation spatial structure but also its variance. Notably, we also introduce P(x), constructed by fine-tuning PRIMER using sparse but reliable gauge observational information. Despite the limited spatial coverage of gauge observations, this calibration yields a climatologically consistent prior that preserves spatial structures learned from the gridded products while injecting localized realism. This “climatological jailbreak” illustrates how PRIMER can adapt to sparse gauge records without compromising coherence across scales. To further evaluate spatial structure, we perform a radially averaged power spectral density (RAPSD) analysis (Fig. 2m), which confirms that the learned priors accurately recover the multiscale spectral characteristics of the reference datasets, especially across mesoscale wavelengths, which are crucial for convective processes (see also Supplementary Information (SI) Fig. 9). Additional statistical evaluations—including precipitation frequency, extremes, skewness, and empirical orthogonal function (EOF) modes—are provided in the SI Fig. 10.

Fig. 2: Climatological consistency between learned priors and reference datasets.
Fig. 2: Climatological consistency between learned priors and reference datasets.
Full size image

af Spatial distributions of mean precipitation from IMERG (a), ERA5 (b), gauge observations at sparse grids which are dotted (c), PIMERG(x) (d), PERA5(x) (e), and P(x) (f). gl Standard deviation fields analogous to (a–f). Pearson correlation coefficients (PCCs) between each learned prior and its corresponding reference (IMERG or ERA5) are indicated in the upper-left corner of relevant panels. m, Radially averaged power spectral density (RAPSD) as a function of spatial wavelength (in degrees). The learned priors PIMERG(x) and PERA5(x) closely follow their references, and P(x) captures consistent multiscale characteristics. All statistics are computed from 1000 randomly sampled realizations. Maps were generated using Cartopy (https://cartopy.readthedocs.io/stable/).

Case study on high-impact events

The previous section evaluated PRIMER’s ability to match climatology. After Stage 2 fine-tuning, the updated prior P(x) is expected to align more closely with gauge observations; however, its actual skill remains to be validated through posterior sampling experiments. To this end, we perform posterior sampling using different priors while conditioning on the same observations \({{\mathcal{O}}}\). By comparing the posterior samples against the held-out gauge observations, we directly assess the impact of the prior on posterior accuracy, thereby quantifying how much fine-tuning improves alignment with real-world observations. We examine three representative high-impact events. These events were selected to span a wide range of precipitation regimes, including prolonged precipitation associated with the Meiyu front, heavy precipitation driven by landfalling typhoons, and localized convective extremes. The primary case, which occurred over Hubei Province, China, during the East Asian summer monsoon on 2 July 2016, is shown in Fig. 3; additional examples are provided in SI Figs. 12, 13.

Fig. 3: Case study of a Meiyu precipitation event on 2 July 2016 at 05 UTC.
Fig. 3: Case study of a Meiyu precipitation event on 2 July 2016 at 05 UTC.
Full size image

a IMERG at the target time, with gauge locations shown as red dots (used as ground truth for evaluation). b Posterior mean (shaded contours) and standard deviation (labeled contour lines) from \({P}_{\star }(x| {{{\mathcal{O}}}}_{{{\rm{IMERG}}}})\) inferred by PRIMER, based on 100 ensemble samples. c Probability density functions (PDFs) of relative changes in mean absolute error (ΔMAE). Here, ΔMAE is defined as the MAE of the original IMERG data minus each posterior sample, both evaluated against gauge observations. Positive values of ΔMAE indicate the amount by which PRIMER effectively reduces the original IMERG errors at gauge locations. df analogous to (ac) but for ERA5. In c, f, different curves represent posterior distributions as labeled, with ensemble means indicated by stars. The label “+GaugeFusion” refers to an experiment in which gauge observational information at sparse grids (20% of locations are incorporated during sampling, with errors evaluated on the remaining 80%) are combined with the background (raw IMERG or ERA5) in a manner analogous to optimal interpolation. This additional observational constraint markedly improves accuracy. Maps were generated using Cartopy.

To evaluate the effectiveness of PRIMER, we employ two standard performance metrics: the mean absolute error (MAE) and the continuous ranked probability score (CRPS), with the latter providing a probabilistic measure of an ensemble system’s accuracy (see Method “Evaluation metrics”). For each metric, we define a relative skill score, \(\Delta {{\mathcal{M}}}\), as the non-negative error measure of the original precipitation dataset (ERA5 or IMERG) minus that of the posterior sample, so that positive values indicate reduced error and thus enhanced skill. All evaluations are conducted at a spatial resolution of 0.1, where ERA5, IMERG, and posterior samples are compared against gauge observations treated as ground truth.

As shown in Fig. 3c, f, the updated prior P(x) substantially outperforms baseline priors derived from ERA5 and IMERG. The ensemble-mean ΔMAE decreases from 0.46 mm/hr for \({P}_{\star }(x| {{{\mathcal{O}}}}_{{{\rm{ERA5}}}})\) to 0.14 mm/hr for \({P}_{{{\rm{ERA5}}}}(x| {{{\mathcal{O}}}}_{{{\rm{ERA5}}}})\); a similar improvement is observed in the IMERG case, where the ΔMAE decreases from 0.29 mm/hr to 0.14 mm/hr. These gains extend beyond ensemble means: across individual samples, ΔMAE values for \({P}_{{{\rm{ERA5}}}}(x| {{{\mathcal{O}}}}_{{{\rm{ERA5}}}})\) are consistently lower than those for \({P}_{\star }(x| {{{\mathcal{O}}}}_{{{\rm{ERA5}}}})\).

An important feature of PRIMER is its ability to incorporate additional gauge observations into the posterior sampling process, rather than relying solely on background fields. This capability resembles the requirements of operational analysis systems, where integrating sparse gauges can substantially improve quality. To evaluate this property, we design an experiment that mimics real-world conditions by including a subset (20%) of gauge observations (while errors are evaluated on the remaining 80%) during sampling (hereafter denoted as “+GaugeFusion”). The inclusion of these observational constraints yields a marked improvement in accuracy, with the ensemble-mean ΔMAE increasing to 1.11 and 0.97 mm/hr for the ERA5 and IMERG cases, respectively. In addition, it needs to be noted that spectral analysis further highlights distinctions among posterior samples (see SI Fig. 11). While \({P}_{{{\rm{ERA5}}}}(x| {{{\mathcal{O}}}}_{{{\rm{ERA5}}}})\) retains low-frequency biases, both \({P}_{\star }(x| {{{\mathcal{O}}}}_{{{\rm{ERA5}}}})\) and its GaugeFusion variant enhance high-frequency components.

Statistical verifications

We applied PRIMER to a curated test set of 150 precipitation events from 2016, selected based on the criteria detailed in SI Section 3. For each event, 50 posterior samples were drawn from \({P}_{\star }(x| {{\mathcal{O}}})\), where \({{\mathcal{O}}}\) corresponds to raw data from either ERA5 or IMERG. In this process, PRIMER downscales ERA5 data to 0.1 resolution and performs bias correction, while directly correcting biases in IMERG. At each gauge location, we computed the MAE and CRPS of the posterior distributions. MAE was calculated using the ensemble mean of each posterior distribution compared against the corresponding gauge observation, while CRPS assessed the full probabilistic accuracy. We then calculated differences in both metrics between original datasets—and \({P}_{\star }(x| {{\mathcal{O}}})\). In simple terms, a positive value at a gauge location means that PRIMER reduces the error of the original dataset after.

Figures 4a, b reveal widespread reductions in MAE, highlighting PRIMER’s ability to systematically correct biases inherent in the original datasets. Figure 4c, d shows pronounced reductions in CRPS, with deeper blue tones indicating substantial gains in probabilistic estimates. These results demonstrate that PRIMER captures the posterior distribution accurately, with the improvements confirmed as statistically significant by t-tests. In addition to PRIMER, we also evaluated baseline priors (PERA5(x) and PIMERG(x)) as well as baseline methods BCSD-EQM (Bias correction and spatial disaggregation–equitable quantile mapping)48 and RM (random mixing)49 (for notes of two methods, refer to SI Section 4). These baselines were subjected to the same application. As detailed in the SI Figs. 68, PRIMER generally outperforms these baselines. What’s more, the largest improvements are observed in the Sichuan Basin and the Pearl River Delta—regions with dense populations and strong economic activity. We further analyzed the correlation between gauge density and performance improvement (SI Section 5). Although a positive trend is apparent, the correlation is not statistically significant, indicating that PRIMER delivers relatively spatially consistent improvements irrespective of local gauge density.

Fig. 4: Bias correction of existing precipitation datasets.
Fig. 4: Bias correction of existing precipitation datasets.
Full size image

This figure shows the reduction in MAE (top row) and CRPS (bottom row) after bias correction (for ERA5 also downscaling) using PRIMER, applied separately to ERA5 (panel a, c) and IMERG (panel b, d). Each dot denotes the grid where gauge observational information is available. Positive values indicate a reduction in error relative to the original IMERG or ERA5, while red points indicate deterioration. The violin plot on the right displays the PDF distribution of the relative error change across all locations, with the central black line representing the median of the relative error changes. The evaluation is based on 150 precipitation events that occurred in 2016. For the results of the three baseline methods, see SI Figs. 68. Maps were generated using Cartopy.

Beyond reducing pointwise error, PRIMER also enhances the physical realism of existing precipitation datasets. To comprehensively evaluate the performance of PRIMER, we adopt two complementary perspectives: the member view and the envelope view. The member view analyzes statistics from a single sample, representing one physically plausible realization. In contrast, the envelope is constructed by selecting, at each gauge location, for a given event, the maximum precipitation value across 50 posterior samples. As illustrated in Fig. 5a, both \({P}_{\star }(x| {{{\mathcal{O}}}}_{{{\rm{ERA5}}}})\) and \({P}_{\star }(x| {{{\mathcal{O}}}}_{{{\rm{IMERG}}}})\) more accurately reproduce the frequency distribution of precipitation, particularly at higher intensities. Both perspectives reveal improvements in the representation of heavy precipitation tails compared to the existing datasets, underscoring PRIMER’s capacity to detect high-impact precipitation events that are often underrepresented in original products. Improvements in spatial structure are further quantified using PCCs with respect to gauge observations (Fig. 5b). \({P}_{\star }(x| {{{\mathcal{O}}}}_{{{\rm{ERA5}}}})\) and \({P}_{\star }(x| {{{\mathcal{O}}}}_{{{\rm{IMERG}}}})\) show markedly enhanced structural agreement relative to existing datasets, suggesting that PRIMER not only reduces local biases but also restores spatial coherence. While various methods have been proposed to assess spatial organization and feature propagation50,51, we employ a simplified yet informative diagnostic based on two-dimensional spatial lagged correlation coefficient (Method “Evaluation tool”, Fig. 5c). Physically, this correlation characterizes how anomalies at a reference point are spatially linked to those at surrounding locations, thereby revealing key features of precipitation system organization. We approximate the 0.6 correlation contour with an ellipse and extract two geometric descriptors: the focal length (F), indicative of spatial extent, and the orientation (O), which captures the dominant directional alignment. Results show that both \({P}_{\star }(x| {{{\mathcal{O}}}}_{{{\rm{ERA5}}}})\) and \({P}_{\star }(x| {{{\mathcal{O}}}}_{{{\rm{IMERG}}}})\) produce orientations that are more consistent with reference orientations derived from gauge observations, indicating improved spatial alignment. In terms of focal length, \({P}_{\star }(x| {{{\mathcal{O}}}}_{{{\rm{ERA5}}}})\) exhibits a clear reduction, while \({P}_{\star }(x| {{{\mathcal{O}}}}_{{{\rm{IMERG}}}})\) shows no substantial improvement. These results demonstrate PRIMER’s effectiveness in correcting the spatial anisotropy of precipitation systems.

Fig. 5: Improved physical reality of existing datasets.
Fig. 5: Improved physical reality of existing datasets.
Full size image

a Log-transformed histogram of precipitation intensity (2 mm/hr bins) at only gauge locations, aggregated over test sets. This panel highlights the ability of different datasets to reproduce the tail of the precipitation distribution (with the purple line as the ground truth). b Probability density functions (PDFs) of Pearson correlation coefficients (PCCs) between each dataset and the individual gauge observations. Higher PCC values indicate better structural fidelity to ground truth. c Spatial lag correlation maps, with the 0.6 PCC contour visualized for each dataset. Elliptical fits to these contours are used to quantify the spatial coherence, including the major axis length (focal distance, F) and orientation angle (O), as summarized below (c). Colors in panels ac are illustrated in the legend below.

Generalization test

PRIMER is not only effective for existing precipitation datasets, but also exhibits a certain degree of generalization. Figure 6 illustrates PRIMER’s ability to correct biases in previously unseen operational precipitation forecasts, using the ECMWF High-Resolution Forecast (HRES) as a representative example52. Despite never being trained on HRES, PRIMER successfully corrects systematic biases in a typical precipitation event caused by typhoon landing (Fig. 6a, e). The ensemble mean of \({P}_{\star }(x| {{{\mathcal{O}}}}_{{{\rm{HRES}}}})\) (Fig. 6b, f) aligns with HRES, while each member (Fig. 6c, g) captures a diverse range of physically plausible precipitation scenarios, reflecting the model’s ability to encode meaningful uncertainty. Maps of ΔCRPS (Fig. 6d, h) with widespread positive values (blue dots) indicate that PRIMER produces a reliable ensemble system for HRES. These improvements arise from the Bayesian posterior sampling mechanism. By drawing samples from \({P}_{\star }(x| {{{\mathcal{O}}}}_{{{\rm{HRES}}}})\), we effectively use the learned prior distribution P(x)—which has been calibrated to match gauge statistics—to adjust the original HRES forecasts. To illustrate these benefits more intuitively, we present time series at two representative gauge locations (Fig. 6i, j). The ensemble envelopes generated by PRIMER closely track observed precipitation peaks, confirming that the HRES guidance is effectively incorporated. Occasional deviations arise when the HRES forecasts and the local gauge observations exhibit divergent trends—for instance, a slower decrease in HRES versus a sharper observed decline (panel i at +27 h), or mismatched peaks (panel j at +27 h and +42 h). In such cases, the posterior may not fully capture the observed variability.

Fig. 6: Bias-correction for operational forecasts without retraining.
Fig. 6: Bias-correction for operational forecasts without retraining.
Full size image

a, e HRES forecasts at 18-h and 36-h lead times (other lead times see SI Fig. 15), initialized at 00:00 UTC on 14 September 2016, coinciding with the landfall of Typhoon Meranti. b, f Ensemble means. c, g Four representative ensemble members, illustrating internal variability and structural diversity. d, h Spatial distribution of ΔCRPS, with blue indicating improvement and red indicating deterioration. i, j Precipitation time series at two representative gauged grids (more stations see SI Fig. 16); gray envelope denotes the spread across 100 ensemble members. Maps were generated using Cartopy.

To assess PRIMER under a future scenario, we selected a model simulation output from CMIP6 for 205053, when elevated CO2 forcing is expected to alter precipitation regimes. Hourly precipitation was downscaled to 0.1 with PRIMER and compared against the raw model output. As shown in SI Fig. 17, the domain-mean precipitation curves from the raw output and the downscaled fields remain closely aligned, indicating that PRIMER preserves large-scale variability while adding fine-scale structure under a shifted climate state. Taken together, these results empirically suggest the broader utility of PRIMER as a foundation model for downstream applications without additional retraining (zero-shot adaptation).

Discussion

Existing precipitation datasets exhibit a persistent trade-off among spatial coverage, temporal resolution, and measurement accuracy, with no single data source simultaneously meeting these criteria. This fundamental limitation necessitates sophisticated fusion methods capable of integrating heterogeneous observations while overcoming the deficiencies of each source. Generative AI, particularly probabilistic diffusion models, offers a powerful approach by capturing the intricate distribution of precipitation patterns. However, practical application has been severely limited by the paradox of establishing reliable priors from individually imperfect and incomplete datasets.

To overcome these barriers, we introduce PRIMER that directly represents precipitation as a function in concept, seamlessly incorporating sparse gauge observations alongside dense gridded data without destructive interpolation. Our two-stage training procedure uniquely exploits the complementary strengths of different data sources: we initially establish robust climatological priors by leveraging broadly available gridded products, which, despite their wide coverage, exhibit considerable uncertainties. These priors are then refined using sparse but accurate gauge observational information. Benchmark evaluations highlight PRIMER’s capability to effectively integrate gauge observations with gridded data, providing localized realism without sacrificing large-scale spatial coherence—a significant innovation termed climatological jailbreak. Experimental results demonstrate PRIMER’s superiority in bias correction and super-resolution enhancement of existing precipitation datasets, consistently outperforming priors derived solely from single-source observations and two baseline methods. Furthermore, experiments show that incorporating additional gauge observations into the posterior sampling markedly enhances accuracy, highlighting PRIMER’s potential for optimal interpolation in operational contexts. Crucially, PRIMER exhibits a certain degree of zero-shot generalization, maintaining consistency when applied to previously unseen operational forecasts and even future scenario simulations.

Despite the impressive performance of PRIMER, several limitations remain. First, the scarcity of high-quality in situ gauge observations over oceanic regions constrains our ability to comprehensively evaluate model performance. Second, our current experiments are restricted to precipitation fusion within China rather than at the global scale. This decision was primarily driven by the substantial computational cost of global fusion, which exceeds our available resources. Third, from a methodological perspective, PRIMER does not provide a theoretical guarantee of temporal continuity across posterior samples at different time steps. A promising direction is to extend the framework from frame-wise priors to videostyle priors that jointly model consecutive fields, thereby enhancing temporal consistency. Notwithstanding these limitations, precipitation itself is among the most complex and discontinuous variables in the climate system, which sets a particularly stringent benchmark for validating our methodology before extending it to other variables and broader climate domains.

In practice, PRIMER is readily deployable: when integrated into operational forecasting chains, it can perform real-time post-processing of precipitation fields from numerical or AI-based forecasts, delivering both bias correction and downscaling. It also integrates seamlessly with optimal interpolation by weighting gauge observations and the background, thereby yielding substantially improved states. Looking ahead, PRIMER advances three key principles for the community. First, recognizing that Earth-system data were inherently imperfect, AI for geoscience must be designed to be uncertainty-aware. By fusing heterogeneous precipitation records into a unified prior, PRIMER distills multi-source information into model parameters, in a manner analogous to how large language models compress corpus-level statistics—yielding greater accuracy than training on a single data source alone. Second, its flexible architecture and training framework naturally accommodate irregular observations alongside gridded products, providing a reusable template for broader geoscientific AI applications. Finally, PRIMER is intrinsically extensible: auxiliary variables (such as temperature, wind, and humidity) can be incorporated as additional input channels, enabling a more complete representation of the atmospheric state and ultimately strengthening both short-range weather forecasting and long-range climate simulation.

Methods

Problem formulation

A general formulation of the precipitation data fusion task involves two key components: (1) constructing an informative prior distribution, and (2) performing posterior inference given new observations.

Let x denote the target precipitation field. Different data sources—including gridded products such as satellite retrievals and reanalysis, as well as sparse gauge observations—provide multiple versions of x, each with varying spatial coverage and accuracy. Our goal is to effectively leverage these heterogeneous sources to construct a unified prior P(x). This prior plays a central role, as it is expected to integrate statistical characteristics of each source through a balanced fusion. A key innovation of this work lies in the design of a principled framework for modeling such a prior.

Once an informative prior is established, posterior inference is conducted as new observational evidence \({{\mathcal{O}}}\) becomes available. Posterior distribution \(P(x| {{\mathcal{O}}})\) can be factored into two components: the prior distribution P(x), and the likelihood \(P({{\mathcal{O}}}| x)\). Another innovation of our work is the effective implementation of posterior inference that balances the prior and the observations, ensuring the inferred precipitation field reflects both the climatological variability and the specific constraints provided by \({{\mathcal{O}}}\). This Bayesian framework naturally enables various downstream applications, such as super-resolution by conditioning on coarse data, bias correction by conditioning on biased estimates, and optimal interpolation by jointly conditioning on observations and background (Fig. 1a).

Preliminary on diffusion models

To construct a prior, we employ score-based diffusion models. To enable the model to distinguish between sources during training, we associate each sample with a corresponding entity embedding ei (e1 = (1, 0, 0), e2 = (0, 1, 0), e3 = (0, 0, 1))54, which is injected into the model. This embedding functions as a source identifier, enabling the model to learn distinct priors for different data sources. Specifically, e1 corresponds to ERA5, e2 to IMERG, and e3 to gauge observations. Here, we first outline the foundations of the traditional diffusion framework before extending its conceptual scope. The forward diffusion process evolves the data distribution into a tractable Gaussian through a stochastic differential equation (SDE)33,34,55,56:

$$d{x}_{t}=f({x}_{t},t)\,dt+g(t)\,d{W}_{t},$$
(1)

where \({x}_{t}\in {{\mathbb{R}}}^{n}\) is the state at time t, f(xt, t) is the drift function, and Wt is a standard Wiener process. To generate samples, we solve the reverse-time SDE55,57:

$$d{x}_{t}=\left[f({x}_{t},t)-{g}^{2}(t){\nabla }_{{x}_{t}}\log {P}_{\theta }({x}_{t}| {e}_{i})\right]dt+g(t)\,d{W}_{t},$$
(2)

where the score function \({\nabla }_{{x}_{t}}\log {P}_{\theta }({x}_{t}| {e}_{i})\) denotes the gradient of the log-density. Since this score is intractable, we approximate it using a neural network fθ.

PRIMER

Traditional diffusion models typically rely heavily on U-Net architectures44, which require inputs and outputs to be uniformly gridded data with fixed resolution. This architectural constraint limits their flexibility, particularly when processing discrete, sparse gauge observations. PRIMER utilizes a framework inspired by recent theoretical advances58,59,60,61, which generalizes diffusion models from finite-dimensional Euclidean space to an infinite-dimensional Hilbert space \({{\mathcal{H}}}\), as illustrated in SI Fig. 1. In this setting, each element \(x\in {{\mathcal{H}}}\) is a function \(x:{{\mathbb{R}}}^{n}\to {{\mathbb{R}}}^{d}\), where \({{\mathbb{R}}}^{n}\) denotes coordinates and \({{\mathbb{R}}}^{d}\) represents physical quantities. Both dense gridded data and sparse gauge observations are treated as partial realizations of an underlying function, allowing PRIMER to natively integrate heterogeneous records. Following ref. 58, we define \({{\mathcal{H}}}\) as \({L}^{2}({[0,1]}^{n}\to {{\mathbb{R}}}^{d})\), where L2 denotes the space of functions f such that \({\int }_{{[0,1]}^{n}}| f(x){| }^{2}\,dx < \infty\). The rationale for the name PRIMER is discussed in SI Section 1.

Mollification

While tempting, using white noise in the forward diffusion process poses a fundamental issue. Let ϵ(c) be a white noise where each \({{\bf{c}}}\in {{\mathbb{R}}}^{n}\) is sampled independently from \({{\mathcal{N}}}(0,1)\). For ϵ to lie in the Hilbert space \({{\mathcal{H}}}\), it must be square-integrable. However, ϵ(c) violates this, as its norm diverges. To address this, PRIMER applies a Gaussian kernel k to mollify the noise: \(\xi ({{\bf{c}}})=(k*\epsilon )({{\bf{c}}})={\int }_{{{\mathbb{R}}}^{n}}k({{\bf{c}}}-{{{\bf{c}}}}^{{\prime} })\epsilon ({{{\bf{c}}}}^{{\prime} })\,d{{{\bf{c}}}}^{{\prime} }.\) The resulting smoothed noise is square-integrable and thus belongs to \({{\mathcal{H}}}\), as rigorously proven in SI Section 2.2. Similarly, PRIMER also mollifies x0, which ensures that Lx0 inherits the same smoothness properties. In practice, this operation is implemented efficiently using discrete Fourier transformations (DFT). In Fourier space, mollification corresponds to: \(\epsilon ({{\boldsymbol{\omega }}})={e}^{\parallel {{\boldsymbol{\omega }}}{\parallel }^{2}t}\,\xi ({{\boldsymbol{\omega }}})\), where \({{\boldsymbol{\omega }}}\in {{\mathbb{R}}}^{n}\) denotes the frequency, and t = σ2/2, with σ being the standard deviation of kernel k (a detailed derivation is provided in SI Section 2.3). Directly applying the inverse transformation is often numerically unstable; thus, we employ the Wiener filter, defined as58,62: \(\widetilde{\epsilon }({{\boldsymbol{\omega }}})=\frac{{e}^{-\parallel {{\boldsymbol{\omega }}}{\parallel }^{2}t}}{{e}^{-2\parallel {{\boldsymbol{\omega }}}{\parallel }^{2}t}+{\delta }^{2}}\,\xi ({{\boldsymbol{\omega }}})\), where δ is a small positive regularization parameter.

Network architecture

Neural Operators are capable of learning a map between two functional spaces61,63,64,65. Neural operators achieve discretization invariance by learning integral kernels parameterized via neural networks. Specifically, for an input function \(x:{{\mathbb{R}}}^{n}\to {{\mathbb{R}}}^{d}\), with observations at m distinct spatial locations, the operator K(x; θ) is defined as:

$$(K(x;\theta )x)({{\bf{c}}})={\int }_{{{\mathbb{R}}}^{n}}{\kappa }_{\theta }\left({{\bf{c}}},{{\bf{b}}},x({{\bf{c}}}),x({{\bf{b}}})\right)\,x({{\bf{b}}})\,d{{\bf{b}}},$$
(3)

where \({\kappa }_{\theta }:{{\mathbb{R}}}^{n}\times {{\mathbb{R}}}^{n}\times {{\mathbb{R}}}^{d}\times {{\mathbb{R}}}^{d}\to {\mathbb{R}}\) is a kernel function parameterized by θ, which captures complex non-local dependencies. PRIMER implements a hybrid multiscale architecture that synthesizes the strengths of Neural Operators and convolutional networks. PRIMER first processes the input \(x\in {{\mathbb{R}}}^{d\times m}\), together with their corresponding locations \(c\in {{\mathbb{R}}}^{n\times m}\) using a series of SparseConvResBlocks, which primarily employ sparse depthwise convolutions66, producing updated features with shape \({{\mathbb{R}}}^{D\times m}\), where D d. This embedding step projects low-dimensional input features into a higher-dimensional space, a crucial operation that enables the model to capture richer representations. For the motivation behind SparseConvResBlock, see SI Section 2.6. Since the features lie on an irregular set of discrete locations, we project them onto a coarse regular grid based on their spatial coordinates (see SI code 1). This transformation aligns the features to a structured grid layout. A U-Net is applied to this grid to capture multiscale context. As we are ultimately interested in observations at the original irregular target locations, the processed grid features are reprojected to these coordinates via bilinear interpolation, yielding a feature tensor of shape \({{\mathbb{R}}}^{D\times m}\). Finally, a subsequent series of SparseConvResBlocks are applied to produce the final output tensor of shape \({{\mathbb{R}}}^{d\times m}\). Details of the network are provided in SI Section 2.5.

Model training

The model is optimized by minimizing a simplified denoising objective33,55,58 (derivation provided in SI Section 2.4):

$${{\mathcal{L}}}={{\mathbb{E}}}_{t}\left[\parallel {f}_{\theta }({x}_{t},t,{e}_{i})-\xi {\parallel }_{{{\mathcal{H}}}}^{2}\right],$$
(4)

where xt denotes the noisy input at time step t, ei represents the entity embedding, ξ is the ground-truth noise, and \(\parallel \cdot {\parallel }_{{{\mathcal{H}}}}\) denotes the loss norm defined in Hilbert space \({{\mathcal{H}}}\). We adopt a two-stage training procedure. In Stage 1, the model is jointly trained on ERA5 (e1) and IMERG (e2). In Stage 2, we specialize the pretrained model to sparse gauge observations (e3), following a strategy akin to DreamBooth67. Specifically, we fine-tune the model using a shared-weight strategy, where training samples are proportionally drawn from multiple data sources. The total loss is computed as:

$${{{\mathcal{L}}}}_{{{\rm{fine}}}}-{{\rm{tuning}}}={\alpha }_{1}{{{\mathcal{L}}}}_{{{\rm{ERA5}}}}+{\alpha }_{2}{{{\mathcal{L}}}}_{{{\rm{IMERG}}}}+{\alpha }_{3}{{{\mathcal{L}}}}_{{{\rm{gauge}}}},$$
(5)

with weights α1 = 0.1, α2 = 0.4, and α3 = 0.5. Assigning α3 = 0.5 prevents catastrophic forgetting of ERA5 and IMERG knowledge while ensuring strong gauge influence68. Among the remaining weights, IMERG (α2 = 0.4) is favored over ERA5 (α1 = 0.1) given its finer resolution. Although not optimized through exhaustive search, this empirical configuration preserves climatological priors while adapting to high-fidelity signals, thereby grounding the generative manifold in real-world observations.

The full training and inference pipelines are summarized in SI Algorithm 1 and SI Algorithm 2, with an overview schematic shown in SI Fig. 2. For the configuration of the hyperparameters, see SI Section 2.7.

Posterior sampling

In tasks such as bias correction, downscaling, and optimal interpolation, the objective is to infer an unknown target state x given observations \({{\mathcal{O}}}\). PRIMER enables the incorporation of prior knowledge through a prior P(x), facilitating posterior inference via Bayes’ theorem: \(P(x| {{\mathcal{O}}})\propto P({{\mathcal{O}}}| x)P(x).\) The standard reverse-time SDE can be modified to sample from the posterior distribution, yielding the following reverse diffusion process:

$$d{x}_{t}=\left[f({x}_{t},t)-{g}^{2}(t)\left({\nabla }_{{x}_{t}}\log {P}_{\theta }({x}_{t}| {e}_{i})+{\nabla }_{{x}_{t}}\log {P}_{\theta }({{\mathcal{O}}}| {e}_{i},{x}_{t})\right)\right]dt+g(t)\,d{W}_{t}.$$
(6)

This formulation requires two key components: the time-dependent score function \({\nabla }_{{x}_{t}}\log {P}_{\theta }({x}_{t}| {e}_{i})\), which can be approximated by a trained score network; and the gradient of the likelihood \({\nabla }_{{x}_{t}}\log {P}_{\theta }({{\mathcal{O}}}| {e}_{i},{x}_{t})\), which remains challenging to estimate due to the generally intractable dependency between \({{\mathcal{O}}}\) and xt. Several recent studies have proposed various strategies to address posterior sampling within the diffusion framework37,38,69. In light of the characteristics of our problem setting, we adopt two representative approaches: Inpainting70,71,72 and SDEdit73.

Inpainting reconstructs unobserved regions by conditioning on partial observations \({{\mathcal{O}}}\). A binary mask m indicates observed entries (mi = 1 if observed). At each reverse-time step t, a denoised estimate \({\widehat{x}}_{t}\) is first computed. To enforce consistency with known observations, we blend the latent state using

$${x}_{t}={{\bf{m}}}\odot q({x}_{t}| {{\mathcal{O}}})+(1-{{\bf{m}}})\odot {\widehat{x}}_{t},$$

where denotes element-wise multiplication. The term \(q({x}_{t}| {{\mathcal{O}}})\) is constructed by applying the same forward noise process to \({{\mathcal{O}}}\); that is, for each observed entry, we simulate its noisy counterpart at step t under the forward SDE. This blending operation preserves observed values while allowing the model to impute missing regions, approximating the posterior distribution \(p(x| {{\mathcal{O}}})\). SDEdit can be viewed as a special case of inpainting where the entire input field is treated as observed, i.e., m = 1. However, a key distinction lies in its use of a noise level parameter τ, which determines the strength of forward noise applied to the input before denoising. This parameter controls the extent to which the model is allowed to deviate from the original input, balancing fidelity and diversity. To select an appropriate τ, we conduct a sensitivity analysis on IMERG for 13 June 2016 at 23:00 UTC. For each noise level from 0.1 to 0.9 in steps of 0.1, we generate an ensemble of 50 samples from posterior \({P}_{\star }(x| {{{\mathcal{O}}}}_{{{\rm{IMERG}}}})\) and compute both the RMSE and CRPS over 50 repeated subsampling trials, each selecting 10 members randomly. As shown in SI Fig. 4, performance improves with increasing τ up to around 0.6, beyond which both RMSE and CRPS begin to deteriorate. This suggests an optimal trade-off at 0.6 noise levels, where PRIMER maintains sufficient variability to explore plausible outcomes while preserving alignment with observational constraints.

Statistical methods

Baseline methods

We employed two additional statistical methods for downscaling and bias correction, namely BCSD-EQM (bias correction and spatial disaggregation– equitable quantile mapping)48 and RM (random mixing)49. Owing to space limitations, the algorithmic flowcharts are provided in SI Section 4.

Evaluation metrics

Deterministic accuracy

To assess the accuracy of posterior sampling, we report the mean absolute error (MAE) and the Pearson correlation coefficient (PCC). MAE captures the average absolute deviation between the predicted ensemble mean \(\widehat{x}\) and the observed value x:

$${{\rm{MAE}}}=\frac{1}{N}{\sum }_{i}\left|{\widehat{x}}_{i}-{x}_{i}\right|.$$
(7)

where i indexes the gauge locations. PCC measures the linear association between predicted and observed spatial fields:

$${{\rm{PCC}}}=\frac{{\sum }_{i}({\widehat{x}}_{i}-\bar{\widehat{x}})({x}_{i}-\bar{x})}{\sqrt{{\sum }_{i}{({\widehat{x}}_{i}-\bar{\widehat{x}})}^{2}}\sqrt{{\sum }_{i}{({x}_{i}-\bar{x})}^{2}}}.$$
(8)

Here, \(\bar{\widehat{x}}\) and \(\bar{x}\) denote the spatial means of the predicted and observed fields, respectively.

Probabilistic skill

We use the continuous ranked probability score (CRPS)74, a proper scoring rule that measures the quality of probabilistic forecasts by comparing the predicted cumulative distribution function (CDF) F with the observation y. It is defined as:

$${{\rm{CRPS}}}(F,y)={\int }_{-\infty }^{\infty }{\left(F(x)-{{{\bf{1}}}}_{\{x\ge y\}}\right)}^{2}\,dx,$$
(9)

where 1{xy} is the Heaviside step function centered at y. Lower CRPS value indicates a better-calibrated ensemble system.

Evaluation tool

Spatial lagged correlation coefficient

We evaluate the spatial dependency of a field \(x\in {{\mathbb{R}}}^{H\times W}\) by computing its correlation with spatially shifted copies. For each fixed offset (Δi, Δj), we compute the PCCs between x and its lagged version xΔij using only the overlapping valid gauge observations. This metric quantifies the degree to which values at one location are linearly correlated with values at a fixed spatial offset (lag) from that location, thus capturing the spatial dependency structure.

EOF

Given an anomaly matrix \(x\in {{\mathbb{R}}}^{N\times T}\), where each row corresponds to spatial points, and each column represents time instances, EOF decomposition factorizes x via75:

$$x=LY,$$
(10)

where \(L\in {{\mathbb{R}}}^{N\times N}\) contains orthonormal spatial modes (EOFs), and \(Y\in {{\mathbb{R}}}^{N\times T}\) holds the corresponding time coefficients (principal components). EOFs are derived as eigenvectors of the covariance matrix \(S=\frac{1}{N-1}x{x}^{\top }\), arranged in decreasing order of eigenvalues, which represent the explained variance of each mode.

RAPSD

To quantify spatial variability76, we compute the radially averaged power spectral density (RAPSD) using the open-source Pysteps library77. Given a 2D scalar field \(f(x,y)\in {{\mathbb{R}}}^{H\times W}\), its discrete Fourier transform is \(F({k}_{x},{k}_{y})={\sum }_{x=0}^{H-1}{\sum }_{y=0}^{W-1}f(x,y)\,{e}^{-2\pi i\left(\frac{{k}_{x}x}{H}+\frac{{k}_{y}y}{W}\right)},\) and the corresponding power spectral density is

$$P({k}_{x},{k}_{y})=\frac{1}{HW}{\left|F({k}_{x},{k}_{y})\right|}^{2}.$$
(11)

RAPSD is obtained by averaging P(kx, ky) over annular bins of constant radial wavenumber \(k=\sqrt{{k}_{x}^{2}+{k}_{y}^{2}}\):

$${{\rm{RAPSD}}}(k)=\frac{1}{{N}_{k}}{\sum }_{({k}_{x},{k}_{y})\in {{{\mathcal{A}}}}_{k}}P({k}_{x},{k}_{y}),$$
(12)

where \({{{\mathcal{A}}}}_{k}\) denotes the components in each bin. We express RAPSD as a function of wavelength λ = 1/k to highlight scale-dependent variability.

Normalized occurrence versus rank analysis

For each gauge and hour with ground truth y and an ensemble of N realizations \({\{\widehat{{y}^{(k)}}\}}_{k=1}^{N}\), we define \(r\,=\,\frac{1}{N}{\sum }_{k=1}^{N}{{\bf{1}}}\,\left\{\widehat{{y}^{(k)}}\le y\right\}.\) If the ensemble is perfectly calibrated, {r} are uniformly distributed on [0, 1]. We assess this by plotting a histogram of normalized occurrence versus rank and by comparing the empirical CDF of {r} against the y = x reference. Deviations from uniformity are diagnostic: U-shaped or dome-shaped histograms indicate under-dispersion or over-dispersion of the ensemble, respectively78,79.

Data

Pretraining uses two gridded datasets: Integrated Multi-satellitE Retrievals for GPM (IMERG)80 and ERA581. IMERG provides global precipitation estimates at 0.1 spatial and 30-min temporal resolutions. To match ERA5’s hourly resolution, pairs of consecutive 30-min intervals are averaged to produce hourly estimates. The study focuses on East Asia (20–45N, 100–125E), a region of high population density. After cropping, IMERG data form 250 × 250 grids, with 2000–2020 (excluding 2016) used for training. ERA5, from ECMWF, provides hourly precipitation at 0.25 resolution, yielding 100 × 100 grids over the same domain. Both datasets are log-transformed as \({x}^{{\prime} }={\log }_{10}(0.1+x)\) and standardized using IMERG statistics. For fine-tuning, we use a gauge-assimilated gridded dataset from Shen et al. 27, constructed from over 30,000 Automatic Weather Stations (AWS) across China, with a spatial resolution of 0.1° and a temporal resolution of 1 hour. Since we do not have direct access to raw gauge measurements, we select only grid cells containing at least one assimilated AWS observation as a proxy for gauge observations. We use data from 2015 and 2017 for training, reserving 2016 for testing to align with the Typhoon Meranti forecasting experiment. For evaluation, we use a subset of grid cells containing at least four AWS observations, assuming these provide more reliable ground truth due to higher observation density. Throughout this work, we refer to these densely observed grid cells simply as “gauge observations.” (See SI Fig. 5 for the spatial distribution of grids with gauge observations). After identical cropping and preprocessing, the data were organized as two arrays: (N, 1) for precipitation intensity and (N, 2) for the grid indices (row, column) corresponding to each gauge’s longitude-latitude location on the 0.1 target domain, both of which are input into the model during fine-tuning.

IFS HRES is ECMWF’s flagship deterministic high-resolution model and is widely regarded as one of the best physics-based numerical-weather-forecast models in the world82,83. HRES produces hourly forecasts at a 0.1 horizontal resolution. We further used simulation outputs from CAM-MPAS-HR under the HighResMIP forced-atmosphere (2015–2050) configuration, with SST and sea-ice prescribed from CMIP5 RCP8.553. The model has a nominal resolution of 0.25 (variant r1i1p1f1), and we used only the data for the year 2050. Precipitation fields were downscaled from 0.25 to 0.1 by PRIMER. These two datasets were included in our experiments to demonstrate PRIMER’s good generalization capability on datasets it was not trained on.