Introduction

Nitrogen dioxide (NO2), emitted by combustion sources such as traffic, power plants, and biomass burning, is a critical trace gas influencing atmospheric composition. NO2 contributes substantially to tropospheric and stratospheric chemistry by driving acid deposition, serving as the primary precursor of tropospheric ozone (O3), and modulating the lifetime of greenhouse gases such as methane, thereby affecting the Earth’s radiative balance1,2,3,4. Epidemiological evidence links NO2 exposure to respiratory and cardiovascular disease, including asthma exacerbation, chronic obstructive pulmonary disease, and lung cancer, as well as to increased premature mortality5,6,7. Given its short atmospheric lifetime, NO2 concentrations peak near urban emission sources8,9, underscoring the need for precise quantification of ambient NO2 to inform regulatory planning and public health.

To better characterize the spatiotemporal behavior of NO2, a combination of in situ observations, numerical modeling, and satellite-based remote sensing has been widely adopted, with each offering distinct advantages and limitations10,11,12. Ground-based air quality monitoring networks provide high-fidelity data at specific sites but are often too sparse for comprehensive regional or global assessments, especially in rural and remote areas where monitoring infrastructure is limited13,14. Chemical transport models (CTMs), such as the GEOS-Chem15 and Community Multiscale Air Quality (CMAQ) model16, have been widely used to estimate air pollutant concentrations across vast areas in three dimensions but remain subject to uncertainties in emission inventories, meteorological inputs, and model parameterizations17,18,19. Satellite-based remote sensing has been instrumental in overcoming the limited spatial coverage of ground-based monitoring by providing extensive observations of NO2 loadings. Since the late 1990s, a series of polar-orbiting UV-visible spectrometers, including the Global Ozone Monitoring Experiment (GOME)20, SCanning Imaging Absorption SpectroMeter for Atmospheric CHartographY (SCIAMACHY)21, Ozone Monitoring Instrument (OMI)22, GOME-223, and TROPOspheric Monitoring Instrument (TROPOMI)24, have provided multi-decadal measurements of NO2 column densities. These datasets, which typically offer near-daily global coverage, have been essential for monitoring air pollution dynamics, evaluating emission trends, and advancing atmospheric chemistry research worldwide.

Recent advances in satellite instruments have introduced new observational capabilities through geostationary platforms, complementing earlier sun-synchronous sensors by enabling continuous intra-day observations of air pollution levels25. This new generation of geostationary instruments includes the Geostationary Environment Monitoring Spectrometer (GEMS)26,27, which provides hourly measurements of columnar loadings of air pollutants, including NO2, across Asia at 7 × 8 km resolution. The Tropospheric Emissions: Monitoring of Pollution (TEMPO)28,29 provides hourly snapshots of air quality over North America at a resolution of ~2.1 × 4.75 km, facilitating the detailed characterization of urban emissions and regional air pollution patterns. The Sentinel-4 mission30, recently launched by the European Space Agency in July 2025, will offer similar observation capabilities over Europe and North Africa. Such advances not only enhance monitoring capacity but also enable more precise adjustments to inventoried air pollutant emissions, beyond what ground-based monitoring alone can achieve1,31,32,33,34. They also provide top-down observational constraints for evaluating the reliability of CTM simulations, identifying systematic biases, and guiding improvements to model inputs and parameterizations35,36. More recently, satellite observations have been increasingly fused with ground-based measurements, CTM outputs, or land use and land cover (LULC) data through machine learning (ML) and deep learning (DL) approaches to refine estimates of surface NO2 concentrations37,38,39.

Despite such advances, satellite-based NO2 retrievals have been subject to systematic errors, which are often pronounced under certain viewing geometries or instrument-specific limitations. For example, retrievals made at coarse spatial resolution smooth out sub-grid NO2 gradients through spatial averaging, which usually leads to an underestimation of NO2 levels in polluted areas13,40. Furthermore, cloud contamination necessitates a data filtering process that may exclude a non-negligible portion of retrievals, resulting in information loss and potential systematic bias41. Beyond these observational constraints, the most critical source of uncertainty arises from the Air Mass Factor (AMF), which is required to convert slant column densities (SCDs) of NO2 into vertical column densities (VCDs)42,43. AMFs are typically calculated using radiative transfer models (RTMs) that require accurate ancillary inputs, including surface reflectance, cloud and aerosol properties, and the assumed vertical distribution of trace gases. To improve computational efficiency, these models often employ look-up tables (LUTs) that provide precomputed AMF values across a range of atmospheric and surface conditions, typically relying on simplified assumptions44,45,46. This process can introduce or amplify uncertainties when the assumed conditions deviate from real-world atmospheric conditions, thereby propagating errors into the resulting VCDs of air pollutants. Because stratospheric AMFs are generally stable and have low-uncertainty, the majority of AMF-related retrieval error arises from the tropospheric component. The influence of such uncertainties has been well-documented across multiple satellite platforms, including OMI47,48,49, GOME-250,51, TROPOMI52,53,54. For example, up to 45% of retrieval error in OMI’s tropospheric NO2 VCD has been attributed to inaccuracies in AMF calculation and the separation process between stratospheric and tropospheric NO2 columns55. Ground-based evaluations of TROPOMI tropospheric NO2 columns using Pandora spectrometer data, collected by the Pandonia Global Network (PGN), report AMF uncertainties ranging from 10% to 35%, primarily driven by unaccounted aerosol effects and directional surface reflectance anisotropy56. Errors in AMF can also vary substantially across different radiative transfer schemes by 31% to 42% depending on the assumptions used to represent atmospheric conditions43.

Recent studies have revealed that recalculating AMF with regionally and temporally specific inputs into RTMs can substantially reduce VCD retrieval errors. In southern China, incorporating high-resolution surface reflectance and aerosol inputs reduced errors by 25% to 30%57. Observation-based corrections in aerosol representation, such as those using the Absorbing Aerosol Index, reduced AMF biases by 10% to 15% during intense biomass-burning events58. Regional optimization of AMFs based on a priori knowledge of NO2 vertical profiles, such as those derived from CTMs, improved retrieval accuracy by 15% to 25% in urban settings59. Replacing these profiles with long-term in situ observation data reduced uncertainties in AMF by up to 20%60. In addition to spatial representativeness, temporal mismatches between the satellite overpass time and the meteorological or profile inputs used in AMF calculation can introduce significant retrieval errors. Prior work has shown that aligning AMF inputs to the correct observation time, rather than using temporally coarse or offset fields, can meaningfully reduce these errors61. These results underscore the necessity for more accurate AMF calculations and bias-correction efforts that explicitly account for spatial and temporal variability in atmospheric conditions, thereby improving the accuracy of top-down NO2 column retrievals.

Deterministic corrections of satellite-derived NO2 columns have historically relied on empirical adjustments against ground-based measurements, improvements to RTMs, and refinements of a priori NO2 profile assumptions50,62,63. These approaches, typically involving regression-based bias correction or full-product reprocessing, have been effective at mitigating systematic errors associated with surface reflectance, aerosol, and cloud properties. However, they often struggle to fully capture the complex, multivariate dependencies that shape AMF behaviors and consequently influence NO2 retrieval accuracy. In recent years, deep learning-based approaches have been increasingly explored for improving satellite NO2 data, with applications including spatial resolution enhancement56, surface-level concentration estimation64,65, and gap-filling of missing retrievals66. Beyond these general applications, several studies have applied deep learning techniques to correct biases in satellite NO2 data more directly. For example, Wu et al. 67 employed a back-extrapolation framework that leveraged a random forest model trained on satellite NO2 columns and ground-level NO2-related predictors to correct biases in long-term daily surface NO2 concentrations across China. Oak et al. 68 developed a two-step bias-correction approach for GEMS tropospheric NO2 columns, first refining the original GEMS retrievals and then applying a Light Gradient Boosting Machine (LightGBM) trained on co-located TROPOMI NO2 columns to further reduce systematic biases in GEMS retrievals over East Asia. Ghahremanloo et al. 25 applied a deep convolutional neural network (Deep-CNN) to correct biases in hourly GEMS tropospheric NO2 columns from 2021 to 2023. The model was trained on 17,879 Pandora collocations using extensive feature selection, improving Pearson’s correlation coefficient (R) between the GEMS and Pandora NO2 columns from 0.68 to 0.88 and reducing mean absolute bias by more than 50%.

More recently, advances in deep learning have introduced a new class of architectures known as neural operators, which extend traditional neural networks beyond fixed-length vector inputs to learn mappings between spatial or spatiotemporal function fields, such as air pollutant concentrations, wind vectors, or temperature distributions. Unlike earlier neural networks that rely on discrete input-output pairs, neural operators can capture relationships between the fields that continuously vary over space and time. This makes the operators particularly well-suited for solving partial differential equations (PDEs) and modeling physics-based processes commonly encountered in scientific applications. For instance, neural operators have been applied to correcting biases in numerical weather prediction outputs, including temperatures and humidity69, as well as forecasting spatial distributions of carbon monoxide70. Although such applications to satellite observation data are still emerging, the potential of neural operators to address biases originating from geophysical processes makes them a compelling tool for air quality applications.

In this study, we present a physics-informed hybrid neural network that combines a transformer and a Fourier neural operator (FNO) branch to correct systematic biases in TEMPONO2 VCDs. Unlike prior ML-based bias correction25 approaches that directly adjust the retrieved VCD, our method predicts a correction to the AMF and then recomputes the vertical column. The transformer branch captures local interactions in 2D surface and geometric predictors, such as surface albedo and viewing geometry, while the FNO branch models global spatial patterns from vertically resolved 3D profile inputs, such as scattering weights and NO2 shape factors. These representations are fused through a cross-attention mechanism and passed to a shared prediction head that estimates a physically meaningful AMF correction, which consequently improves the accuracy of NO2 VCDs. The training objective includes dedicated loss terms that enforce domain-specific physical constraints, penalizing physically implausible corrections and promoting consistency with established atmospheric principles. We train the model using 74,919 collocated TEMPO-Pandora total NO2 VCDs across North America from August 2023 to December 2024. Once fully trained, the model can operate exclusively on TEMPO inputs alone, without reliance on auxiliary inputs, enabling continuous, high-frequency bias correction across the full TEMPO coverage. An extensive evaluation was conducted using 10-fold and leave-one-station-out (LOSO)71,72 cross-validation strategies to assess model generalization across space and time. This framework provides a physically consistent approach to improving the accuracy of TEMPO total NO2 VCDs within a near-real-time processing pipeline. In this study, we focus on the total NO2 vertical column rather than the isolated tropospheric component. This choice is consistent with the Pandora NVS product used for training, which provides accurate total NO2 columns.

Results

Baseline evaluation: standard TEMPO NO2 Product vs. Pandora

To benchmark TEMPO’s NO2 retrieval accuracy and examine associated biases, we compared the hourly variation of TEMPO NO2 VCD and SCD against ground-based VCD measurements at 58 Pandora stations across CONUS during the period from August 2023 to December 2024. After applying a quality control process to both TEMPO and Pandora (see Methods), we identified 74,919 matched retrieval-measurement pairs for analysis. Agreement varied substantially by site, with the coefficient of determination (R2) ranging from 0.09 to 0.60, index of agreement (IOA) from 0.52 to 0.85, mean absolute biases (MAB) from 7.7 × 1014 to 1.09 × 1016 molecules/cm2, and root mean square errors (RMSE) from 1.04 × 1015 to 1.44 × 1016 molecules/cm2. R2 quantifies the variance explained by a linear fit; IOA standardizes the magnitude of prediction errors73; MAB captures the average absolute difference; and RMSE reflects the total error, encompassing both systematic and random components. Regionally, baseline TEMPO VCD performance varies meaningfully across CONUS, with agreement strongest along the West Coast and weakest in the Midwest; errors are smallest in the Southwest, and biases are mostly positive, largest in the Southeast and Northeast, while the Mountain West shows a slight negative bias (Supplementary Table S1). Throughout this study, VCD refers to the total NO2 vertical column. For brevity, we hereafter refer to TEMPO NO2 VCD and SCD as TEMPO VCD and SCD, and Pandora NO2 VCD as Pandora VCD.

Hourly R2 between TEMPO SCD and Pandora VCD, and between TEMPO VCD and Pandora VCD (Fig. 1a), exhibit a pronounced diurnal cycle when evaluated in local time, with lower correlations during the early morning hours, increasing steadily toward a broad maximum during late morning to early afternoon (approximately 09–13 local time), followed by a gradual decline toward the late afternoon. For each local hour, performance metrics were computed by converting individual TEMPO pixel observations from UTC to local solar time using longitude and aggregating all collocated TEMPO–Pandora pairs within each local-time bin. On average, TEMPO VCD shows lower correlation with Pandora VCD (mean R2 ≈ 0.55) than TEMPO SCD does (mean R2 ≈ 0.67), indicating that additional uncertainty is introduced during the AMF conversion process. TEMPO VCD exhibits a modest negative bias relative to Pandora VCD during the early morning hours. The bias approaches near zero by late morning (approximately 10–12 local time) and becomes weakly positive during the mid-to-late afternoon, reaching a maximum of approximately (2–4) × 1014 molecules/cm2 around 16–17 local time, before decreasing again toward the early evening. The grey ±1σ band (on the order of 1015 molecules/cm2) shows that hour-to-hour scatter far exceeds these mean offsets. Such diurnal variations in bias are characteristic of geostationary instruments like TEMPO. The fixed viewing geometry combined with changing solar angles throughout the day systematically alters atmospheric path lengths and surface reflectance conditions, impacting trace gas retrievals63,68. Specifically, the changing angles modify the atmospheric path length of sunlight, significantly impacting the calculated AMF and its sensitivity to assumptions about the NO2 vertical profile, aerosols, and clouds35,43. The tighter clustering of SCD correlations suggests that radiative-transfer assumptions, ancillary inputs, and profile-shape errors drive much of the diurnal variability in VCD accuracy. A portion of the residual spread also reflects inherent pixel–point representativeness differences between Pandora and TEMPO (see Methods), which contribute to baseline discrepancies independent of AMF-related retrieval bias.

Fig. 1: Hourly agreement and bias characteristics of TEMPO Nitrogen Dioxide(NO2) columns against Pandora ground references across the conterminous United States.
Fig. 1: Hourly agreement and bias characteristics of TEMPO Nitrogen Dioxide(NO2) columns against Pandora ground references across the conterminous United States.
Full size image

a shows the hourly coefficient of determination (R2) in local time between Pandora NO2 vertical column density (VCD) and both TEMPO tropospheric NO2 VCD (orange markers) and TEMPO NO2 slant column density (SCD; blue square markers) at 58 Pandora Global Network stations across the continental United States, using all collocated observations from August 2023 to December 2024. b displays hourly mean biases calculated as TEMPO NO2 VCD minus Pandora NO2 VCD (black circular markers), along with the one standard deviation (±1σ) spread shown as gray shaded bands, and the overall mean bias represented by a red solid line. Visual encodings: orange markers in (a) represent TEMPO tropospheric NO2 VCD, blue square markers in (a) represent TEMPO NO2 SCD, black circular markers in (b) represent hourly bias values, gray shading indicates the ±1σ spread, and the red solid line denotes the overall mean bias. Column density units are molecules per square centimeter (molecules/cm2), and σ refers to standard deviation.

Further analysis reveals that the retrieval bias varies systematically across the range of NO2 concentrations (Fig. S1). At low concentrations (<6 × 1015 molecules cm2), TEMPO exhibits a positive mean bias of ~10–15%, indicating a slight overestimation under clean conditions. In the intermediate range of 8–12 × 1015 molecules cm2, the mean bias approaches zero, suggesting good agreement with Pandora. At higher concentrations (> 15 × 1015 molecules cm2), TEMPO increasingly underestimates, with biases reaching −20% to −30% at column amounts above ~30 × 1015 molecules cm2. This pattern indicates that TEMPO tends to overestimate NO2 loadings in clean regimes but underestimate in polluted conditions. Our evaluation is consistent with recent work74, which reported that TEMPO exhibits structured, environment-dependent biases and substantial site-to-site variability. That study also found systematic overestimation in clean conditions and underestimation at higher NO2 loadings, patterns that closely match the concentration-dependent behavior we observe. Our results extend these findings by showing how these biases differ across urban, suburban, and rural environments and by quantifying the extent to which our correction framework reduces these structured errors.

Evaluation of the bias correction performances

We evaluated the ML model’s ability to reduce discrepancies between TEMPO and Pandora VCDs using three cross-validation approaches, each designed to evaluate a different type of data dependence. The first approach used 10-fold cross-validation (CV), where retrieval-measurement pairs are randomly split to provide a baseline estimate of model performance. The second approach was station-based group k-fold CV. In this scheme, the 58 monitoring sites were divided into six non-overlapping groups, each comprising roughly ten stations, with one group reserved for validation in each fold. This ensures that training and validation sets do not share data from the same site, providing a more rigorous test of performance at previously unseen locations. The third approach was LOSO, where all data from a single site are withheld in each iteration. This provides the strictest test of spatial generalization, as the model must predict at completely unseen locations. Both the station-based grouping and LOSO schemes are significant for air-quality datasets, where spatial structure and localized emission patterns can otherwise bias performance estimates. To evaluate model improvements relative to the uncorrected TEMPO product, we used four metrics: R², IOA, MAB, and RMSE.

Figure 2 compares the bias-corrected model estimates (“Model”) and the original TEMPO VCDs (“TEMPO”) against Pandora observations under 10-fold CV. The corrected VCDs showed improved agreement with ground-based measurements. R2 increases from 0.58 to 0.8 and IOA from 0.87 to 0.96, while the regression slope improves from 0.74 to 0.82, reflecting reduced systematic bias. Error magnitudes also decline. MAB decreases from 2.53 to 1.82 × 1015 molecules cm2 (−30%), and RMSE from 4.42 to 3.09 × 1015 molecules cm2 (−30%). Normalized metrics also show improvements, with NMAB dropping from 33.3% to 23.2% and NRMSE from 59.8% to 37.6%. To illustrate ML model performance at the native TEMPO Level-2 pixel resolution, we present representative hourly AMF and total NO2 VCD comparisons for sampled overpasses on 26 January and 10 February 2025 (Figs. S4 and S5). This approach is used because the Level-2 swath geometry and pixel locations vary from hour to hour, making direct temporal averaging at native resolution non-straightforward.

Fig. 2: 10-fold cross-validation comparison of Pandora total nitrogen dioxide vertical column density (NO2 VCD) with a ML model predictions and b original TEMPO retrievals.
Fig. 2: 10-fold cross-validation comparison of Pandora total nitrogen dioxide vertical column density (NO2 VCD) with a ML model predictions and b original TEMPO retrievals.
Full size image

a shows aggregated scatter-density points from all validation folds comparing Pandora total NO2 VCD and model-predicted NO2 VCD. b shows the same comparison using original TEMPO NO2 VCD pooled across all folds. In both panels, color shading shows sample density on a log scale, the red solid line shows the best-fit regression, and the gray dashed line shows the 1:1 reference. NO2 VCD units are molecules/cm2.

Seasonal stratification of 10-fold CV results, spanning winter (n = 8183), spring (n = 14,268), summer (n = 20,623), and fall (n = 31,850), confirms that the ML model effectively corrects biases across all seasons (Fig. 3). In winter, the model improves R2 from 0.58 to 0.84 and IOA from 0.87 to 0.95, while reducing MAB and RMSE by 37% and 39%, respectively (NMAB and NRMSE decline by a similar margin). Spring shows comparable improvements, with R2 increasing to 0.82 and IOA to 0.94, resulting in 27–34% reductions in MAB and RMSE. Summer exhibits the lowest baseline skill in the uncorrected TEMPO VCD, which may be influenced by the combined effects of more vigorous daytime vertical mixing and enhanced contributions from lightning-induced NOx in the upper troposphere, both of which can alter the sensitivity of AMF to errors in the assumed vertical profile75,76. Despite these factors, our model improves the summer R2 from 0.53 to 0.70, IOA from 0.85 to 0.89, and reduces MAB and RMSE by 18–22%. The fall performance mirrors the winter performance, with an R2 of 0.81, IOA of 0.94, and MAB and RMSE reduced by 29% and 36%, respectively. The reductions in normalized errors across all seasons demonstrate the model’s robustness to seasonal variability in meteorology, aerosol properties, surface conditions, and sampling biases.

Fig. 3: Seasonal bias-correction performance of the machine-learning (ML) model and original TEMPO retrievals.
Fig. 3: Seasonal bias-correction performance of the machine-learning (ML) model and original TEMPO retrievals.
Full size image

a shows seasonal R2 (coefficient of determination) and RMSE (root mean square error), both in molecules/cm2. b shows seasonal IOA (index of agreement) and MAB (mean absolute bias), also in molecules/cm2. Seasons (winter, spring, summer, fall) summarize all hourly collocated samples. Blue circles with solid lines show the ML model, and red squares with solid lines show original TEMPO retrievals.

Spatial generalization

Supplementary Fig. S2 shows the station-based group k-fold CV results. After bias correction, R2 increases from 0.58 to 0.74, IOA from 0.87 to 0.92, and slope from 0.74 to 0.80. MAB falls from 2.53 × 1015 to 2.07 × 1015 molecules/cm2 and RMSE from 4.42 × 1015 to 3.36 × 1015 molecules/cm2, while NMAB and NRMSE improve from 30.8% to 25.2% and 53.7% to 40.9%, respectively. For a more stringent evaluation of spatial generalization, a LOSO-CV was performed across 58 Pandora sites. In each fold, observations from one station were withheld for testing while the model was trained on the remaining 57 stations. By preventing any station’s data from appearing in both training and evaluation, this scheme yields an unbiased estimate of performance at entirely unseen monitoring locations. Figure 4 and Supplementary Table S2 summarize the LOSO results. In the TEMPO correlation map, VCDs at many stations, particularly in the western U.S., exhibit poor correlation (R2 ≈ 0.1–0.4) with the Pandora measurements. In contrast, after bias correction, the model VCDs show a substantial increase in correlation with station measurements, achieving R2 of 0.5–0.9. Across all stations, mean R2 rose from 0.36 to 0.50, IOA increased from 0.71 to 0.76, RMSE fell by 1.16 × 1015 molecules/cm2 (from 3.79 to 2.63 × 1015), and MAB was reduced by 0.68 × 1015 molecules/cm2 (from 2.59 to 1.90 × 1015). These improvements indicate the higher fidelity of the corrected VCD, reflecting a significant reduction in both systematic bias and residual error.

Fig. 4: Spatial pattern of leave-one-site-out cross-validated R2 (coefficient of determination) for Pandora nitrogen dioxide vertical column density (NO2 VCD) comparisons.
Fig. 4: Spatial pattern of leave-one-site-out cross-validated R2 (coefficient of determination) for Pandora nitrogen dioxide vertical column density (NO2 VCD) comparisons.
Full size image

Left panel shows TEMPO VCD versus Pandora NO2 VCD, and the right panel shows ML model VCD versus Pandora NO2 VCD, where R2 is computed separately for each held-out site. Colored circular markers show site-wise R2 values at 58 stations across the contiguous United States.

The regional insets in Fig. 4 highlight these improvements at a finer scale. In the western inset (blue box), our correction substantially improves the correlation with Pandora VCD for many sparsely monitored sites, where the original R2 was below 0.3, increased to the 0.4–0.7 range. Similarly, in the northeastern inset (red box), R2 at several urban and suburban stations increased from 0.4–0.6 to 0.7–0.9 after correction. The most pronounced enhancements occurred at five stations where our bias correction was particularly effective (average ΔR2 ≈ +0.38). For instance, at Pandora55s1, R2 and IOA increased from 0.32 to 0.77 (Δ +0.44) and from 0.73 to 0.90 (Δ +0.17), respectively, RMSE fell by 4.23 × 1015 molecules/cm2 (8.68 × 1015 → 4.45 × 1015 molecules/cm2), and MAB dropped by 2.46 × 1015 molecules/cm2 (5.54 × 1015 → 3.08 × 1015 molecules/cm2). Similar improvements in R2 were seen at Pandora142s1 (0.38 → 0.76; Δ +0.38), Pandora170s1 (0.15 → 0.52; Δ +0.37), Pandora247s1 (0.30 → 0.64; Δ +0.34), and Pandora157s1 (0.47 → 0.81; Δ +0.34). In these regions, spanning both urban and rural settings, the bias-corrected VCDs not only captured the broad diurnal and seasonal variability but also mitigated site-specific biases arising from local emission patterns and variable viewing geometries.

A minority of stations (approximately 5%) exhibited modest declines post-correction. The most significant drop occurred at Pandora68s1, where R2 fell by 0.12 (0.29 → 0.17), IOA dropped by 0.29, RMSE increased by 1.61 × 1015 molecules/cm2, and MAB rose by 1.64 × 1015 molecules/cm2. This underperformance may be attributed to sparse training data in their vicinity, unmodeled local pollution sources, or site-specific measurement noise. Additionally, the model may have struggled to generalize to atypical conditions or concentration regimes that were not well represented during training25,49. Similarly, as Tang et al. 77 highlighted, machine learning models can underperform when applied to areas with varying environmental conditions, emphasizing the challenge of model transferability across diverse locations. Despite such isolated cases, the consistently strong results across 75,000 independent retrieval–measurement pairs, particularly the gains in R² and IOA, demonstrate that physics-informed, machine-learning-based bias correction markedly enhances NO₂ VCD retrievals from geostationary observations.

To examine how local surface conditions influence bias-correction performance, we grouped stations into urban, suburban, and rural environments using the 2024 National Land Cover Database (NLCD). Each Pandora site was assigned to a land-use class based on the dominant NLCD category. The performance within each category was evaluated using the LOSO-CV, ensuring that all reported improvements represent true local-scale generalization to unseen stations rather than in-sample fitting (Fig. S3).

Urban sites (n = 26), characterized by strong horizontal spatial gradients in surface NO2 at scales smaller than a TEMPO pixel caused by sharp contrasts between roads, industrial sources, and background areas that lead to large pixel point representativeness errors, showed pronounced improvement after correction (ΔR2 = +0.16, ΔRMSE = 1.07 × 1015 molecules/cm2). Suburban sites (n = 13) exhibited moderate gains (ΔR2 = +0.12; ΔRMSE = 0.85 × 1015 molecules/cm2), reflecting the combined influence of both localized emission-driven gradients and broader scene-dependent retrieval factors. Rural stations (n = 19), which exhibit weak horizontal emission-driven gradients within a TEMPO pixel but strong sensitivity to meteorological controls on the vertical distribution of NO2, particularly boundary layer depth, vertical mixing, and horizontal transport that shape the assumed NO2 profile used in the AMF, also experienced notable improvement (ΔR2 = +0.17; ΔRMSE = 1.16 × 1015 molecules/cm2).

The magnitude of improvement is broadly similar across land-use classes, indicating that the ML model reduces multiple sources of AMF-related error rather than preferentially correcting a single dominant mechanism. While improvements in urban regions are consistent with partial correction of unresolved fine-scale horizontal emission gradients and surface reflectance variability, comparable gains at rural sites highlight the role of meteorology and profile-driven AMF errors that affect all environments. As a result, these stratified results demonstrate that the model improves local-scale retrieval performance across diverse surface conditions, but do not uniquely isolate the relative contributions of spatial-gradient versus meteorological controls on AMF error. Future work could disentangle these contributions more explicitly by combining land-use stratification with controlled sensitivity experiments, such as perturbations to NO2 vertical profiles, meteorological inputs, or spatial-resolution matching, to better attribute AMF error sources.

Ablation study and component analysis

To validate the design of our hybrid ML physics-informed Transformer–FNO model (see Methods), we performed an ablation study to systematically assess the contribution of each core component (Table 1). We first optimized the physics-constraint weight (λ) across five levels: 0, 0.001, 0.0025, 0.005, and 0.0075. Subsequently, we evaluated two structural variants: MLP-Transformer, in which the FNO branch was replaced by a simple MLP, and FNO-MLP, in which the Transformer branch was substituted with an MLP.

Table 1 Ablation study results for TEMPO NO2 bias correction

We assessed the impact of the physics-informed AMF penalty. The λ = 0 case, a purely data-driven model, establishes a baseline with R2 of 0.75 and a slope of 0.68, confirming that a systematic bias remains without the physics constraint. Among the five physics weights tested, λ = 0.005 delivers the closest agreement with reference VCDs, achieving R2 of 0.80 and a slope of 0.81. This was a clear improvement over the baseline model, as well as the intermediate models at λ = 0.001(R2 = 0.77, slope = 0.76) and at λ = 0.0025 (R2 = 0.80, slope = 0.77). At the optimal weight of λ = 0.005, the model also attains the best error characteristics: IOA of 0.94, MAE of 1.82 × 1015 molecules/cm2 with a corresponding normalized error of 22.1%, and RMSE of 2.91 × 1015 molecules/cm2 (35.4%). In contrast, increasing the weight further to λ = 0.0075 resulted in a slight performance decline, with IOA of 0.92, MAE of 2.01 × 1015 molecules/cm2 (24.5%), and RMSE of 3.21 × 1015 molecules/cm2 (39.0%).

Using this optimal weight, we then compared the complete Transformer-FNO model against the two structural variants by evaluating their respective outputs against the Pandora VCDs. The VCDs from the MLP-Transformer show a slope of 0.75, MAE of 1.96 (23.8%), and RMSE of 2.99 × 1015 molecules/cm2 (36.4%), and FNO-MLP show a slope of 0.79, MAE of 1.94 (23.6%), and RMSE of 3.12 (37.9%). The degradation in performance when either component is removed confirms the necessity of the hybrid design. Collectively, these results indicate that our model’s hybrid architecture can effectively correct the retrieval bias by integrating two specialized components, each designed to address the distinct nature of the input data. The Transformer branch interprets the complex relationships within surface and geometric predictors, while its FNO branch concurrently captures the column-wide dependencies within the vertically resolved atmospheric profiles.

Discussions

Our results demonstrate the effectiveness of incorporating a physics-based constraint within a deep learning architecture to correct biases in AMF, thereby improving the final accuracy of TEMPO NO2 column retrievals. Our hybrid Transformer-FNO model achieves this by framing the AMF correction as an intermediate, physically constrained prediction. This approach directly addresses the primary source of systematic, concentration-dependent bias by penalizing deviations from radiative-transfer theory via a Huber loss. The success of this physics constraint, however, is dependent on the model’s ability to process the varied inputs that govern AMF. The ablation study confirms this: the Transformer branch is essential for interpreting the 2D surface and geometric predictors, while the FNO is critical for capturing column-wide dependencies within the vertically resolved 3D atmospheric profiles. The underperformance of the ablated variants demonstrates that these two components perform complementary roles. This finding validates our hybrid approach, where the physics-based constraint provides the necessary physical grounding, while the combination of the Transformer and FNO branches effectively interprets the geometric, surface, and atmospheric profile variables that collectively govern the AMF when retrieving NO2 columns.

The robustness of our physics-constrained model is confirmed through multi-stage validation, showing a substantial reduction in both systematic bias and residual error. Ten-fold CV shows a 0.26 increase in R2 and a 37% reduction in RMSE, and the more stringent LOSO CV confirms these improvements at entirely unseen monitoring stations. The model’s consistent performance across all seasons, including the summer months, further validates its reliability under a wide range of atmospheric conditions.

Beyond these metrics, our approach offers two key operational advantages for practical, large-scale deployment of this bias correction framework. First, the entire correction pipeline operates using only TEMPO data as input, eliminating the need for concurrent ground-based measurements or other auxiliary datasets. This self-sufficiency makes the method readily deployable in real-time processing streams, capable of providing continuous, high-frequency, bias-corrected NO2 VCD across the entire geostationary swath. Second, by embedding the AMF constraint directly in the loss function, the model is explicitly guided to enforce consistency with fundamental radiative-transfer physics. This prevents the model from making physically implausible adjustments and enhances the interpretability of the correction. Together, these features deliver the robust accuracy, practical deployability, and scientific integrity required for geostationary NO2 monitoring.

While the model effectively reduces systematic biases, it operates exclusively on TEMPO Level-2 scene-dependent inputs. It therefore inherits any uncertainty present in those variables (e.g., surface albedo, cloud parameters, aerosol treatment, or GEOS-CF–derived NO2 profiles). Because the model can only learn from the information available in these predictors, upstream retrieval biases or missing variability propagate into the corrected columns, contributing to the slight compression of the highest and lowest VCD values and explaining why the regression slope, though improved, remains below 1.0. In this context, the observed performance gains reflect the combined reduction of multiple AMF-related error sources, rather than isolation of a single dominant correction mechanism. Residual disagreements at certain stations should also be interpreted in the context of pixel–point representativeness differences between Pandora and TEMPO (see Methods), which cannot be removed by AMF bias correction alone. Although the present study focuses on total column correction, this framework can be extended to tropospheric retrievals by incorporating predictors that better characterize vertical structure or stratospheric contributions (e.g., layer-resolved scattering weights or alternative profile sources), enabling the model to correct total and tropospheric VCDs either jointly or in sequence.

Methods

Study area and data collocation

The study covers the North America domain (10–60° N, 140–60° W), encompassing the contiguous United States, southern Canada, and northern Mexico, as shown in Fig. 5. This region lies within TEMPO’s geostationary field of regard (center at 92.85° W), enabling hourly daytime retrievals of NO2 columns across major urban corridors, including Los Angeles, the Northeast megalopolis, and Mexico City, as well as rural background environments. TEMPO NO2 columns were collocated with ground-based Pandora spectrometer measurements from the PGN for validation and bias correction.

Fig. 5: Study domain showing TEMPO satellite coverage and Pandora station locations across North America.
Fig. 5: Study domain showing TEMPO satellite coverage and Pandora station locations across North America.
Full size image

The dashed black line outlines the spatial extent of TEMPO observation coverage, and purple circular markers show the locations of Pandora ground stations.

The Tropospheric Emissions: Monitoring of Pollution (TEMPO) data

TEMPO, launched on April 7, 2023, as the first NASA Earth Venture Instrument mission, provides hourly daytime ultraviolet–visible imaging spectroscopy of key air pollutants over North America from a geostationary orbit at 92.85° W. Covering spectral ranges of ~293–494 nm and ~538–741 nm, TEMPO retrieves columns of NO2, ozone (O3), formaldehyde (HCHO), and sulfur dioxide (SO2), alongside glyoxal, water vapor, bromine monoxide, iodine monoxide, and aerosols, and measures cloud properties, foliage reflectance, and ultraviolet (UV) B flux, with data processing by the Smithsonian Astrophysical Observatory Science Data Processing Center28,29. In this study, we use the TEMPO Level-2 NO2 product (Version V03, provisional). Its nominal spatial resolution at the field-of-regard center (36.5° N, 100° W) is ~2.1 × 4.75 km2 for NO2 (8 × 4.75 km2 for ozone profiling), enabling fine-scale monitoring of urban corridors and regional backgrounds. The NO2 retrieval employs a 405–465 nm fitting window to derive SCDs, which are then converted to VCDs using radiative-transfer-derived air mass factors28,29. The AMF calculation incorporates a priori NO2 vertical profiles from a chemical transport model, surface reflectance based on climatological bidirectional reflectance distribution function (BRDF) products, and cloud parameters, including effective cloud fraction and cloud pressure. Aerosol effects are treated implicitly within the radiative-transfer framework. Only retrievals passing the recommended quality-assurance and cloud-screening flags are retained11,28. TEMPO achieved first light on August 2, 2023, and began nominal operations in October 2023, with products publicly accessible through NASA’s Atmospheric Science Data Center. As part of a geostationary air-quality constellation, including South Korea’s GEMS and Europe’s Sentinel-4, TEMPO’s hourly coverage captures diurnal variability and rapid emission events that polar-orbiting instruments cannot resolve.

Pandora data

The Pandora spectrometer, deployed within the PGN, is a ground-based passive remote-sensing instrument that acquires hyperspectral solar-irradiance measurements in the UV-visible range to retrieve total and tropospheric column densities of NO2, O3, and HCHO via direct-sun and multi-axis DOAS modes25,78,79. Although Pandora columns are inherently spatially and vertically smoothed relative to in situ probes, they serve as high-quality, independent references for satellite validation, with each observation accompanied by quality-assurance flags80,81,82. For this study, we utilized the Level-2 NVS (L2_rnvs3p1-8) data product from 58 PGN stations, which provides total (TotCol), tropospheric (TropCol), and near-surface (SurfConc) retrievals for NO2, O3, and HCHO processed under standardized algorithms and quality-control procedures.

Because our model corrects total NO2 columns, we use the direct-sun NVS TotCol measurements. Under direct-sun geometry, the AMF is close to unity, and the associated uncertainty is relatively low, typically 2–5%, as described in the PGN Data Product Readme. In contrast, the TropCol and SurfConc products require additional AMF calculations that depend on vertical profile shape, cloud fraction, surface reflectance, and viewing geometry, and these additional AMF dependencies introduce more variability into the retrieval, the uncertainties for TropCol and SurfConc are correspondingly higher than for the direct-sun TotCol product78. All Pandora data were obtained through the PGN portal (https://www.pandonia-global-network.org) and are fully documented in the official product readme (https://www.pandonia-global-network.org/wp-content/uploads/2023/11/PGN_DataProducts_Readme_v1-8-8.pdf).

National Land Cover Database (NLCD)

Land-use classification for the local-scale urban, suburban, and rural analysis was based on the 2024 NLCD, produced by the U.S. Geological Survey (USGS), which provides 30 m resolution land-cover information across the United States. For our analysis, stations were classified as urban (NLCD codes 23–24), suburban (21–22), or rural (all other codes) using the NLCD land-cover classes. The NLCD product was downloaded from EarthExplorer (https://earthexplorer.usgs.gov/).

Data preparation and preprocessing

Our data-processing workflow integrates satellite-based NO2 from TEMPO with ground-based NO2 measurements collected from PGN. First, we verify the geographic locations of each Pandora station to ensure coverage within TEMPO’s field of view. Then we compile station metadata and calibration records to ensure consistency in our comparisons. We then identify, retrieve, and process the relevant TEMPO Level 2 granules corresponding spatially and temporally to each Pandora observation via NASA’s EarthData Search portal and the Atmospheric Science Data Center (ASDC) at NASA Langley Research Center (LaRC) (earthaccess v0.5.1). Our spatial matching procedure utilizes polygon-intersection methods and nearest-neighbor resampling79 to align Pandora coordinates with TEMPO pixel boundaries. We identify the single TEMPO Level-2 pixel whose center lies closest to each Pandora station using a nearest-neighbor search, and we retain that pixel only if its center lies within 2 km of the site. This distance criterion is used solely as a screening step to ensure spatial consistency, and no spatial averaging across multiple TEMPO pixels is performed. We apply primary data-quality flags to TEMPO retrievals, retaining only pixels with cloud fraction <20% and positive tropospheric and stratospheric NO2 column densities. Then, we compute the total NO2 column as the sum of those two components. Because Pandora provides point-based measurements while TEMPO retrieves a ~2.1 × 4.75 km2 pixel average, the two instruments differ in representativeness. Local enhancements observed by Pandora (e.g., roadway plumes or near-source gradients) may be spatially smoothed in the larger TEMPO footprint, and early/late-day slant-path geometries can intersect multiple pixels. These differences introduce representativeness error that is unrelated to AMF bias. Throughout this study, we therefore treat Pandora as a high-quality reference rather than a perfect “truth,” and interpret remaining discrepancies as a combination of retrieval uncertainty and pixel–point mismatch. We achieve temporal alignment using two complementary methods. First, we pair each TEMPO measurement with the nearest Pandora observation within a 15-minute window. Second, we apply a Gaussian-weighted smoothing of Pandora observations over a temporal window (σ = 5 minutes) to produce a continuous series83. We then organize our spatiotemporally harmonized data into structured intermediate files, preserving measurement uncertainties and metadata for downstream analysis. We use four groups of predictors: angular variables (solar zenith angle, solar azimuth angle, viewing zenith angle, viewing azimuth angle, and relative azimuth angle); three-dimensional profiles (scattering weights, gas profile, temperature profile); two-dimensional fields (snow-ice fraction, terrain height, surface pressure, albedo, effective cloud fraction), and NO2 observations (total NO2 column, slant column). Next, we screen both the Pandora direct-sun total NO2 column (TotCol) and the TEMPO total NO2 columns for extreme values, excluding observations above the 99th percentile, and standardize all continuous meteorological and column-density variables (e.g., pressure, albedo, cloud fraction, column densities) via mean-centering and division by their standard deviation84 applied independently to each layer of the vertical profiles. We apply min-max normalization to latitude and longitude over our domain bounds and transform angular (solar/viewing zenith & azimuth, relative azimuth) and temporal (hour of day, day of year) predictors into paired sine-cosine channels to preserve cyclic structure (83, 84). We retain unscaled copies of all core inputs and apply a denormalization to the model outputs, restoring them to physical NO2 concentrations for final predictions and loss calculations. All preprocessing and model code were written in Python 3.10 using Xarray, Pandas, NumPy, SciPy, and PyTorch.

Hybrid deep learning architecture for NO2 bias correction

Figure 6 illustrates our model, which processes heterogeneous inputs through two parallel branches. In the first branch, Transformer encoder layers85 extract global contextual features from surface and satellite predictors. In the second branch, a FNO86,87 captures large-scale atmospheric structures from vertically resolved profiles. We then fuse the two sets of features using a cross-attention module, and a final fully connected network estimates an intermediate AMF. This AMF is subsequently used to compute the VCD via the physical inversion relationship:

$$V{\rm{CD}}=\frac{{\rm{SCD}}}{{\rm{AMF}}}$$
(1)
Fig. 6: Architecture schematic of the hybrid NO2 bias-correction machine-learning (ML) model.
Fig. 6: Architecture schematic of the hybrid NO2 bias-correction machine-learning (ML) model.
Full size image

A Transformer-based branch processes 2-D surface and column predictors, an FNO (Fourier Neural Operator) branch extracts features from 3-D vertical profiles, and a cross-attention fusion module combines both streams to predict AMF (air mass factor) used for NO2 VCD (vertical column density; molecules/cm²) estimation.

The Transformer branch processes input data \({{\rm{x}}}_{2{\rm{D}}}\) structured as a tensor with shape (batch, \({{\rm{n}}}_{2{\rm{D}}}\)), where \({{\rm{n}}}_{2{\rm{D}}}\) is the number of 2D features. In contrast, the Profile Branch handles atmospheric profiles \({{\rm{x}}}_{3{\rm{D}}}\) formatted as a tensor with shape \(({\rm{batch}},{{\rm{n}}}_{3{\rm{D}}},{\rm{num\; layers}}),\) where \({{\rm{n}}}_{3{\rm{D}}}\) corresponds to the different profile variables, and num layers (e.g., 72) represents the number of pressure layers. The model is trained by minimizing a composite loss function primarily targeting accurate total VCD, comparing the total VCD derived from the predicted AMF (i.e., \({\rm{SCD}}/{\rm{AM}}{{\rm{F}}}_{{\rm{pred}}}\)) against reference Pandora VCD (\({VC}{D}_{{true}})\). A weighted physics loss term is also included to enforce greater physical consistency in the estimated AMF.

Transformer-based feature extraction

The Transformer branch extracts features from our 2D predictors, measurement angles, terrain height, and non-profile variables. Given an input feature vector \({x}_{2d}\in {R}^{{d}_{2d}}\) (where d2d is the number of input features), the initial measurements undergo a linear transformation to a higher-dimensional latent representation h (the embedding dimension) through:

$${z}_{{emb}}={W}_{{emb}}{x}_{2d}+{b}_{{emb}}$$
(2)

where \({W}_{{emb}}\in {R}^{{{h}\times d}_{2d}}\) and \({b}_{{emb}}\in {R}^{h}\) are learnable parameters. Following embedding, the latent representation \({z}_{{emb}}\) is processed by a stack of transformer encoder layers85,88. Each layer utilizes a multi-head self-attention mechanism to model dependencies between features within the embedded vector. For each attention head, the representation z (\({z}_{{emb}}\)) is linearly transformed into query (Q), key (K), and value (V) matrices:

$${\rm{Q}}={{\rm{W}}}_{{\rm{Q}}}{\rm{z}},{\rm{K}}={{\rm{W}}}_{{\rm{K}}}{\rm{z}},{\rm{V}}={{\rm{W}}}_{{\rm{V}}}{\rm{z}}$$
(3)

where \({W}_{Q},{W}_{K},{W}_{V}\in {R}^{h\times {h}_{k}}\) are projection matrices for a head dimension \({h}_{k}\). The self-attention output is computed using scaled dot-product attention:

$$\mathrm{Attention}\,({\rm{Q}},{\rm{K}},{\rm{V}})=\mathrm{softmax}\left(\frac{{\rm{Q}}{{\rm{K}}}^{{\rm{T}}}}{\sqrt{{{\rm{h}}}_{{\rm{k}}}}}\right){\rm{V}}$$
(4)

Outputs from multiple heads are concatenated and linearly projected. A residual connection is added, followed by layer normalization. Subsequently, a position-wise Feed-Forward Network (FFN), consisting of two linear layers with a non-linear activation (ReLU) in between, is applied, again followed by a residual connection and layer normalization:

$${{\rm{f}}}_{2{\rm{d}}}={\rm{LayerNorm}}\left({{\rm{z}}}_{{\rm{attn}}}+{\rm{FFN}}\left({{\rm{z}}}_{{\rm{attn}}}\right)\right)$$
(5)

The resulting feature vector \({{\rm{f}}}_{2{\rm{d}}}\in {{\rm{R}}}^{{\rm{h}}}\) encapsulates globally relevant information extracted from the auxiliary parameters. The use of Transformers is motivated by their proven ability to model complex dependencies88,89 and their increasing adoption in remote sensing and Earth sciences90.

FNO-based feature extraction

The second branch of our model processes vertically resolved atmospheric profiles discretized into 72 pressure layers. These profiles, although discretized, are regarded as samples from an underlying continuous function \({\rm{u}}\left({\rm{x}}\right)\). Such a continuous formulation is necessary since calculating the AMF requires integrating the profile function continuously over altitude.

$${\rm{AMF}}=\int {\rm{W}}\left({\rm{z}}\right){\rm{S}}\left({\rm{z}}\right){\rm{c}}\left({\rm{z}}\right){\rm{dz}}$$
(6)

where \(W\left(z\right)\) represents the scattering weight at altitude z; \(S\left(z\right)\) is the shape factor, calculated as \(S\left(z\right)=n\left(z\right)/\int n\left(z\right){dz}\), with \(n\left(z\right)\) being the trace gas concentration at layers \(z\); \(c\left(z\right)\) is the temperature correction factor. In the retrieval of NO2, c(z) is determined by the empirical relationship46 \(c\left(z\right)=1-a\left[T\left(z\right)-{T}_{\sigma }\right]+b{\left[T\left(z\right)-{T}_{\sigma }\right]}^{2}\), with \(a=0.00316\), \(b=3.39\times {10}^{-6}\), \({{\rm{T}}}_{{\rm{\sigma }}}=220{\rm{K}}\) and \(T\left(z\right)\) = the temperature at altitude z.

Given that AMF is defined by continuous integration through the column, the FNO branch employs an FNO to extract the global, long-range interactions inherent in the vertical profiles. The profiles are denoted as a function \(a\left(z\right)\), defined over a vertical domain. \({D}_{z}\), which includes the profile variables. In practice, this function is represented by its discretized form, a vector \(a\) sampled at specific \(\{{z}_{1},{z}_{2},\ldots ,{z}_{n}\}\). Within the FNO branch, the input tensor is first lifted internally to a higher-dimensional latent representation \({v}_{0}\left(z\right)\). This richer representation is then passed through a sequence of four Fourier layers. Within each layer, the global dependencies along the vertical dimension z are first computed via the Fourier domain:

$${g}_{t}\left(z\right)={{\mathcal{F}}}_{{\mathcal{Z}}}^{-1}({R}_{t}\cdot {\left({{\mathcal{F}}}_{{\mathcal{Z}}}({v}_{t}))({k}_{z})\right)}_{z}$$
(7)

where \({{\mathcal{F}}}_{{\mathcal{Z}}}\) and \({{\mathcal{F}}}_{{\mathcal{Z}}}^{-1}\) represent the 1D Fast Fourier Transform and its inverse along the vertical dimension z, while \({R}_{t}\) is a learnable linear transformation applied in the frequency domain (modes \({k}_{z}\)). This global component \({g}_{t}\left(z\right)\) is then combined with a parallel local linear transformation \({W}_{t}\) (1D convolution in our case) acting on the layer’s input \({v}_{t}\left(z\right)\), followed by a non-linear activation σ, to yield the updated latent profile \({v}_{t+1}\left(z\right)\):

$${v}_{t+1}\left(z\right)=\sigma \left({W}_{t}{v}_{t}\left(z\right)+{g}_{t}\left(z\right)\right)$$
(8)

After four iterations (i.e., the number of FNO layers), the final output is the latent profile representation \({v}_{T}\left(z\right)\). This structure allows the model to efficiently learn correlations across different pressure levels by integrating both global context (from the Fourier path) and local features (from the \({W}_{t}\) path). To obtain a fixed-size vector representation suitable for subsequent fusion, the resulting latent representation \({v}_{T}\left(z\right)\) is aggregated using a global average operation across the layers dimension z. This aggregation step yields a single feature vector for each sample in the batch, effectively summarizing the salient information from the vertical profiles extracted by the FNO component.

Attention fusion and prediction head

After feature extraction, we merge the Transformer and FNO streams with a single-head cross-attention layer. Here, the Transformer branch output acts as the query Q, while the FNO branch outputs provide both keys K and values V. We compute the attention-weighted fusion using Eq. 4. This attention-based fusion strategy allows the model to dynamically weight and incorporate the most pertinent information captured by the FNO branch (representing vertical profile features) based on the context provided by the Transformer branch (representing other input features). Although analogous to multi-head attention, our implementation uses a single head, and the resulting fused vector is passed to the final prediction network. We then feed the fused features into a two-layer MLP (Linear → ReLU → Dropout → Linear) with a Softplus activation (β = 2) to predict the intermediate AMF values.

Training procedure and loss function

We train the model end-to-end by minimizing a composite loss function (Eq. 10) that combines a data-fidelity term (\({{\mathcal{L}}}_{{vcd}}\)) with a physics-informed constraint \({{\mathcal{L}}}_{{physics}}\). The data-fidelity loss (\({{\mathcal{L}}}_{{vcd}}\)) is defined as the Huber loss (Eq. 9) on the residual a = VCDpred - VCDtrue, with δ = 1.0. Here, VCDpred is computed by combining the satellite SCD with the model-predicted AMF, as shown in Eq. 1. To ensure stable optimization, both the satellite SCD (used to compute VCDpred) and VCDtrue are standardized (zero mean, unit variance), which makes the residual “a” dimensionless and prevents high-magnitude scenes from dominating the loss.

$${{\mathcal{L}}}_{\delta }\left(a\right)=\left\{\begin{array}{l}\frac{1}{2}{a}^{2}\,\,{\text{if}}\left|a\right|\le \delta \\ \delta \left(\left|a\right|-\frac{1}{2}\delta \right)\,\,\text{otherwise},\end{array}\right.$$
(9)

Concurrently, the physics loss \({{\mathcal{L}}}_{{physics}}\) also uses the Huber formulation to penalize deviations between the predicted AMF and an independent AMF reference computed via Eq. 6. The total loss is then

$${{\mathcal{L}}}_{{total}}={{\mathcal{L}}}_{{vcd}}+{({\rm{\lambda }}}_{\text{physics}}\times {{\mathcal{L}}}_{{physics}})$$
(10)

where \({{\rm{\lambda }}}_{\text{physics}}=0.005\) balances the two objectives. We determined this weight, and the optimal number of Fourier modes, Transformer layers, hidden dimensions, and other hyperparameters via Ray Tune91. A concise layer-by-layer specification and whole search space for hyperparameter tuning and selections are summarized in Table S3 and S4. This \({{\rm{\lambda }}}_{\text{physics}}\) specific value accounts for the differing ranges of the target variables, as we normalize VCD values but do not normalize AMF, thereby ensuring a balanced contribution of both terms to the composite loss.