Abstract
To support coordinated air quality management, this study developed a tree-based machine learning framework for multi-pollutant forecasting. We systematically evaluated the predictive performance of Random Forest (RF), Gradient Boosting Decision Tree (GBDT), and Decision Tree (DT) models for six key pollutants: PM2.5, PM10, NO2, SO2, CO, and O3, using high-resolution environmental monitoring data (10 km resolution) from China’s four major municipalities (2021–2024). A comprehensive feature system was constructed incorporating meteorology-emission interaction terms. SHapley Additive exPlanations (SHAP) values were employed to quantify feature contributions. Key findings demonstrate: (1) RF achieved optimal performance in particulate matter prediction (PM2.5: R2 = 0.99, RMSE = 0.11 µg/m3; PM10: R2 = 0.98); (2) GBDT showed comparable accuracy to RF for NO2 (R2 = 0.85) and CO (R2 = 0.98) with minimal differences (ΔR2 ≤ 0.03); (3) DT exhibited competitive O3 prediction capability (R2 = 0.88). SHAP analysis revealed critical mechanisms, such as CO’s positive synergistic effect (SHAP = 0.136) in PM2.5 prediction and O3 generation sensitivity to temperature (SHAP = 0.076). This research provides an interpretable, multi-pollutant forecasting framework applicable to urban air quality warning systems and offers model selection guidance for environmental regulation strategies.
Introduction
Atmospheric pollution control remains a critical challenge in the process of global urbanization, posing severe threats to public health1 and ecosystems2. The advancement of air quality forecasting has become increasingly imperative. With the rapid development of machine learning (ML), interdisciplinary research in air quality prediction has gained substantial momentum. For instance, studies using traditional models like WRF-Chem have demonstrated precision in simulating the physicochemical formation of PM2.5 and O33; however, their heavy dependence on emission inventories and high computational costs limit real-time early warning capabilities.
Compared to conventional models like WRF-Chem, ML has established itself as a core tool for air quality prediction by capitalizing on its strengths in handling nonlinear relationships and high-dimensional data. Given the substantial global health impacts stemming from inadequacies in air quality forecasting, researchers have actively compared different approaches. It compared the performance of machine learning and time series models in predicting urban PM2.5 and PM10 concentrations4. Analysis of five years of observational data from six monitoring stations in Abu Dhabi revealed that linear support vector regression delivered optimal performance for PM2.5 prediction, while the Facebook Prophet model achieved superior accuracy in both 24-hour and weekly forecasts for PM2.5 and PM10.
Conventional statistical models often fail to capture nonlinear interactions between pollutants and environmental factors. In contrast, tree-based ML approaches (e.g., Random Forest (RF), Gradient Boosted Trees (GBDT), Decision Trees (DT)) balance computational efficiency with interpretability, enabling identification of key drivers through feature contribution analysis. The study is the first to apply Bayesian Maximum Entropy blending within a Deep Ensemble Machine Learning framework5. By integrating geographical predictors including RF, XGBoost, and GBM models with three meta-models, they generated 2.5 km × 2.5 km resolution grid surfaces for predicting monthly maximum 1-hour ozone concentrations, achieving high accuracy even in regions with sparse monitoring networks. The study developed AQI prediction models for Delhi using linear regression, RF, and DT regression under three scenarios with multi-year meteorological data6. Results demonstrated superior performance of DT regression and RF models across scenarios, with RF outperforming others in 10-fold cross-validation, providing actionable insights for Delhi’s urban policymakers. The study designed three visibility simulation schemes employing DT and RF algorithms7. Utilizing atmospheric boundary layer meteorological data, pollutant concentrations, and surface observations to mitigate haze impacts, they identified optimal methods for studying boundary layer influences. Findings revealed RF’s superior simulation performance over DT in two haze episodes.
Conventional models often analyze single pollutants in isolation while neglecting synergistic effects, whereas ML enables the integration of meteorological, chemical, and spatiotemporal heterogeneity data, offering novel approaches for multi-pollutant system modeling. In a different regional context, the study employed RF and K-Nearest Neighbors algorithms to model PM10 concentrations in Zabol, Sistan Basin (Iran) using meteorological data8. Through feature selection methods, they identified significant predictors and demonstrated RF’s superior performance during summer months. All models achieved accurate predictions relying solely on readily available meteorological data, providing robust support for air quality forecasting and policy formulation in the region. The study compared three data-driven models—Gaussian Process Regression, Quantile Random Forest, and Bayesian Neural Networks - for predicting long-term spatiotemporal distributions of residential indoor PM2.59.The study proposed a comprehensive framework for model comparison, validation, and attribution analysis, facilitating future research to elucidate complex nonlinear relationships between urban characteristics and indoor air pollutants, while offering insights for urban planning and indoor PM2.5 mitigation strategies.
The intensification of PM2.5-O3 co-pollution further underscores the urgency of multi-pollutant collaborative prediction. The study employed an enhanced synthetic control method and mediation effect model to quantitatively assess the impacts of winter-spring droughts on PM2.5-O3 co-pollution and its driving factors10. The study revealed significant increases in daily averages and diurnal variation patterns of PM2.5 and O3 during drought periods, with elevated temperatures, reduced precipitation, and lower relative humidity identified as primary drivers exacerbating “dual-high” co-pollution risks. This approach expands research on atmospheric composite pollution under extreme weather events and provides novel insights for preventing co-pollution episodes during abnormal meteorological conditions. The study investigated pollution characteristics and health impacts of ozone and its precursors in the Hexi Corridor through land-use change analysis and BenMAP-CE software11. Findings indicate that cropland expansion (primarily from grassland conversion) correlates strongly with ozone pollution influenced by meteorological factors and vegetation patterns, while ozone emerges as a major contributor to premature cardiovascular mortality. The study proposes targeted measures for Wuwei City, including controlling pollution transport during high-ozone periods and implementing coordinated management of VOCs and NOx emissions to mitigate health risks.
Despite technological advancements, existing research predominantly focuses on single-pollutant prediction (e.g., PM2.512 or SO213, lacking systematic multi-pollutant and multi-model comparisons. Furthermore, while SHAP (SHapley Additive exPlanations) values have been employed to interpret ML models14, their environmental science applications remain superficial. Most studies prioritize model accuracy over standardized interpretative frameworks, thereby limiting policy implementation feasibility.
This study aims to systematically compare the predictive performance of six pollutants (PM2.5, PM10, NO2, SO2, CO, and O3) through a tree-based modeling framework incorporating RF, DT, and GBDT. By employing SHAP standardized workflows to quantify feature contributions and identify optimal model-pollutant pairings, we establish an evidence-based foundation for atmospheric pollution early-warning systems through integrated meteorological data and emission inventories.
Materials and methods
This study followed a standardized machine learning workflow for model development and evaluation. The process commenced with data collection and preprocessing, followed by comprehensive feature engineering to construct the input variables. Subsequently, a two-stage feature selection was implemented to identify the most predictive features. Three tree-based models were then constructed, optimized, and rigorously evaluated using a time-series cross-validation protocol. Finally, model interpretability was analyzed using SHAP values to elucidate the key drivers of pollutant concentrations.
Data description and preprocessing
The dataset comprises 5844 independent observational records encompassing six major air pollutants and six meteorological variables, covering four municipalities (temporal scope: 2021–2024 at 0000 UTC). The temporal resolution is hourly, with data recorded at 0000 h daily. The geographical distribution of the four municipalities covers three major climate zones in eastern China. Beijing (39° 54′ N, 116° 23′ E) has a temperate monsoon climate with significant PM2.5 pollution during the winter heating period. Shanghai (31° 14′ N, 121° 29′ E), with subtropical monsoon climate and prominent ozone pollution in summer. Tianjin (39° 08′ N, 117° 12′ E), with significant sea-land wind circulation and compounded pollution from heavy industrial emissions and port transportation. Chongqing (29° 33′ N, 106° 30′ E), basin topography, static and stable weather, secondary aerosol generation is active. Temporal covariates include year, month, and day, with geographical variations encoded through the categorical variable “City”. The spatiotemporal resolution is maintained at 10 km. The dataset exhibits no missing values or outliers. Data quality was ensured through automated anomaly detection using the Interquartile Range (IQR) method, with values beyond 1.5× IQR replaced by linear interpolation. Environmental monitoring data often exhibit non-normal distributions, particularly for pollutant concentrations which are typically right-skewed due to the influence of pollution episodes15. To enhance model stability and performance, all numerical input features were normalized using Z-score standardization, which mitigates the influence of extreme values and differing scales without requiring strict adherence to normality assumptions. Key dataset features are summarized in Table 1. All air quality data and meteorological parameters were obtained from the China National Environmental Monitoring Centre (http://www.cnemc.cn/sssj/).
To eliminate dimensional discrepancies among numerical variables and enhance model performance, Z-score normalization16 was applied to all numerical variables (excluding city/time covariates). Following \(z=\frac{{x - \mu }}{\sigma }\), numerical variables were transformed to standardized normal distributions with mean = 0 and standard deviation = 1.
where x represents the original observed value, µ denotes the variable mean, and σ indicates the standard deviation. This normalization enables comparative analysis of variables with disparate units and magnitudes, thereby improving model fitting accuracy, prediction precision, and generalization capacity.
For the “City” categorical variable, one-hot encoding17 was implemented to prevent misleading numerical interpretations. This method converts each city into a binary dummy variable, where a value of 1 indicates the sample belongs to that city and 0 otherwise. For instance, four dummy variables were created for Beijing, Shanghai, Tianjin, and Chongqing, enabling precise identification of inter-city air quality variations without introducing spurious correlations from ordinal encoding.
Feature engineering
To fully exploit inherent periodic patterns in temporal variables, sine-cosine transformation encoding was implemented. Following \(\sin (\frac{{2\pi x}}{T})\) and \(\cos (\frac{{2\pi x}}{T})\), linear periodic time variables were converted into two-dimensional sinusoidal vectors.
Where x represents Day or Month values, with T = 31 for daily cycles (maximum days per month) and T = 12 for monthly cycles. This encoding enables models to explicitly capture diurnal and seasonal periodicity in air quality variations.
Leveraging the autoregressive properties of pollutant concentrations, lagged features were constructed for the five primary pollutants (excluding ozone due to its distinct photochemical characteristics). Lagged intervals of (1, 3, 6, 12, 24) hours were created to capture temporal persistence and autocorrelation, enhancing model learning of short-term trends and cyclical patterns.
Synergistic effects between variables were modeled through three engineered features, supported by prior studies on atmospheric processes18,19. Temp × Wind, Product of temperature and wind speed to quantify atmospheric dispersion capacity. Humidity × SO2, Humidity-SO2 interaction to reflect hygroscopic effects on secondary particle formation. PM2.5_div_Wind, Ratio of PM2.5 to wind speed characterizing particulate dispersion efficiency. Composite Pollution Index, Weighted combination of PM2.5 and NO2 concentrations (0.6×PM2.5+0.4×NO2) representing oxidative stress potential. These interaction terms enrich feature space representation, enabling improved modeling of complex air quality dynamics.
Feature importance screening
A two-stage feature selection strategy was implemented to reduce dimensionality while enhancing model efficiency and generalizability. First, high-correlation filtering15 was conducted by calculating Pearson correlation coefficients between features, removing those with coefficients > 0.8 to mitigate multicollinearity and information redundancy. Subsequently, feature importance evaluation was performed using a RF model configured with 200 trees for stability. The quantile threshold method retained the top 15% critical features, ensuring preservation of the most valuable predictors for target pollutant estimation.
Importance of key characteristics of the six target pollutants.
Figure 1 illustrates the key important features corresponding to the six target pollutants. The importance scores of the features calculated by the Random Forest model20 visualize the contribution of each feature to the prediction of different pollutants. Each subfigure clearly lists the key features of the corresponding pollutants and their importance ranking, which highlights the important role of the key information retained after feature screening in supporting the prediction ability of the model.
Predictive model construction
Three tree-based algorithms (RF, GBDT, and DT) were selected for their demonstrated efficacy in handling nonlinear relationships, high-dimensional data, and model interpretability in environmental studies7,21. RF and GBDT are ensemble methods capable of capturing complex feature interactions, while DT provides a baseline model with high transparency for rule extraction. The modeling framework was developed using Python within the PyCharm IDE. All experiments were conducted in Python 3.9 with scikit-learn 1.2.2. As a Bagging-based ensemble algorithm, RF constructs multiple decorrelated decision trees through bootstrap sampling, utilizing feature subset splitting and majority voting to enhance model generalization22. GBDT adopts a Boosting framework, sequentially building weak learners with gradient-optimized residuals, demonstrating strong local feature learning capacity23. As a fundamental tree model, DT employs recursive dichotomy to establish rule libraries through information entropy minimization24.
Key model configurations and hyperparameters are detailed in Table 2. The dataset was temporally split into training (80%) and test (20%) sets using time-series-aware partitioning, which preserves chronological dependencies and enables models to effectively learn temporal patterns and evolutionary trends. Number of base learners (n_estimators), Controls ensemble size in tree-based models. Maximum tree depth (max_depth), Prevents overfitting by constraining model complexity. Feature subspace strategy (max_features), Proportion for random feature selection. Shrinkage factor (learning_rate), Modifies contribution weights of sequential trees. Stochastic subsampling (subsample), Fraction of samples used for boosting iterations. Minimum samples for node splitting (min_samples_split), Halting criterion for tree growth. Random seed (random_state), Ensures experimental reproducibility.
Model optimization employed 5-fold grid search with time-series cross-validation via TimeSeriesSplit(n_splits = 3), ensuring temporal generalization across sliding windows. This systematic hyperparameter tuning process explored the predefined parameter space to identify optimal combinations that minimize root mean square error25.
Model evaluation protocol
A comprehensive evaluation framework was established using three metrics. Root mean squared error (RMSE), emphasizes larger prediction errors, critical for assessing extreme value forecasting. Mean absolute error (MAE), quantifies absolute deviation magnitude between predictions and observations. Coefficient of determination (R2), measures explained variance (0–1 scale), where values approaching 1 indicate superior model fit.
SHAP values were computed via TreeExplainer, grounded in cooperative game theory’s Shapley value concept26. This approach quantifies individual feature contributions to model predictions through:
Global interpretability
Aggregate SHAP values across all samples.
Local interpretability
Instance-specific contribution analysis.
Feature importance ranking was derived from absolute SHAP values, enabling identification of dominant predictors and their directional impacts (positive/negative) on pollutant concentrations.
Results
Model performance comparison
The optimized hyperparameter configurations for each pollutant are summarized in Table 3. The results reveal significant variations in optimal parameters across pollutants, reflecting their distinct data distributions and feature interaction complexities. For RF, PM2.5 and PM10 predictions utilized unrestricted tree depth (max_depth = None) and square-root feature selection (max_features=’sqrt’), whereas NO2 and O3-8 h predictions required constrained tree depth (max_depth = 10) to mitigate overfitting risks. GBDT adopted lower learning rates (learning_rate = 0.05) for gaseous pollutants (SO2 and CO) but relied on deeper trees (max_depth = 5) for particulate matter (PM2.5 and PM10). DT employed a minimum sample split threshold of 10 for SO2 prediction and 20 for other pollutants, indicating the necessity for pollutant-specific splitting criteria.
Figure 2 compares the predictive performance of the three models across six pollutants. For PM2.5 prediction, the RF model achieved optimal performance with post-tuning metrics of RMSE = 0.11 µg/m3, MAE = 0.07 µg/m3, and R2 = 0.99, representing reductions of 57.7% (RMSE) and 59.2% (MAE) compared to pre-tuning, alongside a 6.0% improvement in R2. GBDT and DT models exhibited higher errors (RMSE = 0.23 and 0.27 µg/m3, respectively) and lower R2 values (0.95 and 0.93). In PM10 prediction, RF maintained superiority (RMSE = 0.13 µg/m3, MAE = 0.08 µg/m3, R2 = 0.98), with error reductions of 42.5% (RMSE) and 55.1% (MAE) relative to baseline models. GBDT and DT models achieved R2 values of 0.93 and 0.90, confirming the efficacy of hyperparameter tuning in enhancing generalization.
For gaseous pollutants, the RF model demonstrated strong performance: NO2 prediction yielded RMSE = 0.34 µg/m3 (32.6% reduction) and R2 = 0.88, while SO2 prediction achieved near-perfect accuracy (RMSE = 0.07 µg/m3, 70.3% error reduction; R2 = 0.99). In CO prediction, both RF and GBDT attained R2 = 0.98, with MAE values of 0.04 and 0.05 µg/m3, respectively. For O3-8 h prediction, RF outperformed other models (R2 = 0.92 vs. GBDT = 0.86 and DT = 0.88), reducing RMSE by 49.6% post-tuning.
Performance comparison of three models across six pollutants.
RF achieved the best performance for PM2.5 (R2 = 0.99 ± 0.01) and PM10 (R2 = 0.98 ± 0.02), while GBDT showed comparable accuracy for NO2 (R2 = 0.85 ± 0.03) and CO (R2 = 0.98 ± 0.01). DT performed competitively for O3 (R2 = 0.88 ± 0.04).
Feature importance analysis
Based on the optimized RF, GBDT, and DT models, combined with SHAP value analysis, the key drivers and their contribution patterns for the six pollutants are illustrated in Figs. 3, 4 and 5. The x-axis represents SHAP values (positive values indicate increased prediction, negative values indicate decreased prediction). Point colors represent feature values (red: high, blue: low). Dense point clusters indicate regions of high feature influence.
DT-SHAP value (impact on model output).
GBDT-SHAP value (impact on model output).
RF-SHAP value (impact on model output).
For PM10 prediction, the DT model revealed significant positive synergy between PM2.5 (SHAP = 0.0772) and NO2 (SHAP = 0.0580), with high-value regions (red points) densely distributed in the 0.5–2.0 SHAP range (Fig. 3a). The lagged effect of PM10_lag1 (SHAP = 0.0614) and its extreme value (1.62) indicated persistent historical pollution influences. In the GBDT model, PM2.5 exhibited dominant SHAP values (2.34), exceeding all other features (Fig. 4a), reflecting secondary aerosol formation as the primary mechanism. The long-tail distribution of PM10_lag1 (up to 1.90) further validated the predictability of extreme pollution events. For the RF model, PM2.5 demonstrated elevated impact thresholds (SHAP > 1.0), with high-density clusters (1.0–1.49) in critical regions (Fig. 5a). The negative regulatory role of humidity (2m_RH, SHAP=-0.0510) remained consistent across all three models.
In PM2.5 prediction, the DT model identified CO (mean SHAP = 0.1364) and PM10 (SHAP = 0.0519) as key positive drivers, with high-value clusters (> 0.5 SHAP) forming dense red-point groups in the distribution plots (Fig. 3b). SO2 displayed a dual-mode distribution (− 0.45–1.48 SHAP), with significant inhibitory effects in low-concentration ranges (− 0.45–0.0). The GBDT model showed narrowed influence ranges for CO (− 0.50–0.99 SHAP) but increased high-value cluster density (0.2–0.98 SHAP, Fig. 4b), confirming its robustness. Extreme PM10 responses (SHAP = 1.91) and historical concentrations (PM2.5_lag1, SHAP = 0.0070) jointly governed pollution accumulation. In the RF model, PM10 exhibited contradictory directional impacts, with dense blue clusters in low-value regions (− 0.65–0.0 SHAP) coexisting with extreme high-value responses (1.67 SHAP, Fig. 5b). The PM2.5_div_Wind feature demonstrated a positive threshold effect (0.3–0.57 SHAP), indicating suppressed dispersion when wind speeds exceeded 4 m/s.
For SO2 prediction, the DT model identified SO2_lag1 as the core driver (SHAP = 0.2006), with dense red points in its high-value range (0.02–0.31 SHAP, Fig. 3c). The negative influences of CO (SHAP=-0.0661) and PM2.5 (SHAP=-0.0459) were perturbed by extreme values (CO SHAP = 2.27). In the GBDT model, the extreme intensity of SO2_lag1 decreased (1.03 SHAP), but its temporal dependency persisted (SHAP = 0.1491). The humidity interaction term (Humidity × SO₂) maintained a negative mean value (− 0.0104) despite exhibiting extreme responses (1.24 SHAP). For the RF model, the influence range of SO2_lag1 contracted (− 0.20–0.81 SHAP), with reduced sensitivity (SHAP = 0.1318). PM2.5’s symmetrical distribution (− 0.52–1.31 SHAP) reinforced its generalized inhibitory role (mean SHAP = − 0.0101).
In NO2 prediction, the DT model revealed significant positive synergy between PM2.5 (SHAP = 0.1238) and CO (SHAP = 0.0754), though CO exhibited extreme negative impacts (− 1.91 SHAP), indicating emission source heterogeneity (Fig. 3d). The GBDT model highlighted threshold responses from NO2_lag3 (SHAP = 0.49, Fig. 4d), validating temporal dependency in pollution accumulation (SHAP = 0.0436). CO’s positive dominance intensified (mean SHAP = 0.1393). In the RF model, CO’s influence range narrowed (− 0.96–0.59 SHAP) but retained a high mean value (0.1224). Temperature (2m_Temp.) displayed contradictory directional impacts across models (DT: SHAP = − 0.0330 vs. RF: SHAP = 0.0106), reflecting uncertainties in meteorology-chemistry coupling mechanisms.
For CO prediction, the DT model revealed coexisting negative dominance (SHAP = −0.1490) and extreme positive responses (SHAP = 2.21) from NO₂ (Fig. 3e), suggesting scenario-dependent photochemical processes. Temperature exhibited significant positive thermal driving effects (SHAP = 0.0374). In the GBDT model, NO2’s extreme intensity slightly decreased (2.20→2.1951 SHAP) but achieved improved distribution stability. The negative influence of humidity (2m_RH) intensified, with increased point density in the − 0.36–0.0 SHAP range. For the RF model, NO2’s inhibitory effects strengthened (SHAP = −0.1309), and its extreme value range narrowed (− 0.80–1.85 SHAP). Temperature sensitivity diminished (SHAP = 0.0163), reflecting model smoothing mechanisms.
In O3 prediction, the DT model identified temperature (SHAP = 0.0769) and CO (SHAP = 0.0835) as dominant drivers of ozone formation, with dense high-value red clusters (Fig. 3f). PM10 displayed contradictory impacts, combining a negative mean (SHAP=-0.0059) with localized positive extremes (SHAP = 0.63). The GBDT model underestimated CO’s true influence (SHAP = 0.5068) due to axis truncation (displayed range: -0.5–0.5 SHAP). Temperature’s effect range contracted (-0.62–0.80 SHAP). For the RF model, temperature exhibited the strongest positive contribution (SHAP = 0.0594), though values > 0.5 SHAP were not visualized. CO’s influence range (-1.66–0.48 SHAP) decoupled from its mean effect (SHAP = 0.1139).
SHAP analysis identified the top drivers for each pollutant:
-
PM2.5: CO (SHAP = 0.136) and PM2.5_div_Wind (threshold > 4 m/s).
-
O3: Temperature (SHAP = 0.076) and CO (SHAP = 0.084).
-
NO2: CO (SHAP = 0.139) and NO2_lag3 (SHAP = 0.044).
Error pattern analysis
The comprehensive comparison of multi-model prediction performance (Fig. 6) revealed significant differences in error characteristics across pollutant-model combinations. Pollutants are arranged vertically as PM2.5, PM10, NO2, SO2, CO, and O3 (top to bottom). All axes use standardized concentration units (µg/m3 for particulates, ppm for CO, ppb for O3).
Multi-model prediction performance comparison.
PM2.5 prediction results (Fig. 6a). The DT model exhibited fan-shaped dispersion in medium concentration ranges (2.0 ≤ Actual ≤ 3.5 µg/m3) with MAE = 0.19. The RF model achieved optimal fitting (R2 = 0.99, MSE = 0.01), while the GBDT model showed a 23% lag rate in predicting high-concentration samples (> 3.5 µg/m3), yielding RMSE = 0.23.
PM10 prediction results (Fig. 6b). For PM10, the DT model formed stepped error bands (MAE = 0.22) at abrupt concentration gradients (1.5–2.5 µg/m3). The RF model maintained stable low-error performance (MSE = 0.02), whereas the GBDT model introduced systematic bias (Δ=+0.18) in high-concentration segments (> 3.5 µg/m3).
NO2 prediction results (Fig. 6c). The DT model displayed bimodal error distributions (R2 = 0.78). The RF model effectively mitigated overfitting in high-concentration samples (max_features = 0.33), and the GBDT model produced prediction plateaus (MAE = 0.30) within the 1.5–2.5 µg/m3 range.
SO2 prediction results (Fig. 6d). In SO2 prediction, the DT model generated a 0.25 µg/m3 error band within the 1.2–2.0 µg/m3 range. The RF model achieved near-perfect fitting (R2 = 0.99, MSE = 0.01), while the GBDT model showed sample compression effects (> 2.5 µg/m3, RMSE = 0.18) due to tree depth constraints (max_depth = 3).
CO prediction results (Fig. 6e). For CO, the DT model induced stepwise errors (MAE = 0.05) near the concentration inflection point (≈ 1.2 ppm). The RF model attained instrumental-level precision (RMSE = 0.12), whereas the GBDT model systematically overestimated (> 2.0 ppm) by 5%.
O3 prediction results (Fig. 6f). The DT model formed dense error bands (R2 = 0.88) in the 1.8–2.2 ppb range. The RF model best captured photochemical nonlinearity (n_estimators = 200), while the GBDT model produced peak-shaving errors (Δ=-0.38) for samples > 2.5 ppb.
Discussion
Environmental mechanisms underlying model performance differences
This study demonstrates that RF achieves optimal performance in predicting particulate matter (PM2.5, PM10) and SO2 (R2 = 0.99), while GBDT exhibits comparable accuracy to RF for NO2 and CO (ΔR2 < 0.05). These findings partially align with recent research conclusions. For example, The study identified RF’s superiority in modeling nonlinear relationships of PM2.5, attributed to its ability to capture meteorological-chemical coupling effects in secondary aerosol formation21. However, contrary to the excellent performance of GBDT in predicting gas pollutants in research27, this study reveals that RF’s high precision for SO2 (RMSE = 0.07 µg/m3) may stem from its capability to characterize spatial heterogeneity in emission inventories. Future work will incorporate high-resolution emission data to validate this hypothesis.
The performance disparities can be attributed to the complexity of pollutant formation mechanisms: particulate matter is governed by multiphase reactions requiring high-dimensional nonlinear feature processing, whereas gaseous pollutants (e.g., NO2) are influenced by spatiotemporal heterogeneity in point-source emissions, demanding stronger feature interaction modeling capabilities28. RF’s ensemble structure inherently addresses the former through parallelized nonlinear fitting, while GBDT’s sequential error correction adapts better to the latter’s localized emission dynamics.
The observed performance disparities among tree-based models stem from fundamental differences in pollutant formation mechanisms and algorithm learning characteristics. For particulate matter prediction, RF’s ensemble architecture demonstrated superior capability in capturing nonlinear interactions between secondary aerosol precursors and meteorological parameters. Specifically, the model achieved R²=0.99 for PM₂.₅ prediction by effectively resolving the photochemical coupling between CO (SHAP = 0.136) and humidity-driven sulfate suppression (SHAP=-0.102). This aligns with field observations of nitrate-dominated particle growth under stagnant conditions29, where RF’s bootstrap aggregating mechanism mitigated overfitting risks from collinear features (PM₂.₅_lag24 vs. PM₁₀_lag12: r = 0.89).
Analysis of synergistic effects among key drivers
SHAP analysis revealed asymmetric synergistic effects among pollutants. In DT-based PM2.5 prediction, the positive contribution of CO (SHAP = 0.136) validated the role of incomplete combustion products as precursors of secondary organic aerosols19, while the inhibitory effect of SO2 (SHAP=-0.102) may relate to limited hygroscopic growth of sulfate aerosols. For DT-based O3 prediction, the dominant positive influence of temperature (SHAP = 0.076) aligns with photochemical generation mechanisms, yet its synergy with CO (SHAP = 0.084) indicates indirect impacts from non-methane volatile organic compound (NMVOC) precursors, consistent with observational ozone formation potential analyses in study30. Notably, the threshold effect of the PM2.5_div_Wind feature (SHAP > 0.3) suggests that when wind speeds exceed 4 m/s, turbulent mixing surpasses advective dispersion, leading to localized accumulation of fine particles—a phenomenon corroborated by satellite retrievals during pollution episodes in the Beijing-Tianjin-Hebei region18.
Model-pollutant compatibility and algorithm characteristics
This study systematically compares the predictive performance of RF, GBDT, and DT across six pollutants, revealing intrinsic compatibility between algorithmic properties and pollutant formation mechanisms. Key findings demonstrate RF’s superiority for particulate matter (PM2.5, PM10) and SO2 predictions (R2 > 0.98), GBDT’s competitive accuracy for NO2 (ΔR2 = 0.03 vs. RF), and DT’s unexpected competence in O3 prediction (R2 = 0.88). These outcomes are fundamentally linked to pollutant-specific data characteristics and algorithmic design principles.
Particulate matter prediction
RF’s dominance in PM2.5/PM10 prediction stems from its ensemble learning capacity to resolve high-dimensional nonlinear interactions. The secondary aerosol conversion and meteorological dispersion governing PM2.5 formation require modeling complex dependencies among lagged features (e.g., PM2.5_lag24) and meteorological interactions (PM2.5_div_Wind), which ranked top three in SHAP importance (Fig. 5a). RF’s bootstrap sampling and feature subset partitioning mitigate multicollinearity impacts (e.g., PM2.5-PM10 correlation r = 0.82) on model stability. In contrast, GBDT exhibits 23% prediction lag for high-concentration PM2.5 samples (> 3.5 µg/m3, Fig. 6a), likely due to AdaBoost’s sensitivity to extreme values.
SO2 prediction
RF’s exceptional SO2 accuracy (RMSE = 0.07 µg/m3) correlates with its ability to characterize spatial heterogeneity in emission inventories. SO2_lag1’s leading SHAP value (0.1318, Fig. 5c) underscores historical emission persistence as the primary driver. DT’s neglect of feature interactions (e.g., SO2_lag12-SO2_lag24 correlation r = 0.91) results in expanded error bands (Fig. 6d).
NO2/CO prediction
GBDT achieves comparable performance to RF for NO2 (R2 = 0.85, ΔR2 = 0.03) and CO (R2 = 0.98), reflecting algorithmic adaptability to emission dynamics. NO2’s traffic-derived pulse emissions (skewness = 1.32) demand spatiotemporal heterogeneity modeling, where GBDT’s boosting-driven residual correction (Fig. 6c) outperforms RF’s plateau effects. For CO, both RF and GBDT converge in modeling linear emission sources (e.g., stationary combustion), achieving identical R2 = 0.98 (Fig. 6e).
O3 prediction
DT’s competitive O3 prediction (R2 = 0.88) highlights compatibility between algorithmic simplicity and photochemical thresholds. O3 formation governed by nonlinear NOx-VOCs ratios and radiation thresholds exhibits piecewise-linear feature interactions (e.g., abrupt O3 generation rate increases at T > 25 °C). DT’s binary tree splitting inherently aligns with such mechanisms (Fig. 3f), where an 8-layer tree depth (Table 3) matches O3 concentration inflection points. Ensemble models like RF suffer from prediction smoothing, diminishing peak-capturing ability (Fig. 6f).
Limitations and future research directions
Despite significant advancements, this study has three key limitations. First, the temporal scope of the dataset (2021–2024) excludes extreme climate events such as El Niño31, potentially underestimating long-term meteorological impacts. Second, the spatial resolution of emission inventories (10 km) limits precise characterization of industrial point sources. Third, prediction errors remain pronounced during extreme pollution episodes (PM2.5>250 µg/m3, RMSE = 38.7 µg/m3), necessitating robustness optimization methods like adversarial training32.
Future research will focus on integrating WRF-Chem simulations with ML models to establish a multi-scale hybrid forecasting system. Additionally, dynamic pollution control strategies will be explored through SHAP-based optimization, targeting feature-specific interventions aligned with key driver thresholds.
Conclusions
This study systematically elucidates the performance disparities of tree-based models in multi-pollutant atmospheric prediction and their environmental mechanisms. This study demonstrates that tree-based models offer complementary strengths for multi-pollutant forecasting. RF excels in particulate matter, GBDT in traffic-related gases, and DT in photochemical O3.
SHAP attribution analysis quantified synergistic interactions among key drivers, revealing CO’s dominant role as a secondary organic precursor (SHAP = 0.136) in PM2.5 prediction and the inhibitory effect of humidity-SO2 interactions (Humidity × SO2=-0.0104) on sulfate aerosol formation, providing mechanistic foundations for coordinated pollution control.
Current models remain limited by emission inventory resolution (10 km) in characterizing industrial point sources and excluding climate anomalies like El Niño, resulting in elevated errors during extreme pollution events (PM2.5>250 µg/m3, RMSE = 38.7 µg/m3). Future research will integrate WRF-Chem simulations with machine learning to develop multi-scale hybrid forecasting systems, enhance robustness via adversarial training, and optimize heavy-pollution emergency responses through SHAP-driven dynamic feature weighting. The standardized prediction-interpretation framework proposed here has been operationally implemented in the Beijing-Tianjin-Hebei regional air quality warning system, offering a transferable paradigm for urban atmospheric pollution management.
Data availability
Due to the confidentiality regulations involved, the data sets generated and/or analyzed during the current study are not public, but can be obtained from the corresponding authors upon reasonable request.
References
Cao, W. et al. Short-term air pollution exposure and risk of respiratory pathogen infections: an 11-year case-crossover study in Guangzhou, China. BMC Public. Health. 25, 1411 (2025).
Watson, A. S. & Bai, R. S. Studies on ecosystem services and air-pollution mitigation in tropical urban vegetation using i-Tree eco model. Environ. Dev. Sustain. https://doi.org/10.1007/s10668-025-06016-7 (2025).
Inlaung, K., Chotamonsak, C., Macatangay, R. & Surapipith, V. Assessment of transboundary PM2.5 from biomass burning in Northern Thailand using the WRF-Chem model. Toxics 12, 462 (2024).
Abuouelezz, W. et al. Exploring PM2.5 and PM10 ML forecasting models: a comparative study in the UAE. Sci. Rep. 15, 9797 (2025).
Hanigan, I. C. et al. Deep ensemble machine learning with bayesian blending improved accuracy and precision of modelled ground-level Ozone for region with sparse monitoring: Australia, 2005–2018. Environ. Model. Softw. 187, 106378 (2025).
Pande, C. B. et al. Forecasting of monthly air quality index and Understanding the air pollution in the urban city, India based on machine learning models and cross-validation. J. Atmos. Chem. 82, 1 (2025).
Wang, F. et al. Ground visibility prediction using tree-based and random-forest machine learning algorithm: comparative study based on atmospheric pollution and atmospheric boundary layer data. Atmospheric Pollution Res. 15, 102270 (2024).
Kalantari, E., Gholami, H., Malakooti, H., Kaskaoutis, D. G. & Saneei, P. An integrated feature selection and machine learning framework for PM10 concentration prediction. Atmospheric Pollution Res. 16, 102456 (2025).
Dai, H. et al. Comparison and evaluation of machine learning models for predicting indoor PM2.5 concentrations on a large Spatiotemporal scale. Build. Simul. https://doi.org/10.1007/s12273-025-1276-0 (2025).
Wu, B. et al. Winter-spring droughts exacerbated PM2.5-O3 compound pollution? Evidence from China. Sci. Total Environ. 959, 178309 (2025).
Lei, S. et al. Study on O3-NOx-VOCs combined air pollution and Ozone health effects in the Hexi corridor. Environ. Sci. Pollut Res. 31, 49837–49854 (2024).
Cao, W., Wang, L., Li, R., Zhou, W. & Zhang, D. Unveiling the nonlinear relationships and co-mitigation effects of green and blue space landscapes on PM2.5 exposure through explainable machine learning. Sustainable Cities Soc. 122, 106234 (2025).
Corradino, C., Jouve, P. & La Spina, A. Del Negro, C. Monitoring earth’s atmosphere with Sentinel-5 TROPOMI and artificial intelligence: quantifying volcanic SO2 emissions. Remote Sens. Environ. 315, 114463 (2024).
Ren, W., Zhou, A. S., Ma, J. & C. & Prediction of residual stresses in GFRP strips under wind-sand erosion by interpretable machine learning methods: feature engineering and SHAP analysis. Multiscale Multidiscip Model. Exp. Des. 8, 270 (2025).
Gao, Y. et al. Adversarial neural collaborative filtering with embedding dimension correlations. Data Intell. 5, 786–806 (2023).
Imai, S., Koriyama, T., Yonekura, S., Sugasawa, S. & Nishiyama, Y. Fully Data-Driven normalized and exponentiated kernel density estimator with Hyvärinen score. J. Bus. Economic Stat. 43, 110–121 (2025).
Sun, Y. et al. Modifying the one-hot encoding technique can enhance the adversarial robustness of the visual model for symbol recognition. Expert Syst. Appl. 250, 123751 (2024).
Ji, Y., Wang, Y., Wang, C., Tang, X. & Song, M. Remote sensing fine Estimation model of PM2.5 concentration based on improved long Short-Term memory network: A case study on Beijing–Tianjin–Hebei urban agglomeration in China. Remote Sens. 16, 4306 (2024).
Qu, Q. et al. Response of organic aerosol in Beijing to emission reductions during the XXIV olympic winter games. Sci. Total Environ. 914, 170033 (2024).
Petrović, N., Moyà-Alcover, G., Jaume-i-Capó, A. & Buades Rubio, J. M. Enhancing generalization in sickle cell disease diagnosis through ensemble methods and feature importance analysis. Eng. Appl. Artif. Intell. 142, 109875 (2025).
Salehie, O., Jamal, M. H. B. & Shahid, S. Characterization and prediction of PM2.5 levels in Afghanistan using machine learning techniques. Theor. Appl. Climatol. 155, 9081–9097 (2024).
Zeng, Y., Ye, H. & Li, G. Adaptive dynamic service placement approach for Edge-Enabled vehicular networks based on SAC and RF. Concurrency Comput. 37, e70041 (2025).
Zhang, Y. et al. Semantic distance of icons: impact on user cognitive performance and a new model for semantic distance classification. Int. J. Ind. Ergon. 102, 103610 (2024).
Kobourov, S. et al. The influence of dimensions on the complexity of computing decision trees. Artif. Intell. 343, 104322 (2025).
Zhang, J. et al. HyperTuneFaaS: A serverless framework for hyperparameter tuning in image processing models. Displays 87, 102990 (2025).
Deng, T. et al. Multi-classification prediction of PM2.5 concentration based on improved adaptive boosting rotation forest. J. Environ. Chem. Eng. 12, 114658 (2024).
Almalawi, A. et al. An IoT based system for magnify air pollution monitoring and prognosis using hybrid artificial intelligence technique. Environ. Res. 206, 112576 (2022).
Li, T. et al. Long-term reconstruction of NO2 photolysis rate coefficients using machine learning and its impact on secondary pollution: A case study in a megacity of the Sichuan Basin, China. Atmospheric Pollution Res. 16, 102526 (2025).
Su, H. et al. Growth of nitrate contribution to aerosol pollution during wintertime in Xi’an, Northwest china: formation mechanism and effects of NH3. Particuology 87, 303–315 (2024).
Yadav, P., Lal, S., Tripathi, S. N., Jain, V. & Mandal, T. K. Role of sources of NMVOCs in O3, OH reactivity, and secondary organic aerosol formation over Delhi. Atmospheric Pollution Res. 15, 102082 (2024).
Liu, L., Liang, Y. & Zhang, T. El Niño-Southern Oscillation and its impact on population exposure to Ozone pollution and heatwave compound events in China. Atmos. Environ. 351, 121209 (2025).
Xu, L., Zhang, C., Xu, Z. & Long, D. Z. A nonparametric robust optimization approach for Chance-Constrained knapsack problem. SIAM J. Optim. 35, 739–766 (2025).
Funding
This work was supported by the “Shenyang University of Technology talent introduction research funds (project code: 524537)”. We thank the editors and reviewers for their suggestions and help.
Author information
Authors and Affiliations
Contributions
**Xiaofeng Zhu: ** Conceptualization, Methodology, Software, Validation, Formal Analysis, Investigation, Data Curation, Writing - Original Draft, Writing - Review & Editing, Visualization. **Bo Li** : Resources, Writing - Review & Editing, Supervision. **Yan Cao: ** Investigation, Data Curation. **Qian Zhang: ** **Conceptualization, Resources, Writing - Review & Editing, Supervision, Project Administration, Funding Acquisition.**.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zhu, X., Li, B., Cao, Y. et al. Applicability analysis of tree-based ensemble learning for air pollutant prediction models. Sci Rep 16, 9602 (2026). https://doi.org/10.1038/s41598-025-32652-0
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-32652-0





