Applicability analysis of tree-based ensemble learning for air pollutant prediction models

Zhu, Xiaofeng; Li, Bo; Cao, Yan; Zhang, Qian

doi:10.1038/s41598-025-32652-0

Download PDF

Article
Open access
Published: 25 February 2026

Applicability analysis of tree-based ensemble learning for air pollutant prediction models

Xiaofeng Zhu¹^na1,
Bo Li²^na1,
Yan Cao¹ &
…
Qian Zhang¹

Scientific Reports volume 16, Article number: 9602 (2026) Cite this article

805 Accesses
Metrics details

Subjects

Abstract

To support coordinated air quality management, this study developed a tree-based machine learning framework for multi-pollutant forecasting. We systematically evaluated the predictive performance of Random Forest (RF), Gradient Boosting Decision Tree (GBDT), and Decision Tree (DT) models for six key pollutants: PM_2.5, PM₁₀, NO₂, SO₂, CO, and O₃, using high-resolution environmental monitoring data (10 km resolution) from China’s four major municipalities (2021–2024). A comprehensive feature system was constructed incorporating meteorology-emission interaction terms. SHapley Additive exPlanations (SHAP) values were employed to quantify feature contributions. Key findings demonstrate: (1) RF achieved optimal performance in particulate matter prediction (PM_2.5: R2 = 0.99, RMSE = 0.11 µg/m³; PM₁₀: R² = 0.98); (2) GBDT showed comparable accuracy to RF for NO2 (R² = 0.85) and CO (R² = 0.98) with minimal differences (ΔR² ≤ 0.03); (3) DT exhibited competitive O₃ prediction capability (R² = 0.88). SHAP analysis revealed critical mechanisms, such as CO’s positive synergistic effect (SHAP = 0.136) in PM_2.5 prediction and O₃ generation sensitivity to temperature (SHAP = 0.076). This research provides an interpretable, multi-pollutant forecasting framework applicable to urban air quality warning systems and offers model selection guidance for environmental regulation strategies.

Introduction

Atmospheric pollution control remains a critical challenge in the process of global urbanization, posing severe threats to public health¹ and ecosystems². The advancement of air quality forecasting has become increasingly imperative. With the rapid development of machine learning (ML), interdisciplinary research in air quality prediction has gained substantial momentum. For instance, studies using traditional models like WRF-Chem have demonstrated precision in simulating the physicochemical formation of PM_2.5 and O₃³; however, their heavy dependence on emission inventories and high computational costs limit real-time early warning capabilities.

Compared to conventional models like WRF-Chem, ML has established itself as a core tool for air quality prediction by capitalizing on its strengths in handling nonlinear relationships and high-dimensional data. Given the substantial global health impacts stemming from inadequacies in air quality forecasting, researchers have actively compared different approaches. It compared the performance of machine learning and time series models in predicting urban PM_2.5 and PM₁₀ concentrations⁴. Analysis of five years of observational data from six monitoring stations in Abu Dhabi revealed that linear support vector regression delivered optimal performance for PM_2.5 prediction, while the Facebook Prophet model achieved superior accuracy in both 24-hour and weekly forecasts for PM_2.5 and PM₁₀.

Conventional statistical models often fail to capture nonlinear interactions between pollutants and environmental factors. In contrast, tree-based ML approaches (e.g., Random Forest (RF), Gradient Boosted Trees (GBDT), Decision Trees (DT)) balance computational efficiency with interpretability, enabling identification of key drivers through feature contribution analysis. The study is the first to apply Bayesian Maximum Entropy blending within a Deep Ensemble Machine Learning framework⁵. By integrating geographical predictors including RF, XGBoost, and GBM models with three meta-models, they generated 2.5 km × 2.5 km resolution grid surfaces for predicting monthly maximum 1-hour ozone concentrations, achieving high accuracy even in regions with sparse monitoring networks. The study developed AQI prediction models for Delhi using linear regression, RF, and DT regression under three scenarios with multi-year meteorological data⁶. Results demonstrated superior performance of DT regression and RF models across scenarios, with RF outperforming others in 10-fold cross-validation, providing actionable insights for Delhi’s urban policymakers. The study designed three visibility simulation schemes employing DT and RF algorithms⁷. Utilizing atmospheric boundary layer meteorological data, pollutant concentrations, and surface observations to mitigate haze impacts, they identified optimal methods for studying boundary layer influences. Findings revealed RF’s superior simulation performance over DT in two haze episodes.

Conventional models often analyze single pollutants in isolation while neglecting synergistic effects, whereas ML enables the integration of meteorological, chemical, and spatiotemporal heterogeneity data, offering novel approaches for multi-pollutant system modeling. In a different regional context, the study employed RF and K-Nearest Neighbors algorithms to model PM₁₀ concentrations in Zabol, Sistan Basin (Iran) using meteorological data⁸. Through feature selection methods, they identified significant predictors and demonstrated RF’s superior performance during summer months. All models achieved accurate predictions relying solely on readily available meteorological data, providing robust support for air quality forecasting and policy formulation in the region. The study compared three data-driven models—Gaussian Process Regression, Quantile Random Forest, and Bayesian Neural Networks - for predicting long-term spatiotemporal distributions of residential indoor PM_2.5⁹.The study proposed a comprehensive framework for model comparison, validation, and attribution analysis, facilitating future research to elucidate complex nonlinear relationships between urban characteristics and indoor air pollutants, while offering insights for urban planning and indoor PM_2.5 mitigation strategies.

The intensification of PM_2.5-O₃ co-pollution further underscores the urgency of multi-pollutant collaborative prediction. The study employed an enhanced synthetic control method and mediation effect model to quantitatively assess the impacts of winter-spring droughts on PM_2.5-O₃ co-pollution and its driving factors¹⁰. The study revealed significant increases in daily averages and diurnal variation patterns of PM_2.5 and O₃ during drought periods, with elevated temperatures, reduced precipitation, and lower relative humidity identified as primary drivers exacerbating “dual-high” co-pollution risks. This approach expands research on atmospheric composite pollution under extreme weather events and provides novel insights for preventing co-pollution episodes during abnormal meteorological conditions. The study investigated pollution characteristics and health impacts of ozone and its precursors in the Hexi Corridor through land-use change analysis and BenMAP-CE software¹¹. Findings indicate that cropland expansion (primarily from grassland conversion) correlates strongly with ozone pollution influenced by meteorological factors and vegetation patterns, while ozone emerges as a major contributor to premature cardiovascular mortality. The study proposes targeted measures for Wuwei City, including controlling pollution transport during high-ozone periods and implementing coordinated management of VOCs and NOx emissions to mitigate health risks.

Despite technological advancements, existing research predominantly focuses on single-pollutant prediction (e.g., PM_2.5¹² or SO₂¹³, lacking systematic multi-pollutant and multi-model comparisons. Furthermore, while SHAP (SHapley Additive exPlanations) values have been employed to interpret ML models¹⁴, their environmental science applications remain superficial. Most studies prioritize model accuracy over standardized interpretative frameworks, thereby limiting policy implementation feasibility.

This study aims to systematically compare the predictive performance of six pollutants (PM_2.5, PM₁₀, NO₂, SO₂, CO, and O₃) through a tree-based modeling framework incorporating RF, DT, and GBDT. By employing SHAP standardized workflows to quantify feature contributions and identify optimal model-pollutant pairings, we establish an evidence-based foundation for atmospheric pollution early-warning systems through integrated meteorological data and emission inventories.

Materials and methods

This study followed a standardized machine learning workflow for model development and evaluation. The process commenced with data collection and preprocessing, followed by comprehensive feature engineering to construct the input variables. Subsequently, a two-stage feature selection was implemented to identify the most predictive features. Three tree-based models were then constructed, optimized, and rigorously evaluated using a time-series cross-validation protocol. Finally, model interpretability was analyzed using SHAP values to elucidate the key drivers of pollutant concentrations.

Data description and preprocessing

The dataset comprises 5844 independent observational records encompassing six major air pollutants and six meteorological variables, covering four municipalities (temporal scope: 2021–2024 at 0000 UTC). The temporal resolution is hourly, with data recorded at 0000 h daily. The geographical distribution of the four municipalities covers three major climate zones in eastern China. Beijing (39° 54′ N, 116° 23′ E) has a temperate monsoon climate with significant PM_2.5 pollution during the winter heating period. Shanghai (31° 14′ N, 121° 29′ E), with subtropical monsoon climate and prominent ozone pollution in summer. Tianjin (39° 08′ N, 117° 12′ E), with significant sea-land wind circulation and compounded pollution from heavy industrial emissions and port transportation. Chongqing (29° 33′ N, 106° 30′ E), basin topography, static and stable weather, secondary aerosol generation is active. Temporal covariates include year, month, and day, with geographical variations encoded through the categorical variable “City”. The spatiotemporal resolution is maintained at 10 km. The dataset exhibits no missing values or outliers. Data quality was ensured through automated anomaly detection using the Interquartile Range (IQR) method, with values beyond 1.5× IQR replaced by linear interpolation. Environmental monitoring data often exhibit non-normal distributions, particularly for pollutant concentrations which are typically right-skewed due to the influence of pollution episodes¹⁵. To enhance model stability and performance, all numerical input features were normalized using Z-score standardization, which mitigates the influence of extreme values and differing scales without requiring strict adherence to normality assumptions. Key dataset features are summarized in Table 1. All air quality data and meteorological parameters were obtained from the China National Environmental Monitoring Centre (http://www.cnemc.cn/sssj/).

Table 1 Characteristics of the dataset.

Full size table

To eliminate dimensional discrepancies among numerical variables and enhance model performance, Z-score normalization¹⁶ was applied to all numerical variables (excluding city/time covariates). Following \(z=\frac{{x - \mu }}{\sigma }\), numerical variables were transformed to standardized normal distributions with mean = 0 and standard deviation = 1.

where x represents the original observed value, µ denotes the variable mean, and σ indicates the standard deviation. This normalization enables comparative analysis of variables with disparate units and magnitudes, thereby improving model fitting accuracy, prediction precision, and generalization capacity.

For the “City” categorical variable, one-hot encoding¹⁷ was implemented to prevent misleading numerical interpretations. This method converts each city into a binary dummy variable, where a value of 1 indicates the sample belongs to that city and 0 otherwise. For instance, four dummy variables were created for Beijing, Shanghai, Tianjin, and Chongqing, enabling precise identification of inter-city air quality variations without introducing spurious correlations from ordinal encoding.

Feature engineering

To fully exploit inherent periodic patterns in temporal variables, sine-cosine transformation encoding was implemented. Following \(\sin (\frac{{2\pi x}}{T})\) and \(\cos (\frac{{2\pi x}}{T})\), linear periodic time variables were converted into two-dimensional sinusoidal vectors.

Where x represents Day or Month values, with T = 31 for daily cycles (maximum days per month) and T = 12 for monthly cycles. This encoding enables models to explicitly capture diurnal and seasonal periodicity in air quality variations.

Leveraging the autoregressive properties of pollutant concentrations, lagged features were constructed for the five primary pollutants (excluding ozone due to its distinct photochemical characteristics). Lagged intervals of (1, 3, 6, 12, 24) hours were created to capture temporal persistence and autocorrelation, enhancing model learning of short-term trends and cyclical patterns.

Synergistic effects between variables were modeled through three engineered features, supported by prior studies on atmospheric processes^18,19. Temp × Wind, Product of temperature and wind speed to quantify atmospheric dispersion capacity. Humidity × SO₂, Humidity-SO₂ interaction to reflect hygroscopic effects on secondary particle formation. PM_2.5_div_Wind, Ratio of PM_2.5 to wind speed characterizing particulate dispersion efficiency. Composite Pollution Index, Weighted combination of PM_2.5 and NO₂ concentrations (0.6×PM_2.5+0.4×NO₂) representing oxidative stress potential. These interaction terms enrich feature space representation, enabling improved modeling of complex air quality dynamics.

Feature importance screening

A two-stage feature selection strategy was implemented to reduce dimensionality while enhancing model efficiency and generalizability. First, high-correlation filtering¹⁵ was conducted by calculating Pearson correlation coefficients between features, removing those with coefficients > 0.8 to mitigate multicollinearity and information redundancy. Subsequently, feature importance evaluation was performed using a RF model configured with 200 trees for stability. The quantile threshold method retained the top 15% critical features, ensuring preservation of the most valuable predictors for target pollutant estimation.

Figure 1 illustrates the key important features corresponding to the six target pollutants. The importance scores of the features calculated by the Random Forest model²⁰ visualize the contribution of each feature to the prediction of different pollutants. Each subfigure clearly lists the key features of the corresponding pollutants and their importance ranking, which highlights the important role of the key information retained after feature screening in supporting the prediction ability of the model.

Predictive model construction

Three tree-based algorithms (RF, GBDT, and DT) were selected for their demonstrated efficacy in handling nonlinear relationships, high-dimensional data, and model interpretability in environmental studies^7,21. RF and GBDT are ensemble methods capable of capturing complex feature interactions, while DT provides a baseline model with high transparency for rule extraction. The modeling framework was developed using Python within the PyCharm IDE. All experiments were conducted in Python 3.9 with scikit-learn 1.2.2. As a Bagging-based ensemble algorithm, RF constructs multiple decorrelated decision trees through bootstrap sampling, utilizing feature subset splitting and majority voting to enhance model generalization²². GBDT adopts a Boosting framework, sequentially building weak learners with gradient-optimized residuals, demonstrating strong local feature learning capacity²³. As a fundamental tree model, DT employs recursive dichotomy to establish rule libraries through information entropy minimization²⁴.

Key model configurations and hyperparameters are detailed in Table 2. The dataset was temporally split into training (80%) and test (20%) sets using time-series-aware partitioning, which preserves chronological dependencies and enables models to effectively learn temporal patterns and evolutionary trends. Number of base learners (n_estimators), Controls ensemble size in tree-based models. Maximum tree depth (max_depth), Prevents overfitting by constraining model complexity. Feature subspace strategy (max_features), Proportion for random feature selection. Shrinkage factor (learning_rate), Modifies contribution weights of sequential trees. Stochastic subsampling (subsample), Fraction of samples used for boosting iterations. Minimum samples for node splitting (min_samples_split), Halting criterion for tree growth. Random seed (random_state), Ensures experimental reproducibility.

Table 2 Model parameter configurations.

Full size table

Model optimization employed 5-fold grid search with time-series cross-validation via TimeSeriesSplit(n_splits = 3), ensuring temporal generalization across sliding windows. This systematic hyperparameter tuning process explored the predefined parameter space to identify optimal combinations that minimize root mean square error²⁵.

Model evaluation protocol

A comprehensive evaluation framework was established using three metrics. Root mean squared error (RMSE), emphasizes larger prediction errors, critical for assessing extreme value forecasting. Mean absolute error (MAE), quantifies absolute deviation magnitude between predictions and observations. Coefficient of determination (R²), measures explained variance (0–1 scale), where values approaching 1 indicate superior model fit.

SHAP values were computed via TreeExplainer, grounded in cooperative game theory’s Shapley value concept²⁶. This approach quantifies individual feature contributions to model predictions through:

Global interpretability

Aggregate SHAP values across all samples.

Local interpretability

Instance-specific contribution analysis.

Feature importance ranking was derived from absolute SHAP values, enabling identification of dominant predictors and their directional impacts (positive/negative) on pollutant concentrations.

Results

Model performance comparison

The optimized hyperparameter configurations for each pollutant are summarized in Table 3. The results reveal significant variations in optimal parameters across pollutants, reflecting their distinct data distributions and feature interaction complexities. For RF, PM_2.5 and PM₁₀ predictions utilized unrestricted tree depth (max_depth = None) and square-root feature selection (max_features=’sqrt’), whereas NO₂ and O₃-8 h predictions required constrained tree depth (max_depth = 10) to mitigate overfitting risks. GBDT adopted lower learning rates (learning_rate = 0.05) for gaseous pollutants (SO₂ and CO) but relied on deeper trees (max_depth = 5) for particulate matter (PM_2.5 and PM₁₀). DT employed a minimum sample split threshold of 10 for SO₂ prediction and 20 for other pollutants, indicating the necessity for pollutant-specific splitting criteria.

Table 3 Optimized hyperparameter configurations for each pollutant.

Full size table

Figure 2 compares the predictive performance of the three models across six pollutants. For PM_2.5 prediction, the RF model achieved optimal performance with post-tuning metrics of RMSE = 0.11 µg/m³, MAE = 0.07 µg/m³, and R² = 0.99, representing reductions of 57.7% (RMSE) and 59.2% (MAE) compared to pre-tuning, alongside a 6.0% improvement in R². GBDT and DT models exhibited higher errors (RMSE = 0.23 and 0.27 µg/m³, respectively) and lower R² values (0.95 and 0.93). In PM₁₀ prediction, RF maintained superiority (RMSE = 0.13 µg/m³, MAE = 0.08 µg/m³, R² = 0.98), with error reductions of 42.5% (RMSE) and 55.1% (MAE) relative to baseline models. GBDT and DT models achieved R² values of 0.93 and 0.90, confirming the efficacy of hyperparameter tuning in enhancing generalization.

For gaseous pollutants, the RF model demonstrated strong performance: NO₂ prediction yielded RMSE = 0.34 µg/m³ (32.6% reduction) and R² = 0.88, while SO₂ prediction achieved near-perfect accuracy (RMSE = 0.07 µg/m³, 70.3% error reduction; R² = 0.99). In CO prediction, both RF and GBDT attained R² = 0.98, with MAE values of 0.04 and 0.05 µg/m³, respectively. For O₃-8 h prediction, RF outperformed other models (R² = 0.92 vs. GBDT = 0.86 and DT = 0.88), reducing RMSE by 49.6% post-tuning.

RF achieved the best performance for PM_2.5 (R² = 0.99 ± 0.01) and PM₁₀ (R² = 0.98 ± 0.02), while GBDT showed comparable accuracy for NO₂ (R² = 0.85 ± 0.03) and CO (R² = 0.98 ± 0.01). DT performed competitively for O₃ (R² = 0.88 ± 0.04).

Feature importance analysis

Based on the optimized RF, GBDT, and DT models, combined with SHAP value analysis, the key drivers and their contribution patterns for the six pollutants are illustrated in Figs. 3, 4 and 5. The x-axis represents SHAP values (positive values indicate increased prediction, negative values indicate decreased prediction). Point colors represent feature values (red: high, blue: low). Dense point clusters indicate regions of high feature influence.

For PM₁₀ prediction, the DT model revealed significant positive synergy between PM_2.5 (SHAP = 0.0772) and NO₂ (SHAP = 0.0580), with high-value regions (red points) densely distributed in the 0.5–2.0 SHAP range (Fig. 3a). The lagged effect of PM₁₀_lag1 (SHAP = 0.0614) and its extreme value (1.62) indicated persistent historical pollution influences. In the GBDT model, PM_2.5 exhibited dominant SHAP values (2.34), exceeding all other features (Fig. 4a), reflecting secondary aerosol formation as the primary mechanism. The long-tail distribution of PM₁₀_lag1 (up to 1.90) further validated the predictability of extreme pollution events. For the RF model, PM_2.5 demonstrated elevated impact thresholds (SHAP > 1.0), with high-density clusters (1.0–1.49) in critical regions (Fig. 5a). The negative regulatory role of humidity (2m_RH, SHAP=-0.0510) remained consistent across all three models.

In PM_2.5 prediction, the DT model identified CO (mean SHAP = 0.1364) and PM₁₀ (SHAP = 0.0519) as key positive drivers, with high-value clusters (> 0.5 SHAP) forming dense red-point groups in the distribution plots (Fig. 3b). SO₂ displayed a dual-mode distribution (− 0.45–1.48 SHAP), with significant inhibitory effects in low-concentration ranges (− 0.45–0.0). The GBDT model showed narrowed influence ranges for CO (− 0.50–0.99 SHAP) but increased high-value cluster density (0.2–0.98 SHAP, Fig. 4b), confirming its robustness. Extreme PM₁₀ responses (SHAP = 1.91) and historical concentrations (PM_2.5_lag1, SHAP = 0.0070) jointly governed pollution accumulation. In the RF model, PM₁₀ exhibited contradictory directional impacts, with dense blue clusters in low-value regions (− 0.65–0.0 SHAP) coexisting with extreme high-value responses (1.67 SHAP, Fig. 5b). The PM_2.5_div_Wind feature demonstrated a positive threshold effect (0.3–0.57 SHAP), indicating suppressed dispersion when wind speeds exceeded 4 m/s.

For SO₂ prediction, the DT model identified SO₂_lag1 as the core driver (SHAP = 0.2006), with dense red points in its high-value range (0.02–0.31 SHAP, Fig. 3c). The negative influences of CO (SHAP=-0.0661) and PM_2.5 (SHAP=-0.0459) were perturbed by extreme values (CO SHAP = 2.27). In the GBDT model, the extreme intensity of SO₂_lag1 decreased (1.03 SHAP), but its temporal dependency persisted (SHAP = 0.1491). The humidity interaction term (Humidity × SO₂) maintained a negative mean value (− 0.0104) despite exhibiting extreme responses (1.24 SHAP). For the RF model, the influence range of SO₂_lag1 contracted (− 0.20–0.81 SHAP), with reduced sensitivity (SHAP = 0.1318). PM_2.5’s symmetrical distribution (− 0.52–1.31 SHAP) reinforced its generalized inhibitory role (mean SHAP = − 0.0101).

In NO₂ prediction, the DT model revealed significant positive synergy between PM_2.5 (SHAP = 0.1238) and CO (SHAP = 0.0754), though CO exhibited extreme negative impacts (− 1.91 SHAP), indicating emission source heterogeneity (Fig. 3d). The GBDT model highlighted threshold responses from NO₂_lag3 (SHAP = 0.49, Fig. 4d), validating temporal dependency in pollution accumulation (SHAP = 0.0436). CO’s positive dominance intensified (mean SHAP = 0.1393). In the RF model, CO’s influence range narrowed (− 0.96–0.59 SHAP) but retained a high mean value (0.1224). Temperature (2m_Temp.) displayed contradictory directional impacts across models (DT: SHAP = − 0.0330 vs. RF: SHAP = 0.0106), reflecting uncertainties in meteorology-chemistry coupling mechanisms.

For CO prediction, the DT model revealed coexisting negative dominance (SHAP = −0.1490) and extreme positive responses (SHAP = 2.21) from NO₂ (Fig. 3e), suggesting scenario-dependent photochemical processes. Temperature exhibited significant positive thermal driving effects (SHAP = 0.0374). In the GBDT model, NO₂’s extreme intensity slightly decreased (2.20→2.1951 SHAP) but achieved improved distribution stability. The negative influence of humidity (2m_RH) intensified, with increased point density in the − 0.36–0.0 SHAP range. For the RF model, NO₂’s inhibitory effects strengthened (SHAP = −0.1309), and its extreme value range narrowed (− 0.80–1.85 SHAP). Temperature sensitivity diminished (SHAP = 0.0163), reflecting model smoothing mechanisms.

In O₃ prediction, the DT model identified temperature (SHAP = 0.0769) and CO (SHAP = 0.0835) as dominant drivers of ozone formation, with dense high-value red clusters (Fig. 3f). PM₁₀ displayed contradictory impacts, combining a negative mean (SHAP=-0.0059) with localized positive extremes (SHAP = 0.63). The GBDT model underestimated CO’s true influence (SHAP = 0.5068) due to axis truncation (displayed range: -0.5–0.5 SHAP). Temperature’s effect range contracted (-0.62–0.80 SHAP). For the RF model, temperature exhibited the strongest positive contribution (SHAP = 0.0594), though values > 0.5 SHAP were not visualized. CO’s influence range (-1.66–0.48 SHAP) decoupled from its mean effect (SHAP = 0.1139).

SHAP analysis identified the top drivers for each pollutant:

PM_2.5: CO (SHAP = 0.136) and PM_2.5_div_Wind (threshold > 4 m/s).
O₃: Temperature (SHAP = 0.076) and CO (SHAP = 0.084).
NO₂: CO (SHAP = 0.139) and NO₂_lag3 (SHAP = 0.044).

Error pattern analysis

The comprehensive comparison of multi-model prediction performance (Fig. 6) revealed significant differences in error characteristics across pollutant-model combinations. Pollutants are arranged vertically as PM_2.5, PM₁₀, NO₂, SO₂, CO, and O₃ (top to bottom). All axes use standardized concentration units (µg/m³ for particulates, ppm for CO, ppb for O₃).

PM_2.5 prediction results (Fig. 6a). The DT model exhibited fan-shaped dispersion in medium concentration ranges (2.0 ≤ Actual ≤ 3.5 µg/m³) with MAE = 0.19. The RF model achieved optimal fitting (R² = 0.99, MSE = 0.01), while the GBDT model showed a 23% lag rate in predicting high-concentration samples (> 3.5 µg/m³), yielding RMSE = 0.23.

PM₁₀ prediction results (Fig. 6b). For PM₁₀, the DT model formed stepped error bands (MAE = 0.22) at abrupt concentration gradients (1.5–2.5 µg/m³). The RF model maintained stable low-error performance (MSE = 0.02), whereas the GBDT model introduced systematic bias (Δ=+0.18) in high-concentration segments (> 3.5 µg/m³).

NO₂ prediction results (Fig. 6c). The DT model displayed bimodal error distributions (R² = 0.78). The RF model effectively mitigated overfitting in high-concentration samples (max_features = 0.33), and the GBDT model produced prediction plateaus (MAE = 0.30) within the 1.5–2.5 µg/m³ range.

SO₂ prediction results (Fig. 6d). In SO₂ prediction, the DT model generated a 0.25 µg/m³ error band within the 1.2–2.0 µg/m³ range. The RF model achieved near-perfect fitting (R² = 0.99, MSE = 0.01), while the GBDT model showed sample compression effects (> 2.5 µg/m³, RMSE = 0.18) due to tree depth constraints (max_depth = 3).

CO prediction results (Fig. 6e). For CO, the DT model induced stepwise errors (MAE = 0.05) near the concentration inflection point (≈ 1.2 ppm). The RF model attained instrumental-level precision (RMSE = 0.12), whereas the GBDT model systematically overestimated (> 2.0 ppm) by 5%.

O₃ prediction results (Fig. 6f). The DT model formed dense error bands (R² = 0.88) in the 1.8–2.2 ppb range. The RF model best captured photochemical nonlinearity (n_estimators = 200), while the GBDT model produced peak-shaving errors (Δ=-0.38) for samples > 2.5 ppb.

Discussion

Environmental mechanisms underlying model performance differences

This study demonstrates that RF achieves optimal performance in predicting particulate matter (PM_2.5, PM₁₀) and SO₂ (R² = 0.99), while GBDT exhibits comparable accuracy to RF for NO₂ and CO (ΔR² < 0.05). These findings partially align with recent research conclusions. For example, The study identified RF’s superiority in modeling nonlinear relationships of PM_2.5, attributed to its ability to capture meteorological-chemical coupling effects in secondary aerosol formation²¹. However, contrary to the excellent performance of GBDT in predicting gas pollutants in research²⁷, this study reveals that RF’s high precision for SO₂ (RMSE = 0.07 µg/m³) may stem from its capability to characterize spatial heterogeneity in emission inventories. Future work will incorporate high-resolution emission data to validate this hypothesis.

The performance disparities can be attributed to the complexity of pollutant formation mechanisms: particulate matter is governed by multiphase reactions requiring high-dimensional nonlinear feature processing, whereas gaseous pollutants (e.g., NO₂) are influenced by spatiotemporal heterogeneity in point-source emissions, demanding stronger feature interaction modeling capabilities²⁸. RF’s ensemble structure inherently addresses the former through parallelized nonlinear fitting, while GBDT’s sequential error correction adapts better to the latter’s localized emission dynamics.

The observed performance disparities among tree-based models stem from fundamental differences in pollutant formation mechanisms and algorithm learning characteristics. For particulate matter prediction, RF’s ensemble architecture demonstrated superior capability in capturing nonlinear interactions between secondary aerosol precursors and meteorological parameters. Specifically, the model achieved R²=0.99 for PM₂.₅ prediction by effectively resolving the photochemical coupling between CO (SHAP = 0.136) and humidity-driven sulfate suppression (SHAP=-0.102). This aligns with field observations of nitrate-dominated particle growth under stagnant conditions²⁹, where RF’s bootstrap aggregating mechanism mitigated overfitting risks from collinear features (PM₂.₅_lag24 vs. PM₁₀_lag12: r = 0.89).

Analysis of synergistic effects among key drivers

SHAP analysis revealed asymmetric synergistic effects among pollutants. In DT-based PM_2.5 prediction, the positive contribution of CO (SHAP = 0.136) validated the role of incomplete combustion products as precursors of secondary organic aerosols¹⁹, while the inhibitory effect of SO₂ (SHAP=-0.102) may relate to limited hygroscopic growth of sulfate aerosols. For DT-based O₃ prediction, the dominant positive influence of temperature (SHAP = 0.076) aligns with photochemical generation mechanisms, yet its synergy with CO (SHAP = 0.084) indicates indirect impacts from non-methane volatile organic compound (NMVOC) precursors, consistent with observational ozone formation potential analyses in study³⁰. Notably, the threshold effect of the PM_2.5_div_Wind feature (SHAP > 0.3) suggests that when wind speeds exceed 4 m/s, turbulent mixing surpasses advective dispersion, leading to localized accumulation of fine particles—a phenomenon corroborated by satellite retrievals during pollution episodes in the Beijing-Tianjin-Hebei region¹⁸.

Model-pollutant compatibility and algorithm characteristics

This study systematically compares the predictive performance of RF, GBDT, and DT across six pollutants, revealing intrinsic compatibility between algorithmic properties and pollutant formation mechanisms. Key findings demonstrate RF’s superiority for particulate matter (PM_2.5, PM₁₀) and SO₂ predictions (R² > 0.98), GBDT’s competitive accuracy for NO₂ (ΔR² = 0.03 vs. RF), and DT’s unexpected competence in O₃ prediction (R² = 0.88). These outcomes are fundamentally linked to pollutant-specific data characteristics and algorithmic design principles.

Particulate matter prediction

RF’s dominance in PM_2.5/PM₁₀ prediction stems from its ensemble learning capacity to resolve high-dimensional nonlinear interactions. The secondary aerosol conversion and meteorological dispersion governing PM_2.5 formation require modeling complex dependencies among lagged features (e.g., PM_2.5_lag24) and meteorological interactions (PM_2.5_div_Wind), which ranked top three in SHAP importance (Fig. 5a). RF’s bootstrap sampling and feature subset partitioning mitigate multicollinearity impacts (e.g., PM_2.5-PM₁₀ correlation r = 0.82) on model stability. In contrast, GBDT exhibits 23% prediction lag for high-concentration PM_2.5 samples (> 3.5 µg/m³, Fig. 6a), likely due to AdaBoost’s sensitivity to extreme values.

SO₂ prediction

RF’s exceptional SO₂ accuracy (RMSE = 0.07 µg/m³) correlates with its ability to characterize spatial heterogeneity in emission inventories. SO₂_lag1’s leading SHAP value (0.1318, Fig. 5c) underscores historical emission persistence as the primary driver. DT’s neglect of feature interactions (e.g., SO₂_lag12-SO₂_lag24 correlation r = 0.91) results in expanded error bands (Fig. 6d).

NO₂/CO prediction

GBDT achieves comparable performance to RF for NO₂ (R² = 0.85, ΔR² = 0.03) and CO (R² = 0.98), reflecting algorithmic adaptability to emission dynamics. NO₂’s traffic-derived pulse emissions (skewness = 1.32) demand spatiotemporal heterogeneity modeling, where GBDT’s boosting-driven residual correction (Fig. 6c) outperforms RF’s plateau effects. For CO, both RF and GBDT converge in modeling linear emission sources (e.g., stationary combustion), achieving identical R² = 0.98 (Fig. 6e).

O₃ prediction

DT’s competitive O₃ prediction (R² = 0.88) highlights compatibility between algorithmic simplicity and photochemical thresholds. O₃ formation governed by nonlinear NOx-VOCs ratios and radiation thresholds exhibits piecewise-linear feature interactions (e.g., abrupt O₃ generation rate increases at T > 25 °C). DT’s binary tree splitting inherently aligns with such mechanisms (Fig. 3f), where an 8-layer tree depth (Table 3) matches O₃ concentration inflection points. Ensemble models like RF suffer from prediction smoothing, diminishing peak-capturing ability (Fig. 6f).

Limitations and future research directions

Despite significant advancements, this study has three key limitations. First, the temporal scope of the dataset (2021–2024) excludes extreme climate events such as El Niño³¹, potentially underestimating long-term meteorological impacts. Second, the spatial resolution of emission inventories (10 km) limits precise characterization of industrial point sources. Third, prediction errors remain pronounced during extreme pollution episodes (PM_2.5>250 µg/m³, RMSE = 38.7 µg/m³), necessitating robustness optimization methods like adversarial training³².

Future research will focus on integrating WRF-Chem simulations with ML models to establish a multi-scale hybrid forecasting system. Additionally, dynamic pollution control strategies will be explored through SHAP-based optimization, targeting feature-specific interventions aligned with key driver thresholds.

Conclusions

This study systematically elucidates the performance disparities of tree-based models in multi-pollutant atmospheric prediction and their environmental mechanisms. This study demonstrates that tree-based models offer complementary strengths for multi-pollutant forecasting. RF excels in particulate matter, GBDT in traffic-related gases, and DT in photochemical O₃.

SHAP attribution analysis quantified synergistic interactions among key drivers, revealing CO’s dominant role as a secondary organic precursor (SHAP = 0.136) in PM_2.5 prediction and the inhibitory effect of humidity-SO₂ interactions (Humidity × SO₂=-0.0104) on sulfate aerosol formation, providing mechanistic foundations for coordinated pollution control.

Current models remain limited by emission inventory resolution (10 km) in characterizing industrial point sources and excluding climate anomalies like El Niño, resulting in elevated errors during extreme pollution events (PM_2.5>250 µg/m³, RMSE = 38.7 µg/m³). Future research will integrate WRF-Chem simulations with machine learning to develop multi-scale hybrid forecasting systems, enhance robustness via adversarial training, and optimize heavy-pollution emergency responses through SHAP-driven dynamic feature weighting. The standardized prediction-interpretation framework proposed here has been operationally implemented in the Beijing-Tianjin-Hebei regional air quality warning system, offering a transferable paradigm for urban atmospheric pollution management.

Data availability

Due to the confidentiality regulations involved, the data sets generated and/or analyzed during the current study are not public, but can be obtained from the corresponding authors upon reasonable request.

References

Cao, W. et al. Short-term air pollution exposure and risk of respiratory pathogen infections: an 11-year case-crossover study in Guangzhou, China. BMC Public. Health. 25, 1411 (2025).
Article PubMed PubMed Central CAS Google Scholar
Watson, A. S. & Bai, R. S. Studies on ecosystem services and air-pollution mitigation in tropical urban vegetation using i-Tree eco model. Environ. Dev. Sustain. https://doi.org/10.1007/s10668-025-06016-7 (2025).
Article Google Scholar
Inlaung, K., Chotamonsak, C., Macatangay, R. & Surapipith, V. Assessment of transboundary PM2.5 from biomass burning in Northern Thailand using the WRF-Chem model. Toxics 12, 462 (2024).
Article PubMed PubMed Central Google Scholar
Abuouelezz, W. et al. Exploring PM2.5 and PM10 ML forecasting models: a comparative study in the UAE. Sci. Rep. 15, 9797 (2025).
Article ADS PubMed PubMed Central CAS Google Scholar
Hanigan, I. C. et al. Deep ensemble machine learning with bayesian blending improved accuracy and precision of modelled ground-level Ozone for region with sparse monitoring: Australia, 2005–2018. Environ. Model. Softw. 187, 106378 (2025).
Article Google Scholar
Pande, C. B. et al. Forecasting of monthly air quality index and Understanding the air pollution in the urban city, India based on machine learning models and cross-validation. J. Atmos. Chem. 82, 1 (2025).
Article CAS Google Scholar
Wang, F. et al. Ground visibility prediction using tree-based and random-forest machine learning algorithm: comparative study based on atmospheric pollution and atmospheric boundary layer data. Atmospheric Pollution Res. 15, 102270 (2024).
Article ADS Google Scholar
Kalantari, E., Gholami, H., Malakooti, H., Kaskaoutis, D. G. & Saneei, P. An integrated feature selection and machine learning framework for PM10 concentration prediction. Atmospheric Pollution Res. 16, 102456 (2025).
Article CAS Google Scholar
Dai, H. et al. Comparison and evaluation of machine learning models for predicting indoor PM2.5 concentrations on a large Spatiotemporal scale. Build. Simul. https://doi.org/10.1007/s12273-025-1276-0 (2025).
Article Google Scholar
Wu, B. et al. Winter-spring droughts exacerbated PM2.5-O3 compound pollution? Evidence from China. Sci. Total Environ. 959, 178309 (2025).
Article PubMed CAS Google Scholar
Lei, S. et al. Study on O3-NOx-VOCs combined air pollution and Ozone health effects in the Hexi corridor. Environ. Sci. Pollut Res. 31, 49837–49854 (2024).
Article CAS Google Scholar
Cao, W., Wang, L., Li, R., Zhou, W. & Zhang, D. Unveiling the nonlinear relationships and co-mitigation effects of green and blue space landscapes on PM2.5 exposure through explainable machine learning. Sustainable Cities Soc. 122, 106234 (2025).
Article Google Scholar
Corradino, C., Jouve, P. & La Spina, A. Del Negro, C. Monitoring earth’s atmosphere with Sentinel-5 TROPOMI and artificial intelligence: quantifying volcanic SO2 emissions. Remote Sens. Environ. 315, 114463 (2024).
Article Google Scholar
Ren, W., Zhou, A. S., Ma, J. & C. & Prediction of residual stresses in GFRP strips under wind-sand erosion by interpretable machine learning methods: feature engineering and SHAP analysis. Multiscale Multidiscip Model. Exp. Des. 8, 270 (2025).
Article Google Scholar
Gao, Y. et al. Adversarial neural collaborative filtering with embedding dimension correlations. Data Intell. 5, 786–806 (2023).
Article Google Scholar
Imai, S., Koriyama, T., Yonekura, S., Sugasawa, S. & Nishiyama, Y. Fully Data-Driven normalized and exponentiated kernel density estimator with Hyvärinen score. J. Bus. Economic Stat. 43, 110–121 (2025).
Article Google Scholar
Sun, Y. et al. Modifying the one-hot encoding technique can enhance the adversarial robustness of the visual model for symbol recognition. Expert Syst. Appl. 250, 123751 (2024).
Article Google Scholar
Ji, Y., Wang, Y., Wang, C., Tang, X. & Song, M. Remote sensing fine Estimation model of PM2.5 concentration based on improved long Short-Term memory network: A case study on Beijing–Tianjin–Hebei urban agglomeration in China. Remote Sens. 16, 4306 (2024).
Article ADS Google Scholar
Qu, Q. et al. Response of organic aerosol in Beijing to emission reductions during the XXIV olympic winter games. Sci. Total Environ. 914, 170033 (2024).
Article PubMed CAS Google Scholar
Petrović, N., Moyà-Alcover, G., Jaume-i-Capó, A. & Buades Rubio, J. M. Enhancing generalization in sickle cell disease diagnosis through ensemble methods and feature importance analysis. Eng. Appl. Artif. Intell. 142, 109875 (2025).
Article Google Scholar
Salehie, O., Jamal, M. H. B. & Shahid, S. Characterization and prediction of PM2.5 levels in Afghanistan using machine learning techniques. Theor. Appl. Climatol. 155, 9081–9097 (2024).
Article ADS Google Scholar
Zeng, Y., Ye, H. & Li, G. Adaptive dynamic service placement approach for Edge-Enabled vehicular networks based on SAC and RF. Concurrency Comput. 37, e70041 (2025).
Article Google Scholar
Zhang, Y. et al. Semantic distance of icons: impact on user cognitive performance and a new model for semantic distance classification. Int. J. Ind. Ergon. 102, 103610 (2024).
Article Google Scholar
Kobourov, S. et al. The influence of dimensions on the complexity of computing decision trees. Artif. Intell. 343, 104322 (2025).
Article MathSciNet Google Scholar
Zhang, J. et al. HyperTuneFaaS: A serverless framework for hyperparameter tuning in image processing models. Displays 87, 102990 (2025).
Article Google Scholar
Deng, T. et al. Multi-classification prediction of PM2.5 concentration based on improved adaptive boosting rotation forest. J. Environ. Chem. Eng. 12, 114658 (2024).
Article ADS CAS Google Scholar
Almalawi, A. et al. An IoT based system for magnify air pollution monitoring and prognosis using hybrid artificial intelligence technique. Environ. Res. 206, 112576 (2022).
Article PubMed CAS Google Scholar
Li, T. et al. Long-term reconstruction of NO2 photolysis rate coefficients using machine learning and its impact on secondary pollution: A case study in a megacity of the Sichuan Basin, China. Atmospheric Pollution Res. 16, 102526 (2025).
Article CAS Google Scholar
Su, H. et al. Growth of nitrate contribution to aerosol pollution during wintertime in Xi’an, Northwest china: formation mechanism and effects of NH3. Particuology 87, 303–315 (2024).
Article CAS Google Scholar
Yadav, P., Lal, S., Tripathi, S. N., Jain, V. & Mandal, T. K. Role of sources of NMVOCs in O3, OH reactivity, and secondary organic aerosol formation over Delhi. Atmospheric Pollution Res. 15, 102082 (2024).
Article CAS Google Scholar
Liu, L., Liang, Y. & Zhang, T. El Niño-Southern Oscillation and its impact on population exposure to Ozone pollution and heatwave compound events in China. Atmos. Environ. 351, 121209 (2025).
Article CAS Google Scholar
Xu, L., Zhang, C., Xu, Z. & Long, D. Z. A nonparametric robust optimization approach for Chance-Constrained knapsack problem. SIAM J. Optim. 35, 739–766 (2025).
Article MathSciNet Google Scholar

Download references

Funding

This work was supported by the “Shenyang University of Technology talent introduction research funds (project code: 524537)”. We thank the editors and reviewers for their suggestions and help.

Author information

Xiaofeng Zhu and Bo Li contributed equally to this work.

Authors and Affiliations

College of Chemical Equipment, Shenyang University of Technology, Liaoyang, 111003, China
Xiaofeng Zhu, Yan Cao & Qian Zhang
Appraisal Center for Environment and Engineering, Ministry of Ecology and Environment, Beijing, 100144, China
Bo Li

Authors

Xiaofeng Zhu
View author publications
Search author on:PubMed Google Scholar
Bo Li
View author publications
Search author on:PubMed Google Scholar
Yan Cao
View author publications
Search author on:PubMed Google Scholar
Qian Zhang
View author publications
Search author on:PubMed Google Scholar

Contributions

**Xiaofeng Zhu: ** Conceptualization, Methodology, Software, Validation, Formal Analysis, Investigation, Data Curation, Writing - Original Draft, Writing - Review & Editing, Visualization. **Bo Li** : Resources, Writing - Review & Editing, Supervision. **Yan Cao: ** Investigation, Data Curation. **Qian Zhang: ** **Conceptualization, Resources, Writing - Review & Editing, Supervision, Project Administration, Funding Acquisition.**.

Corresponding author

Correspondence to Qian Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhu, X., Li, B., Cao, Y. et al. Applicability analysis of tree-based ensemble learning for air pollutant prediction models. Sci Rep 16, 9602 (2026). https://doi.org/10.1038/s41598-025-32652-0

Download citation

Received: 19 August 2025
Accepted: 11 December 2025
Published: 25 February 2026
Version of record: 23 March 2026
DOI: https://doi.org/10.1038/s41598-025-32652-0

Subjects

Abstract

Introduction

Materials and methods

Data description and preprocessing

Feature engineering

Feature importance screening

Predictive model construction

Model evaluation protocol

Global interpretability

Local interpretability

Results

Model performance comparison

Feature importance analysis

Error pattern analysis

Discussion

Environmental mechanisms underlying model performance differences

Analysis of synergistic effects among key drivers

Model-pollutant compatibility and algorithm characteristics

Particulate matter prediction

SO2 prediction

NO2/CO prediction

O3 prediction

Limitations and future research directions

Conclusions

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links

SO₂ prediction

NO₂/CO prediction

O₃ prediction