Assessment model of ozone pollution based on SHAP-IPSO-CNN and its application

Zhou, Xiaolei; Wang, Xingyue; Guo, Ruifeng

doi:10.1038/s41598-025-87702-4

Download PDF

Article
Open access
Published: 27 January 2025

Assessment model of ozone pollution based on SHAP-IPSO-CNN and its application

Xiaolei Zhou^1,2,
Xingyue Wang^1,2 &
Ruifeng Guo^1,2

Scientific Reports volume 15, Article number: 3404 (2025) Cite this article

3089 Accesses
5 Citations
Metrics details

Subjects

Abstract

The problem of ground-level ozone (O₃) pollution has become a global environmental challenge with far-reaching impacts on public health and ecosystems. Effective control of ozone pollution still faces complex challenges from factors such as complex precursor interactions, variable meteorological conditions and atmospheric chemical processes. To address this problem, a convolutional neural network (CNN) model combining the improved particle swarm optimization (IPSO) algorithm and SHAP analysis, called SHAP-IPSO-CNN, is developed in this study, aiming to reveal the key factors affecting ground-level ozone pollution and their interaction mechanisms. Firstly, an atmospheric dispersion model is utilized to predict the distribution concentration of VOCs emitted by enterprises in the park at the target monitoring stations based on the ozone generation mechanism. Then three mainstream machine learning models are compared for SHAP analysis to obtain the significance results of relevant features. Finally, the IPSO algorithm is combined with SHAP analysis to dynamically adjust the training features to optimize the performance of the CNN model. The model integrates atmospheric pollutants and related meteorological data to explore the nonlinear influence relationship of ozone formation in depth. The performance of the model is validated by the comprehensive evaluation indexes of R², MAE and RMSE, and the results show that the present model outperforms the IPSO-CNN and SHAP-PSO-CNN models with the performance indexes of R² of 0.9492, MAE of 0.0061 mg/m³ and RMSE of 0.0084 mg/m³. This study not only advances the understanding of ozone pollution formation mechanisms, but also provides an assessment of the impact of VOCs emissions from enterprises in the park, which provides empirical support for environmental management.

Sustained emission reductions have restrained the ozone pollution over China

Article 28 September 2023

Unequal spatio-temporal distribution of population-weighted pollution extremes through deep learning

Article Open access 13 October 2025

Parsimonious estimation of hourly surface ozone concentration across China during 2015–2020

Article Open access 14 May 2024

Introduction

Ground-level ozone (O₃) pollution has become a major environmental problem globally, and severe ozone pollution has significant impacts on human health and socio-economic development³. In China, especially in some cities, ozone has become another important pollutant affecting the improvement of urban air quality in addition to PM2.5^1,2. Although various measures have been taken by local governments and scientific research institutions to control and manage ozone pollution, effective control of ground-level ozone remains a challenge due to the complexity of its formation mechanism. Ozone formation is affected by the interaction of various factors such as precursors, e.g., volatile organic compounds (VOCs) and nitrogen oxides (NO_x), atmospheric chemical processes, and meteorological factors^4,5,6. Therefore, it is important to make full use of observational data to analyze the causes of ozone pollution as well as the relationships among multiple drivers to promote air pollution prevention and control.

Several studies have applied air quality models to assess the independent contribution of different factors to ozone formation^7,8,9. However, these models are limited by high computational costs, high sensitivity to input data errors, and difficulties in reconstructing accurate processes under complex meteorological and chemical conditions. In recent years, rapid advances in data science and machine learning techniques have provided new solutions in the field of air pollution control¹⁰. Compared to traditional air quality models, machine learning methods can be modeled in a data-driven manner by processing large-scale datasets without the need to directly simulate complex physical and chemical processes in the atmosphere. These methods can effectively capture the complex nonlinear relationships between ozone and its drivers, reducing the computational burden while improving model robustness to data noise and input perturbations¹¹. In recent years, various machine learning methods have been extensively applied to predict air pollutant concentrations and analyze the causes of air pollution^{12,13,14,15,16,17}. In addition, some studies have used ML techniques to quantify the relative contributions of input variables to model the prediction targets, e.g., Dang et al. applied a multiple linear regression (MLR) model to quantify the contributions of important meteorological variables to ground-level ozone concentrations, and efficiently quantitatively resolved the effects of different factors on ozone formation⁷. In addition, Xu et al. combined an optimized random forest (RF) model with structure mining techniques to deeply reveal the combined effects of multiple drivers on ozone formation in the environment¹⁸. Wang et al. constructed a hybrid deep learning model based on observational data, which was combined with the GEOS-Chem atmospheric chemical transport model, to investigate the changes of surface ozone concentration in the North China Plain from 2015 to 2021. A comprehensive analysis of meteorological and anthropogenic drivers of summer surface ozone concentration changes in the North China Plain from 2015 to 2021 is presented¹⁹.

The aforementioned studies effectively utilized machine learning techniques to analyze the formation process of atmospheric ozone and its relationship with various meteorological and anthropogenic factors. This study focuses on how to adjust and optimize the parameters of the deep learning model through advanced optimization algorithms to further enhance the accuracy and performance of the model in analyzing atmospheric ozone formation.

The effective application of particle swarm optimization (PSO) algorithms has been validated and demonstrated in several fields. For example, Wang et al. successfully combined the PSO algorithm with a BP neural network to create a prediction model for spontaneous combustion risk in gas extraction boreholes (PSO-BPNN), which demonstrated a high degree of accuracy and stability in predicting the risk of spontaneous combustion in mines²⁰. Similarly, Huang et al. proposed a new air quality index (AQI) prediction method by optimizing a BP neural network with an improved PSO algorithm, which further confirmed the ability of PSO in optimizing the algorithm parameters²¹. In addition, Ordóñez-De León et al. used PSO to improve the Adaptive Neuro-Fuzzy Inference System (ANFIS) to enhance the performance of a PM10 particle concentration model²². You et al. proposed a deep learning model for predicting the risk of nationwide forest fires by optimizing convolutional neural networks with PSO algorithms, showing the potential application of PSO for the design of deep learning architecture application potential²³. Murillo-Escobar et al. used a PSO-optimized support vector regression (SVR) model to predict air pollutant concentrations ahead of time, showing the utility of PSO in improving the performance of prediction models²⁴. Cen et al. proposed a CO₂ mass concentration prediction method based on the RF-PSO-LSTM model for sheep barn environments, exemplifying the PSO’s utility in parameter optimization of composite models²⁵. These studies not only demonstrate the wide range of applications of PSO algorithms in environmental science, air pollution control and other fields, but also indicate the importance of model optimization and parameter tuning using PSO in improving model accuracy and stability.

We propose an integrated model combining convolutional neural network, improved particle swarm optimization algorithm and SHAP analysis, SHAP-IPSO-CNN. Firstly, based on the ozone generation mechanism, we use an atmospheric dispersion model to predict the distribution concentration of VOCs emitted by the enterprises in the Chemical Industry Park at the target monitoring stations. And together with the concentrations of other pollutants at the monitoring stations and meteorological factors constitute the model input features. The improved PSO algorithm combined with the feature importance of SHAP analysis to dynamically adjust the training features to achieve the optimization of the CNN model. The improved particle swarm optimization algorithm dynamically adjusts the inertia weights and balances the global and local search capabilities by comparing the particle fitness, the population average fitness and the historical best fitness. In addition, an adaptive learning rate mechanism based on particle performance is introduced to further improve the flexibility and efficiency of the particle search process and enhance the overall optimization performance.

The model combines the efficient feature extraction capability of CNN, the global search and optimization properties of IPSO and the interpretability of SHAP analysis to improve the accuracy and interpretability of ozone pollution assessment under complex conditions. This study contributes to sustainable development by optimizing a deep learning model to improve the accuracy of ozone formation analysis and is important for effective environmental management and policy formulation.

Methods

Atmospheric dispersion model

Ozone formation is driven by a complex series of photochemical reactions triggered by volatile organic compounds and nitrogen oxides when exposed to sunlight. These reactions are particularly active in sunny environments and contribute significantly to the increase in ozone concentrations near the surface, a phenomenon commonly described as photochemical smog formation. Therefore, accurate prediction of the concentration distribution of VOCs emitted by firms as they propagate through the atmosphere to reach the monitoring station is essential for a deeper understanding of their impact on ozone concentrations in the vicinity of the monitoring station. To this end, a Gaussian plume model was used to estimate the concentration distribution of VOCs emitted by different enterprises at the target monitoring station²⁶.

In the calculation process, the specific inputs are as follows: pollutant emission concentration $Q$, wind speed at the emission point $U$, emission height $H$, distance from the source $x$, ambient temperature $T$, relative humidity $RH$, atmospheric pressure $P$, and diffusion coefficients $\sigma _{y}$ and $\sigma _{z}$ for the lateral and vertical directions, respectively. To accommodate changes in temperature, humidity, and pressure, we design an environmental factor adjustment factor, calculated by Eq. (1). This factor modifies the diffusion coefficients to account for the effects of these environmental variables on pollutant dispersion:

$$Adjustment\_factor = 1 + 0.01*(T - 20) + 0.005*(RH - 50) + 0.0001*P$$

(1)

Here, $T$, $RH$ and $P$ indicate deviations from a set of baseline conditions (temperature of 20 °C, relative humidity of 50%, pressure not specified as baseline).

After adjusting the diffusion coefficients appropriately, the Gaussian plume Eq. (2) for calculating the concentration of the pollutant at $C(x)$ at a distance of $x$ is:

$$C(x) = \frac{Q}{{2\pi U\sigma _{y} \sigma _{z} }}\exp \left( { - \frac{{x^{2} }}{{2\sigma _{y}^{2} }}} \right)\left( {\exp \left( { - \frac{{(H)^{2} }}{{2\sigma _{z}^{2} }}} \right) + \exp \left( { - \frac{{( - H)^{2} }}{{2\sigma _{z}^{2} }}} \right)} \right)$$

(2)

The output of the Gaussian plume model will be used as input features to the model for fitting the contribution of ozone formation. This integration forms an important part of our study.

Machine learning models

This study comprehensively considers three different machine learning models: random forest (RF), XGBoost, and ridge regression.RF improves the overall model accuracy and robustness by integrating the prediction results of numerous decision trees²⁷. The XGBoost model, an optimized implementation of gradient boosting decision trees, is widely respected for its superior computational efficiency and accurate prediction performance²⁸. Ridge regression models, on the other hand, are chosen for their ability to handle covariance problems and prevent overfitting, especially in the case of high-dimensional data²⁹. We provide an in-depth comparison of these three models through R², RMSE and MAE key performance indicators. The performance of the models will directly determine their suitability for application in subsequent analysis. The parameter settings of the selected models are detailed in Table 1.

Table 1 Hyperparameter settings for machine learning models.

Full size table

SHAP algorithm

A feature importance analysis of the best performing machine learning model is performed using the SHAP (SHapley Additive exPlanations) algorithm³⁰ to identify features that have a significant impact on the target variable.SHAP is an explanatory machine learning method that quantifies for each feature the degree to which it contributes to the predicted outcome of the model. The important feature indexes identified based on the SHAP analysis are placed into a feature queue as an initial feature selection scheme. The feature indexes in the queue are dynamically updated with the optimization process.

Improved particle swarm optimization algorithm

Particle Swarm Optimization (PSO), introduced by Kennedy and Eberhart in 1995³¹, is a nature-inspired optimization algorithm that simulates the social behavior of organisms like birds or fish schools to solve global optimization problems.

Particle swarm optimization principles and workflow

In PSO, a set of candidate solutions, referred to as “particles,” navigate through the search space to find the optimal solution. The particles communicate with each other and adjust their positions based on individual and group experiences.

Each particle has a position, representing a potential solution, and a velocity, determining the direction and speed of its movement. Particle fitness represents how good a solution a particle’s position is, as evaluated by a fitness function. It quantifies the particle’s performance in the context of the optimization objective.

The movement of particles is influenced by two factors:

1)
The personal best position ($P_{{best}}$), representing the best solution a particle has achieved so far.
2)
The global best position ($G_{{best}}$), which is the best solution found by any particle in the swarm.

The workflow of the PSO is as follows:
1. a.
  Initialization: The algorithm begins by randomly initializing a set of particles with random positions and velocities. Each particle represents a potential solution to the problem.
2. b.
  Fitness evaluation: Each particle’s fitness is evaluated based on a defined fitness function. This function measures the quality of the particle’s current position.
3. c.
  Update rules: Particles update their velocities and positions based on $P_{{bes}}$ and $P_{{bes}}$:
4. d.
  Position update: After updating velocities, particles adjust their positions as follows:
  
  The velocity update rule is given by: The velocity update rule for each particle is described in Eq. (3):
  $$V_{i}^{{(t + 1)}} = \omega \cdot V_{i}^{{(t)}} + c_{1} \cdot r_{1} \cdot (P_{{best,i}} - X_{i}^{{(t)}} ) + c_{2} \cdot r_{2} \cdot (G_{{best}} - X_{i}^{{(t)}} )$$
  (3)
  
  where:
  - $V_{i}^{{(t)}}$ denotes the velocity of the $i$th particle at the $i$th iteration,
  - $\omega$ denotes the inertia weight coefficients, balancing exploration and exploitation,
  - $c_{1}$ and $c_{2}$ denote the acceleration coefficients,
  - $r_{2}$ and $r_{2}$ are random numbers between 0 and 1,
  - $P_{{best,i}}$ is the personal best position of the$i$th particle,
  - $G_{{best}}$ is the global best position of all the particle,
  - $X_{i}^{{(t)}}$ denotes the position of the $i$th particle at the $i$th iteration.
    $$X_{i}^{{(t + 1)}} = X_{i}^{{(t)}} + V_{i}^{{(t + 1)}}$$
    (4)
  This step moves each particle towards better solutions based on both personal and global information.
5. e.
  Termination: The algorithm iterates until a termination criterion is met, such as reaching a maximum number of iterations or achieving a desired accuracy. Over time, particles converge towards the global best solution.

The improved particle swarm optimization

The improved PSO algorithm (IPSO) in this study introduces a dynamic inertia weight adjustment mechanism, which enables the inertia weights to change dynamically according to the relative performance of the particles instead of being fixed. The mechanism adjusts the inertia weights based on the results of comparing the particle fitness with the average fitness of all particles and the best fitness as a way to optimize the global and local search capabilities. For each particle, the inertia weights are dynamically adjusted according to its fitness value relative to the average fitness. If a particle’s fitness is higher than the average, its inertia weight is kept at its maximum value. If the particle’s adaptation is lower than the average, the inertia weights are dynamically adjusted according to Eq. (5):

$$\omega = \omega _{{\max }} - mul \cdot (f_{{avg}} - f_{i} ) \cdot (\omega _{{\max }} - \omega _{{\min }} ) \cdot (f_{{avg}} - f_{{\min }} )$$

(5)

Where $\omega$ is the inertia weight of the current iteration. $\omega _{{\max }}$ and $\omega _{{\min }}$ are the maximum and minimum values of the inertia weights, respectively. $mul$ is an adjustable parameter to control the rate of change of inertia weights. $f_{i}$ is the fitness of the current particle, $f_{{avg}}$ is the average fitness of all particles, and $f_{{\min }}$ is the minimum fitness observed in the population. This adaptive strategy allows particles to explore more extensively when they perform better than average.

Similarly, we also employ an adaptive learning rate based on particle performance, which further improves the flexibility and efficiency of particle search. In terms of updating the global optimal position, this algorithm not only considers the individual historical optimal value, but also refers to the historical optimal value of the population. It is calculated according to Eq. (6):

$$c = c_{{\max }} - \left(\frac{{c_{{\max }} - c_{{\min }} }}{{f_{{\max }} - f_{{\min }} }}\right) \cdot (f_{{\max }} - f_{i} )$$

(6)

Where $c$ is the individual or global learning rate. $c_{{\max }}$ and $c_{{\min }}$ are the maximum and minimum values of the learning rate, respectively. $f_{i}$ are the adaptations of the current particle. $f_{{\max}}$ and $f_{{\min }}$ are the maximum and minimum values of the fitness of all particles in the population, respectively. The adaptive learning rate thus computed ensures that particles close to the global optimum will have a more conservative learning rate to fine-tune their position, while particles far from the global optimum will have a more aggressive learning rate to facilitate a broader search.

Convolutional neural network

We employ a deep convolutional neural network. The model gradually extracts deep features of the data by sequentially stacking multiple convolutional, pooling and fully connected layers. Each convolutional layer is equipped with a ReLU activation function to introduce nonlinearities, while a batch normalization layer is used to speed up the training process and improve the generalization ability of the model. The Dropout layer is used to mitigate the overfitting phenomenon by randomly dropping a portion of the network connections to increase the robustness of the model. After passing through the Maximum Pooling layer to reduce the dimensionality of the data, a Spreading operation is applied to convert the multidimensional data to one-dimensional, thus allowing the data to be passed to the Fully Connected layer that immediately follows. Ultimately, the output layer consists of a single neuron that is used to predict the ozone concentration values after Box-Cox transformation. The CNN model structure is shown in Fig. 1. The key hyperparameters of the CNN model, including the number of filters, the number of iterations, the Dropout probability, etc., are found through IPSO.

Figure 2 illustrates the process of optimizing a CNN using IPSO. The optimization begins with partitioning the dataset and initializing key components, including training feature subsets and IPSO parameters. Each particle in the swarm represents a candidate solution for the CNN’s hyperparameters.

During each iteration, a CNN model is constructed based on the current position of each particle, which reflects its hyperparameter values. The CNN is then trained using the training dataset, and its performance is evaluated using a predefined fitness function. The inertia weights and learning factors are calculated, and these values guide the particle’s velocity and position updates. As particles adjust their velocities and positions, both P_best and G_best solutions are updated accordingly.

This iterative process continues until either the maximum number of iterations is reached or the fitness function converges. Finally, the optimized CNN model is constructed using the Gbest solution, and its performance is evaluated on the test dataset to ensure the model’s effectiveness.

Indicators for model assessment

In this paper, three evaluation metrics, root mean square error (RMSE), mean absolute error (MAE), and coefficient of determination (R²), were selected, with mean absolute error and root mean square error providing information about prediction accuracy and precision, and R² providing information about goodness of fit. The formulas are described by Eq. (7) to (9):

$$R^{2} = 1 - \frac{{\sum\nolimits_{{i = 1}}^{n} {(y_{i} - \hat{y}_{i} )^{2} } }}{{\sum\nolimits_{{i = 1}}^{n} {(y_{i} - \bar{y}_{i} )^{2} } }}$$

(7)

$$MAE = \frac{1}{n}\sum\limits_{{i = 1}}^{n} {\left| {y_{i} - \hat{y}_{i} } \right|}$$

(8)

$$RMSE = \sqrt {\frac{1}{n}\sum\limits_{{i = 1}}^{n} {(y_{i} - \hat{y}{}_{i})^{2} } }$$

(9)

In Eq. (7) to (9), $y_{i}$ is the actual value, $\hat{y}_{i}$ is the predicted value, $\bar{y}_{i}$ is the mean value of the dependent variable in the test set, and $n$ is the sample size in the test set.

Experimental

In this section, we will specify and analyze the experimental data, experimental process, and experimental results. The computer configuration used for the experiment is Win 11 operating system, Intel(R) Core(TM) i5-8265U CPU, and 8GB RAM. The experimental program is based on the TensorFlow framework and implemented using Python language coding, and the programming tool is Jupyter notebook.

Dataset and pre-processing

In this study, we analyzed an environmental monitoring project implemented between 2020 and 2023, from May to October, in a city chemical park. This project set up 59 monitoring points in the park to track VOC and other environmental parameters such as temperature, humidity, wind speed and direction in real time. At the same time, air pollutants such as O₃, NO₂, SO₂, CO, PM2.5 and PM10 are monitored simultaneously at fixed monitoring stations in the park. The data collection interval was 1 h. Data preprocessing includes outlier detection, missing value imputation and standardization. First, to ensure the accuracy of data analysis, outliers were removed using the Tukey’s method based on the interquartile range (IQR).Specifically, the first (Q1) and third (Q3) quartiles of each feature were used to establish outlier thresholds and any observation falling below Q1–1.5 * IQR or above Q3 + 1.5 * IQR was considered an outlier and excluded from the dataset. For handling missing data, a nearest-neighbor imputation strategy was adopted to maintain the continuity and structural integrity of the data. This method effectively preserves the short-term stability of atmospheric pollutant concentrations. Additionally, to address the skewed distribution of the target variable, a Box-Cox transformation was applied to approximate a normal distribution. All numerical features were also standardized by subtracting the mean and dividing by the standard deviation, ensuring that each feature has a mean of zero and a standard deviation of one. Integration of VOC concentrations calculated by the atmospheric dispersion model diffusing to the target detection station with NO₂, meteorological conditions (T, RH, P, WS, WD), SO₂, CO, PM10, and PM2.5 collected at the target monitoring station constitutes the experimental dataset, which comprises a total of 11,947 samples. Detailed information about the variables after preprocessing is shown in Table 2. The distribution of these data is shown in Fig. 3. The data distribution is normal with kurtosis and skewness values in the range of -1 to + 1 and − 2 to + 2, respectively. Finally, to evaluate the generalization ability of the model, we divided the dataset into 70% training set and 30% test set.

Table 2 Statistical results of the dataset.

Full size table

Figure 4 shows the heat map of the correlation between meteorological variables and atmospheric pollutant concentrations. The results of the analysis show significant correlations between temperature (T), relative humidity (RH) and wind speed (WS) and ozone concentration, highlighting the key role of these meteorological factors in the regional ozone formation process. Particularly noteworthy is the highly significant positive correlation between temperature and ozone concentration, reflecting the enhancement of photochemical reaction activity in high-temperature environments, which may lead to an increase in ozone production. When further exploring the thermograms, a strong positive correlation (0.64) was found between VOCs and ozone concentration, suggesting that VOCs play a key driving role in ozone production. While NO₂ showed a negative correlation (-0.49) with ozone concentration, which is in agreement with the findings of Huang, Yan, et al.³², although NO₂ can generate ozone, excessive NOx titrates ozone and leads to a decrease in ozone concentration, reflecting its complexity in the ozone formation mechanism. Although the direct correlation between particulate matter (PM10 and PM2.5) and ozone is not significant, it may indirectly modulate the ozone distribution by influencing photochemical reactions or interacting with other meteorological variables. Studies have shown that secondary organic carbon (SOC) in PM2.5 competes with ozone by sharing the same precursors, while high PSA levels promote the production of non-homogeneously reacting NO₃ on particle surfaces, which in turn compete with NOx required for ozone formation, and in turn may affect ozone production. Therefore, the potential impact of particulate matter on regional ozone pollution cannot be ignored.

Results of SHAP analysis

After completing the data preprocessing, we trained the three mainstream machine learning models, which all showed good performance, but by comparing the evaluation metrics such as R², RMSE, we found that the Random Forest model showed a slightly better performance than XGBoost and Ridge Regression when dealing with the dataset of this study. The model performance comparison is shown in Table 3, and the R² of the random forest model reaches 0.9023, which is more superior compared to 0.8731 of XGBoost and 0.8594 of ridge regression. Based on this result, in order to more accurately parse the process of model prediction and the driving factors behind it, we applied the SHAP analysis method to conduct an in-depth feature importance assessment of the Random Forest model, and visualized the key features and their strength of influence on the prediction results with the help of the SHAP summary graph.

Table 3 Machine learning performance.

Full size table

Figure 5 presents the SHAP summary plot, while Table 4 provides the corresponding feature importance rankings. Both indicate that WS, T, and RH are the primary meteorological factors influencing ozone concentration. Additionally, NO₂ and VOCs are shown to have a significant impact, underscoring their role as key precursors in ozone formation. In terms of SHAP values, WS, WD, T, NO₂ and VOCs exhibit positive SHAP values, meaning that higher values of these features are associated with increased ozone concentrations. In contrast, RH has negative SHAP values, indicating that higher humidity tends to reduce ozone levels. PM2.5, PM10, and P have relatively lower SHAP values, suggesting their limited influence on ozone concentration predictions in this context.

Subsequently, a feature queue is established for subsequent dynamic adjustment of training feature selection based on feature importance ranking, and in each training cycle, the top seven features from the queue are selected for training. During the training process, the feature queue update interval is dynamically adjusted according to the changes in the loss value, and if the change in the loss value exceeds a threshold, the update interval is shortened and the update of the feature queue is triggered (re-ordering of the feature queue). Conversely, the update interval is extended and the feature queue is not updated. The threshold is set to 0.01 and the update interval is set to 5, i.e., every 5 iterations, while the shortening or lengthening amplitude is 1, and the changed interval cannot be lower than 1.

Table 4 Results of feature importance analysis.

Full size table

Model results

The initial parameter settings of the IPSO algorithm are shown in Table 5. In the IPSO algorithm, both the inertia weight w and the cognitive factors c1 and c2 are set as initial values, specifically 0.5 for w, 0.4 for c₁, and 0.6 for c₂. These initial settings play a critical role in influencing the optimization process. However, the key innovation of our approach lies in the fact that these parameters are not fixed; instead, they are dynamically adjusted throughout the optimization process. This variability allows the algorithm to adapt more effectively to the changing landscape of the search space, enhancing its exploration and exploitation capabilities. Additionally, r₁ and r₂ are randomly generated values ranging from 0 to 1, which contribute to the stochastic behavior of the algorithm, enhancing its ability to escape local optima. The neighborhood size is set to 4, and the Minkowski distance rule is configured with a value of 2, facilitating local interactions among particles. The population consists of 50 particles, and the algorithm is run for 50 iterations, allowing sufficient exploration of the search space. The multiplication factor (mul) is set to 100, which scales the particle velocities.

The CNN model was trained using the Adam optimizer with mean square error as the loss function to quantify the difference between the predicted and actual observations. An early stopping method was used to prevent overfitting with a minimum change threshold of 1e-4 and a PATIENCE of 30. The hyperparameter results of the CNN model obtained by IPSO algorithm are Filters = 32, Epochs = 50, Dropout = 0.2, Batch size = 64. training the CNN model after determining the hyperparameters, the variation of the training loss values is shown in Fig. 6.

Table 5 IPSO algorithm parameter initialization.

Full size table

Figure 6 exhibits the decreasing trend of training loss and validation loss with the number of iterations. In the initial stage, although the initial loss of the model is high, the loss decreases rapidly, indicating that the model can effectively learn from the data; subsequently, the validation loss gradually stabilizes with the depth of iterations, and when the network is trained for more than 20 epochs, the value of the training loss gradually stabilizes between 0.0009 and 0.0010, indicating that the model achieves a good generalization performance.

Model training is completed to obtain the final SHAP-IPSO-CNN model for this paper, and Fig. 7 shows how well the model fits the ozone mass concentration in the park. As can be seen in Fig. 7, the overall trend of the ozone mass concentration estimated by our model is similar to the actual ozone mass concentration, and the model captures the main trend of the data, but there is still a bias in some peaks.

Model comparison

In order to verify the performance difference between our proposed SHAP-IPSO-CNN model and other models, we conducted comparison experiments including RF, RF-PSO-LSTM²⁵, SHAP-PSO-CNN and IPSO-CNN. In the experiments of RF-PSO-LSTM model, the original paper used RF to analyze the feature importance and select the training features, while this study performs SHAP feature importance analysis based on RF. Therefore, in the RF-PSO-LSTM experiments, we directly used the training features filtered based on the results of the SHAP analysis and ensured that the other hyperparameters were consistent with the original paper settings. All models were tested in the same experimental environment and performance comparisons were made using uniform evaluation criteria. The results are shown in Table 6.

The decreasing trend of the loss of SHAP-PSO-CNN, IPSO-CNN and SHAP-IPSO-CNN models during the training process is presented in Fig. 8. As can be seen from the figure, the SHAP-IPSO-CNN model has a high loss value in the initial stage, but then it decreases rapidly and gradually stabilizes. The SHAP-IPSO-CNN model has the lowest loss value throughout the training process, showing the best convergence behavior and stability among the three. This result indicates that although all models successfully learn the patterns in the dataset, the SHAP-IPSO-CNN model exhibits superior performance during the training process. In addition, the decreasing trend of loss for SHAP-IPSO-CNN and IPSO-CNN validates the effectiveness of the improved particle swarm optimization algorithm in optimizing the CNN parameters.

Table 6 Model assessment indicator data.

Full size table

After in-depth analysis and comparison of multiple machine learning models, the results of this study show that the convolutional neural network model (SHAP-IPSO-CNN), which uses a combination of IPSO and SHAP analyses, exhibits significant performance improvement. Detailed data are shown in Table 6. The SHAP-IPSO-CNN model improves R² by 1.66% compared to the SHAP-PSO-CNN model, and decreases by 12.5% and 10.3% in terms of RMSE and MAE, respectively, which indicates that the overall accuracy and stability are enhanced. Compared with the IPSO-CNN model, the SHAP-IPSO-CNN model achieves improvements in RMSE and MAE metrics, where RMSE and MAE decrease by 11.6% and 9.0%, respectively, and R² improves by 1.5%, further verifying the positive impact of feature selection on model performance. Additionally, when compared with the RF-PSO-LSTM model, the SHAP-IPSO-CNN model outperforms it in all key metrics. Specifically, the SHAP-IPSO-CNN model achieves a 1.74% improvement in R², while RMSE decreases by 13.4% and MAE decreases by 13.0% .

The SHAP-IPSO-CNN model obtained the highest value among all models in the comparative analysis in terms of R² presented in Fig. 9. This result shows that by first using the SHAP algorithm for ranking feature importance and then combining it with a dynamic feature subset for model training, the model error can be significantly reduced and the reliability of the model can be improved compared to the model using the full feature set. The reduction of RMSE and MAE, as well as the improvement of the R² value, together confirm the high accuracy and credibility of the method in estimating ozone mass concentration.

Overall, the adoption of the SHAP-IPSO-CNN model is not only optimized in terms of accuracy but also shows superiority in terms of its fitness. This model integrates the advantages of convolutional neural networks for efficient feature extraction, the improved particle swarm optimization algorithm to enhance convergence, and dynamic feature selection based on SHAP analysis results. The combination of this dynamic feature selection strategy with the optimization of the IPSO algorithm significantly enhances the performance of the CNN model. The effectiveness and adaptability of this integrated strategy were verified, providing solid support for estimating the contribution of VOCs emitted by enterprises in chemical parks to ozone formation.

Impact assessment

Within the framework of the trained SHAP-IPSO-CNN model, this study further performs a quantitative assessment of the impact of VOC emissions on surface ozone formation at each enterprise in the park. An experimental design was used in which the samples from the ID of an enterprise were excluded from the training when evaluating that enterprise, which is equivalent to “turning off” the emissions from that enterprise in the simulation environment, while keeping all other conditions unchanged. The results of this hypothetical simulation were compared with the model predictions for the actual situation to calculate the average impact score of VOC emissions on ozone concentration changes for each enterprise.

The evaluation process first involves iterating over each individual enterprise ID in the dataset and modifying the corresponding samples in the test dataset. Thereafter, the modified dataset was passed through the trained SHAP-IPSO-CNN model to estimate the ozone concentration values for that condition, i.e., to assess the ozone concentration values in the absence of emissions from that enterprise. The average impact score of each business on ozone concentrations was derived by comparing the difference between these output values and the model output values when the dataset was not modified. Using the July 19, 2021 data as an example, the results are shown in Table 7.

Table 7 Impact assessment results.

Full size table

Conclusion

The SHAP-IPSO-CNN model proposed in this study combines an improved Particle Swarm Optimization algorithm with SHAP analysis to achieve dynamic feature selection, significantly enhancing the model’s predictive performance. The improved PSO algorithm effectively accelerates the convergence of CNN model training. Through in-depth analysis, this study not only deepens the understanding of the mechanisms behind ozone pollution formation but also provides a robust quantitative method to assess the contribution of industrial VOC emissions to ground-level ozone production. The practical case studies presented here successfully evaluate the impact of individual enterprise VOC emissions on ozone concentrations, offering valuable data support and decision-making tools for policymakers and environmental management authorities in the context of chemical park pollution control and regulation.

Overall, this study demonstrates the substantial potential of deep learning techniques in addressing environmental science challenges, particularly in investigating the complex interactions between atmospheric pollutants and environmental factors. Due to limitations in data availability, the current model does not fully account for seasonal variations. Future work will focus on expanding the dataset to better capture the seasonal effects on ozone levels and refining VOC classification to further optimize the model, enabling more accurate and predictive monitoring and analysis of environmental pollutants.

Data availability

The datasets used and analysed during the current study available from the corresponding author on reasonable request.

References

Nelson, D. et al. A comprehensive approach combining positive matrix factorization modeling, meteorology, and machine learning for source apportionment of surface ozone precursors: Underlying factors contributing to ozone formation in Houston, Texas. Environ. Pollut. 334, 122223 (2023).
Article CAS PubMed Google Scholar
Martinez, J. American lung association. Retrieved from new report: Houston’s Air quality gets mixed grades, residents exposed to unhealthy air pollution: https://www.lung.org/media/press-releases/sota-houston-fy22, (2022).
Cao, X. et al. Characterization, reactivity, source apportionment, and potential source areas of ambient volatile organic compounds in a typical tropical city. J. Environ. Sci. 123, 417–429. https://doi.org/10.1016/j.jes.2022.08.005 (2023).
Article CAS MATH Google Scholar
Hopke, P. K. et al. Global review of recent source apportionments for airborne particulate matter. Sci. Total Environ. 740, 140091. https://doi.org/10.1016/j.scitotenv.2020.140091 (2020).
Article CAS PubMed PubMed Central MATH Google Scholar
Ding, D. et al. Impacts of emissions and meteorological changes on China’s ozone pollution in the warm seasons of 2013 and 2017. Front. Environ. Sci. Eng. 13(5), 1–9. https://doi.org/10.1007/s11783-019-1160-1.10.1007/s11783-019-1160-1 (2019).
Article MATH Google Scholar
Luecken, D. J. et al. Sensitivity of ambient atmospheric formaldehyde and ozone to precursor species and source types across the United States. Environ. Sci. Technol. 52(8), 4668–4675. https://doi.org/10.1021/acs.est.7b05509 (2018).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Dang, R., Liao, H. & Fu, Y. Quantifying the anthropogenic and meteorological influences on summertime surface ozone in China over 2012–2017.
Wang, H. et al. Seasonality and reduced nitric oxide titration dominated ozone increase during COVID-19 lockdown in eastern China. Npj Climate Atmos. Sci. https://doi.org/10.1038/s41612-022-00249-3 (2022).
Article MATH Google Scholar
Wang, S. et al. A high-performance convolutional neural network for ground-level ozone estimation in eastern China. Remote Sens. 14, 1640. https://doi.org/10.3390/rs14071640 (2022).
Article ADS MATH Google Scholar
Cheng, N. N. Research on ozone emission characteristics, ozone influencing mechanism and pollution source tracing in typical chemical industrial park. industrial park. Zhejiang University, Hangzhou, China PhD thesis (2022).
Wu, Q. et al. Bias correction of the secondary inorganic aerosol modelingbased on machine Bias correction of the secondary inorganic aerosol modelingbased on machine learning algorithm. Acta Sci. Circumst. 43, 121–130 (2023).
Google Scholar
Ren, X., Mi, Z. & Cai, T., et al. Flexible Bayesian ensemble machine learning framework for predicting local ozone concentrations.
Xing, J. et al. Deep learning for prediction of the air quality response to emission changes. Environ. Sci. Technol. 54, 8589–8600. https://doi.org/10.1021/acs.est.0c02923 (2020).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Gao, M., Yin, L. & Ning, J. Artificial neural network model for ozone concentration estimation and Monte Carlo analysis. Atmos. Environ. 184, 129–139. https://doi.org/10.1016/j.atmosenv.2018.03.027 (2018).
Article ADS CAS MATH Google Scholar
Sun, W. & Li, Z. Hourly PM2.5 concentration forecasting based on mode decomposition-recombination technique and ensemble learning approach in severe haze episodes of China. J. Clean. Product. 263, 121442 (2020).
Article CAS MATH Google Scholar
Sun, W. & Li, Z. Hourly PM2.5 concentration forecasting based on feature extraction and stacking-driven ensemble model for the winter of the Beijing-Tianjin-Hebei area. Atmos. Pollut. Res. 11(6), 110–121 (2020).
Article CAS MATH Google Scholar
Huang, C. J. & Kuo, P. H. A deep CNN-LSTM model for particulate matter (PM2.5) forecasting in smart cities. Sensors 18(7), 2220 (2018).
Article ADS PubMed PubMed Central MATH Google Scholar
Xu, H. et al. Machine learning coupled structure mining method visualizes the impact of multiple drivers on ambient ozone. Commun. Earth Environ. 4(1), 265 (2023).
Article ADS MATH Google Scholar
Wang, M. et al. Meteorological and anthropogenic drivers of surface ozone change in the North China Plain in 2015–2021. Sci. Total Environ. 906, 167763 (2024).
Article CAS PubMed Google Scholar
Wang, W. et al. Prediction model of spontaneous combustion risk of extraction borehole based on PSO-BPNN and its application. Sci. Rep. 14(1), 5 (2024).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Huang, Y. et al. Air quality prediction using improved PSO-BP neural network. IEEE Access 8, 99346–99353 (2020).
Article MATH Google Scholar
Ordóñez-De León, B. et al. An improved particle swarm optimization (PSO): Method to enhance modeling of airborne particulate matter (PM10). Evolving Syst. 11, 615–624 (2020).
Article MATH Google Scholar
You, X. et al. A PSO-CNN-based deep learning model for predicting forest fire risk on a national scale. Forests 15(1), 86 (2023).
Article MATH Google Scholar
Murillo-Escobar, J. et al. Forecasting concentrations of air pollutants using support vector regression improved with particle swarm optimization: Case study in Aburrá Valley, Colombia. Urban Climate 29, 100473 (2019).
Article Google Scholar
Cen, H. et al. A method to predict CO₂ mass concentration in sheep barns based on the RF-PSO-LSTM model. Animals 13(8), 1322 (2023).
Article PubMed PubMed Central Google Scholar
Lotrecchiano, N. et al. Pollution dispersion from a fire using a Gaussian plume model. Int. J. Saf. Secur. Eng 10, 431–439 (2020).
MATH Google Scholar
Boateng, E. Y., Joseph, O. & Daniel, A. A. Basic tenets of classification algorithms K-nearest-neighbor, support vector machine, random forest and neural network: A review. J. Data Anal. Inf. Process. 8(4), 341–357 (2020).
MATH Google Scholar
Li, J. et al. Application of XGBoost algorithm in the optimization of pollutant concentration. Atmos. Res. 276, 106238 (2022).
Article CAS MATH Google Scholar
Maronna, R. A. Robust ridge regression for high-dimensional data. Technometrics 53(1), 44–53 (2011).
Article MathSciNet MATH Google Scholar
Lundberg, S., Scott, M. & Su-In, L. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 30, 4765–4774 (2017).
MATH Google Scholar
Eberhart, R. & Kennedy, J. Particle swarm optimization. In Proceedings of the IEEE international conference on neural networks. Vol. 4. (1995).
Yan, H. et al. Identification of response regulation governing ozone formation based on influential factors using a random forest approach. Available at SSRN 4728997.

Download references

Acknowledgements

This research was funded by the “Xing Shen Talent Plan” high-level technological innovation talent project (2023-28) and the Shenyang Young and Middle-aged Science and Technology Innovation Talent Support Plan (RC230230).

Author information

Authors and Affiliations

Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang, 110168, Liaoning, China
Xiaolei Zhou, Xingyue Wang & Ruifeng Guo
University of Chinese Academy of Sciences, Beijing, 100049, China
Xiaolei Zhou, Xingyue Wang & Ruifeng Guo

Authors

Xiaolei Zhou
View author publications
Search author on:PubMed Google Scholar
Xingyue Wang
View author publications
Search author on:PubMed Google Scholar
Ruifeng Guo
View author publications
Search author on:PubMed Google Scholar

Contributions

Z.X. L. and W.X. Y. processed the data, built the model, performed the tests, and wrote the manuscript. G.R.F. checked the manuscript. All authors approved the manuscript.

Corresponding author

Correspondence to Xingyue Wang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhou, X., Wang, X. & Guo, R. Assessment model of ozone pollution based on SHAP-IPSO-CNN and its application. Sci Rep 15, 3404 (2025). https://doi.org/10.1038/s41598-025-87702-4

Download citation

Received: 22 April 2024
Accepted: 21 January 2025
Published: 27 January 2025
Version of record: 27 January 2025
DOI: https://doi.org/10.1038/s41598-025-87702-4

Subjects

Abstract

Similar content being viewed by others

Sustained emission reductions have restrained the ozone pollution over China

Unequal spatio-temporal distribution of population-weighted pollution extremes through deep learning

Parsimonious estimation of hourly surface ozone concentration across China during 2015–2020

Introduction

Methods

Atmospheric dispersion model

Machine learning models

SHAP algorithm

Improved particle swarm optimization algorithm

Particle swarm optimization principles and workflow

The improved particle swarm optimization

Convolutional neural network

Indicators for model assessment

Experimental

Dataset and pre-processing

Results of SHAP analysis

Model results

Model comparison

Impact assessment

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links