Introduction

Ground-level ozone (O3) pollution has become a major environmental problem globally, and severe ozone pollution has significant impacts on human health and socio-economic development3. In China, especially in some cities, ozone has become another important pollutant affecting the improvement of urban air quality in addition to PM2.51,2. Although various measures have been taken by local governments and scientific research institutions to control and manage ozone pollution, effective control of ground-level ozone remains a challenge due to the complexity of its formation mechanism. Ozone formation is affected by the interaction of various factors such as precursors, e.g., volatile organic compounds (VOCs) and nitrogen oxides (NOx), atmospheric chemical processes, and meteorological factors4,5,6. Therefore, it is important to make full use of observational data to analyze the causes of ozone pollution as well as the relationships among multiple drivers to promote air pollution prevention and control.

Several studies have applied air quality models to assess the independent contribution of different factors to ozone formation7,8,9. However, these models are limited by high computational costs, high sensitivity to input data errors, and difficulties in reconstructing accurate processes under complex meteorological and chemical conditions. In recent years, rapid advances in data science and machine learning techniques have provided new solutions in the field of air pollution control10. Compared to traditional air quality models, machine learning methods can be modeled in a data-driven manner by processing large-scale datasets without the need to directly simulate complex physical and chemical processes in the atmosphere. These methods can effectively capture the complex nonlinear relationships between ozone and its drivers, reducing the computational burden while improving model robustness to data noise and input perturbations11. In recent years, various machine learning methods have been extensively applied to predict air pollutant concentrations and analyze the causes of air pollution12,13,14,15,16,17. In addition, some studies have used ML techniques to quantify the relative contributions of input variables to model the prediction targets, e.g., Dang et al. applied a multiple linear regression (MLR) model to quantify the contributions of important meteorological variables to ground-level ozone concentrations, and efficiently quantitatively resolved the effects of different factors on ozone formation7. In addition, Xu et al. combined an optimized random forest (RF) model with structure mining techniques to deeply reveal the combined effects of multiple drivers on ozone formation in the environment18. Wang et al. constructed a hybrid deep learning model based on observational data, which was combined with the GEOS-Chem atmospheric chemical transport model, to investigate the changes of surface ozone concentration in the North China Plain from 2015 to 2021. A comprehensive analysis of meteorological and anthropogenic drivers of summer surface ozone concentration changes in the North China Plain from 2015 to 2021 is presented19.

The aforementioned studies effectively utilized machine learning techniques to analyze the formation process of atmospheric ozone and its relationship with various meteorological and anthropogenic factors. This study focuses on how to adjust and optimize the parameters of the deep learning model through advanced optimization algorithms to further enhance the accuracy and performance of the model in analyzing atmospheric ozone formation.

The effective application of particle swarm optimization (PSO) algorithms has been validated and demonstrated in several fields. For example, Wang et al. successfully combined the PSO algorithm with a BP neural network to create a prediction model for spontaneous combustion risk in gas extraction boreholes (PSO-BPNN), which demonstrated a high degree of accuracy and stability in predicting the risk of spontaneous combustion in mines20. Similarly, Huang et al. proposed a new air quality index (AQI) prediction method by optimizing a BP neural network with an improved PSO algorithm, which further confirmed the ability of PSO in optimizing the algorithm parameters21. In addition, Ordóñez-De León et al. used PSO to improve the Adaptive Neuro-Fuzzy Inference System (ANFIS) to enhance the performance of a PM10 particle concentration model22. You et al. proposed a deep learning model for predicting the risk of nationwide forest fires by optimizing convolutional neural networks with PSO algorithms, showing the potential application of PSO for the design of deep learning architecture application potential23. Murillo-Escobar et al. used a PSO-optimized support vector regression (SVR) model to predict air pollutant concentrations ahead of time, showing the utility of PSO in improving the performance of prediction models24. Cen et al. proposed a CO2 mass concentration prediction method based on the RF-PSO-LSTM model for sheep barn environments, exemplifying the PSO’s utility in parameter optimization of composite models25. These studies not only demonstrate the wide range of applications of PSO algorithms in environmental science, air pollution control and other fields, but also indicate the importance of model optimization and parameter tuning using PSO in improving model accuracy and stability.

We propose an integrated model combining convolutional neural network, improved particle swarm optimization algorithm and SHAP analysis, SHAP-IPSO-CNN. Firstly, based on the ozone generation mechanism, we use an atmospheric dispersion model to predict the distribution concentration of VOCs emitted by the enterprises in the Chemical Industry Park at the target monitoring stations. And together with the concentrations of other pollutants at the monitoring stations and meteorological factors constitute the model input features. The improved PSO algorithm combined with the feature importance of SHAP analysis to dynamically adjust the training features to achieve the optimization of the CNN model. The improved particle swarm optimization algorithm dynamically adjusts the inertia weights and balances the global and local search capabilities by comparing the particle fitness, the population average fitness and the historical best fitness. In addition, an adaptive learning rate mechanism based on particle performance is introduced to further improve the flexibility and efficiency of the particle search process and enhance the overall optimization performance.

The model combines the efficient feature extraction capability of CNN, the global search and optimization properties of IPSO and the interpretability of SHAP analysis to improve the accuracy and interpretability of ozone pollution assessment under complex conditions. This study contributes to sustainable development by optimizing a deep learning model to improve the accuracy of ozone formation analysis and is important for effective environmental management and policy formulation.

Methods

Atmospheric dispersion model

Ozone formation is driven by a complex series of photochemical reactions triggered by volatile organic compounds and nitrogen oxides when exposed to sunlight. These reactions are particularly active in sunny environments and contribute significantly to the increase in ozone concentrations near the surface, a phenomenon commonly described as photochemical smog formation. Therefore, accurate prediction of the concentration distribution of VOCs emitted by firms as they propagate through the atmosphere to reach the monitoring station is essential for a deeper understanding of their impact on ozone concentrations in the vicinity of the monitoring station. To this end, a Gaussian plume model was used to estimate the concentration distribution of VOCs emitted by different enterprises at the target monitoring station26.

In the calculation process, the specific inputs are as follows: pollutant emission concentration \(Q\), wind speed at the emission point \(U\), emission height \(H\), distance from the source \(x\), ambient temperature \(T\), relative humidity \(RH\), atmospheric pressure \(P\), and diffusion coefficients \(\sigma _{y}\) and \(\sigma _{z}\) for the lateral and vertical directions, respectively. To accommodate changes in temperature, humidity, and pressure, we design an environmental factor adjustment factor, calculated by Eq. (1). This factor modifies the diffusion coefficients to account for the effects of these environmental variables on pollutant dispersion:

$$Adjustment\_factor = 1 + 0.01*(T - 20) + 0.005*(RH - 50) + 0.0001*P$$
(1)

Here, \(T\), \(RH\) and \(P\) indicate deviations from a set of baseline conditions (temperature of 20 °C, relative humidity of 50%, pressure not specified as baseline).

After adjusting the diffusion coefficients appropriately, the Gaussian plume Eq. (2) for calculating the concentration of the pollutant at \(C(x)\) at a distance of \(x\) is:

$$C(x) = \frac{Q}{{2\pi U\sigma _{y} \sigma _{z} }}\exp \left( { - \frac{{x^{2} }}{{2\sigma _{y}^{2} }}} \right)\left( {\exp \left( { - \frac{{(H)^{2} }}{{2\sigma _{z}^{2} }}} \right) + \exp \left( { - \frac{{( - H)^{2} }}{{2\sigma _{z}^{2} }}} \right)} \right)$$
(2)

The output of the Gaussian plume model will be used as input features to the model for fitting the contribution of ozone formation. This integration forms an important part of our study.

Machine learning models

This study comprehensively considers three different machine learning models: random forest (RF), XGBoost, and ridge regression.RF improves the overall model accuracy and robustness by integrating the prediction results of numerous decision trees27. The XGBoost model, an optimized implementation of gradient boosting decision trees, is widely respected for its superior computational efficiency and accurate prediction performance28. Ridge regression models, on the other hand, are chosen for their ability to handle covariance problems and prevent overfitting, especially in the case of high-dimensional data29. We provide an in-depth comparison of these three models through R², RMSE and MAE key performance indicators. The performance of the models will directly determine their suitability for application in subsequent analysis. The parameter settings of the selected models are detailed in Table 1.

Table 1 Hyperparameter settings for machine learning models.

SHAP algorithm

A feature importance analysis of the best performing machine learning model is performed using the SHAP (SHapley Additive exPlanations) algorithm30 to identify features that have a significant impact on the target variable.SHAP is an explanatory machine learning method that quantifies for each feature the degree to which it contributes to the predicted outcome of the model. The important feature indexes identified based on the SHAP analysis are placed into a feature queue as an initial feature selection scheme. The feature indexes in the queue are dynamically updated with the optimization process.

Improved particle swarm optimization algorithm

Particle Swarm Optimization (PSO), introduced by Kennedy and Eberhart in 199531, is a nature-inspired optimization algorithm that simulates the social behavior of organisms like birds or fish schools to solve global optimization problems.

Particle swarm optimization principles and workflow

In PSO, a set of candidate solutions, referred to as “particles,” navigate through the search space to find the optimal solution. The particles communicate with each other and adjust their positions based on individual and group experiences.

Each particle has a position, representing a potential solution, and a velocity, determining the direction and speed of its movement. Particle fitness represents how good a solution a particle’s position is, as evaluated by a fitness function. It quantifies the particle’s performance in the context of the optimization objective.

The movement of particles is influenced by two factors:

  1. 1)

    The personal best position (\(P_{{best}}\)), representing the best solution a particle has achieved so far.

  2. 2)

    The global best position (\(G_{{best}}\)), which is the best solution found by any particle in the swarm.

    The workflow of the PSO is as follows:

    1. a.

      Initialization: The algorithm begins by randomly initializing a set of particles with random positions and velocities. Each particle represents a potential solution to the problem.

    2. b.

      Fitness evaluation: Each particle’s fitness is evaluated based on a defined fitness function. This function measures the quality of the particle’s current position.

    3. c.

      Update rules: Particles update their velocities and positions based on \(P_{{bes}}\) and \(P_{{bes}}\):

    4. d.

      Position update: After updating velocities, particles adjust their positions as follows:

      The velocity update rule is given by: The velocity update rule for each particle is described in Eq. (3):

      $$V_{i}^{{(t + 1)}} = \omega \cdot V_{i}^{{(t)}} + c_{1} \cdot r_{1} \cdot (P_{{best,i}} - X_{i}^{{(t)}} ) + c_{2} \cdot r_{2} \cdot (G_{{best}} - X_{i}^{{(t)}} )$$
      (3)

      where:

      • \(V_{i}^{{(t)}}\) denotes the velocity of the \(i\)th particle at the \(i\)th iteration,

      • \(\omega\) denotes the inertia weight coefficients, balancing exploration and exploitation,

      • \(c_{1}\) and \(c_{2}\) denote the acceleration coefficients,

      • \(r_{2}\) and \(r_{2}\) are random numbers between 0 and 1,

      • \(P_{{best,i}}\) is the personal best position of the\(i\)th particle,

      • \(G_{{best}}\) is the global best position of all the particle,

      • \(X_{i}^{{(t)}}\) denotes the position of the \(i\)th particle at the \(i\)th iteration.

        $$X_{i}^{{(t + 1)}} = X_{i}^{{(t)}} + V_{i}^{{(t + 1)}}$$
        (4)

      This step moves each particle towards better solutions based on both personal and global information.

    5. e.

      Termination: The algorithm iterates until a termination criterion is met, such as reaching a maximum number of iterations or achieving a desired accuracy. Over time, particles converge towards the global best solution.

The improved particle swarm optimization

The improved PSO algorithm (IPSO) in this study introduces a dynamic inertia weight adjustment mechanism, which enables the inertia weights to change dynamically according to the relative performance of the particles instead of being fixed. The mechanism adjusts the inertia weights based on the results of comparing the particle fitness with the average fitness of all particles and the best fitness as a way to optimize the global and local search capabilities. For each particle, the inertia weights are dynamically adjusted according to its fitness value relative to the average fitness. If a particle’s fitness is higher than the average, its inertia weight is kept at its maximum value. If the particle’s adaptation is lower than the average, the inertia weights are dynamically adjusted according to Eq. (5):

$$\omega = \omega _{{\max }} - mul \cdot (f_{{avg}} - f_{i} ) \cdot (\omega _{{\max }} - \omega _{{\min }} ) \cdot (f_{{avg}} - f_{{\min }} )$$
(5)

Where \(\omega\) is the inertia weight of the current iteration. \(\omega _{{\max }}\) and \(\omega _{{\min }}\) are the maximum and minimum values of the inertia weights, respectively. \(mul\) is an adjustable parameter to control the rate of change of inertia weights. \(f_{i}\) is the fitness of the current particle, \(f_{{avg}}\) is the average fitness of all particles, and \(f_{{\min }}\) is the minimum fitness observed in the population. This adaptive strategy allows particles to explore more extensively when they perform better than average.

Similarly, we also employ an adaptive learning rate based on particle performance, which further improves the flexibility and efficiency of particle search. In terms of updating the global optimal position, this algorithm not only considers the individual historical optimal value, but also refers to the historical optimal value of the population. It is calculated according to Eq. (6):

$$c = c_{{\max }} - \left(\frac{{c_{{\max }} - c_{{\min }} }}{{f_{{\max }} - f_{{\min }} }}\right) \cdot (f_{{\max }} - f_{i} )$$
(6)

Where \(c\) is the individual or global learning rate. \(c_{{\max }}\) and \(c_{{\min }}\) are the maximum and minimum values of the learning rate, respectively. \(f_{i}\) are the adaptations of the current particle. \(f_{{\max}}\) and \(f_{{\min }}\) are the maximum and minimum values of the fitness of all particles in the population, respectively. The adaptive learning rate thus computed ensures that particles close to the global optimum will have a more conservative learning rate to fine-tune their position, while particles far from the global optimum will have a more aggressive learning rate to facilitate a broader search.

Convolutional neural network

We employ a deep convolutional neural network. The model gradually extracts deep features of the data by sequentially stacking multiple convolutional, pooling and fully connected layers. Each convolutional layer is equipped with a ReLU activation function to introduce nonlinearities, while a batch normalization layer is used to speed up the training process and improve the generalization ability of the model. The Dropout layer is used to mitigate the overfitting phenomenon by randomly dropping a portion of the network connections to increase the robustness of the model. After passing through the Maximum Pooling layer to reduce the dimensionality of the data, a Spreading operation is applied to convert the multidimensional data to one-dimensional, thus allowing the data to be passed to the Fully Connected layer that immediately follows. Ultimately, the output layer consists of a single neuron that is used to predict the ozone concentration values after Box-Cox transformation. The CNN model structure is shown in Fig. 1. The key hyperparameters of the CNN model, including the number of filters, the number of iterations, the Dropout probability, etc., are found through IPSO.

Fig. 1
figure 1

Schematic diagram of CNN model structure. The basic elements are derived from the opensource project “ML Visuals” (https://github.com/dair-ai/ml-visuals).

Figure 2 illustrates the process of optimizing a CNN using IPSO. The optimization begins with partitioning the dataset and initializing key components, including training feature subsets and IPSO parameters. Each particle in the swarm represents a candidate solution for the CNN’s hyperparameters.

During each iteration, a CNN model is constructed based on the current position of each particle, which reflects its hyperparameter values. The CNN is then trained using the training dataset, and its performance is evaluated using a predefined fitness function. The inertia weights and learning factors are calculated, and these values guide the particle’s velocity and position updates. As particles adjust their velocities and positions, both Pbest and Gbest solutions are updated accordingly.

This iterative process continues until either the maximum number of iterations is reached or the fitness function converges. Finally, the optimized CNN model is constructed using the Gbest solution, and its performance is evaluated on the test dataset to ensure the model’s effectiveness.

Fig. 2
figure 2

IPSO optimized CNN algorithm flow.

Indicators for model assessment

In this paper, three evaluation metrics, root mean square error (RMSE), mean absolute error (MAE), and coefficient of determination (R2), were selected, with mean absolute error and root mean square error providing information about prediction accuracy and precision, and R2 providing information about goodness of fit. The formulas are described by Eq. (7) to (9):

$$R^{2} = 1 - \frac{{\sum\nolimits_{{i = 1}}^{n} {(y_{i} - \hat{y}_{i} )^{2} } }}{{\sum\nolimits_{{i = 1}}^{n} {(y_{i} - \bar{y}_{i} )^{2} } }}$$
(7)
$$MAE = \frac{1}{n}\sum\limits_{{i = 1}}^{n} {\left| {y_{i} - \hat{y}_{i} } \right|}$$
(8)
$$RMSE = \sqrt {\frac{1}{n}\sum\limits_{{i = 1}}^{n} {(y_{i} - \hat{y}{}_{i})^{2} } }$$
(9)

In Eq. (7) to (9), \(y_{i}\) is the actual value, \(\hat{y}_{i}\) is the predicted value, \(\bar{y}_{i}\) is the mean value of the dependent variable in the test set, and \(n\) is the sample size in the test set.

Experimental

In this section, we will specify and analyze the experimental data, experimental process, and experimental results. The computer configuration used for the experiment is Win 11 operating system, Intel(R) Core(TM) i5-8265U CPU, and 8GB RAM. The experimental program is based on the TensorFlow framework and implemented using Python language coding, and the programming tool is Jupyter notebook.

Dataset and pre-processing

In this study, we analyzed an environmental monitoring project implemented between 2020 and 2023, from May to October, in a city chemical park. This project set up 59 monitoring points in the park to track VOC and other environmental parameters such as temperature, humidity, wind speed and direction in real time. At the same time, air pollutants such as O3, NO2, SO2, CO, PM2.5 and PM10 are monitored simultaneously at fixed monitoring stations in the park. The data collection interval was 1 h. Data preprocessing includes outlier detection, missing value imputation and standardization. First, to ensure the accuracy of data analysis, outliers were removed using the Tukey’s method based on the interquartile range (IQR).Specifically, the first (Q1) and third (Q3) quartiles of each feature were used to establish outlier thresholds and any observation falling below Q1–1.5 * IQR or above Q3 + 1.5 * IQR was considered an outlier and excluded from the dataset. For handling missing data, a nearest-neighbor imputation strategy was adopted to maintain the continuity and structural integrity of the data. This method effectively preserves the short-term stability of atmospheric pollutant concentrations. Additionally, to address the skewed distribution of the target variable, a Box-Cox transformation was applied to approximate a normal distribution. All numerical features were also standardized by subtracting the mean and dividing by the standard deviation, ensuring that each feature has a mean of zero and a standard deviation of one. Integration of VOC concentrations calculated by the atmospheric dispersion model diffusing to the target detection station with NO2, meteorological conditions (T, RH, P, WS, WD), SO2, CO, PM10, and PM2.5 collected at the target monitoring station constitutes the experimental dataset, which comprises a total of 11,947 samples. Detailed information about the variables after preprocessing is shown in Table 2. The distribution of these data is shown in Fig. 3. The data distribution is normal with kurtosis and skewness values in the range of -1 to + 1 and − 2 to + 2, respectively. Finally, to evaluate the generalization ability of the model, we divided the dataset into 70% training set and 30% test set.

Table 2 Statistical results of the dataset.
Fig. 3
figure 3

Distribution of relevant data.

Figure 4 shows the heat map of the correlation between meteorological variables and atmospheric pollutant concentrations. The results of the analysis show significant correlations between temperature (T), relative humidity (RH) and wind speed (WS) and ozone concentration, highlighting the key role of these meteorological factors in the regional ozone formation process. Particularly noteworthy is the highly significant positive correlation between temperature and ozone concentration, reflecting the enhancement of photochemical reaction activity in high-temperature environments, which may lead to an increase in ozone production. When further exploring the thermograms, a strong positive correlation (0.64) was found between VOCs and ozone concentration, suggesting that VOCs play a key driving role in ozone production. While NO2 showed a negative correlation (-0.49) with ozone concentration, which is in agreement with the findings of Huang, Yan, et al.32, although NO2 can generate ozone, excessive NOx titrates ozone and leads to a decrease in ozone concentration, reflecting its complexity in the ozone formation mechanism. Although the direct correlation between particulate matter (PM10 and PM2.5) and ozone is not significant, it may indirectly modulate the ozone distribution by influencing photochemical reactions or interacting with other meteorological variables. Studies have shown that secondary organic carbon (SOC) in PM2.5 competes with ozone by sharing the same precursors, while high PSA levels promote the production of non-homogeneously reacting NO3 on particle surfaces, which in turn compete with NOx required for ozone formation, and in turn may affect ozone production. Therefore, the potential impact of particulate matter on regional ozone pollution cannot be ignored.

Fig. 4
figure 4

Data correlation analysis.

Results of SHAP analysis

After completing the data preprocessing, we trained the three mainstream machine learning models, which all showed good performance, but by comparing the evaluation metrics such as R2, RMSE, we found that the Random Forest model showed a slightly better performance than XGBoost and Ridge Regression when dealing with the dataset of this study. The model performance comparison is shown in Table 3, and the R² of the random forest model reaches 0.9023, which is more superior compared to 0.8731 of XGBoost and 0.8594 of ridge regression. Based on this result, in order to more accurately parse the process of model prediction and the driving factors behind it, we applied the SHAP analysis method to conduct an in-depth feature importance assessment of the Random Forest model, and visualized the key features and their strength of influence on the prediction results with the help of the SHAP summary graph.

Table 3 Machine learning performance.

Figure 5 presents the SHAP summary plot, while Table 4 provides the corresponding feature importance rankings. Both indicate that WS, T, and RH are the primary meteorological factors influencing ozone concentration. Additionally, NO2 and VOCs are shown to have a significant impact, underscoring their role as key precursors in ozone formation. In terms of SHAP values, WS, WD, T, NO2 and VOCs exhibit positive SHAP values, meaning that higher values of these features are associated with increased ozone concentrations. In contrast, RH has negative SHAP values, indicating that higher humidity tends to reduce ozone levels. PM2.5, PM10, and P have relatively lower SHAP values, suggesting their limited influence on ozone concentration predictions in this context.

Subsequently, a feature queue is established for subsequent dynamic adjustment of training feature selection based on feature importance ranking, and in each training cycle, the top seven features from the queue are selected for training. During the training process, the feature queue update interval is dynamically adjusted according to the changes in the loss value, and if the change in the loss value exceeds a threshold, the update interval is shortened and the update of the feature queue is triggered (re-ordering of the feature queue). Conversely, the update interval is extended and the feature queue is not updated. The threshold is set to 0.01 and the update interval is set to 5, i.e., every 5 iterations, while the shortening or lengthening amplitude is 1, and the changed interval cannot be lower than 1.

Table 4 Results of feature importance analysis.
Fig. 5
figure 5

SHAP summary.

Model results

The initial parameter settings of the IPSO algorithm are shown in Table 5. In the IPSO algorithm, both the inertia weight w and the cognitive factors c1 and c2 are set as initial values, specifically 0.5 for w, 0.4 for c1, and 0.6 for c2. These initial settings play a critical role in influencing the optimization process. However, the key innovation of our approach lies in the fact that these parameters are not fixed; instead, they are dynamically adjusted throughout the optimization process. This variability allows the algorithm to adapt more effectively to the changing landscape of the search space, enhancing its exploration and exploitation capabilities. Additionally, r1 and r2 are randomly generated values ranging from 0 to 1, which contribute to the stochastic behavior of the algorithm, enhancing its ability to escape local optima. The neighborhood size is set to 4, and the Minkowski distance rule is configured with a value of 2, facilitating local interactions among particles. The population consists of 50 particles, and the algorithm is run for 50 iterations, allowing sufficient exploration of the search space. The multiplication factor (mul) is set to 100, which scales the particle velocities.

The CNN model was trained using the Adam optimizer with mean square error as the loss function to quantify the difference between the predicted and actual observations. An early stopping method was used to prevent overfitting with a minimum change threshold of 1e-4 and a PATIENCE of 30. The hyperparameter results of the CNN model obtained by IPSO algorithm are Filters = 32, Epochs = 50, Dropout = 0.2, Batch size = 64. training the CNN model after determining the hyperparameters, the variation of the training loss values is shown in Fig. 6.

Table 5 IPSO algorithm parameter initialization.
Fig. 6
figure 6

Variation of loss values during training.

Figure 6 exhibits the decreasing trend of training loss and validation loss with the number of iterations. In the initial stage, although the initial loss of the model is high, the loss decreases rapidly, indicating that the model can effectively learn from the data; subsequently, the validation loss gradually stabilizes with the depth of iterations, and when the network is trained for more than 20 epochs, the value of the training loss gradually stabilizes between 0.0009 and 0.0010, indicating that the model achieves a good generalization performance.

Fig. 7
figure 7

Prediction results curve of SHAP-IPSO-CNN model.

Model training is completed to obtain the final SHAP-IPSO-CNN model for this paper, and Fig. 7 shows how well the model fits the ozone mass concentration in the park. As can be seen in Fig. 7, the overall trend of the ozone mass concentration estimated by our model is similar to the actual ozone mass concentration, and the model captures the main trend of the data, but there is still a bias in some peaks.

Model comparison

In order to verify the performance difference between our proposed SHAP-IPSO-CNN model and other models, we conducted comparison experiments including RF, RF-PSO-LSTM25, SHAP-PSO-CNN and IPSO-CNN. In the experiments of RF-PSO-LSTM model, the original paper used RF to analyze the feature importance and select the training features, while this study performs SHAP feature importance analysis based on RF. Therefore, in the RF-PSO-LSTM experiments, we directly used the training features filtered based on the results of the SHAP analysis and ensured that the other hyperparameters were consistent with the original paper settings. All models were tested in the same experimental environment and performance comparisons were made using uniform evaluation criteria. The results are shown in Table 6.

Fig. 8
figure 8

Comparison of changes in loss values during training of each model.

The decreasing trend of the loss of SHAP-PSO-CNN, IPSO-CNN and SHAP-IPSO-CNN models during the training process is presented in Fig. 8. As can be seen from the figure, the SHAP-IPSO-CNN model has a high loss value in the initial stage, but then it decreases rapidly and gradually stabilizes. The SHAP-IPSO-CNN model has the lowest loss value throughout the training process, showing the best convergence behavior and stability among the three. This result indicates that although all models successfully learn the patterns in the dataset, the SHAP-IPSO-CNN model exhibits superior performance during the training process. In addition, the decreasing trend of loss for SHAP-IPSO-CNN and IPSO-CNN validates the effectiveness of the improved particle swarm optimization algorithm in optimizing the CNN parameters.

Table 6 Model assessment indicator data.
Fig. 9
figure 9

Comparison of model performance.

After in-depth analysis and comparison of multiple machine learning models, the results of this study show that the convolutional neural network model (SHAP-IPSO-CNN), which uses a combination of IPSO and SHAP analyses, exhibits significant performance improvement. Detailed data are shown in Table 6. The SHAP-IPSO-CNN model improves R2 by 1.66% compared to the SHAP-PSO-CNN model, and decreases by 12.5% and 10.3% in terms of RMSE and MAE, respectively, which indicates that the overall accuracy and stability are enhanced. Compared with the IPSO-CNN model, the SHAP-IPSO-CNN model achieves improvements in RMSE and MAE metrics, where RMSE and MAE decrease by 11.6% and 9.0%, respectively, and R2 improves by 1.5%, further verifying the positive impact of feature selection on model performance. Additionally, when compared with the RF-PSO-LSTM model, the SHAP-IPSO-CNN model outperforms it in all key metrics. Specifically, the SHAP-IPSO-CNN model achieves a 1.74% improvement in R2, while RMSE decreases by 13.4% and MAE decreases by 13.0% .

The SHAP-IPSO-CNN model obtained the highest value among all models in the comparative analysis in terms of R2 presented in Fig. 9. This result shows that by first using the SHAP algorithm for ranking feature importance and then combining it with a dynamic feature subset for model training, the model error can be significantly reduced and the reliability of the model can be improved compared to the model using the full feature set. The reduction of RMSE and MAE, as well as the improvement of the R2 value, together confirm the high accuracy and credibility of the method in estimating ozone mass concentration.

Overall, the adoption of the SHAP-IPSO-CNN model is not only optimized in terms of accuracy but also shows superiority in terms of its fitness. This model integrates the advantages of convolutional neural networks for efficient feature extraction, the improved particle swarm optimization algorithm to enhance convergence, and dynamic feature selection based on SHAP analysis results. The combination of this dynamic feature selection strategy with the optimization of the IPSO algorithm significantly enhances the performance of the CNN model. The effectiveness and adaptability of this integrated strategy were verified, providing solid support for estimating the contribution of VOCs emitted by enterprises in chemical parks to ozone formation.

Impact assessment

Within the framework of the trained SHAP-IPSO-CNN model, this study further performs a quantitative assessment of the impact of VOC emissions on surface ozone formation at each enterprise in the park. An experimental design was used in which the samples from the ID of an enterprise were excluded from the training when evaluating that enterprise, which is equivalent to “turning off” the emissions from that enterprise in the simulation environment, while keeping all other conditions unchanged. The results of this hypothetical simulation were compared with the model predictions for the actual situation to calculate the average impact score of VOC emissions on ozone concentration changes for each enterprise.

The evaluation process first involves iterating over each individual enterprise ID in the dataset and modifying the corresponding samples in the test dataset. Thereafter, the modified dataset was passed through the trained SHAP-IPSO-CNN model to estimate the ozone concentration values for that condition, i.e., to assess the ozone concentration values in the absence of emissions from that enterprise. The average impact score of each business on ozone concentrations was derived by comparing the difference between these output values and the model output values when the dataset was not modified. Using the July 19, 2021 data as an example, the results are shown in Table 7.

Table 7 Impact assessment results.

Conclusion

The SHAP-IPSO-CNN model proposed in this study combines an improved Particle Swarm Optimization algorithm with SHAP analysis to achieve dynamic feature selection, significantly enhancing the model’s predictive performance. The improved PSO algorithm effectively accelerates the convergence of CNN model training. Through in-depth analysis, this study not only deepens the understanding of the mechanisms behind ozone pollution formation but also provides a robust quantitative method to assess the contribution of industrial VOC emissions to ground-level ozone production. The practical case studies presented here successfully evaluate the impact of individual enterprise VOC emissions on ozone concentrations, offering valuable data support and decision-making tools for policymakers and environmental management authorities in the context of chemical park pollution control and regulation.

Overall, this study demonstrates the substantial potential of deep learning techniques in addressing environmental science challenges, particularly in investigating the complex interactions between atmospheric pollutants and environmental factors. Due to limitations in data availability, the current model does not fully account for seasonal variations. Future work will focus on expanding the dataset to better capture the seasonal effects on ozone levels and refining VOC classification to further optimize the model, enabling more accurate and predictive monitoring and analysis of environmental pollutants.