Introduction

General background of study

Air pollution is a major issue worldwide due to its negative effects on human health, the environment, and the climate1. Among several criterion pollutants for determining the levels of air pollution, six parameters are generally considered highly concerning, including SO2, NO2, CO, O3, PM2.5, and PM102. According to data from the World Health Organisation (WHO), nine out of ten individuals breathe air with high levels of these pollutants, which is beyond the standard limits of the WHO. Air pollution will cause seven million premature deaths globally each year. In addition to impairing vision, air pollution can affect the balance of solar radiation directly or indirectly3 and might even spark more severe weather conditions like drought and flooding4. The concentration of air pollutants in ambient air can be influenced by meteorological conditions, which can also contribute to the mobility, emission, chemical synthesis, and deposition of these pollutants. This is crucial to any actions or management initiatives to reduce air pollution5.

Variations in meteorological data may result in inaccurate conclusions about management effectiveness or intervention. This can make it difficult to determine the trends in different air contaminants accurately6. Therefore, it is essential to distinguish weather effects from data trends on air quality and to identify the precise policy-driven changes in air quality4. Determining air pollution parameters using a theoretical model based on algorithmic methods was a reliable alternative method. Regression models mathematically represent statistical correlations, quantifying the influence of multiple independent variables on a single dependent variable. Since big data is gradually affecting every aspect of daily life. In the future, data resources will become more and more valuable. The use of data and technology from big data thinking and artificial intelligence (AI) diagnostic tools can be beneficial to environmental governance7. Additionally, based on the availability of online sensor data collection as real-time data monitoring, with the help of citizen participation management and environmental governance, it offers a noble scientific philosophy for government decision-making in public ecological tracking and early warning8,9. Countries monitoring air quality have increased dramatically in recent years10,11. These infrastructure developments in air quality monitoring can be attributed to the government’s recently constructed or expanded monitoring networks and the crucial contributions of non-governmental groups and concerned citizens worldwide. Although progress has been achieved, several regions still lack air quality monitoring, necessitating that a substantial portion of the population access information necessary to manage pollution and make informed health decisions.

Literature review

Deep learning (DL) and machine learning (ML) models have significantly contributed to recent developments in air quality monitoring and forecasting. These models have demonstrated extraordinary potential in predicting air pollution levels and identifying contamination sources12,13.

The co-training framework for air quality monitoring proposed for real-time monitoring in Beijing and Shanghai is noteworthy. By integrating spatial and temporal classifications such as artificial neural networks (ANN) and conditional random fields (CRF), this approach outperformed traditional models such as decision trees and linear interpolation. The use of real-time meteorological and traffic flow data proved beneficial in improving accuracy14. The key technological advancement here was the integration of various data sources for real-time monitoring. However, a significant challenge remains in the complex integration of these models, particularly when scaling to larger urban areas.

In addition, the use of wavelet-ANN models for short-term air pollution forecasting in Xi’an and Lanzhou was a key factor. The wavelet-ANN (WANN) model demonstrated superior performance in predicting air pollution indices (API), providing a higher R-value (0.8906) than traditional ANN models. This improvement was attributed to the WANN’s ability to capture non-linear patterns in pollution data. However, a challenge was the computational cost and the complexity of processing large datasets15.

The use of mobile air quality monitoring systems has increased, particularly through a study in Beijing, where electric vehicles equipped with real-time sensors collected PM2.5 data. These mobile sensors were mapped using decision tree models, which significantly outperformed fixed monitoring stations. The advantage of mobile sensors is their ability to provide high-resolution air quality data, but this approach is a challenge in fleet management and sensor calibration across a wide range16.

In terms of cost-effective solutions, the Deep-MAPS framework utilized mobile and fixed air quality sensors to estimate PM2.5 concentrations, delivering results at a resolution of 1 km × 1 km and 1 h. This model reduced hardware costs by up to 90% compared to conventional fixed sensor methods, providing a more economical way to monitor urban air quality. However, the challenges persist in ensuring secure network coverage and expanding the sensor network for broader geographic coverage17.

The MCST-Tree model proposed for space–time learning of air quality in Chengdu included both mobile and fixed sensor data, achieving a high accuracy (R2 = 0.94 for PM2.5) even with sparse data. This model is capable of handling space–time data gaps, but ensuring high model accuracy with limited sensor data is an ongoing challenge18. In Chengdu, the Multi-AP learning system was introduced for high-resolution pollutant mapping. This method produced detailed hourly pollution maps, resulting in a decrease in computational efficiency and accuracy. However, the challenge remains to maintain the complexity of real-time predictions, particularly for large-scale urban areas19.

For long-term forecasting, an ANN-based model for PM2.5 concentrations in Liaocheng demonstrated a high accuracy (R = 0.9570), resulting in Bayesian regularization. The challenge of mitigating overfitting and guaranteeing consistent performance across a variety of conditions persists, although this approach was successful in long-term predictions20. In forecasting volatility, a hybrid XGBoost-GARCH-MLP model was employed for PM2.5 volatility prediction, providing better long-term prediction accuracy. The hybrid model’s strength was in incorporating volatility into the forecasting process, yet the complexity and high computational requirements of such models are limited to their practical application21. In addition, AI-based models such as ANN, CNN, and LSTM have been employed for climate and air quality forecasting in cities such as Jinan and Hohhot, where CNN-LSTM models showed superior performance. These multimodal forecasting techniques provide a great opportunity to improve air quality predictions, though challenges persist in model generalization and integrating across diverse regions22. ST-Exposure, a promising model, utilizes fixed and mobile sensors to predict PM2.5 exposure on a pixel-wise basis. This model achieved an SMAPE below 15%, indicating its potential in high-resolution exposure predictions. However, the challenges of sparse sensor deployment and data integration remain obstacles to achieving optimal accuracy23.

While significant efforts have been made in utilizing ML and DL techniques for air quality forecasting, data quality, computational complexity, model integration, and coverage persist. Future advances in sensor technology, data fusion, and model optimization will be crucial in advancing these methods for broader geographical applications and improving their practical application.

Research objectives and novelty

While deep learning models such as ANN, CNN, and LSTM have shown strong performance in predicting air pollution, each comes with its own set of challenges. ANN models are susceptible to overfitting and often struggle to generalize across different geographical areas. CNNs are good at identifying spatial features, but they typically need large datasets and may fail to capture time-based patterns. LSTMs handle temporal data well, but they require significant computational resources and can perform poorly when data is noisy or incomplete. Additionally, many deep learning approaches have trouble integrating diverse data types—like weather, traffic, and sensor inputs—and scaling efficiently in complex urban settings. These limitations point to the importance of enhanced data preprocessing, hybrid model approaches, and transfer learning techniques. Therefore, the research is aimed at the forecasting problems of six fundamental pollutants—Carbon Monoxide (CO), Nitrogen Dioxide (NO2), Ozone (O3), Sulphur Dioxide (SO2), Fine Particulate Matter (PM2.5), and Coarse Particulate Matter (PM10) using state-of-the-art machine learning approaches, specifically Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs). These pollutants are particularly important due to their adverse effects on the environment and the environment.

The research uses data from the Beijing Municipal Environmental Monitoring Centre (BMEMC) from March 2013 to February 2017, which includes meteorological data and pollutant levels from 10 nationally controlled monitoring sites. This study intends to provide hourly predictions that enable a more accurate assessment of the health-related impacts of air pollution, unlike the traditional models that often provide poor temporal accuracy.

This study is a unique approach to air pollution prediction using DNNs and CNNs based on multivariate time series analysis. Until now, only a few ML studies have attempted to apply such high temporal resolution to pollutant concentration prediction using feature-encoded DNN and CNN frameworks with both encoded and unencoded features. The contribution captures intricate spatiotemporal structures that especially tend to be masked in strongly polluted urban settings where both time and pollution levels fluctuate dramatically. The study utilizes sophisticated pre-processing of data, including interpolation of missing values, as well as thorough exploratory analysis using box plots to ensure the accuracy and integrity of the data. This strategy enhances the accuracy of air quality predictions and the model’s scalability for real-time forecasting; likewise, urban planners and public health regulators can monitor hourly pollutant concentrations in their areas to implement effective pollution management methods. This work aims to provide reliable emission prediction models that could guide decisions regarding human health protection and environmental sustainability.

Study area and data

In January 2013, Beijing established 36 air-quality monitoring sites, 35 of which are Beijing Municipal Environmental Monitoring Center (BMEMC) sites, and one at the US Embassy in Beijing24. The current dataset comprises hourly air pollutant data from 10 national air quality monitoring stations, namely Aotizhongxin, Changping, Dongsi, Guanyuan, Huairou, Nongzhanguan, Shunyi, Tiantan, Wanliu, and Wanshouxigong. These ten stations were chosen due to the availability of free access to their data. The meteorological data in each air-quality site are compared to the nearest weather station from the China Meteorological Administration (CMA). The data was recorded in hours from 01/03/2013 to 28/02/2017. The datasets included four time attributes such as (year of the data, month of the data, day of the data, and hour of the data), six principal air pollutants such as (PM2.5 concentration (ug/m3), PM10 concentration (ug/m3), SO2 concentration (ug/m3), NO2 concentration (ug/m3), CO concentration (ug/m3) and O3 concentration (ug/m3)), and six relevant meteorological variables such as (temperature (°C), pressure (hPa), dew point temperature (°C), rainfall (mm), wind direction and wind speed (m/s)).

Artificial intelligence models

Deep neural networks (DNN)

Artificial neural networks (ANNs) are an effective machine learning technique based on the human brain’s structure. In self-learning, ANNs can identify patterns and hidden correlations in datasets25. Furthermore, a particular type of ANN called a deep neural network (DNN) has numerous layers of connected nodes, which enables it to represent more complex data relationships and perform better than traditional ANNs26.

The layers in the DNN model typically have an input layer, three or more hidden layers, and an output layer. The input layer receives the data, which is then altered by the hidden layers, and the output layer generates a forecast. Figure 1a displays the DNN model’s architecture employed in this investigation. A single neuron receives inputs, multiplies each input by the corresponding weight (\(W\)), adds a bias (\(b\)), and then passes the sum through an activation function (\(f(x)\)) to produce an output. The weights and biases are calculated to determine the impact of the inputs. At the same time, the activation function provides nonlinearity and enables the model to learn complex patterns27.

Fig. 1
figure 1

Schematic diagrams of deep learning models; (a) DNN, (b) CNN.

A backpropagation approach is employed to alter the weights between the nodes to train a DNN model. This strategy eliminates the disparity between the forecasted and actual output by altering the weights until the network can reliably predict new data. A DNN model can identify complex relationships in data and generate precise predictions when given new data, if it receives sufficient training28.

Convolutional neural network (CNN)

The CNN network’s convolutional and pooling layers are the essential elements of feature extraction29. Time series data are typically the primary application for 1D-CNNs due to their strong feature extraction capabilities30. Alternating convolutional and pooling layers in the 1D-CNN enable the extraction of non-linear features from raw data, and the fully connected layer completes adaptive feature learning31.

The basic architecture of the CNN is outlined in Fig. 1b, which comprises an input layer, several convolutional layers, several pooling layers, a fully connected layer, and an output layer. The convolutional and pooling layers are connected in an alternating fashion. In this regard, the CNN feature extraction module comprises the input, convolutional, and pooling layers. The output module includes the connected and output layers32. The complete calculation formula is outlined in Eq. (1).

$$y_{j} = f\left( {\mathop \sum \limits_{{i \in M_{j} }} x_{i}^{l - 1} \otimes w^{l}_{i,j} + b^{l}_{j} } \right)$$
(1)

where, \(f\) is the activation function, \(\otimes\) is the convolutional operator, \(w\) is the weight matrix, and \(b\) is the bias deviation.

Model development and configuration

Ten nationally controlled air-quality monitoring sites—Aotizhongxin, Changping, Dongsi, Guanyuan, Huairou, Nongzhanguan, Shunyi, Tiantan, Wanliu, and Wanshouxigong—provided hourly air quality data for this study. The dataset spans from March 1, 2013, to February 28, 2017, and includes four temporal attributes (year, month, day, and hour), six major air pollutants (PM2.5, PM10, SO2, NO2, CO, and O3 in µg/m3), and six meteorological variables (dew point temperature (°C), air temperature (°C), pressure (hPa), wind direction, wind speed (m/s), and precipitation (mm)).

In preprocessing, each station’s dataset was analyzed to identify and impute missing values using linear interpolation (Table 1). The final cleaned dataset contained 35,064 instances per station. These were split into training, validation, and testing sets with a time-series window size of 10, resulting in: Training: (25,000, 10, number of features), (25,000), Validation: (5000, 10, number of features), (5000) and Testing: (5054, 10, number of features), (5054).

Table 1 Distribution of missing values (NaN) across columns in multiple stations.

To improve prediction accuracy (Fig. 2), the proposed method incorporates both temporal and spatial features. Temporal features (e.g., hour, day, month) are inherently cyclical. To model this periodicity, each cyclical feature was encoded using sine and cosine transformations, allowing the model to capture repeating patterns. Specifically Eqs. (24):

Fig. 2
figure 2

Model development and configuration for forecasting hourly air pollutants in 10 Beijing stations.

$$hou{r}_{sin}=sin\left(\frac{2\pi .hour}{24}\right), hou{r}_{cos}=cos\left(\frac{2\pi .hour}{24}\right)$$
(2)
$${day}_{sin}=sin\left(\frac{2\pi .day}{31}\right), {day}_{cos}=cos\left(\frac{2\pi .day}{31}\right)$$
(3)
$${month}_{sin}=sin\left(\frac{2\pi .month}{24}\right), {month}_{cos}=cos\left(\frac{2\pi .month}{24}\right)$$
(4)

These transformations help preserve the cyclical continuity (e.g., hour 23 to hour 0) and support the model in learning seasonal or diurnal effects, especially during transitions such as dawn/dusk or seasonal changes.

Spatial features are derived from the geographical and industrial characteristics unique to each monitoring site. These include proximity to traffic, industrial zones, and residential areas, which introduce local dependencies into pollution patterns. Rather than generalizing across all locations, the model trains separately for each station to account for such location-specific dynamics.

The rationale for selecting DNN and CNN lies in their respective strengths:

  1. i.

    DNNs are effective for learning non-linear feature interactions, especially when the dataset includes mixed data types (e.g., meteorological and pollutant data).

  2. ii.

    CNNs are chosen for their ability to capture local patterns across the time dimension, as convolutional filters can detect trends and abrupt changes in short sequences—an important feature in hourly air pollution data.

Both models were implemented for each station independently to capture station-specific pollution dynamics. Furthermore, in both models, we set window_size = 10, which refers to the number of consecutive time steps (or rows) used to construct a single input sample for the time series forecasting model. The encoded approach outperformed the unencoded baseline in capturing temporal fluctuations and spatial heterogeneity. This approach led to improved air quality forecasting accuracy and provided insights into region-specific pollution trends. As per Table 2, to ensure fair and optimal performance of both DNN and CNN architectures across unencoded and encoded feature sets, a systematic hyperparameter tuning process was conducted using grid search and fivefold cross-validation on the training data, aiming to minimize validation RMSE. The model architectures were designed to balance complexity and generalization, with DNNs using three dense layers (32 → 16 → 8 units) for progressive feature abstraction and CNNs employing a kernel size of 2 with 32 filters for efficient local pattern extraction. The Adam optimizer with a learning rate of 0.005—selected from the range {0.001, 0.003, 0.005, 0.01}—offered stable convergence and the lowest validation error. MSE was used as the loss function, while RMSE served as the evaluation metric for its interpretability. Models were trained for 100 epochs, with early stopping applied to prevent overfitting.

Table 2 Components and descriptions for DNN and CNN models with Unencoded and Encoded features.

Forecasting metrics

In this paper, the following metrics were applied to measure the efficiencies of unencoded and encoded deep learning models:

  1. i.

    Mean Absolute Error (MAE)

    $$MAE=\frac{1}{N}\times \sum_{i=1}^{N}\left|{P}_{i}-{O}_{i}\right|$$
    (5)
  2. ii.

    Mean Squared Error (MSE)

    $$MSE=\frac{1}{N}\times \sum_{i=1}^{N}{\left({P}_{i}-{O}_{i}\right)}^{2}$$
    (6)
  3. iii.

    Root Mean Square Error (RMSE)

    $$RMSE=\sqrt{MSE}=\sqrt{\frac{1}{N}\times \sum_{i=1}^{N}{\left({P}_{i}-{O}_{i}\right)}^{2}}$$
    (7)
  4. iv.

    Coefficient of Determination (R2)

    $${R}^{2}= 1-\frac{\sum {\left({P}_{i}-{O}_{i}\right)}^{2}}{\sum {\left({P}_{i}-{O}_{i}\right)}^{2}}$$
    (8)
  5. v.

    Willmott Index (WI)

    $$WI= 1-\frac{{\sum }_{i=1}^{N}{({O}_{i}-{P}_{i})}^{2}}{{\sum }_{i=1}^{N}{(\left|{P}_{i}-\overline{O }\right|+\left|{O}_{i}-\overline{O }\right|)}^{2}}$$
    (9)
  6. vi.

    Kling-Gupta Efficiency (KGE)

    $$KGE= 1-\sqrt{{\left(PCC-1\right)}^{2}+{(\frac{std}{rd}-1)}^{2}+(\frac{\overline{O} }{\overline{P} }-1)}$$
    (10)

In the above equations, \({O}_{i}\) is the observed (actual) value of air pollutants. \({P}_{i}\) is the forecasted value of air pollutants. \(\overline{O }\) and \(\overline{P }\) are the average values of the observed and forecasted values of air pollutants, respectively. \(PCC\), \(std,\) and \(rd\) are the Pearson correlation coefficient, the standard deviation of forecasted values, and the standard deviation of observation values, respectively.

Results and discussion

Aotizhongxin station

At Aotizhongxin station (Fig. 3), DNN and CNN models showed distinct strengths across pollutants. CNN-Unencoded performed best for CO (RMSE: 483.5 µg/m3) and PM10 (KGE: 0.921), while DNN-Encoded led in NO2 (KGE: 0.914), O3 (RMSE: 12.4 µg/m3), and SO₂ (KGE: 0.952). The dew point was the dominant predictor in ~ 83% of top-performing models. Rainfall and hourly features contributed minimally (< 10%). PM2.5 forecasts exhibited the highest variability. Overall, performance was driven by broad temporal and environmental patterns, with each model excelling in specific pollutant contexts.

Fig. 3
figure 3figure 3

Forecasting results of six air pollutants using Unencoded and Encoded deep learning models at Aotizhongxin Station.

Changping station

At Changping Station (Fig. 4), CNN-Encoded models showed consistently strong performance across pollutants. They achieved the lowest MAE and RMSE in 67% of cases and ranked highest in R2 or KGE for ~ 50%. For CO, CNN-Encoded had R2 = 0.849 and MAE = 279.9 µg/m3, outperforming CNN-Unencoded despite a slightly lower KGE. In NO2, CNN-Encoded reduced MAE by ~ 21% compared to DNN-Unencoded. O3 forecasts showed close performance: DNN-Encoded had the highest KGE (0.943), while CNN-Encoded achieved the lowest MAE (7.3 µg/m3). For PM2.5 and PM10, CNN-Unencoded slightly outperformed in KGE and R2, but CNN-Encoded had lower error metrics. SO2 results were mixed, with CNN-Unencoded leading in R2 (0.983), while DNN-Encoded topped in KGE (0.852). Dew point, month_cosine, and month were key predictors in over 80% of models, while rainfall and hourly features had < 10% impact.

Fig. 4
figure 4figure 4

Forecasting results of six air pollutants using Unencoded and Encoded deep learning models at Changping Station.

Dongsi station

At Dongsi Station (Fig. 5), CNN models—particularly CNN-Encoded—demonstrated superior forecasting accuracy for most pollutants. CNN-Encoded achieved the lowest MAE and RMSE in 60% of pollutants, excelling in CO (MAE: 244.2 µg/m3), NO2 (MAE: 6.93 µg/m3; R2: 0.912), and SO2 (MAE: 2.21 µg/m3). CNN-Unencoded led in KGE and WI for PM2.5 (KGE: 0.959) and PM10 (KGE: 0.934), while DNN-Encoded had the lowest RMSE for PM2.5 (22.05 µg/m3) and the highest KGE for O3 (0.934). Dew point, month_cosine, and temperature were key predictors in > 80% of cases, while rainfall had < 10% influence. Overall, CNN models outperformed others in MAE/RMSE in most cases, confirming their robustness in pollutant forecasting.

Fig. 5
figure 5figure 5

Forecasting results of six air pollutants using Unencoded and Encoded deep learning models at Dongsi Station.

Guanyuan station

At Guanyuan Station (Fig. 6), DNN-Encoded models outperformed in 60% of pollutants, achieving top KGE and NSE for CO (KGE: 0.949; NSE: 0.9), PM2.5, and SO2, along with the lowest MAE/RMSE for CO and PM2.5. CNN-Encoded led in O3 (KGE: 0.936; WI: 0.97) and PM10 (KGE: 0.949; WI: 0.982). DNN-Unencoded had the highest R2 for NO2 (0.916) and lowest MAE/RMSE for O3, despite lower KGE/NSE. Key predictors across models included dew point, month_cosine, and temperature (relevant in > 80% of top-performing models), while rainfall and hour_sine had < 10% impact. Overall, encoded models performed better on KGE/NSE/WI, while unencoded models excelled in MAE/RMSE for select pollutants, emphasizing pollutant-specific model suitability.

Fig. 6
figure 6figure 6

Forecasting results of six air pollutants using Unencoded and Encoded deep learning models at Guanyuan Station.

Huairou station

At Huairou Station (Fig. 7), CNN-Encoded models outperformed others in ~ 70% of pollutant forecasts, achieving top KGE (0.951) and WI (0.976) for CO, and lowest errors for CO, NO2, and O3. DNN-Encoded excelled for PM2.5 (KGE: 0.949), PM10, and SO2, showing better KGE and NSE in ~ 30% of cases. Encoding improved performance across all pollutants, particularly for CO, NO2, and O3. Key predictors—dew point, month, and temperature—were influential in over 80% of top-performing models, while rainfall and hourly features had minimal impact (< 10%). Overall, encoded models consistently delivered superior accuracy by effectively capturing temporal and environmental patterns.

Fig. 7
figure 7figure 7

Forecasting results of six air pollutants using Unencoded and Encoded deep learning models at Huairou Station.

Nongzhanguan station

At Nongzhanguan Station (Fig. 8), CNN-Encoded models outperformed others in ~ 70% of pollutants, achieving top KGE for CO (0.960), O3 (0.962), and NO2 (0.943), with the lowest MAE/RMSE (e.g., CO MAE: 209.2 μg/m3; O3 MAE: 7.1 μg/m3). DNN-Unencoded excelled in PM2.5 with the highest KGE (0.976) and lowest RMSE (19.7 μg/m3). Forecasts for CO, NO2, and O3 had strong accuracy (KGE > 0.92; R2 > 0.91), while SO2 had the lowest performance (max KGE: 0.933; R2: 0.88). Dew point, temperature, and pressure were key predictors in > 80% of cases; rainfall had a < 10% impact. Overall, CNN-Encoded models proved most effective for pollutant forecasting, driven by strong meteorological and temporal features.

Fig. 8
figure 8figure 8

Forecasting results of six air pollutants using Unencoded and Encoded deep learning models at Nongzhanguan Station.

Shunyi station

At Shunyi Station (Fig. 9), CNN-Encoded models outperformed in ~ 60% of pollutants, achieving top metrics for CO (KGE: 0.946, R2: 0.907), PM10 (KGE: 0.945, R2: 0.937), and SO2 (MAE: 2.62 µg/m3, RMSE: 6.56 µg/m3). DNN-Encoded led in NO2 (KGE: 0.928, R2: 0.908), O3 (KGE: 0.938), and PM2.5 (KGE: 0.969, R2: 0.953). CO and O3 had the strongest model performance (KGE > 0.94), linked to high meteorological sensitivity. Dew point and month_cosine were key drivers in > 80% of cases, while rainfall had a < 10% impact. Overall, CNN-Encoded models provided superior forecasts when pollutant levels were strongly meteorology-dependent.

Fig. 9
figure 9figure 9

Forecasting results of six air pollutants using Unencoded and Encoded deep learning models at Shunyi Station.

Tiantan station

At Tiantan Station (Fig. 10), encoded models outperformed unencoded ones in ~ 70% of cases. CNN-Encoded achieved top results for CO, PM2.5, and SO2 (KGE > 0.93, R2 > 0.88, MAE: 210.2 µg/m3 for CO; 10.53 µg/m3 for PM2.5). DNN-Encoded led in NO2 (KGE: 0.941, MAE: 7.32 µg/m3) and O3 (NSE: 0.917, R2: 0.92). PM10 was best predicted by CNN-Unencoded (KGE: 0.938). Key drivers (dew point, temperature, pressure, wind) influenced forecasts in > 80% of cases, while rainfall had a < 10% impact. Overall, encoded models improved accuracy by effectively capturing seasonal and atmospheric patterns.

Fig. 10
figure 10figure 10

Forecasting results of six air pollutants using Unencoded and Encoded deep learning models at Tiantan Station.

Wanliu station

At Wanliu Station (Fig. 11), CNN-Encoded models outperformed unencoded ones in ~ 75% of cases, achieving top accuracy for CO, PM2.5, and SO2 (KGE > 0.93, R2 > 0.88, MAE: 240.55 μg/m3 for CO; 9.78 μg/m3 for PM2.5). DNN-Encoded led NO2 forecasting (KGE: 0.938, MAE: 7.22 μg/m3) and O3 (NSE: 0.912, R2: 0.923). PM10 predictions were best by CNN-Unencoded (KGE: 0.931, MAE: 15.64 μg/m3). Key meteorological drivers (dew point, temperature, pressure, wind) influenced > 80% of results, while rainfall had minimal (< 10%) impact. Encoded features improved forecast accuracy by effectively capturing complex temporal and atmospheric patterns.

Fig. 11
figure 11figure 11

Forecasting results of six air pollutants using Unencoded and Encoded deep learning models at Wanliu Station.

Wanshouxigong station

At Wanshouxigong Station (Fig. 12), CNN-Encoded models led in ~ 80% of cases, achieving the highest accuracy for CO (KGE: 0.964, R2: 0.938, MAE: 192.96 µg/m3) and PM2.5 (KGE: 0.958, R2: 0.948, MAE: 1.85 µg/m3). O3 was driven by temperature, wind, and hourly cycles; NO2 and SO2 showed strong seasonal (month) effects. PM10 performed best with DNN-Encoded (R2: 0.935, RMSE: 28.20 µg/m3). SO2 had lower accuracy but CNN-Encoded still improved errors (RMSE: 4.02 µg/m3). Rainfall had a minimal (< 10%) impact. Encoding boosted forecast reliability by capturing seasonal and short-term variations and reducing errors.

Fig. 12
figure 12figure 12

Forecasting results of six air pollutants using Unencoded and Encoded deep learning models at Wanshouxigong Station.

Remarks and comparison

As per Tables 3, 4, 5, 6, 7, 8. The model performance across pollutants and locations is generally high, with R2 values mostly exceeding 0.85, reflecting strong predictive accuracy. PM2.5 and PM10 exhibit the highest and most consistent R2 scores, often above 0.94, indicating excellent model fit across all sites and methods. O3 predictions also show robust results, generally above 0.89. CO and NO2 show slightly more variation but still maintain strong performance, with CNN models—especially those using encoded inputs—tend to have a slight edge over DNNs. SO2 predictions are the most variable and generally lower, with some locations like Changping and Guanyuan showing R2 values closer to 0.76–0.78, suggesting more complexity or noise in the data. Locations such as Nongzhanguan, Wanshouxigong, and Tiantan consistently yield higher R2 values across pollutants and models, indicating more stable data or better model generalization, whereas Changping and Guanyuan often show comparatively lower performance. Overall, CNN architectures with encoded inputs generally offer marginal improvements, particularly for more challenging pollutants like SO2 and CO. Moreover, based on Table 9, CNN achieves the highest R2 in 70% of the cases (14 out of 20 combinations) across both unencoded and encoded features. DNN follows, ranking highest in 25% of cases, while LSTM leads only once (5%). ANN consistently underperforms, with the lowest R2 in 90% of the stations when features are encoded. Top-performing stations like Nongzhanguan, Tiantan, and Wanliu record R2 values above 0.96 with CNN, while lower-performing stations like Changping and Huairou have values around 0.94 or below, highlighting site-specific variability in model accuracy.

Table 3 Comparison of R2 values for pollutant CO across locations.
Table 4 Comparison of R2 values for pollutant NO2 across locations.
Table 5 Comparison of R2 values for pollutant O3 across locations.
Table 6 Comparison of R2 values for pollutant PM2.5 across locations.
Table 7 Comparison of R2 values for pollutant PM10 across locations.
Table 8 Comparison of R2 values for pollutant SO2 across locations.
Table 9 Comparison of R2 values for PM2.5 prediction across stations using deep learning models with unencoded and encoded features.

Conclusion, limitations and future directions

This research highlights how effective deep learning models—specifically DNN and CNN frameworks—are at predicting major urban air pollutants across several monitoring locations using four years of hourly data. Both models delivered strong predictive performance, showing a high level of alignment between real and predicted pollutant values. Notably, the inclusion of feature encoding greatly boosted model accuracy, leading to steady gains of about 2–5% in key evaluation metrics like R2, NSE, and KGE.

The findings show that CNNs excelled at detecting spatial and temporal pollution patterns, particularly for pollutants such as CO and PM2.5. Meanwhile, DNNs demonstrated strong results across a wider range of pollutants. Feature encoding proved essential in enhancing the models’ ability to generalize and reduce prediction errors, underscoring the value of preprocessing in forecasting air quality over time.

Differences between monitoring sites showed the models could adapt to varying pollution trends and levels, reinforcing their usefulness in a variety of urban contexts. These insights suggest that deep learning, especially when supported by encoded features, holds significant potential for delivering accurate and scalable air quality predictions—tools that could be crucial for city planning and public health efforts.

Despite these strong results, the study is not without limitations. The deep learning models require significant computational resources—GPU or cloud-based infrastructure—which may limit their application in resource-constrained settings. Moreover, external environmental drivers such as meteorological data (temperature, wind speed, humidity), traffic emissions, and industrial output were not included in the modeling pipeline. Incorporating these factors could potentially improve model accuracy by up to 10–15%, based on evidence from other related literature. Additionally, the geographic scope of the study was limited to Beijing, reducing the model’s generalizability to regions with different climatic and socio-economic profiles.

The practical implications of this research are remarkable. High-accuracy pollutant forecasting, with R2 values above 0.90 in many cases, can support early warning systems, enabling city authorities to issue timely health advisories and reduce exposure risks. The integration of these models into smart city infrastructure could lead to more efficient urban planning, including dynamic traffic control and targeted industrial regulation. Furthermore, the framework demonstrated in this study provides a scalable foundation for AI-driven air quality management, capable of being deployed in various urban areas.

For future research, increasing the input feature space to include meteorological and socioeconomic variables is essential. Preliminary studies indicate that adding weather-related variables can increase forecasting accuracy by 8–12%. Model interpretation should also be prioritized using tools such as SHAP or attention mechanisms to uncover the influence of specific features on predictions. Furthermore, incorporating data from multiple cities with different pollution profiles would enhance its adaptability and general applicability. Finally, transitioning to real-time, cloud-based deployment can provide scalable, on-demand predictions. Hybrid models that combine deep learning with physical or statistical modeling may improve prediction robustness by 10–20%, offering a promising direction for next-generation environmental forecasting systems.