Air quality prediction based on factor analysis combined with Transformer and CNN-BILSTM-ATTENTION models

Liu, Shuyuan; Hu, Yang

doi:10.1038/s41598-025-03780-4

Download PDF

Article
Open access
Published: 06 June 2025

Air quality prediction based on factor analysis combined with Transformer and CNN-BILSTM-ATTENTION models

Shuyuan Liu¹ &
Yang Hu¹

Scientific Reports volume 15, Article number: 20014 (2025) Cite this article

3033 Accesses
2 Citations
Metrics details

Subjects

Abstract

This study presents an innovative air quality prediction framework that integrates factor analysis with deep learning models for precise prediction of original variables. Using data from Beijing’s Tiantan station, factor analysis was applied to reduce dimensionality. We embed the factor score matrix into the Transformer model which leveraged self-attention to capture long-term dependencies, marking a significant advancement over traditional LSTM methods. Our hybrid framework outperforms these methods and surpasses models like Transformer, N-BEATS, and Informer combined with principal component and factor analysis. Residual analysis and ${R}^{2}$ evaluation confirmed superior accuracy and stability, with the maximum likelihood factor analysis Transformer model achieving an MSE of 0.1619 and ${R}^{2}$ of 0.8520 for factor 1, and an MSE of 0.0476 and ${R}^{2}$ of 0.9563 for factor 2. Additionally, we introduced a cutting-edge CNN-BILSTM-ATTENTION model with discrete wavelet transform, which optimizes predictive performance by extracting local features, capturing temporal dependencies, and enhancing key time steps. Its MSE was 0.0405, with ${R}^{2}$ values all above 0.94, demonstrating exceptional performance. This study emphasizes the groundbreaking integration of factor analysis with deep learning, transforming causal relationships into conditions for predictive models. Future plans include optimizing factor extraction, exploring external data sources, and developing more efficient deep learning architectures.

Air-quality prediction based on the ARIMA-CNN-LSTM combination model optimized by dung beetle optimizer

Article Open access 26 July 2023

An air quality index prediction model based on CNN-ILSTM

Article Open access 19 May 2022

Air quality prediction models based on meteorological factors and real-time data of industrial waste gas

Article Open access 03 June 2022

Introduction

Air quality prediction plays a crucial role in modern society’s environmental governance and public health management. With the acceleration of urbanization, air pollution has become increasingly severe, directly impacting residents’ quality of life and health levels. Accurate air quality prediction can help governments and the public take timely and effective measures to mitigate the harm caused by air pollution, and provide scientific support for the formulation and implementation of environmental policies. In recent years, with the rapid development of data science and deep learning technologies, significant progress has been made in air quality prediction models. However, when facing high-dimensional complex data and non-linear trends of pollution changes, traditional prediction methods are still inadequate, especially in capturing long-term time dependencies and extracting potential associative features. To address these issues, this study proposes an innovative air quality prediction framework aimed at improving the accuracy and stability of prediction models by combining factor analysis with deep learning techniques. In terms of research methodology, factor analysis was applied to reduce the dimensionality of the air quality data from the Tian Tan station in Beijing, extracting the main factors and reducing the data’s dimensionality, providing a simplified and efficient data structure for subsequent modeling. Then, during the time series prediction phase, we introduced the Transformer model based on the self-attention mechanism, using factor scores as input, fully leveraging its advantages in capturing long-term time dependencies and modeling sequential data. Additionally, the CNN-BILSTM-ATTENTION hybrid deep learning architecture, combined with wavelet transform, further enhanced the model’s performance. The fundamental purpose of this study is to look into the effectiveness of the combined factor analysis and deep learning prediction frameworks in forecasting air quality. We believe that this new hybrid modeling method may effectively solve the shortcomings of standard models when dealing with vast volumes of data, complex changes in pollutants, and time-related aspects. This should result in better and more efficient predictions. This research aims to provide a novel methodological framework for air quality analysis and prediction, as well as scientific support for environmental governance and policy formation, thereby serving as a reference for air quality monitoring and management in similar settings.

This paper first proposes an innovative deep learning prediction approach based on factor analysis. Initially, factor analysis is conducted on the data, revealing that two interpretable factors can be extracted. Factor 1 explains four original variables, while Factor 2 explains two original variables. Subsequently, we integrate factor analysis with the Transformer model for the overall prediction of factors. A series of comparative studies are then carried out, contrasting this integrated approach with other advanced baselines, including a baseline combined with principal component analysis, as well as predictions from two different factor analysis methods. The results indicate that the factor analysis using the maximum likelihood method combined with the Transformer model stands out in performance, making it suitable as a model for overall factor prediction. Next, we employ discrete wavelet transform to decompose the output of the maximum likelihood factor analysis combined with the Transformer model into multiple scales. These decomposed components serve as feature inputs for the CNN-BILSTM-ATTENTION model. By constructing this model, we conduct localized and specific predictions for the original variables explained by the two factors respectively. In summary, the research logic of this paper follows a holistic-to-local approach.

Related work

Time series analysis has extensive applications across various fields. For instance, Fahman Saeed and Sultan Aldera¹ proposed an adaptive renewable energy forecasting method based on PCA-Transformer. This approach leverages the advantages of Principal Component Analysis and the Transformer model, significantly enhancing the accuracy of renewable energy output forecasting by using PCA for dimensionality reduction and Transformers for capturing long-term dependencies in time series data. The study demonstrated that this method excels in handling complex renewable energy datasets, dynamically adjusting the model structure to accommodate varying data complexity and scale, thereby improving prediction efficiency and reliability.

In a 2022 study, Wan et al.² introduced a model that integrates Convolutional Neural Networks, Long Short-Term Memory networks, and attention mechanisms for short-term power load forecasting. The model extracts high-dimensional features through one-dimensional CNN layers, captures temporal dependencies within time series using LSTM layers, and incorporates an attention mechanism to optimize the weights of LSTM outputs, enhancing the impact of critical information. Experimental results indicated that this model outperformed traditional LSTM models in predicting the power load of two thermal power plant units, improving prediction accuracy by 7.3% and 5.7%, respectively. This research highlights the effectiveness of incorporating attention mechanisms into power load forecasting, providing a more efficient solution for energy management.

In another 2022 study, Zhao et al.³ proposed a convolutional neural network combining wavelet transform and attention mechanisms for image classification. The model employs discrete wavelet transform to decompose feature maps into low-frequency and high-frequency components, storing the structural information of basic objects and detailed features or noise, respectively. Attention mechanisms are then applied to capture detailed information.

In recent years, deep learning has significantly advanced meteorological prediction, particularly in handling nonlinear spatiotemporal data. Gong et al.⁴ proposed a CNN-LSTM hybrid model that uses CNN for spatial feature extraction and LSTM for temporal dependencies, achieving high accuracy in historical temperature prediction with MAE optimization, supporting agriculture and energy management. Similarly, Shen et al.⁵ designed a multi-scale CNN-LSTM-Attention model that incorporates attention mechanisms to focus on key features, achieving an MSE of 1.98 and RMSE of 0.81 in temperature prediction for eastern China, validating the effectiveness of multi-scale feature fusion.

In air quality prediction, Bekkar et al.⁶ explored a CNN-LSTM hybrid model, integrating spatiotemporal features such as PM2.5 data from adjacent stations and meteorological variables, achieving superior performance in hourly PM2.5 prediction in Beijing compared to traditional models. This work provided methodological references for subsequent studies like Kumar’s multi-view model.

For severe convective weather forecasting, Zhang et al.⁷ developed the CNN-BILSTM-AM model, combining bidirectional LSTM and attention mechanisms, using ERA5 data to predict precipitation 0–6 h in advance. This model outperformed traditional numerical models like WRF and highlighted the importance of total precipitable water (PWAT) and convective available potential energy (CAPE). In extreme weather event prediction, Alijoyo et al.⁸ proposed a hybrid CNN-BILSTM model optimized with a genetic algorithm and fruit fly optimizer (FFO), achieving a 99.4% accuracy in cyclone intensity prediction, significantly outperforming existing methods such as VGG-16.

Multi-step forecasting and data decomposition methods have further enhanced model stability. Coutinho et al.⁹ compared SD-CNN-LSTM and EEMD-CNN-LSTM, finding SD-CNN-LSTM better for single-step forecasts and EEMD-CNN-LSTM more stable for long-term forecasts, with error metrics significantly lower than undecomposed models. For multivariate meteorological data, Bai et al.¹⁰ proposed a hybrid model combining CNN-BILSTM with Random Forest, achieving MAE and MSE reductions of 35.6% and 57.5%, respectively, in temperature prediction for Changsha.

In pollutant prediction, Kumar and Kumar¹¹ proposed MvS CNN-BILSTM, reducing RMSE by 3.8–7.1% in PM2.5 prediction. Park et al.¹² used CNN with dynamic climate data (KPOPS) for monthly PM2.5 prediction in Seoul, while Yin and Sun¹³ combined CEEMD-DWT decomposition with BILSTM-InceptionV3-Transformer, reducing MAE by 57.41% in wind power prediction.

Although current research has made breakthroughs in model architecture optimization and multi source data fusion, it has not sufficiently explored dimensionality—reduction prediction tasks. Lv et al.¹⁴ used factor analysis to select key indicators in wastewater treatment, improving the accuracy of machine learning predictions. However, no study in meteorological prediction has combined factor analysis with deep learning models. This gap limits model efficiency and generalization in high dimensional meteorological data. Future research should urgently explore integrating factor analysis with deep learning models to unlock more prediction potential.

Dataset overview

We adopted a dataset called “Beijing Multi-Site Air-Quality Data Set” from Kaggle, which covers air quality monitoring data for the Tiantan station in Beijing, spanning from March 1, 2013, to February 28, 2017. This dataset includes hourly pollutant concentration records from multiple national-controlled air quality monitoring stations. The data was sourced from the Beijing Environmental Monitoring Center and was cross-referenced with recent meteorological station data provided by the China Meteorological Administration to ensure completeness and accuracy. The selection of these air quality features is based on the availability of sensors. Due to advancements in modern environmental monitoring technology, many sensors can measure the concentration of pollutants and meteorological conditions in the air in real-time and with accuracy. Therefore, pollutants such as PM2.5, PM10, SO₂, NO₂, CO, and O₃, as well as meteorological data like temperature, pressure, dew point, and wind speed, can all be obtained through existing sensors. They together provide a solid data foundation for precise air quality prediction.

Research methodology

Data preprocessing and visualization

Data cleaning

To ensure the integrity of the time series data and address issues of missing values, we used the column-wise mean value imputation method. By filling in missing values with the column mean, the bias caused by missing data can be reduced, as this method assumes that the missing values are random and similar to the mean of other observations. At the same time, after filling in the missing values, the integrity of the dataset is maintained, which helps the model utilize all available information during training.

Standardization

During the data preprocessing phase, in order to eliminate discrepancies in the dimensionality of the various variables and enhance the numerical stability of the data, we applied Z-score standardization to all variables. The mathematical expression for Z-score standardization is as follows:

$${\text{Z}}_{\text{i}}= \frac{{\text{X}}_{\text{i}}-\upmu }{\upsigma }$$

${Z}_{i}$ represents the standardized data value, ${X}_{i}$ denotes the original data value, $\mu$ and $\sigma$ correspond to the mean and standard deviation of the variable. Z-score standardization ensures that different variables are optimized on the same scale during model training, thereby enhancing the learning efficiency and predictive accuracy of the model.

Data visualization

To intuitively understand the distribution characteristics of various air pollutant indicators and meteorological factors, we conducted a visualization analysis of the standardized data. Violin plots were generated to illustrate the probability density distribution of each variable. These plots combine the statistical information of box plots with kernel density estimation, providing a more comprehensive representation of the data distribution. Figure 1 shows the distribution of air pollutant and meteorological variables using violin plots.

From the Visualization Results, it can be observed that the distribution of PM2.5 and PM10 particles is relatively wide, indicating significant concentration variations and the potential presence of extreme values. The distribution of NO₂ and CO is more concentrated, with relatively smaller concentration variations. SO₂ has a narrower distribution, reflecting lower variability. The distribution of O₃ is relatively concentrated, showing smaller variability.

Correlation Analysis

We subsequently calculated the correlation matrix and visualized it using a heatmap to intuitively present the correlations between variables. Figure 2 presents the heatmap of the correlation matrix for air quality variables.

The results indicate that there is a strong correlation between PM2.5, PM10, NO₂, and CO, suggesting that they may be driven by the same latent factor.

Framework overview

Figure 3 illustrates the overall flowchart of the air quality prediction framework.

Factor analysis

Factor analysis applicability analysis

Next, we conducted an applicability test on the air quality data to ensure that it meets the basic assumptions and requirements for factor analysis. For this purpose, we used the KMO test and Bartlett’s sphericity test. The formula for calculating the KMO test is as follows:

$$KMO=\frac{\sum_{i\ne j}{r}_{ij}^{2}}{\sum_{i\ne j}{r}_{ij}^{2}+\sum_{i\ne j}{a}_{ij}^{2}}$$

${r}_{ij}$ represents the simple correlation coefficient between variable $i$ and $j$, and ${a}_{ij}$ represents their partial correlation coefficient.

The formula for calculating Bartlett’s test statistic is:

$$\left\{\begin{array}{l}{\chi }^{2}=\left(N-k\right)lnS-\sum_{i=1}^{k}({N}_{i}-1)ln{S}_{i}\\ S=\frac{\sum_{i=1}^{k}({N}_{i}-1){S}_{i}}{N-k}\\ {\chi }^{2}\sim {\chi }_{k-1}^{2}\end{array}\right.$$

$N$ is the total sample size, $k$ is the number of groups, ${S}_{i}$ is the sample variance of group $i$, ${N}_{i}$ is the sample size of group $i$, and $S$ is the weighted average variance.

This statistic follows a chi-square distribution with $k-1$ degrees of freedom.

Through computation, the KMO value is 0.6799, which exceeds the minimum applicability threshold of 0.6 for factor analysis. The p-value for Bartlett’s sphericity test is 0.0, which is significantly less than 0.05, indicating significant correlations among the variables and confirming the suitability for conducting factor analysis.

Factor analysis models

The factor analysis model assumes that the observed variables are generated by linear transformations of latent factors with error terms. The model can be expressed as follows:

$$x=\Lambda z+\epsilon$$

$\Lambda$ is the factor loading matrix, describing the influence of the factors on the observed variables. $\epsilon$ denotes the error term, and $z$ represents the factors, which are assumed to be independent of the errors.

We estimated the parameters of the factor analysis model using the maximum likelihood method and Principal Component Method. The maximum likelihood method formula is as follows:

$$\left\{\begin{array}{l}E[z\mid x]={\sum }_{zx}{\sum }_{xx}^{-1}(x-{\mu }_{x})\\ \Lambda ={\left(\sum_{i=1}^{N}E[{z}_{i}{z}_{i}^{T}\mid {x}_{i}]\right)}^{-1}\sum_{i=1}^{N}E[{z}_{i}\mid {x}_{i}]{x}_{i}^{T}\end{array}\right.$$

Since factor $z$ is latent variable, we employed the Expectation–Maximization algorithm for estimation¹⁵.

The Principal Component Method is a commonly used approach that determines factor loadings by calculating the eigenvalues and eigenvectors of the data’s covariance or correlation matrix. First, solve for the eigenvalues ${\lambda }_{i}$ and corresponding eigenvectors ${e}_{i}$ of the correlation matrix $R$ of the observed variables. Then, select the eigenvectors corresponding to the $k$ largest eigenvalues as the initial estimate of the factor loading matrix $L$, The formulas are presented as follows:

$$\left\{\begin{array}{l}R{e}_{i}={\lambda }_{i}{e}_{i}\\ L=[{e}_{1},{e}_{2},\ldots,{e}_{k}]{\Lambda }^{1/2}\end{array}\right.$$

where $\Lambda$ is a diagonal matrix containing the square roots of the selected Eigen values.

To enhance the interpretability of the factors, orthogonal rotation is typically performed.

Orthogonal rotation is a transformation method aimed at adjusting the factor structure so that each variable’s loading on a particular factor tends to be extreme, thereby improving the interpretability of the factors. The most common orthogonal rotation method is Varimax Rotation, which seeks to maximize the variance of the factor loadings, resulting in each factor having high loadings on a few variables and low loadings on others.

Let the rotation matrix be $R$. The rotated factor loading matrix $L^{{\prime }}$ is given by:

$$L^{{\prime}}=LR$$

$R$ is an orthogonal matrix satisfying the condition ${R}^{T}R=I$.

The rotation matrix $R$ is adjusted through iterative optimization methods, such as the Kaiser criterion, to make the rotated factor loading matrix sparser.

Finally, the factor scores are computed as:

$$F=XL^{{\prime-1}}$$

there by completing the process of factor analysis.

The communalities of the variables were computed as follows:

$${h}_{i}^{2}=\sum_{j=1}^{m}{L}_{ij}^{2}$$

${h}_{i}^{2}$ denotes the communality of variable $i$, $m$ is the number of factors.

In order to determine the number of factors to be utilized in subsequent predictive modeling, we calculated the variance contribution of the four factors. Tables 1 and 2 show the factor loading matrices from maximum likelihood and principal component methods to determine the number of factors for predictive modeling. The results are as follows:

Table 1 Factor loading matrix (ML).

Subjects

Abstract

Similar content being viewed by others

Introduction

Related work

Dataset overview

Research methodology

Data preprocessing and visualization

Data cleaning

Standardization

Data visualization

Correlation Analysis

Framework overview

Factor analysis

Factor analysis applicability analysis

Factor analysis models

Factor analysis combined with transformer models

Combining two factor analysis methods with transformer models

Comparative study with two advanced baselines

Principal component analysis comparative experiments

Model robustness assessment

Discrete wavelet transform

CNN-BILSTM-ATTENTION model

Relationship between transformer prediction and CNN-BILSTM-ATTENTION

CNN-BILSTM-ATTENTION model architecture

Training strategy

Results

Two factor analysis methods with transformer models results

Predictive results

Prediction residual plot

Histogram of residual distribution

Prediction accuracy in random time intervals

Comparison with two baseline predictions

Principal component analysis comparative experiments results

Model robustness test results

CNN-BILSTM-ATTENTION model results

Predictive results

Residual distribution of original variable predictions

Prediction accuracy in random time intervals

Error rate results

Performance evaluation

Reasons for selecting the model architecture

Summary

Computational resources for model training

Data source

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links