Introduction

Air quality is closely related to people’s lives. Severe air pollution significantly impacts the economy, society, and human health1,2,3. When the human body is exposed to high concentrations of air pollutants for extended periods, it can lead to a range of health issues, including lung diseases, cardiovascular diseases, and respiratory diseases4,5,6,7,8,9. The Air Quality Index (AQI), serving as a quantitative descriptor of air quality, can effectively depict the present level of air pollution or cleanliness and its health implications, as per the Chinese standard 'Technical Regulation on Ambient Air Quality Index (on trial) (HJ633-2012)'. Hence, precise prediction of AQI enables early warning and forecasting of air pollution, aiding governmental bodies in formulating proactive air pollution prevention measures.

In recent years, various models have been employed for air quality prediction, categorized into mechanism models and non-mechanism models based on their establishment methods. Mechanism models are numerical prediction models that account for numerous physical and chemical processes, encompassing the movement and transformation of atmospheric pollutants and photochemical reactions. These models often involve intricate equations and are optimized through meteorological factors, emission inventories, and other data10Common mechanism models include the AMS/EPA Regulatory Model (AERMOD), the Community Multiscale Air Quality Modelling System (CMAQ), the Comprehensive Air Quality Model Extensions (CAMx) and the Weather Research Forecast-Chem (WRF-Chem)11. Yang et al. utilized the WRF-CMAQ model to forecast the Air Quality Index (AQI) and its corresponding levels in Changzhou. However, the predicted AQI values tended to be lower than the actual values. The accuracy of AQI level prediction varied across different seasons, reaching its highest level at 54.90%12. Sengupta et al. developed a high-resolution air quality early warning system based on the WRF-Chem model to forecast particulate matter concentration (PM10 and PM2.5) and Air Quality Index (AQI) in New Delhi, India. This system integrated meteorological parameters including air temperature, relative humidity, wind speed, wind direction, and short-wave radiation. However, the prediction performance for PM10 and PM2.5 concentrations, as well as AQI, exhibited instability, with root mean squared errors (RMSE) generally exceeding 60.00. Additionally, the PM10 concentration was consistently overestimated13. The numerical prediction model based on the mechanism of pollution processes heavily relies on both emission inventories of pollution sources and meteorological field data. The emission list of pollution sources is derived from the Technical Guidelines for the Preparation of Air Pollutant Emission Lists of Various Industries. Annual pollutant emissions are calculated using the emission coefficient method, typically based on data from the preceding 1–2 years14,15. In reality, daily emissions of industrial waste gas exhibit significant fluctuations, rendering the emission inventory of pollution sources unable to reflect real-time changes in pollutant emissions. Moreover, the determination process of physical and chemical parameters is complex, and the absence of established scientific index thresholds further compounds the issue14. These factors contribute to significant uncertainty in the predictive efficacy of the model. Consequently, the applicability of the numerical prediction model is limited.

Non-mechanism models mainly include statistical models and machine learning models16,17, such as Random Forests (RF), Extreme Learning Machine (ELM), Support Vector Regression (SVR), Artificial Neural Network (ANN) and so on. These models can mine and analyze the characteristics of various factors affecting AQI. Therefore, non-mechanism models can predict the trend of pollutants over a period. Compared with numerical prediction models, non-mechanism models do not require extensive data, such as meteorological data and pollutant discharge inventories. The data dimensions needed for modelling and forecasting are relatively small, typically including historical air quality data and meteorological data. Liu and Zhang used ELM to predict the AQI in Beijing, Tianjin, and Shijiazhuang, and the results showed that the RMSE for each model was above 37.0018. Zhu et al. predicted the AQI in Xingtai, China using a hybrid model that combined the LS-SVR and seasonal ARIMA algorithm. The RMSE was 24.4619. Xu et al. combined an intelligent optimization search algorithm with SVR for AQI prediction in Taiyuan, achieving a MAPE (Mean Absolute Percentage Error) of 37.28%20. Qin et al. utilized the improved Grasshopper Optimization algorithm (IGOA) to optimize the parameters of the BP neural network, establishing an air quality prediction model for Taiyuan City21. The experimental results showed that the MAPE of the optimized BP neural network model could achieve 31.91%. The methods discussed in the literature above confirm that machine learning models can indeed be applied to AQI prediction. However, it’s worth noting that all these methods overlook the time-series nature of AQI. Different algorithms are suitable for different data characteristics, mainly due to the principles they use to process data. Nontime series algorithms, such as RF and SVR, usually assume that data points are independently and identically distributed, lacking consideration of temporal dependencies in the data. Therefore, these algorithms are not suitable for handling time series data with significant temporal correlation22.

The AQI time series exhibits instability and nonlinear complexity19,20,21,23,24. The time series characteristics of AQI indicate that past air quality impacts future air quality, showing a certain trend and periodicity. Therefore, future AQI can be predicted through statistical analysis of historical AQI data. Additionally, AQI is influenced by pollution-related meteorological conditions and pollution sources, leading to sudden changes. Thus, it is necessary to combine meteorological characteristics and the dynamic changes of pollution sources to predict air quality. Currently, commonly used air quality time series prediction models include ARIMA (Autoregressive Integrated Moving Average Model), RNN (Recurrent Neural Network), LSTM (Long Short-Term Memory Network), and CNN (Convolutional Neural Network), among others. Zhang et al. used a novel spatiotemporal algorithm, the bidirectional gated recurrent unit integrated with an attention mechanism (BiGRU), to predict AQI for ten cities in the Huaihai Economic Zone. The results showed that the prediction accuracy of the proposed model outperformed traditional machine learning methods, with an RMSE of 31.1025. Chhikara et al. established an air quality prediction model of Delhi, India, based on CNN-LSTM, and the results showed that the RMSE was 221.6826. Sethi and Mittal used the ARIMA model to predict air quality in Gurugram, India. Considering the concentration of O3, CO, SO2, PM2.5, and NO2pollutants along with meteorological parameters, their prediction accuracy had an RMSE of 66.8027. The above studies have considered the historical feature of AQI when predicting air quality. However, these methods have certain limitations. For example, the ARIMA model is a univariate time series prediction model that requires the time series to be stable and linear. ARIMA typically cannot capture the nonlinear relationship within the data and does not align with the nonlinear and irregular characteristics of air pollutants. Therefore, the prediction accuracy of the ARIMA model is significantly limited by its linear mapping capability28. In the application of RNN in long time series, there are issues of gradient vanishing and explosion. Although LSTM, which is an improvement over RNN, mitigates these problems, it still cannot completely solve them in long time series29,30.

In the above studies, Zhu et al., Xu et al. and Qin et al. considered the influence of air pollutant concentrations19,20,21. Meanwhile, Zhang et al25. and Shishegaran et al28. considered both meteorological factors and pollutant concentrations. Additionally, Liu and Zhang18, Li et al23. and Chhikara et al26. incorporated the influence of historical AQI series. Hu et al. developed a hybrid prediction model for air quality at sparse monitoring stations by leveraging spatio-temporal features extracted from both the target station and its surrounding stations’ air quality and meteorological data30. Sarkar et al. established a hybrid model for air quality prediction, aiming to enhance prediction accuracy through diverse feature selection and classification techniques31. Regional air quality is comprehensively affected by meteorological factors, pollutant emissions, past air quality and the transmission of pollutants from surrounding areas. Achieving high accuracy and effectively combating air pollution solely by considering a single factor in AQI prediction poses a challenge. A multi-factor forecasting method is expected to yield better prediction results. However, meteorological data and historical air quality data exhibit distinct characteristics. For instance, meteorological factors entail strong uncertainty, while historical air quality data is characterized by its time-series nature. Consequently, combining meteorological factors and historical air quality data as input variables in the same model may inevitably affect prediction accuracy. This is because different machine learning models typically perform more effectively when handling different types of data features. Mixing data from different features in a single model may hinder the simultaneous capture of each feature’s influence32,33. Hence, meteorological factors and historical air quality are modelled separately to thoroughly explore their respective impacts, and then integrated in subsequent steps. This approach accommodates the distinct data features of each factor and leverages the strengths of each sub-model in handling different feature groups30.

In the early study of our research group34, an air quality meteorological correlation model was developed using the RF algorithm. The model utilized daily industrial exhaust emissions and meteorological factors as input variables. By leveraging meteorological conditions, the model achieved the objective of dynamically adjusting the key operations of polluting enterprises and mitigating the conflict between environmental preservation and economic interests34,35. Therefore, building on the meteorological correlation model, this study aims to separately investigate the impact of past air quality. The GHMM model, a classical machine learning sequential prediction model, is employed to establish the air quality historical correlation model. GHMM has the advantages of simple structure, few parameters and strong generalization ability. A further explanation regarding the transferability and generalizability of GHMM is provided in the Supplementary. The number of samples required by GHMM is much smaller than that of the neural network model, and GHMM can capture the time series characteristics of AQI series in the modelling process. The air quality historical correlation model is then integrated with the air quality meteorological correlation model to improve prediction accuracy (the structure of this study is displayed in Fig. 1). Additionally, the transmission of pollutants from surrounding areas exhibits spatial correlation and dynamic changes. This characteristic introduces significant challenges and complexities in modelling due to the variability in pollutant dispersion influenced by various factors such as weather conditions in surrounding regions and geographic features. These aspects will be further explored in subsequent research.

Fig. 1
figure 1

The structure of this study.

Data and research methodologies

Study area and data source

Zhangdian District (located in Zibo City, Shandong Province, between 36°04 ′30 "-36°54′ 00" N and 117°55 ′40 "-118°12′ 20" E) is one of the most important industrial bases in Shandong Province. In 2019, there were 133 industrial enterprises above designated size in 2019 in the district, including 108 heavy industries. In recent years, the air quality of Zibo City has consistently ranked last in Shandong province, with the number of days with heavy pollution consistently higher than the provincial average36.

The datasets of this study include daily AQI, daily concentrations of six major pollutants (PM10, PM2.5, SO2, NO2, CO, O3), and daily industrial exhaust emissions (including NOx emissions, SO2 emissions, TSP emissions, and total waste gas emissions) recorded in Zhangdian District from 1/1/2017 to 31/12/2019. The daily emissions of industrial exhaust are obtained from Zhangdian District Bureau of Ecology and Environment. The daily AQI and concentrations of the six major air pollutants are collected from three national air monitoring stations in Zhangdian District: People’s New District, Dongfeng Chemical Plant, and New District. Data from these stations are sourced from https://www.aqistudy.cn/. The location of each monitoring station is shown in Fig. 2. The meteorological data are sourced from https://wheata.cn.

Fig. 2
figure 2

Map of study area. (a) Study area geographical position. (b) The location of monitoring stations in Zhangdian District. The map was generated with ArcGIS10.2 (https://www.esri.com/en-us/arcgis/products/develop-with-arcgis/overview).

AQI calculation equation

According to the “Ambient Air Quality Index (AQI) Technical Regulations (Trial) (HJ 633–2012)”, the Individual Air Quality Index (IAQI) is calculated using Eq. (1), and the AQI is calculated using Eq. (2).

$${\text{IAQI}}_{\text{P}}=\frac{{\text{IAQI}}_{\text{Hi}}-{\text{IAQI}}_{\text{Lo}}}{{\text{BP}}_{\text{Hi}}-{\text{BP}}_{\text{Lo}}}\left({\text{C}}_{\text{p}}-{\text{BP}}_{\text{i}}\right)+{\text{IAQI}}_{\text{Lo}}$$
(1)

In the above equation: IAQIP is the individual air quality index of the pollutant P; C is the air quality concentration of the pollutant P; BPHi and BPLo are the high and low values of the pollutant concentration limits bracheting Cp; IAQIHi and IAQILo are the IAQI values corresponding to BPHi and BPHi.

$$\text{AQI}=\text{max}\{{\text{IAQI}}_{1},{\text{IAQI}}_{2},{\text{IAQI}}_{3},\dots ,{\text{IAQI}}_{\text{n}}\}$$
(2)

In the above equation: n represents the nth pollutant, where n = 6, corresponding to PM10, PM2.5, SO2, NO2, CO, and O3.

Improved GHMM algorithm

Hidden Markov Model (HMM) is widely used in speech recognition, fault diagnosis, biological information and financial markets37. HMM is a probabilistic model consisting of dual stochastic processes. The state sequence of the model is unobservable, referred to as the hidden states sequence38,39. Transitions between two hidden states in the sequence are random. Each hidden state randomly generates observation data according to the observation probability distribution function \({\text{b}}_{{\text{S}}_{\text{i}}}({\text{O}}_{\text{t}})\). The observation sequences consist of a series of observation data. However, the observation probability function of HMM is a discrete probability distribution function, while AQI series are continuous variables. Therefore, GHMM with a continuous observation probability function is suitable for constructing the air quality history correlation model. In this study, the observation sequences serve as the input variables of the model, which can be categorized into three types: AQI, IAQI and exhaust emissions. The structure of air quality historical correlation model based on GHMM is shown in Fig. 3, where \({a}_{ij}\) represents the transition probability between state \({S}_{i}\) and state \({S}_{j}\), and \({b}_{{Si}_{(\text{Ot})}}\) is the probability of generating observation \({O}_{t}\) from state \({S}_{i}\) at time t.

Fig. 3
figure 3

The structure of air quality history correlation model based on industrial exhaust emissions.

Select the optimal number of hidden states.In the Hidden Markov model, the complexity and running time of the model are influenced by the number of hidden states. The complexity of the GHMM increases with the number of hidden states40. However, if the number of hidden states is too small, the model’s accuracy will decrease, and the desired prediction effect cannot be achieved. Therefore, selecting an appropriate number of hidden states improves the model’s accuracy and reduces the running time41. The learning algorithm of the traditional GHMM model is typically based on known number of hidden states42, which might not be suitable for substantive research. For example, Lolea and Stamule directly specified the maximum number of hidden states as 4 based on economic justification43. However, determining the actual meaning of hidden states in the context of air quality prediction can be challenging. Various methods, such as the Akaike information criterion (AIC), the Bayesian information criterion (BIC), and Odd–Even-Half-Sampling (OEHS), are used to select the number of hidden states38. Nevertheless, the AIC tends to underestimate the complexity of a model, the BIC tends to overestimate the complexity of the models, and the OEHS criterion divides the original series into odd-position sequences and even-position sequences, which may not be entirely suitable for GHMM based on continuous-time.

A traversal method is proposed to determine the number of hidden states in this paper. Firstly, the interval within which the number of hidden states will be traversed is specified. Then, within this interval, each number of hidden states is used to establish the corresponding GHMM under the same input data. Finally, the prediction results of GHMMs with different hidden states are compared, and the number of hidden states that corresponds to the highest prediction accuracy is selected as the optimal number of hidden states. Hidden states can be assigned a range of values based on their actual meaning in specific applications, or they can be independent of their physical meaning. They can be viewed as an abstract representation within the model, whose specific interpretation may not directly correspond to a physical concept in the real question, thus avoiding the limitations of subjective knowledge. This abstract nature of hidden states can aid the model in better learning patterns and regularities within the data, free from the constraints of prior knowledge42,43. The study determines the range of hidden state numbers (2–7) based on specific physical meanings. For example, two states might represent increasing and decreasing trends, while seven states could denote significant increase, moderate increase, slight increase, no change, slight decrease, moderate decrease, and significant decrease. This approach ensures that the model captures sufficient information without becoming overly complex and prone to overfitting.

The Multi-day weighted matching method. The traditional GHMM parameters are trained using the Baum-Welch algorithm and the forward-backward algorithm. Simultaneously, the optimal hidden state sequence is determined through the Viterbi algorithm. Utilizing the optimal hidden state sequence and the model parameters—state transition probability matrix, the most probable state for the next day is identified. Subsequently, the AQI for the following day is computed based on the observation probability density function. However, this forecasting method may yield a fixed trend (monotonically increasing, monotonically decreasing, or no change) in the forecasted values for the next few days, resulting in a notable disparity between the predicted and actual values.

Considering the limitations of traditional GHMM in forecasting, the Multi-day Weighted Matching method is proposed as an alternative approach (the algorithm process is shown in Fig. 4).

Fig. 4
figure 4

The Multi-day weighted matching method.

Step 1: Choose the training data, train the GHMM model, and derive the optimal hidden state sequence (referred to as the optimal path) along with the state transition probability matrix.

Step 2: Determine the state (assumed to be S2) and its corresponding probability \({P}_{T(S2)}\) for the last day (Day T) of the training data.

Step 3: In the state transition probability matrix, find the probability values most similar to \({P}_{T(S2)}\)​ and their corresponding dates. These dates are the “matching days”.

Step 4: Calculate the differential value between \({P}_{T(S2)}\) and \({P}_{j(S2)} \left(1\le j<T\right), \Delta P=\left|{P}_{j(S2)}-{P}_{T(S2)}\right|\), sort the ∆P from small to large, and select the top ∆P, which corresponding \({P}_{j(S2)}\) are \({P}_{1}\), \({P}_{2}\), \({P}_{3},\dots ,{P}_{d}\).

Step 5: Calculate the weight and the predicted AQI (Seen in Eq. (3) and Eq. (4)).

$${\omega }_{\text{i}}=\frac{log\left|{P}_{i}-{P}_{T({S}_{2})}\right|}{\sum_{i=1}^{d}log\left|{P}_{i}-{P}_{T({S}_{2})}\right|}$$
(3)
$${\text{y}=y}_{T}+\sum_{i=1}^{d}{(y}_{i}-{y}_{i+1})\bullet {\omega }_{i}$$
(4)

In the above equation: \({\omega }_{\text{i}}\) is the weight of \({P}_{i}\), d is the days we match, T is the number of training set, the y is the predicted AQI value, the \({y}_{T}\) is the AQI value of the last day. \({y}_{i}\) is the AQI value corresponding \({P}_{i}\), and the \({y}_{i+1}\) is the AQI value corresponding next day of \({P}_{i}\).

Fixed the length of training set. To enhance AQI prediction performance, the Fixed Training Set Length method is proposed to mitigate error accumulation. Initially, the test data is integrated into the training set instead of utilizing predicted data. Subsequently, the earliest data point in the original training set is removed to maintain a fixed training set length. Compared to the traditional GHMM approach, this method employs a consistent training set length, thereby reducing error accumulation and uncertainty stemming from increasing training set lengths. Algorithm 1 (refer to the Supplementary) outlines the pseudo-code for the Fixed Training Set Length method.

The air quality historical correlation model based on improved GHMM algorithm

Data analysis

The research datasets consist of daily AQI data, daily air pollutant concentration and industrial exhaust emissions data from Zhangdian District, Zibo City, spanning from January 1, 2017 to December 31, 2019. Statistical analysis revealed the presence of 5 groups of zero values in the datasets, which were subsequently removed, resulting in 1089 remaining groups of datasets. The GHMM algorithm places greater emphasis on the conditional probability among input variables and the distribution characteristics of the data. Therefore, utilizing the original data directly aligns with the principles of GHMM and benefits AQI prediction. The statistical features such as mean values, standard deviation, maximum, and minimum values of the datasets are presented in Table 1. The frequency histograms of the above datasets are displayed in Fig. 5. The trend diagram of AQI and six air pollutants is shown in Fig. 6.

Table 1 The statistical features of the data set.
Fig. 5
figure 5

The Frequency histograms of the data set.

Fig. 6
figure 6

The trend of AQI and six air pollutants.

As depicted in Fig. 5. and Fig. 6, the overall AQI in Zhangdian District during 2017–2019 exhibited significant fluctuations, with peaks typically occurring in winter. The AQI values were primarily clustered around the mean value. Furthermore, the distribution skewed to the left, suggesting that most AQI values fell below the mean of 99.77. Table 1 illustrates a standard deviation of 49.63, indicating relatively large fluctuations. The minimum AQI value recorded was 13, corresponding to excellent air quality, while the maximum value of 313 indicated severely polluted air. The considerable difference between these extremities underscores the extensive variation range of the AQI series in Zhangdian District, characterized by instability and nonlinearity. Predominantly, air quality levels ranged from good to lightly polluted. The AQI trend paralleled that of PM10, PM2.5, and NO2, all peaking in winter with significant fluctuations. In contrast, the CO trend exhibited a relatively gentle pattern, while O3 concentration was more pronounced in summer and less so in winter.

Figure 7 depicts the Pearson correlation coefficient matrix diagram illustrating the relationship between each IAQI and AQI. The correlation scale ranges from 0.8 to 1, indicating highly strong correlation, from 0.6 to 0.8 for strong correlation, from 0.4 to 0.6 for moderate correlation, and less than 0.3 for weak correlation. In the Pearson analysis, significance levels were set as follows: **** indicates p ≤ 0.0001, *** indicates p ≤ 0.001, ** indicates p ≤ 0.01, and * indicates p ≤ 0.05. The results showed that except for the correlation between AQI and IAQIO3, which was marked with *, all other correlations were marked with ****, indicating a high level of statistical significance. According to the Pearson correlation analysis results, the AQI in Zhangdian District is positively correlated with all IAQI variables. Among them, the AQI has the strongest correlation with IAQIPM2.5, with a value of 0.78, indicating that PM2.5 concentration has the most significant impact on the AQI. The correlation between IAQIO3 and AQI is the weakest, with a value of only 0.07, suggesting that ozone concentration has a minimal effect on the AQI in this area. Therefore, ozone may not be a primary factor influencing AQI in Zhangdian District. Additionally, there is a strong positive correlation between IAQIPM2.5 and IAQIPM10, with a correlation coefficient of 0.89, indicating a high similarity in their sources and variation trends. Except for IAQIO3, all other IAQI variables are positively correlated with each other, while IAQIO3 is negatively correlated with the other IAQI variables. This could be due to the fact that the formation and depletion of ozone are closely related to the concentration changes of other pollutants such as NOx. Typically, in photochemical reactions, NOx reacts with volatile organic compounds (VOCs) to form ozone. Therefore, when the concentration of other pollutants is high, ozone may be consumed or its formation inhibited. In Zhangdian District, PM2.5 is the main pollutant affecting AQI and should be the primary focus for control. Moreover, controlling particulate matter pollution requires comprehensive consideration of the sources and mitigation measures for both PM2.5 and PM10. Due to the negative correlation between ozone and other pollutants, managing ozone pollution necessitates a systematic approach to the coordinated control of NOx and VOCs.

Fig. 7
figure 7

Pearson correlation coefficient plot.

Experiments and results

Four sets of experiments with different input variables are designed, each corresponding to Experiments 1, 2, 3, and 4. The datasets cover the period from 1/1/2017 to 30/11/2019 for training the model, while the data from 1/12/2019 to 31/12/2019 are reserved for testing to evaluate the model’s performance. For details on the modelling process and parameters, please refer to the Supplementary.

Table 2 Prediction results of improved and traditional GHMM.

Experiment 1. Experiment 1 uses the direct mode for AQI prediction, with the input variable being only the historical AQI series. The model proposed in this study is compared with the traditional GHMM, ARIMA, and LSTM models. The experimental results are shown in Table 2. The summer month forecast result is provided in the Supplementary. It can be seen that the proposed model greatly improves AQI forecasting accuracy compared to the traditional GHMM, ARIMA, and LSTM models. The proposed model significantly enhances prediction accuracy and demonstrates a notable advantage in predicting AQI time series with non-linear and unstable characteristics. Therefore, the improved GHMM model proposed in this study is used for further analysis in Experiments 2, 3, and 4.

Experiment 2. To obtain more accurate results, Experiment 2 predicted the AQI using the indirect mode, since the AQI is calculated from the Individual Air Quality Index (IAQI). According to the Pearson correlation matrix plot (Fig. 7), the IAQIs affect each other. Therefore, different combinations of input variables are designed to predict each IAQI (shown in Fig. 8). After predicting each daily IAQI, the corresponding daily AQI is calculated using Eq. (2). The optimal combination of input variables for each IAQI, the corresponding prediction results for each IAQI, and the accuracy of the calculated AQI are shown in Table 3. As seen in Table 3, compared with Experiment 1, the accuracy of AQI prediction in the indirect mode is higher than in the direct mode, with MAPE improved by 14.64%, RMSE by 11.46%, and MAE by 24.08%

Fig. 8
figure 8

The input combination of IAQI in experiment 2.

Table 3 IAQI and AQI prediction accuracy in Experiment 2.

Experiment 3. The AQI is not only closely related to the IAQI but also influenced by pollutant emissions. To comprehensively predict the AQI, it is necessary to include industrial exhaust emissions. In Experiment 3, exhaust emissions are combined with the AQI as input variables for the proposed model. The combinations and corresponding prediction accuracy of the AQI are shown in Table 4. It can be seen that the prediction accuracy of AQI is highest when the input variables are AQI and total waste gas emissions compared to other input combinations.

Table 4 AQI prediction precision in Experiment 3.

Experiment 4. In Experiment 4, the model’s input data includes both IAQI and industrial exhaust emissions. The combinations of each IAQI and industrial exhaust emissions are shown in Fig. 9 based on Experiment 2. Table 5. presents the best combinations of IAQI and industrial exhaust emissions as input variables and their corresponding prediction accuracy. The AQI values and evaluation indices are calculated based on the predicted IAQI, which are also shown in Table 5.. It is found that the prediction accuracy of the IAQI, except for IAQICO and IAQIO3, is improved when industrial exhaust emissions are considered as input variables. Furthermore, the prediction accuracy of AQI has been improved in Experiment 4.

Fig. 9
figure 9

The input combination of IAQI and emissions in experiment 4.

Table 5. IAQI and AQI prediction precision in Experiment 4.

Comparison and analysis of each experiment. The results from Experiment 1 to Experiment 4 are summarized in Table 6 and Fig. 10. The AQI prediction results of Experiment 3 are based on the combination of Total waste gas emissions and AQI. According to the evaluation metrics MAPE, RMSE, and MAE, the following conclusions can be drawn:

Table 6 Comparison of prediction results of each experiment.
Fig. 10
figure 10

AQI prediction effect in each experiment.

(i) Compared with the traditional GHMM, the precision of the improved GHMM model is significantly higher when predicting AQI in Zhangdian District.

(ii) The indirect mode has higher precision in predicting AQI than the direct mode (Experiment 1 compared with Experiment 2, Experiment 3 compared with Experiment 4).

(iii) Compared with Experiment 1, considering exhaust emissions in Experiment 3 does not significantly improve AQI prediction. This may be due to the varying primary pollutants on different days (as seen in Fig. 11). Consequently, the prediction results of Experiment 3 (d) (AQI and Total waste gas emissions) are higher than the other combinations in Experiment 3.

Fig. 11
figure 11

Daily primary pollutants.

(iv) The prediction precision in Experiment 4 is higher than in all other experiments. This demonstrates that the indirect mode combined with industrial exhaust emissions yields relatively higher AQI prediction precision when fully considering historical AQI. Compared to Experiment 1, the AQI prediction results of Experiment 4 improved MAPE, RMSE, and MAE by 24.87%, 26.95%, and 30.38%, respectively. The prediction precision of IAQICO and IAQIO3 in Experiment 4 is lower than in Experiment 2 due to the lack of CO emissions and VOCs emissions data.

Model fusion

In the previous project, the RF (Random Forest) algorithm was used to develop a meteorological correlation model for predicting AQI. This algorithm, based on multiple decision trees, is capable of handling high-dimensional meteorological features. In this meteorological correlation model, the input variables include various meteorological factors and pollutant emission data from January 1, 2017, to December 31, 2019. The specific meteorological factors are: precipitation, air temperature, relative humidity, wind speed, air pressure, total sunshine intensity, and precipitation.

The air quality historical correlation model based on the improved GHMM proposed in this study is combined with the air quality meteorological correlation model based on the RF algorithm established earlier in the project. The two models are fused using the weighted average method (as shown in Eq. (5), (6), and (7)) to establish an ensemble model for air quality prediction. In this fusion, \({w}_{G}\) and \({G}_{i}\) represent the weight and predicted value of the proposed GHMM model, respectively, while \({w}_{R}\) and \({R}_{i}\) represent the weight and predicted value of the RF model, respectively, and \({y}_{i}\) is the actual value of the AQI. The AQI prediction values of the improved GHMM model are adopted from the prediction results of Experiment 4. The calculated weights are \({w}_{G}\) = 0.5050 and \({w}_{R}\) = 0.4950. The fusion results are shown in Table 7. and Fig. 12. The results demonstrate that the AQI prediction precision is further improved by comprehensively considering the effects of meteorology, exhaust emissions, and historical air quality.

Table 7. The results of Model fusion.
Fig. 12
figure 12

AQI prediction results of GHMM, RF, and Ensemble models.

$${Y}_{i}={w}_{G}\bullet {G}_{i}+{w}_{R}\bullet {R}_{i}$$
(5)
$${w}_{G}= \sum_{i=1}^{31}\frac{\left|{y}_{i}-{R}_{i}\right|}{\left|{y}_{i}-{G}_{i}\right|+\left|{y}_{i}-{R}_{i}\right|} / 31\times 100\text{\%}$$
(6)
$${w}_{R}= \sum_{i=1}^{31}\frac{\left|{y}_{i}-{G}_{i}\right|}{\left|{y}_{i}-{G}_{i}\right|+\left|{y}_{i}-{R}_{i}\right|} / 31\times 100\text{\%}$$
(7)

However, from the fusion results, it is evident that the ensemble model failed to predict the peak of AQI on the 9th to 10th days. This may be due to the fact that regional air quality is influenced not only by meteorological factors, pollutant emissions, and historical air quality, but also by the transmission of pollutants from surrounding areas. This study did not discuss the impact of pollutant transmission from neighbouring regions because the characteristics of various influencing factors differ. Mixing these inputs into a single model would inevitably affect prediction accuracy. Different machine learning models perform differently when handling various types of data features. The transmission of pollutants from surrounding areas has spatial correlations and dynamic variability, influenced by meteorological conditions, geographical features, and other factors affecting pollutant dispersion. GHMM or RF models struggle to capture the spatial diffusion and transmission characteristics of pollutants. Future research can explore establishing inter-city spatial correlation models for regional transmission to further improve the accuracy of AQI predictions.

Conclusions

This study introduces an indirect mode for predicting the Air Quality Index (AQI), contrasting with the commonly employed direct mode. Initially, Individual Air Quality Index (IAQI) values are predicted, followed by the calculation of AQI based on these predictions, aligning more closely with the AQI calculation principle. Additionally, a novel air quality historical prediction model is established using GHMM, incorporating industrial exhaust emissions and historical AQI data as input factors. To address issues of randomness and subjectivity in selecting the number of hidden states in the model, a traversal method is employed. Furthermore, methods such as the Multi-day weighted matching and Fixed training set length are introduced to mitigate error accumulation in long time series prediction, thereby enhancing the model’s accuracy. Lastly, a novel air quality fusion method is proposed, integrating meteorological factors, historical air quality data, and industrial exhaust emissions, providing a more comprehensive approach to air quality forecasting.

Air quality is correlated with meteorological factors, air pollutant discharge, and historical air quality. Considering the characteristics of the AQI time series, an improved GHMM was developed to establish a novel air quality historical prediction model. Additionally, a new prediction mode called the indirect mode was introduced for AQI prediction, aligning more closely with the AQI calculation principle. To evaluate the proposed model, numerous comparative experiments were conducted using AQI time series data from Zhangdian District, covering the period from January 1, 2017, to December 31, 2019. Both indirect and direct modes were proposed for AQI prediction. The experimental results demonstrate that the model, based on the air quality historical prediction framework proposed in this study, achieves the best prediction performance using the indirect mode. Furthermore, by integrating the air quality historical correlation model, based on the improved GHMM, with the air quality meteorological correlation model, based on the RF algorithm established earlier, the prediction performance of the Ensemble model surpasses that of individual models.

The study did not directly consider the emissions from residential and traffic sources because their emission levels are relatively stable. Since the prediction of air pollutant concentrations is based on daily-scale data, the changes in residential and traffic sources occur at a low frequency and at a slower rate, lacking significant daily-scale fluctuations and making it difficult to capture daily rapid changes. Including these low-frequency, slow-varying factors in the model could introduce noise, thereby interfering with the model’s prediction accuracy. In the long run, residential and traffic sources will inevitably impact air quality, but this impact accumulates over time. This study proposes the fixed training set length method, which involves replacing the predicted data with test data in each new round of prediction, while simultaneously removing the first data point in the original training set to achieve a fixed training data length. This improved method indirectly reflects the impact of residential and traffic sources by treating the long-term stable changes in these sources as systematic errors. Additionally, it reduces the errors generated by outdated socio-economic information. In addition, air quality is also influenced by factors in surrounding regions. In future research, an intercity correlation model for air quality will be developed to further improve AQI prediction accuracy.