Introduction

Air pollution is one of the most concerned issues of our society due to its impacts on human health and the economy1,2. Many studies have shown that the mortality rate increases with increasing particulate matter (PM) concentration in the air. For example, Dockery et al.3 reported that an increase of 10 μg/m3 in the PM2.5 concentration would make the death rate increase by 1.5% in 6 cities in the United Stated4. A high amount of PM in the air also showed a strong correlation with lung cancer and other cardiovascular and lung related diseases which makes it one of the known causes for deaths in adults in the United States5. An increased concentration of 10 μg/m3 PM10 would result in a 0.5% increase in the number of deaths per day in 20 big cities with more than 50 million inhabitants6,7. A similar outcome was also confirmed by Katsouyanni et al.8 in a study of over 29 cities. Not only does it have negative effects on human health, PM air pollution also causes huge economic loss. Because of air pollution, each year around 4.1 billion Euro loss was reported in Switzerland9 and a loss of 40–50 million USD was revealed by a statistic made by The National Center for Health Statistics of the United States in 200110. According to the World Bank, in 2016 air pollution cost the world’s economy around 5.11 trillion USD in welfare losses. In South Asia and East Asia and the Pacific, losses were around 7.4 percent and 7.5 percent of the regional gross domestic product (GDP), respectively11. It has been long recognized that meteorological conditions and PM concentration have a close relationship. Zhao et al.12, in a study on PM2.5 pollution from 2005 to 2007, highlighted a clear seasonal variation in the concentration of PM10–2.5 in which the PM concentration at the rural area was at the minimum level in winter and maximum level in spring and summer, while the urban region experienced the opposite trend. This study also pointed out that precipitation had an important contribution to the seasonal pattern of PM2.5 in the urban area, while monsoon was the main factor for that in the countryside. In a similar study, Duo et al.13 concluded that temperature in Lhasa, Tibet was the dominant factor governing all air pollutants including PM10 and PM2.5 in spring, while relative humidity and atmospheric pressure were the major meteorological drivers during summertime. Spatial distribution of fine to coarse PM showed an inverse relationship with wind speed14,15. Srimuruganandam and Nagendra16 revealed that low wind speed was highly correlated with PM concentration. It was however reported by Giri et al.17 that PM10 concentration in Kathmandu, Nepal increased with wind speed and atmospheric pressure. Wang and Ogawa18 also reported that PM2.5 was positively proportional to wind speed higher than 3 m/s, and negatively proportional to the wind speed lower than that level.

Based on the relationship between the PM10 concentration and meteorological factors, efforts have been made to construct data-driven models for predicting the PM10 concentration from meteorological data using different statistical methods19,20,21. Among these methods, the Multiple Linear Regression (MLR)-based data-driven models are the most popular and have been widely used. The MLR algorithm is usually used to formulate the linear relationship between meteorological factors (including temperature, relative humidity, wind speed, and wind direction) with the PM10 or PM2.5 concentration. Although there are some studies that reported good prediction results22, it is generally seen that the MLR-based models are yet to present consistently satisfactory results due to the linearizing of the non-linear system as reviewed by Shahraiyni and Sodoudi23. With the ability to represent complex non-linear problems, the ANN-based data-driven model with different architectures has been extensively used to estimate the PM concentration. Several studies reported that the Artificial Neural Network (ANN) models produced satisfactory prediction results24. Although some studies showed that the ANN models performed better than the MLR models24, the ANN models have a more complicated structure and still present some limitations in terms of handling high dimensional input variables, local minima or interpretability (the black-box model problem). As a result, for each case study, it is necessary to compare these two models to select the more suitable model.

PM10 concentration recorded in air quality monitoring statios only cover an area surrounding those stations, thus it is necessary to use interpolation techniques for mapping the PM10 concentration. There are multiple interpolation techniques for this purpose. Wong et al.25 provided an excellent review on these techniques and divided them into four groups, namely spatial averaging, nearest neighbor, inverse distance weighting and kriging. These interpolation techniques have been successfully employed to construct PM10 maps in many studies. For example, Perez26 applied the nearest neighbor technique to provide a forecasting map of PM10 in Santiago, Chile. Li et al27 developed two IDW-based spatiotemporal interpolation techniques to evaluate the spatial variation of PM2.5 concentration over the contiguous United States. Kim et al. employed the ordinary kriging to interpolate the PM10 concentration from 226 urban-ambient monitoring sites in South Korea28. Raja et al.29 used spatiotemporal kriging with the external drift to explore spatio-temporal variations of PM10 concentrations in Ankara, Turkey. The main drawback of these studies was that they only used the PM10 concentration measured at the air quality stations for interpolation without considering the impacts of the spatio-temporal variations of the meteorological factors on PM10 variation. As a result, it is crucial to develop an interpolation technique that can account for information from both air quality stations and meteorological data.

With the increasing population and rapid urbanization, air quality in Vietnam, especially in large cities like Hanoi has been significantly degraded. Hopke et al.30 indicated that Hanoi was one of the cities which had the worst air quality in Asia. Saksena et al.31 showed that the average value of PM10 concentrations in the streets in Hanoi could reach up to 455 μg/m3, which is much higher than the Vietnamese daily standard for the PM10 concentration (150 μg/m3). This has posed a negative effect on the city’s public health. It was reported that ambient and in-house air pollution was becoming the major reason for deaths related to the environment in Vietnam, just second to smoking32. As a result, there has been increasing attention and demand from both the local community and the government of Hanoi for a study on air quality and its controlling factors with PM10–2.5 concentration prediction being the top priority.

There are two objectives to this study. The first objective is to develop a hybrid mapping approach that combines the data-driven model and IDW interpolation to produce the PM10 concentration maps from global meteorological data. The second objective is to employ this approach to construct the monthly PM10 maps for the central districts of Hanoi and analyze its spatio-temporal variations.

Methodology and material

In this study, we developed a hybrid approach that combines data-driven models and interpolation techniques to construct the PM10 concentration maps from global meteorological data. As shown in Fig. 1, this approach consists of two main steps, namely, (1) development of data-driven models at each air quality monitoring station and (2) construction of PM10 maps from global meteorological data using the IDW interpolation technique. Details of these two steps are presented below.

Figure 1
figure 1

Hybrid data-driven and interpolation scheme.

Development of data-driven models

Before developing data-driven models, a set of input features derived from meteorological factors were constructed. After that, the data-driven models linking these selected input features and the PM10 concentration using two machine-learning algorithms, namely, MLR and ANN were developed. Finally, based on their performance, the more accurate models were selected for mapping the PM10 concentration.

Construction of input features

Multiple meteorological factors can be considered in data-driven models to predict the PM10 concentration. However, based on the availability of meteorological data at the air quality monitoring stations and their correlation with the PM10 concentration, the following variables were taken into account: mean daily temperature, maximum daily temperature, minimum daily temperature, mean daily humidity, mean daily wind speed and mean daily atmospheric pressure. Next, we assumed that the PM10 concentration was linked to meteorological factors by a quadratic function as follows:

$$PM_{10} = f(X_{i} , \;X_{i}^{2} , \;X_{i} X_{j} )\quad i,\;j = 1, 2, \ldots 6,\;\;i \ne j$$
(1)

in which Xi (i = 1, 2, …, 6) is the mean daily pressure (X1), mean daily temperature (X2), mean daily humidity (X3), mean daily wind speed (X4), maximum daily temperature (X5), minimum daily temperature (X6). Totally, there are 27 features considered in Eq. (1). Since the size of features is considerably large, features with low correlation coefficients with the PM10 concentration or close correlation with other previously-selected features were removed from the equation.

Next, the selected input features and PM10 concentration were standardized to avoid the effects of differences in scale of features that could significantly influence the performance of regression models. After standardization, both input features and PM10 concentration were unitless values with a mean of 0 and a standard deviation of 1. The standardized features and PM10 concentration were used as inputs and outputs for the data-driven models.

Development of data-driven models

Multiple linear regression model

MLR is a statistical technique used to find a linear relationship between a response variable (dependent variable) and explanatory variables (independent variables). It is one of the most common methods to generalize the relationship of PM concentration with its determinants23. Generally, the MLR model is defined by the following equation:

$$y_{i} = c_{0} + c_{1} x_{i1} + c_{2} x_{i2} + \cdots + c_{n} x_{in} + \epsilon$$
(2)

in which, \(y_{i}\) is dependent variable/response variable; \(x_{i}\) = independent/explanatory variables; \(c_{0}\) is the intercept; \(c_{n}\) is the slope coefficient for each independent variables; \(\epsilon\) is the error term. The \(y_{i}\), in this study, is the standardized PM10 concentration, and \(x_{i}\) is the standardized features constructed in the previous step. At each air quality monitoring station, a data-driven model based on the MLR algorithm was developed from measurement data of PM10 concentration and meteorological-derived input features using the method of least squares.

Artificial neural network model

In this study, a feed forward neural network (FFNN) model, a common architecture of ANN model, was employed to build data-driven models for each air quality monitoring station using the same input and output data as in the MLR models for comparison. The structure of a FFNN model consists of three layers (input layer, hidden layer and output layer). We refered to Sanger33 for more detailed information about the FFNN algorithm.

In order to develop the FFNN data-driven models, the input and output data were randomly sampled into 3 sub-sets with 70% of data for training, 15% for validation and 15% for testing. Since the number of nodes in the input and output layers was determined, the determination of the FFNN structure focused on determining the number of hidden nodes. In this study, the trial-and-error method was used to find the number of hidden nodes for each air quality monitoring station.

Development of a hybrid interpolation approach for PM10 concentration mapping

Based on the data-driven models developed in the previous section, this study constructed the monthly PM10 concentration maps from meteorological data using a new approach based on the IDW interpolation method. In order to predict the PM concentration at a given location (interpolated location) from surrounding air quality monitoring stations, the IDW method determines the weighting factors of each station as below:

$$w_{ik} = \frac{{{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 {d_{ik}^{2} }}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${d_{ik}^{2} }$}}}}{{\mathop \sum \nolimits_{i = 1}^{N} d_{ik}^{2} }}$$
(3)

in which wik is the weighting factor of station ith at interpolated location kth. \(d_{ik}^{2}\) is the distance from station ith to interpolated location kth. N is the total number of air quality monitoring stations used for interpolation. Using the weighting factors, the PM10 concentration was estimated as below:

$$PM_{10}^{k} = w_{ik} f_{i}^{{PM_{10} }} ({\varvec{X}}_{k} )$$
(4)

in which \(f_{i}^{{PM_{10} }}\) is the data-driven model developed for the station ith; \({\varvec{X}}_{k}\) is the input feature vector which was derived from meteorological data at interpolated location kth. The novel of this approach is that instead of using the PM10 concentration values at the air quality monitoring stations like in traditional approaches, it used meteorological data at the interpolated location to feed the data-driven models. This hybrid approach allows us to consider the impacts of both local conditions (via the data-driven model developed for each station) and spatio-temporal variations of meteorological factors.

All the above steps including feature construction, development of MLR and ANN-based models and mapping the PM10 concentration were programmed on Matlab (coding of this program is provided in the Supplementary document). This programming language allows for quick implementation of the algorithms and easy visualization of the results without using any other additional software.

The method developed in this study can be well applied for other cities where meteorological and PM10 concetration observations are avaiable. However, the selection of meteorological factors and development of data-driven models in this study were purely relied on data from air quality monitoring stations in the city of Hanoi. As a result, the data-driven models may not be applicable to other cities. The data-driven models should be developed for each city based on the availability of observation data in that city.

Study area

Hanoi is the capital city of Vietnam with an area of 3358 km2 (following the administrative expansion in 2008) and more than 7.4 million people in 201734. The study area consists of eleven central districts of Hanoi (Fig. 2). This is the most crowded area of Hanoi where 41% population of the city resides in 7.7% of the total city area. With a high population density and a large number of vehicles and construction activities, this area has been severely suffered from air pollution. As for the weather conditions, Hanoi has four distinct seasons including spring (March–May), summer (June–August), autumn (September–November) and winter (December–February). In winter, the weather is cold and dry, while summer has high humidity and rainfall35. With a lower temperature and low humidity, the PM10 concentration in winter is much higher than the other seasons (Fig. 3). While the 24 h PM10 concentration is mostly below the National Technical Regulation on Ambient Air Quality (QCVN 05:2013/BTNMT, PM10 = 150 μg/m3), there are many days in which PM10 concentration is above this standard in winter. Due to the harmful impacts of high PM10 concentration on human health and the importance of the study area, it is necessary to construct high resolution PM10 concentration maps for this area in order to provide more detail air quality information for local residents who are most likely to be affected.

Figure 2
figure 2

Location of 11 air quality stations used in the study.

Figure 3
figure 3

Daily PM10 in 2017 in Trung Yen 3 air quality monitoring station.

Data availability

The data used in this study was collected from two sources corresponding to two objectives. For the development of the data-driven models, input data was collected from 11 air quality monitoring stations located across the study area (Fig. 2). These stations include three fixed stations (Minh Khai, Trung Yen 3 and Nguyen Van Cu) and eight sensor stations (Table 1). Nguyen Van Cu station is under the management of the Vietnam Environment Administration. The remaining stations are under the management of the Hanoi Department of Environmental Protection. Hourly PM10 concentration and meteorological data (atmospheric pressure, temperature, humidity, wind speed) at all stations from 01/06/2017 to 31/12/2018 were collected. This dataset covers one and a haft year, and therefore, can represent the temporal variations of the PM10 concentration and meteorological factors over four seasons of the year. In addition to PM10, other air pollutants were also collected, although they were not considered in the scope of this study. The hourly PM10 concentration and meteorological data were averaged to generate a daily dataset to reduce the measurement errors and remove their diurnal variation.

Table 1 Summary information of air quality monitoring stations.

Since the data collected from 11 meteorological stations were limited, the PM10 concentration calculated from these data was not representative for its spatial variation in the study area. Therefore, high spatial resolution maps of the PM10 concentration were needed. For mapping the monthly PM10 concentration, we used the global meteorological data from the WorldClim 2.0 database (https://www.worldclim.org/), which contains temperature (mean, maximum, minimum), precipitation, solar radiation, vapor pressure, and wind speed data with a spatial resolution of 1 km2. This is a reliable data source that was validated with gauged data (correlation coefficient with gauged data \(r \ge 0.99\) for temperature and vapor pressure, \(r \ge 0.86\) for precipitation and \(r \ge 0.76\) for wind speed). After the global meteorological data was downloaded, they were extracted for the region of the study area. Because the relative humidity was not available, it was calculated from actual and saturated vapor pressure. The atmospheric pressure was calculated from the location altitude and air temperature. Figure 4 shows the temperature, wind speed, relative humidity and air pressure in February obtained from the WorldClim database as an example. As shown in the figure, the data extracted from the WorldClim can well reflect the spatial variation of meteorological factors.

Figure 4
figure 4

Spatial distribution of Temperature, Humidity, Atmospheric Pressure, and Wind Speed derived from WorldClim 2.0 in January.

Results

Construction of input features for data-driven models

The meteorological data collected from 01/06/2017 to 31/12/2018 were used to construct the input features for the data-driven models. The total number of features considered in this study was 27. In order to reduce this number of features, the correlation coefficients between each feature with the PM10 concentration and between features were estimated. Figure 5 presents the correlation matrix which indicates the correlation coefficients of input features with each other and with the PM10 concentration. The figure shows that the correlation coefficients between the input features and the PM10 concentration range from − 0.46 to 0.46. All six meteorological factors (mean daily atmospheric pressure, mean daily temperature; mean daily humidity, mean daily wind speed, maximum daily temperature and minimum daily temperature) have a relatively high correlation with the PM10 concentration with absolute correlation coefficients greater than 0.24. It is interesting that of these meteorological factors, only the mean daily pressure is positively correlated with the PM10 concentration, while the other factors have negative correlation coefficients. This implies that the PM10 concentration increases with increasing mean daily pressure and decreasing other factors. The features with the highest correlation with the PM10 concentration are the mean daily pressure (X1) and its quadratic term (\(X_{1}^{2}\)) (correlation coefficients = 0.46), while the product of the mean daily pressure and the mean daily humidity (X1X3) has the lowest correlation (correlation coefficient = − 0.22).

Figure 5
figure 5

Coerrelation matrix of features and between features with the PM10 concentration. X1 is the mean daily pressure; X2 is the mean daily temperature; X3 is the mean daily humidity; X4 is the mean daily wind speed; X5 is the maximum daily temperature; X6 is the minimum daily temperature.

In order to select input features for the data-driven models, we evaluated the correlation of each feature with the PM10 concentration and with the other features. The mean daily pressure (X1), mean daily temperature (X2), mean daily humidity (X3), mean daily wind speed (X4) are well correlated with the PM10 concentration and are independent from each other. As a result, they were added to the input features. The maximum (X5) and minimum daily temperature (X6) are well-correlated to the mean daily temperature with correlation coefficients of 0.98. Hence, these two features and their associated features (X1X5, X2X5, X3X5, X4X5, X5X5, X5X6; X1X6, X2X6, X3X6, X4X6, X5X6, X6X6) were not included in the input features. Features X1X1, X1X2, X1X3, X1X4, X1X5 and X1X6 that are functionally correlated with X1 were not considered either. Of the features associated with the mean daily temperature (X2) and mean daily humidity (X3), only features X2X3, X2X4, and X3X4 are relatively independent on the others factors and have a high correlation with the PM10 concentration. Therefore, these features were selected as inputs for the data-driven models. In total, the input features consist of X1, X2, X3, X4, X2X4, and X3X4.

Development of data-driven models

Multiple linear regression model

Using the input features selected in the previous section, the MLR-based model (Eq. 2) was written as below:

$$PM_{10} = c_{1} + c_{2} X_{1} + c_{3} X_{2} + c_{4} X_{3} + c_{5} X_{4} + c_{6} X_{2} X_{3} + c_{7} X_{2} X_{4} + c_{8} X_{3} X_{4}$$
(5)

in which the coefficient ci (i = 1…8) was determined from measurement data using the method of least squares. Comparing with previous studies that usually used the MLR algorithm to construct the relationship between the PM10 concentration and meteorological factors, this study considered both the meteorological factors (X1, X2, X3, X4) and their combinative quadratic terms (X2X3, X2X4, X3X4), and therefore, could account for the nonlinear quadratic form of this relationship. In addition, to account for the seasonal dependence of the PM10 concentration on meteorological factors, this study built two MLR models corresponding to the winter and spring period and the summer and autumn periods.

Artificial neutron network model

Although the MLR-based data-driven models developed in this study accounted for the nonlinear and seasonal relationship between the PM10 concentration and meteorological factors, they only considered the quadratic nonlinear relationship. For comparison, we employed the ANN algorithm, which could formulate this relationship in a more complicated manner. The ANN model included three layers (input, hidden and output) in which the number of nodes in the input layer and output layer was equal to 7 (number of input features) and 1 (PM10), respectively. After using the trial-and-error method, we found that the number of the hidden layers with 12 nodes was the most suitable for all air quality monitoring stations. Figure 6 illustrates the architecture of the ANN model used in this study.

Figure 6
figure 6

Schematization of the ANN model structure. The model includes 3 layers: input (7 nodes), hidden (12 nodes) and output (1 node).

In order to build the ANN model for each station, the measurement data were divided into three sub-datasets (70% for training, 15% for validation and 15% for testing) to avoid overfitting. Figure 7 compares the PM10 concentration between the ANN modeling and measurement in each sub-dataset and the whole dataset at the Trung Yen 3 station as an example. The figure shows that the ANN model could well simulate the PM10 concentration at all datasets, and therefore, could be reliably used for predicting PM10 concentration.

Figure 7
figure 7

ANN models results for Trung Yen 3 air quality monitoring station.

Comparison of model performance

To evaluate the performance of the MLR and ANN models, we used three statistical indices, namely, Root Mean Squared Error (RMSE), correlation coefficient (r) and Nash–Sutcliffe Efficiency (NSE), which are formulated as below:

$$r = \frac{{\mathop \sum \nolimits_{i = 1}^{N} \left( {Y_{o}^{i} - \overline{{Y_{o} }} } \right)\left( {Y_{m}^{i} - \overline{{Y_{m} }} } \right)}}{{\sqrt {\mathop \sum \nolimits_{i = 1}^{N} \left( {Y_{o}^{i} - \overline{{Y_{o} }} } \right)^{2} } \sqrt {\mathop \sum \nolimits_{i = 1}^{N} \left( {Y_{m}^{i} - \overline{{Y_{m} }} } \right)^{2} } }}$$
(6)
$$RMSE = \sqrt {\frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \left( {Y_{m}^{i} - Y_{o}^{i} } \right)^{2} }$$
(7)
$$NSE = 1 - \frac{{\mathop \sum \nolimits_{i = 1}^{N} \left( {Y_{m}^{i} - Y_{o}^{i} } \right)^{2} }}{{\mathop \sum \nolimits_{i = 1}^{N} \left( {Y_{o}^{i} - \overline{{Y_{o} }} } \right)^{2} }}$$
(8)

where N is total number of data points; Ym is the modeled data, Yo is the observed data. The correlation coefficient ranges from − 1 to 1 in which the higher value corresponds to the closer positive relationship between the modeling and measurement. RMSE measures the differences between modeling and measurement. The lower RMSE indicates a better agreement between the modeling and measurement. NSE varies in the range from \(- \infty\) to 1 in which NSE = 1 indicates a perfect match between the modeling and measurement, NSE ≤ 0 implies that the model predictions have the same or lower accuracy than the mean of measurements.

Table 2 compares the indices for both data-driven models. The table indicates that the ANN models performed much better than the MLR models in all eleven stations. The ANN outputs are well correlated with the measurement data with an average correlation coefficient of 0.7. No station has a correlation coefficient below 0.65. Meanwhile, the average correlation coefficient of the MLR outputs with measurements is 0.58 in which the Hoan Kiem station has the lowest coefficient (r = 0.51). As for the modeling errors, the average RMSE of the ANN and MLR models are 14.1 and 15.6, respectively. The NSE criterion ranges from 0.41 to 0.57 for the ANN models and from 0.26 to 0.53 for the MLR models. It is clear that the differences between modeling and measurement of the ANN models are lower than those of the MLR models. The reason for this fact is that the ANN algorithm accounts for more complicated interactions between the input and output than the MLR model. For their better performance, the ANN models were employed for mapping the PM10 concentration.

Table 2 Performance comparison between ANN model and MLR model.

Of all air quality monitoring stations, the performance of both MLR and ANN at the Trung Yen 3 3, Nguyen Van Cu and Minh Khai stations are much better than the others. Indeed, the average correlation coefficient and NSE corresponding to the ANN models for the three stations are respectively equal to 0.74 and 0.55, while these indices for the other stations are much lower (0.68 for the correlation coefficient and 0.46 for the NSE). This could be explained by the fact that Trung Yen 3, Nguyen Van Cu, Minh Khai are the main stations of Hanoi, which have been frequently checked and performed quality control. As a result, the quality of measurement data at these stations is better than the others.

Monthly PM10 concentration mapping

Mapping the monthly PM10 concentration for the study area of eleven central districts in Hanoi from the WorldClim global meteorological data was performed using the ANN models developed in the previous section and the hybrid interpolation approach (Eq. 4). Figures 8 and 9 below presents the monthly and seasonal maps of the PM10 concentration. The spatial resolution of these maps is equal to that of meteorological data (1 km2). The seasonal PM10 concentration maps were generated by assembling monthly maps.

Figure 8
figure 8

Monthly maps of PM10 concentration.

Figure 9
figure 9

Seasonal maps of the PM10 concentration. The red-solid lines represent the administrative boundary of eleven districts.

As for the temporal variation of the PM10 concentration, it can be seen that the PM10 concentration reaches its peaks in the winter season (January, February, November and December). For example, the mean PM10 concentration in November is up to 71 μg/m3. The low temperature (ranging from 16 to 21 °C in these months) and the temperature inversion phenomenon in winter are likely the causes of the high PM10 concentration in these months. By contrast, because the air temperature in June–August reaches its highest level (~ 29 °C), the concentration of PM10 hits a trough in this period with the PM10 concentration ranging from 46 to 48 μg/m3 As regards to the seasonal variation, points out that the PM10 concentration is highest in winter and lowest in summer. In addition to the air temperature, the humidity and wind speed, which are lowest in winter, are also the reason for the higher PM10 concentration in winter than in the other seasons.

As for the spatial variation, the concentration of PM10 in Long Bien district which is situated in the northeast of the study area is much lower than the other districts. The main reason is that compared to the other districts in the study area, the density of population in this district is lowest in the study area (around 4.5 thousand people/km2, versus 11.6 thousand people/km2 in other districts). This lower population density leads to less intensive traffic. Besides, due to the sparse air quality monitoring network, the PM10 concentration in Long Bien district strongly depends on the PM10 concentration at the Nguyen Van Cu station, which situates relatively far away from transportation routes. On the other hand, as pointed out by Nghiem et al.36, since the inauguration of Vinh Tuy Bridge in 2010 and Nhat Tan Bridge in 2015, the flow of traffic vehicles through Nguyen Van Cu road was decreased. As a result, the annual average of PM10 concentration at Nguyen Van Cu station from 2010 to 2018 slightly declined which makes the PM10 concentration lower. On the contrary, the highest concentration is found at the Pham Van Dong station located in the southwest of the study area. This station is placed on the Pham Van Dong Street, one of the main route to access Hanoi from Noi Bai International Airport, thus the traffic in this street is normally quite intensive. Besides, a high number of active construction works in this area might be an important factor for the increased level of PM10. Meteorological conditions also influence the spatial distribution of the PM10 concentration. The highest PM10 concentration in the northwest region is partly caused by the low air temperature in this region (see Fig. 4). However, Fig. 8 shows that the impact of local factors (e.g., street, population, transportation intensity) on the spatial variation of the PM10 concentration is larger than that of meteorological factors.

Conclusion

In this study, a combinative approach of data-driven models and IDW interpolation technique was developed to construct the PM10 concentration maps for the central area of Hanoi. The construction of data-driven models consisted of two steps, feature construction and model development. The feature construction is responsible for constructing optimal features from meteorological factors. By evaluating the correlation between the PM10 concentration with each feature and correlation between features, a set of features was selected as the input for the data-driven models. The model development step built the data-driven models that link the PM10 concentration with the input features using the MLR and ANN algorithms for each air quality monitoring station. The obtained results indicate that the ANN-based data-driven models provided much better results than the MRL-based models. In order to construct the PM10 concentration maps, the IDW interpolation technique was used to calculate the weighting factors for each air quality monitoring stations. While many other studies obtained the unknown PM10 concentration by interpolating the PM10 concentration at the air quality monitoring stations without considering meteorological factors, this study accounted for the meteorological factors in the data-driven models. Using this approach, both the local PM10 surrounding monitoring stations and the dependence of PM10 on meteorological factors were taken into account hence provided a better representation of the current situation in the study area.

Due to a lack of high spatial resolution of meteorological data, this study used the 1 km2 resolution monthly WorldClim data as the input to predict monthly PM10 concentration via combination of the established data-driven models and interpolation method. The monthly PM10 maps were then aggregated to construct seasonal maps. The temporal analysis revealed that the PM10 concentration was highest in the winter months and lowest in the summer months, which was mainly caused by the negative dependence of the PM10 concentration on air temperature and low humidity. The spatial analysis indicated that the northeast region was the region with the lowest PM10 concentration because the urbanization in this region was less developed than the others. The northwest region had the highest PM10 concentration because of the high population and ongoing constructions of new buildings and roads, which together elevated the PM10 concentration. The meteorological factors also influenced the spatial variation of the PM10 concentration but with a lower impact level compared to the local sources of PM10 generation surrounding the monitoring stations. This study also pointed out that although the spatial variation of meteorological factors was taken into account, the low density of air quality monitoring stations might reduce the accuracy of PM10 concentration maps. Hence, it is necessary to establish a denser air quality monitoring stations network to better cover the spatial variation of the PM10 concentration. The approach developed in this study can be applied to provide the forecasting PM10 concentration maps based on predicting meteorological information. These results could also provide a very meaningful foundation for the local authority in deriving and implementing city air quality management activities and urban planning in Hanoi.