Combination of data-driven models and interpolation technique to develop of PM10 map for Hanoi, Vietnam

Nguyen, Dung Anh; Duong, Son Hong; Tran, Phuong Anh; Cao, Hai Hoang; Ho, Bang Quoc

doi:10.1038/s41598-020-75547-y

Download PDF

Article
Open access
Published: 06 November 2020

Combination of data-driven models and interpolation technique to develop of PM10 map for Hanoi, Vietnam

Dung Anh Nguyen¹,
Son Hong Duong²,
Phuong Anh Tran^2,3,
Hai Hoang Cao^2,3 &
…
Bang Quoc Ho^4,5

Scientific Reports volume 10, Article number: 19268 (2020) Cite this article

5913 Accesses
6 Citations
1 Altmetric
Metrics details

Subjects

Abstract

The degradation of air quality is the most concerned issue of our society due to its harmful impacts on human health, especially in cities with rapid urbanization and population growth like Hanoi, the capital of Vietnam. This study aims at developing a new approach that combines data-driven models and interpolation technique to develop the PM₁₀ concentration maps from meteorological factors for the central area of Hanoi. Data-driven models that relate the PM₁₀ concentration with the meteorological factors at the air quality monitoring stations in the study area were developed using the Multiple Linear Regression (MLR) and Artificial Neural Network (ANN) algorithms. Models’ performance comparison showed that ANN models yielded better goodness-of-fit indices than MLR models at all stations in the study area with average coefficient of correlation (r) and Nash–Sutcliffe Efficiency Index (NSE) of 0.51 and 0.34 for the former, and 0.7 and 0.49 for the latter. These indices indicates that the ANN-based data-driven models outperformed the MLR-based models. Thus, the ANN-based models and the Inverse Distance Weighting (IDW) interpolation technique were then combined for mapping the monthly PM₁₀ concentration with a spatial resolution of 1 km from global meteorological data. With this combination, the PM₁₀ concentration maps account for both local PM₁₀ concentration and impacts of spatio-temporal variations of meteorological factors on the PM₁₀ concentration. This study provides a promising method to predict the PM concentration with a high spatio-temporal resolution from meteorological data.

Prediction of atmospheric PM_2.5 level by machine learning techniques in Isfahan, Iran

Article Open access 24 January 2024

PM10 and PM2.5 real-time prediction models using an interpolated convolutional neural network

Article Open access 07 June 2021

Meteorological variability and predictive forecasting of atmospheric particulate pollution

Article Open access 02 January 2024

Introduction

Air pollution is one of the most concerned issues of our society due to its impacts on human health and the economy^1,2. Many studies have shown that the mortality rate increases with increasing particulate matter (PM) concentration in the air. For example, Dockery et al.³ reported that an increase of 10 μg/m³ in the PM_2.5 concentration would make the death rate increase by 1.5% in 6 cities in the United Stated⁴. A high amount of PM in the air also showed a strong correlation with lung cancer and other cardiovascular and lung related diseases which makes it one of the known causes for deaths in adults in the United States⁵. An increased concentration of 10 μg/m³ PM₁₀ would result in a 0.5% increase in the number of deaths per day in 20 big cities with more than 50 million inhabitants^6,7. A similar outcome was also confirmed by Katsouyanni et al.⁸ in a study of over 29 cities. Not only does it have negative effects on human health, PM air pollution also causes huge economic loss. Because of air pollution, each year around 4.1 billion Euro loss was reported in Switzerland⁹ and a loss of 40–50 million USD was revealed by a statistic made by The National Center for Health Statistics of the United States in 2001¹⁰. According to the World Bank, in 2016 air pollution cost the world’s economy around 5.11 trillion USD in welfare losses. In South Asia and East Asia and the Pacific, losses were around 7.4 percent and 7.5 percent of the regional gross domestic product (GDP), respectively¹¹. It has been long recognized that meteorological conditions and PM concentration have a close relationship. Zhao et al.¹², in a study on PM_2.5 pollution from 2005 to 2007, highlighted a clear seasonal variation in the concentration of PM_10–2.5 in which the PM concentration at the rural area was at the minimum level in winter and maximum level in spring and summer, while the urban region experienced the opposite trend. This study also pointed out that precipitation had an important contribution to the seasonal pattern of PM_2.5 in the urban area, while monsoon was the main factor for that in the countryside. In a similar study, Duo et al.¹³ concluded that temperature in Lhasa, Tibet was the dominant factor governing all air pollutants including PM₁₀ and PM_2.5 in spring, while relative humidity and atmospheric pressure were the major meteorological drivers during summertime. Spatial distribution of fine to coarse PM showed an inverse relationship with wind speed^14,15. Srimuruganandam and Nagendra¹⁶ revealed that low wind speed was highly correlated with PM concentration. It was however reported by Giri et al.¹⁷ that PM₁₀ concentration in Kathmandu, Nepal increased with wind speed and atmospheric pressure. Wang and Ogawa¹⁸ also reported that PM_2.5 was positively proportional to wind speed higher than 3 m/s, and negatively proportional to the wind speed lower than that level.

Based on the relationship between the PM₁₀ concentration and meteorological factors, efforts have been made to construct data-driven models for predicting the PM₁₀ concentration from meteorological data using different statistical methods^19,20,21. Among these methods, the Multiple Linear Regression (MLR)-based data-driven models are the most popular and have been widely used. The MLR algorithm is usually used to formulate the linear relationship between meteorological factors (including temperature, relative humidity, wind speed, and wind direction) with the PM₁₀ or PM_2.5 concentration. Although there are some studies that reported good prediction results²², it is generally seen that the MLR-based models are yet to present consistently satisfactory results due to the linearizing of the non-linear system as reviewed by Shahraiyni and Sodoudi²³. With the ability to represent complex non-linear problems, the ANN-based data-driven model with different architectures has been extensively used to estimate the PM concentration. Several studies reported that the Artificial Neural Network (ANN) models produced satisfactory prediction results²⁴. Although some studies showed that the ANN models performed better than the MLR models²⁴, the ANN models have a more complicated structure and still present some limitations in terms of handling high dimensional input variables, local minima or interpretability (the black-box model problem). As a result, for each case study, it is necessary to compare these two models to select the more suitable model.

PM₁₀ concentration recorded in air quality monitoring statios only cover an area surrounding those stations, thus it is necessary to use interpolation techniques for mapping the PM₁₀ concentration. There are multiple interpolation techniques for this purpose. Wong et al.²⁵ provided an excellent review on these techniques and divided them into four groups, namely spatial averaging, nearest neighbor, inverse distance weighting and kriging. These interpolation techniques have been successfully employed to construct PM₁₀ maps in many studies. For example, Perez²⁶ applied the nearest neighbor technique to provide a forecasting map of PM₁₀ in Santiago, Chile. Li et al²⁷ developed two IDW-based spatiotemporal interpolation techniques to evaluate the spatial variation of PM_2.5 concentration over the contiguous United States. Kim et al. employed the ordinary kriging to interpolate the PM₁₀ concentration from 226 urban-ambient monitoring sites in South Korea²⁸. Raja et al.²⁹ used spatiotemporal kriging with the external drift to explore spatio-temporal variations of PM₁₀ concentrations in Ankara, Turkey. The main drawback of these studies was that they only used the PM₁₀ concentration measured at the air quality stations for interpolation without considering the impacts of the spatio-temporal variations of the meteorological factors on PM₁₀ variation. As a result, it is crucial to develop an interpolation technique that can account for information from both air quality stations and meteorological data.

With the increasing population and rapid urbanization, air quality in Vietnam, especially in large cities like Hanoi has been significantly degraded. Hopke et al.³⁰ indicated that Hanoi was one of the cities which had the worst air quality in Asia. Saksena et al.³¹ showed that the average value of PM₁₀ concentrations in the streets in Hanoi could reach up to 455 μg/m³, which is much higher than the Vietnamese daily standard for the PM₁₀ concentration (150 μg/m³). This has posed a negative effect on the city’s public health. It was reported that ambient and in-house air pollution was becoming the major reason for deaths related to the environment in Vietnam, just second to smoking³². As a result, there has been increasing attention and demand from both the local community and the government of Hanoi for a study on air quality and its controlling factors with PM_10–2.5 concentration prediction being the top priority.

There are two objectives to this study. The first objective is to develop a hybrid mapping approach that combines the data-driven model and IDW interpolation to produce the PM₁₀ concentration maps from global meteorological data. The second objective is to employ this approach to construct the monthly PM₁₀ maps for the central districts of Hanoi and analyze its spatio-temporal variations.

Methodology and material

In this study, we developed a hybrid approach that combines data-driven models and interpolation techniques to construct the PM₁₀ concentration maps from global meteorological data. As shown in Fig. 1, this approach consists of two main steps, namely, (1) development of data-driven models at each air quality monitoring station and (2) construction of PM₁₀ maps from global meteorological data using the IDW interpolation technique. Details of these two steps are presented below.

Development of data-driven models

Before developing data-driven models, a set of input features derived from meteorological factors were constructed. After that, the data-driven models linking these selected input features and the PM₁₀ concentration using two machine-learning algorithms, namely, MLR and ANN were developed. Finally, based on their performance, the more accurate models were selected for mapping the PM₁₀ concentration.

Construction of input features

Multiple meteorological factors can be considered in data-driven models to predict the PM₁₀ concentration. However, based on the availability of meteorological data at the air quality monitoring stations and their correlation with the PM₁₀ concentration, the following variables were taken into account: mean daily temperature, maximum daily temperature, minimum daily temperature, mean daily humidity, mean daily wind speed and mean daily atmospheric pressure. Next, we assumed that the PM₁₀ concentration was linked to meteorological factors by a quadratic function as follows:

$$PM_{10} = f(X_{i} , \;X_{i}^{2} , \;X_{i} X_{j} )\quad i,\;j = 1, 2, \ldots 6,\;\;i \ne j$$

(1)

in which X_i (i = 1, 2, …, 6) is the mean daily pressure (X₁), mean daily temperature (X₂), mean daily humidity (X₃), mean daily wind speed (X₄), maximum daily temperature (X₅), minimum daily temperature (X₆). Totally, there are 27 features considered in Eq. (1). Since the size of features is considerably large, features with low correlation coefficients with the PM₁₀ concentration or close correlation with other previously-selected features were removed from the equation.

Next, the selected input features and PM₁₀ concentration were standardized to avoid the effects of differences in scale of features that could significantly influence the performance of regression models. After standardization, both input features and PM₁₀ concentration were unitless values with a mean of 0 and a standard deviation of 1. The standardized features and PM₁₀ concentration were used as inputs and outputs for the data-driven models.

Development of data-driven models

Multiple linear regression model

MLR is a statistical technique used to find a linear relationship between a response variable (dependent variable) and explanatory variables (independent variables). It is one of the most common methods to generalize the relationship of PM concentration with its determinants²³. Generally, the MLR model is defined by the following equation:

$$y_{i} = c_{0} + c_{1} x_{i1} + c_{2} x_{i2} + \cdots + c_{n} x_{in} + \epsilon$$

(2)

in which, $y_{i}$ is dependent variable/response variable; $x_{i}$ = independent/explanatory variables; $c_{0}$ is the intercept; $c_{n}$ is the slope coefficient for each independent variables; $\epsilon$ is the error term. The $y_{i}$, in this study, is the standardized PM₁₀ concentration, and $x_{i}$ is the standardized features constructed in the previous step. At each air quality monitoring station, a data-driven model based on the MLR algorithm was developed from measurement data of PM₁₀ concentration and meteorological-derived input features using the method of least squares.

Artificial neural network model

In this study, a feed forward neural network (FFNN) model, a common architecture of ANN model, was employed to build data-driven models for each air quality monitoring station using the same input and output data as in the MLR models for comparison. The structure of a FFNN model consists of three layers (input layer, hidden layer and output layer). We refered to Sanger³³ for more detailed information about the FFNN algorithm.

In order to develop the FFNN data-driven models, the input and output data were randomly sampled into 3 sub-sets with 70% of data for training, 15% for validation and 15% for testing. Since the number of nodes in the input and output layers was determined, the determination of the FFNN structure focused on determining the number of hidden nodes. In this study, the trial-and-error method was used to find the number of hidden nodes for each air quality monitoring station.

Development of a hybrid interpolation approach for PM₁₀ concentration mapping

Based on the data-driven models developed in the previous section, this study constructed the monthly PM₁₀ concentration maps from meteorological data using a new approach based on the IDW interpolation method. In order to predict the PM concentration at a given location (interpolated location) from surrounding air quality monitoring stations, the IDW method determines the weighting factors of each station as below:

$$w_{ik} = \frac{{{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 {d_{ik}^{2} }}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${d_{ik}^{2} }$}}}}{{\mathop \sum \nolimits_{i = 1}^{N} d_{ik}^{2} }}$$

(3)

in which w_ik is the weighting factor of station ith at interpolated location kth. $d_{ik}^{2}$ is the distance from station ith to interpolated location kth. N is the total number of air quality monitoring stations used for interpolation. Using the weighting factors, the PM₁₀ concentration was estimated as below:

$$PM_{10}^{k} = w_{ik} f_{i}^{{PM_{10} }} ({\varvec{X}}_{k} )$$

(4)

in which $f_{i}^{{PM_{10} }}$ is the data-driven model developed for the station ith; ${\varvec{X}}_{k}$ is the input feature vector which was derived from meteorological data at interpolated location kth. The novel of this approach is that instead of using the PM₁₀ concentration values at the air quality monitoring stations like in traditional approaches, it used meteorological data at the interpolated location to feed the data-driven models. This hybrid approach allows us to consider the impacts of both local conditions (via the data-driven model developed for each station) and spatio-temporal variations of meteorological factors.

All the above steps including feature construction, development of MLR and ANN-based models and mapping the PM₁₀ concentration were programmed on Matlab (coding of this program is provided in the Supplementary document). This programming language allows for quick implementation of the algorithms and easy visualization of the results without using any other additional software.

The method developed in this study can be well applied for other cities where meteorological and PM₁₀ concetration observations are avaiable. However, the selection of meteorological factors and development of data-driven models in this study were purely relied on data from air quality monitoring stations in the city of Hanoi. As a result, the data-driven models may not be applicable to other cities. The data-driven models should be developed for each city based on the availability of observation data in that city.

Study area

Hanoi is the capital city of Vietnam with an area of 3358 km² (following the administrative expansion in 2008) and more than 7.4 million people in 2017³⁴. The study area consists of eleven central districts of Hanoi (Fig. 2). This is the most crowded area of Hanoi where 41% population of the city resides in 7.7% of the total city area. With a high population density and a large number of vehicles and construction activities, this area has been severely suffered from air pollution. As for the weather conditions, Hanoi has four distinct seasons including spring (March–May), summer (June–August), autumn (September–November) and winter (December–February). In winter, the weather is cold and dry, while summer has high humidity and rainfall³⁵. With a lower temperature and low humidity, the PM₁₀ concentration in winter is much higher than the other seasons (Fig. 3). While the 24 h PM₁₀ concentration is mostly below the National Technical Regulation on Ambient Air Quality (QCVN 05:2013/BTNMT, PM₁₀ = 150 μg/m³), there are many days in which PM₁₀ concentration is above this standard in winter. Due to the harmful impacts of high PM₁₀ concentration on human health and the importance of the study area, it is necessary to construct high resolution PM₁₀ concentration maps for this area in order to provide more detail air quality information for local residents who are most likely to be affected.

Data availability

The data used in this study was collected from two sources corresponding to two objectives. For the development of the data-driven models, input data was collected from 11 air quality monitoring stations located across the study area (Fig. 2). These stations include three fixed stations (Minh Khai, Trung Yen 3 and Nguyen Van Cu) and eight sensor stations (Table 1). Nguyen Van Cu station is under the management of the Vietnam Environment Administration. The remaining stations are under the management of the Hanoi Department of Environmental Protection. Hourly PM₁₀ concentration and meteorological data (atmospheric pressure, temperature, humidity, wind speed) at all stations from 01/06/2017 to 31/12/2018 were collected. This dataset covers one and a haft year, and therefore, can represent the temporal variations of the PM₁₀ concentration and meteorological factors over four seasons of the year. In addition to PM₁₀, other air pollutants were also collected, although they were not considered in the scope of this study. The hourly PM₁₀ concentration and meteorological data were averaged to generate a daily dataset to reduce the measurement errors and remove their diurnal variation.

Table 1 Summary information of air quality monitoring stations.

Full size table

Since the data collected from 11 meteorological stations were limited, the PM₁₀ concentration calculated from these data was not representative for its spatial variation in the study area. Therefore, high spatial resolution maps of the PM₁₀ concentration were needed. For mapping the monthly PM₁₀ concentration, we used the global meteorological data from the WorldClim 2.0 database (https://www.worldclim.org/), which contains temperature (mean, maximum, minimum), precipitation, solar radiation, vapor pressure, and wind speed data with a spatial resolution of 1 km². This is a reliable data source that was validated with gauged data (correlation coefficient with gauged data $r \ge 0.99$ for temperature and vapor pressure, $r \ge 0.86$ for precipitation and $r \ge 0.76$ for wind speed). After the global meteorological data was downloaded, they were extracted for the region of the study area. Because the relative humidity was not available, it was calculated from actual and saturated vapor pressure. The atmospheric pressure was calculated from the location altitude and air temperature. Figure 4 shows the temperature, wind speed, relative humidity and air pressure in February obtained from the WorldClim database as an example. As shown in the figure, the data extracted from the WorldClim can well reflect the spatial variation of meteorological factors.

Results

Construction of input features for data-driven models

The meteorological data collected from 01/06/2017 to 31/12/2018 were used to construct the input features for the data-driven models. The total number of features considered in this study was 27. In order to reduce this number of features, the correlation coefficients between each feature with the PM₁₀ concentration and between features were estimated. Figure 5 presents the correlation matrix which indicates the correlation coefficients of input features with each other and with the PM₁₀ concentration. The figure shows that the correlation coefficients between the input features and the PM₁₀ concentration range from − 0.46 to 0.46. All six meteorological factors (mean daily atmospheric pressure, mean daily temperature; mean daily humidity, mean daily wind speed, maximum daily temperature and minimum daily temperature) have a relatively high correlation with the PM₁₀ concentration with absolute correlation coefficients greater than 0.24. It is interesting that of these meteorological factors, only the mean daily pressure is positively correlated with the PM₁₀ concentration, while the other factors have negative correlation coefficients. This implies that the PM₁₀ concentration increases with increasing mean daily pressure and decreasing other factors. The features with the highest correlation with the PM₁₀ concentration are the mean daily pressure (X₁) and its quadratic term ($X_{1}^{2}$) (correlation coefficients = 0.46), while the product of the mean daily pressure and the mean daily humidity (X₁X₃) has the lowest correlation (correlation coefficient = − 0.22).

In order to select input features for the data-driven models, we evaluated the correlation of each feature with the PM₁₀ concentration and with the other features. The mean daily pressure (X₁), mean daily temperature (X₂), mean daily humidity (X₃), mean daily wind speed (X₄) are well correlated with the PM₁₀ concentration and are independent from each other. As a result, they were added to the input features. The maximum (X₅) and minimum daily temperature (X₆) are well-correlated to the mean daily temperature with correlation coefficients of 0.98. Hence, these two features and their associated features (X₁X₅, X₂X₅, X₃X₅, X₄X₅, X₅X₅, X₅X₆; X₁X₆, X₂X₆, X₃X₆, X₄X₆, X₅X₆, X₆X₆) were not included in the input features. Features X₁X₁, X₁X₂, X₁X₃, X₁X₄, X₁X₅ and X₁X₆ that are functionally correlated with X₁ were not considered either. Of the features associated with the mean daily temperature (X₂) and mean daily humidity (X₃), only features X₂X₃, X₂X₄, and X₃X₄ are relatively independent on the others factors and have a high correlation with the PM₁₀ concentration. Therefore, these features were selected as inputs for the data-driven models. In total, the input features consist of X₁, X₂, X₃, X₄, X₂X₄, and X₃X₄.

Development of data-driven models

Multiple linear regression model

Using the input features selected in the previous section, the MLR-based model (Eq. 2) was written as below:

$$PM_{10} = c_{1} + c_{2} X_{1} + c_{3} X_{2} + c_{4} X_{3} + c_{5} X_{4} + c_{6} X_{2} X_{3} + c_{7} X_{2} X_{4} + c_{8} X_{3} X_{4}$$

(5)

in which the coefficient c_i (i = 1…8) was determined from measurement data using the method of least squares. Comparing with previous studies that usually used the MLR algorithm to construct the relationship between the PM₁₀ concentration and meteorological factors, this study considered both the meteorological factors (X₁, X₂, X₃, X₄) and their combinative quadratic terms (X₂X₃, X₂X₄, X₃X₄), and therefore, could account for the nonlinear quadratic form of this relationship. In addition, to account for the seasonal dependence of the PM₁₀ concentration on meteorological factors, this study built two MLR models corresponding to the winter and spring period and the summer and autumn periods.

Artificial neutron network model

Although the MLR-based data-driven models developed in this study accounted for the nonlinear and seasonal relationship between the PM₁₀ concentration and meteorological factors, they only considered the quadratic nonlinear relationship. For comparison, we employed the ANN algorithm, which could formulate this relationship in a more complicated manner. The ANN model included three layers (input, hidden and output) in which the number of nodes in the input layer and output layer was equal to 7 (number of input features) and 1 (PM₁₀), respectively. After using the trial-and-error method, we found that the number of the hidden layers with 12 nodes was the most suitable for all air quality monitoring stations. Figure 6 illustrates the architecture of the ANN model used in this study.

In order to build the ANN model for each station, the measurement data were divided into three sub-datasets (70% for training, 15% for validation and 15% for testing) to avoid overfitting. Figure 7 compares the PM₁₀ concentration between the ANN modeling and measurement in each sub-dataset and the whole dataset at the Trung Yen 3 station as an example. The figure shows that the ANN model could well simulate the PM₁₀ concentration at all datasets, and therefore, could be reliably used for predicting PM₁₀ concentration.

Comparison of model performance

To evaluate the performance of the MLR and ANN models, we used three statistical indices, namely, Root Mean Squared Error (RMSE), correlation coefficient (r) and Nash–Sutcliffe Efficiency (NSE), which are formulated as below:

$$r = \frac{{\mathop \sum \nolimits_{i = 1}^{N} \left( {Y_{o}^{i} - \overline{{Y_{o} }} } \right)\left( {Y_{m}^{i} - \overline{{Y_{m} }} } \right)}}{{\sqrt {\mathop \sum \nolimits_{i = 1}^{N} \left( {Y_{o}^{i} - \overline{{Y_{o} }} } \right)^{2} } \sqrt {\mathop \sum \nolimits_{i = 1}^{N} \left( {Y_{m}^{i} - \overline{{Y_{m} }} } \right)^{2} } }}$$

(6)

$$RMSE = \sqrt {\frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \left( {Y_{m}^{i} - Y_{o}^{i} } \right)^{2} }$$

(7)

$$NSE = 1 - \frac{{\mathop \sum \nolimits_{i = 1}^{N} \left( {Y_{m}^{i} - Y_{o}^{i} } \right)^{2} }}{{\mathop \sum \nolimits_{i = 1}^{N} \left( {Y_{o}^{i} - \overline{{Y_{o} }} } \right)^{2} }}$$

(8)

where N is total number of data points; Y_m is the modeled data, Y_o is the observed data. The correlation coefficient ranges from − 1 to 1 in which the higher value corresponds to the closer positive relationship between the modeling and measurement. RMSE measures the differences between modeling and measurement. The lower RMSE indicates a better agreement between the modeling and measurement. NSE varies in the range from $- \infty$ to 1 in which NSE = 1 indicates a perfect match between the modeling and measurement, NSE ≤ 0 implies that the model predictions have the same or lower accuracy than the mean of measurements.

Table 2 compares the indices for both data-driven models. The table indicates that the ANN models performed much better than the MLR models in all eleven stations. The ANN outputs are well correlated with the measurement data with an average correlation coefficient of 0.7. No station has a correlation coefficient below 0.65. Meanwhile, the average correlation coefficient of the MLR outputs with measurements is 0.58 in which the Hoan Kiem station has the lowest coefficient (r = 0.51). As for the modeling errors, the average RMSE of the ANN and MLR models are 14.1 and 15.6, respectively. The NSE criterion ranges from 0.41 to 0.57 for the ANN models and from 0.26 to 0.53 for the MLR models. It is clear that the differences between modeling and measurement of the ANN models are lower than those of the MLR models. The reason for this fact is that the ANN algorithm accounts for more complicated interactions between the input and output than the MLR model. For their better performance, the ANN models were employed for mapping the PM₁₀ concentration.

Table 2 Performance comparison between ANN model and MLR model.

Full size table

Of all air quality monitoring stations, the performance of both MLR and ANN at the Trung Yen 3 3, Nguyen Van Cu and Minh Khai stations are much better than the others. Indeed, the average correlation coefficient and NSE corresponding to the ANN models for the three stations are respectively equal to 0.74 and 0.55, while these indices for the other stations are much lower (0.68 for the correlation coefficient and 0.46 for the NSE). This could be explained by the fact that Trung Yen 3, Nguyen Van Cu, Minh Khai are the main stations of Hanoi, which have been frequently checked and performed quality control. As a result, the quality of measurement data at these stations is better than the others.

Monthly PM₁₀ concentration mapping

Mapping the monthly PM₁₀ concentration for the study area of eleven central districts in Hanoi from the WorldClim global meteorological data was performed using the ANN models developed in the previous section and the hybrid interpolation approach (Eq. 4). Figures 8 and 9 below presents the monthly and seasonal maps of the PM₁₀ concentration. The spatial resolution of these maps is equal to that of meteorological data (1 km²). The seasonal PM₁₀ concentration maps were generated by assembling monthly maps.

As for the temporal variation of the PM₁₀ concentration, it can be seen that the PM₁₀ concentration reaches its peaks in the winter season (January, February, November and December). For example, the mean PM₁₀ concentration in November is up to 71 μg/m³. The low temperature (ranging from 16 to 21 °C in these months) and the temperature inversion phenomenon in winter are likely the causes of the high PM₁₀ concentration in these months. By contrast, because the air temperature in June–August reaches its highest level (~ 29 °C), the concentration of PM₁₀ hits a trough in this period with the PM₁₀ concentration ranging from 46 to 48 μg/m³ As regards to the seasonal variation, points out that the PM₁₀ concentration is highest in winter and lowest in summer. In addition to the air temperature, the humidity and wind speed, which are lowest in winter, are also the reason for the higher PM₁₀ concentration in winter than in the other seasons.

As for the spatial variation, the concentration of PM₁₀ in Long Bien district which is situated in the northeast of the study area is much lower than the other districts. The main reason is that compared to the other districts in the study area, the density of population in this district is lowest in the study area (around 4.5 thousand people/km², versus 11.6 thousand people/km² in other districts). This lower population density leads to less intensive traffic. Besides, due to the sparse air quality monitoring network, the PM₁₀ concentration in Long Bien district strongly depends on the PM₁₀ concentration at the Nguyen Van Cu station, which situates relatively far away from transportation routes. On the other hand, as pointed out by Nghiem et al.³⁶, since the inauguration of Vinh Tuy Bridge in 2010 and Nhat Tan Bridge in 2015, the flow of traffic vehicles through Nguyen Van Cu road was decreased. As a result, the annual average of PM₁₀ concentration at Nguyen Van Cu station from 2010 to 2018 slightly declined which makes the PM₁₀ concentration lower. On the contrary, the highest concentration is found at the Pham Van Dong station located in the southwest of the study area. This station is placed on the Pham Van Dong Street, one of the main route to access Hanoi from Noi Bai International Airport, thus the traffic in this street is normally quite intensive. Besides, a high number of active construction works in this area might be an important factor for the increased level of PM₁₀. Meteorological conditions also influence the spatial distribution of the PM₁₀ concentration. The highest PM₁₀ concentration in the northwest region is partly caused by the low air temperature in this region (see Fig. 4). However, Fig. 8 shows that the impact of local factors (e.g., street, population, transportation intensity) on the spatial variation of the PM₁₀ concentration is larger than that of meteorological factors.

Conclusion

In this study, a combinative approach of data-driven models and IDW interpolation technique was developed to construct the PM₁₀ concentration maps for the central area of Hanoi. The construction of data-driven models consisted of two steps, feature construction and model development. The feature construction is responsible for constructing optimal features from meteorological factors. By evaluating the correlation between the PM₁₀ concentration with each feature and correlation between features, a set of features was selected as the input for the data-driven models. The model development step built the data-driven models that link the PM₁₀ concentration with the input features using the MLR and ANN algorithms for each air quality monitoring station. The obtained results indicate that the ANN-based data-driven models provided much better results than the MRL-based models. In order to construct the PM₁₀ concentration maps, the IDW interpolation technique was used to calculate the weighting factors for each air quality monitoring stations. While many other studies obtained the unknown PM₁₀ concentration by interpolating the PM₁₀ concentration at the air quality monitoring stations without considering meteorological factors, this study accounted for the meteorological factors in the data-driven models. Using this approach, both the local PM₁₀ surrounding monitoring stations and the dependence of PM₁₀ on meteorological factors were taken into account hence provided a better representation of the current situation in the study area.

Due to a lack of high spatial resolution of meteorological data, this study used the 1 km² resolution monthly WorldClim data as the input to predict monthly PM₁₀ concentration via combination of the established data-driven models and interpolation method. The monthly PM₁₀ maps were then aggregated to construct seasonal maps. The temporal analysis revealed that the PM₁₀ concentration was highest in the winter months and lowest in the summer months, which was mainly caused by the negative dependence of the PM₁₀ concentration on air temperature and low humidity. The spatial analysis indicated that the northeast region was the region with the lowest PM₁₀ concentration because the urbanization in this region was less developed than the others. The northwest region had the highest PM₁₀ concentration because of the high population and ongoing constructions of new buildings and roads, which together elevated the PM₁₀ concentration. The meteorological factors also influenced the spatial variation of the PM₁₀ concentration but with a lower impact level compared to the local sources of PM₁₀ generation surrounding the monitoring stations. This study also pointed out that although the spatial variation of meteorological factors was taken into account, the low density of air quality monitoring stations might reduce the accuracy of PM₁₀ concentration maps. Hence, it is necessary to establish a denser air quality monitoring stations network to better cover the spatial variation of the PM₁₀ concentration. The approach developed in this study can be applied to provide the forecasting PM₁₀ concentration maps based on predicting meteorological information. These results could also provide a very meaningful foundation for the local authority in deriving and implementing city air quality management activities and urban planning in Hanoi.

References

Pope, C. A. III., Ezzati, M. & Dockery, D. W. Fine particulate air pollution and life expectancy in the United States. N. Engl. J. Med. 360, 376–386 (2009).
Article CAS PubMed PubMed Central Google Scholar
Kunzli, N., Perez, L. & Rapp, R. Air Quality and Health (ERS, Lausanne, 2010).
Google Scholar
Dockery, D. W. et al. An association between air pollution and mortality in six United-States cities. N. Engl. J. Med. 329, 1753–1759 (1993).
Article CAS PubMed Google Scholar
Laden, F., Schwartz, J., Speizer, F. E. & Dockery, D. W. Air pollution and mortality: a continued follow-up in the Harvard six cities study. Epidemiology 12, S81 (2001).
Google Scholar
Pope, C. A. et al. Particulate air pollution as a predictor of mortality in a prospective study of United-States adults. Am. J. Respir. Crit. Care Med. 151, 669–674 (1995).
Article PubMed Google Scholar
Samet, J. M., Dominici, F., Curriero, F., Coursac, I. & Zeger, S. L. Fine Particulate air pollution and mortality in 20 US cities, 1987–1994. N. Engl. J. Med. 343(24), 1742–1749 (2000).
Article CAS PubMed Google Scholar
Samet, J. M. et al. The National Morbidity, Mortality, and Air Pollution Study. Part II: morbidity and mortality from air pollution in the United States. Res. Rep. Health Effects Inst. 94, 5–70 (2000).
CAS Google Scholar
Katsouyanni, K. et al. Short term effects of ambient sulphur dioxide and particulate matter on mortality in 12 European cities: Results from time series data from the APHEA project. BMJ 314, 1658–1663 (1997).
Article CAS PubMed PubMed Central Google Scholar
Seethaler, R. Health Costs due to Road Traffic-related Air Pollution, Air Pollution Attributable Cases. An impact assessment project of Austria, France and Switzerland, Prepared for the WHO-Ministerial Conference on Environment and Health, London, 16–18 June 1999. Synthesis. Berne, Paris (1999).
Fuchs, V. R. & Frank, S. R. Air pollution and medical care use by older Americans: a cross-area analysis. Health Affairs (Millwood) 21(6), 207–214 (2002).
Article Google Scholar
World Bank. The cost of air pollution. Strengthening the Economic Case for Action. The World Bank and Institute for Health Metrics and Evaluation, University of Washington, Seattle (2016).
Zhao, X. et al. Seasonal and diurnal variations of ambient PM2.5 concentration in urban and rural environments in Beijing. Atmos. Environ. 43, 2893–2900 (2009).
Article ADS CAS Google Scholar
Duo, B. et al. Observations of atmospheric pollutants at Lhasa during 2014–2015: pollution status and the influence of meteorological factors. J. Environ. Sci. 63, 28–42 (2018).
Article Google Scholar
Li, X., Ma, Y., Wang, Y., Liu, N. & Hong, Y. Temporal and spatial analyses of particulate matter (PM10 and PM2.5) and its relationship with meteorological parameters over an urban city in northeast China. Atmos. Res. 198, 185–193 (2017).
Article CAS Google Scholar
Hartog, J. J. et al. Relationship between different size classes of particulate matter and meteorology in three European cities. J. Environ. Monit. 7, 302–310 (2005).
Article PubMed Google Scholar
Srimuruganandam, B. & Nagendra, S. Impact of meteorology on roadside ambient particulate matter concentrations. Mod. Traffic Transp. Eng. Res. 2(3), 141–152 (2013).
Google Scholar
Giri, D., Krishna Murthy, V. & Adhikary, P. R. The influence of meteorological conditions on PM10 concentrations in Kathmandu Valley. Int. J. Environ. Res. 2(1), 49–60 (2003).
Google Scholar
Wang, J. & Ogawa, S. Effects of meteorological conditions on PM2.5 concentrations in Nagasaki, Japan. Int. J. Environ. Res. Public Health 12, 9089–9101 (2015).
Article CAS PubMed PubMed Central Google Scholar
Liu, Z. et al. Seasonal and diurnal variation in particulate matter (PM10 and PM2.5) at an urban site of Beijing: analyses from a 9-year study. Environ. Sci. Pollut. Res. 22, 627–642 (2015).
Article CAS Google Scholar
Koutrakis, P. et al. Analysis of PM10, PM2.5, and PM2.5–10 concentrations in Santiago, Chile, from 1989 to 2001. J. Air Waste Manag. Assoc. 55, 342–351 (2005).
Article CAS PubMed Google Scholar
Clements, N., Hannigan, M. P., Miller, S. L., Peel, J. L. & Milford, J. B. Comparisons of urban and rural PM10–2.5 and PM2.5 mass concentrations and semi-volatile fractions in northeastern Colorado. Atmos. Chem. Phys. 16, 7469–7484 (2016).
Article ADS CAS Google Scholar
Chellali, M., Abderrahim, H., Hamou, A., Nebatti, A. & Janovec, J. Artificial neural network models for prediction of daily fine particulate matter concentrations in Algiers. Environ. Sci. Pollut. Res. 23, 14008–14017 (2016).
Article CAS Google Scholar
Shahraiyni, H. T. & Sodoudi, S. Statistical modeling approaches for PM10 prediction in urban areas: a review of 21st-century studies. Atmosphere 7(2), 15 (2016).
Article ADS Google Scholar
Chaloulakou, A., Kassomenos, P., Spyrellis, N., Demokritou, P. & Koutrakis, P. Measurements of PM10 and PM2.5 particle concentrations in Athens, Greece. Atmos. Environ. 37(5), 649–660 (2003).
Article ADS CAS Google Scholar
Wong, D. W., Yuan, L. & Perlin, S. A. Comparison of spatial interpolation methods for the estimation of air quality data. J. Eposure Sci. Environ. Epidemiol. 14(5), 404–415 (2004).
Article CAS Google Scholar
Perez, P. Combined model for PM10 forecasting in a large city. Atmos. Environ. 60, 271–276 (2012).
Article ADS CAS Google Scholar
Li, L., Losser, T., Yorke, C. & Piltner, R. Fast inverse distance weighting-based spatiotemporal interpolation: a web-based application of interpolating daily fine particulate matter PM2.5 in the contiguous us using parallel programming and kd-tree. Int. J. Environ. Res. Public Health 11(9), 9101–9141 (2014).
Article CAS PubMed PubMed Central Google Scholar
Kim, S. Y. et al. Ordinary kriging approach to predicting long-term particulate matter concentrations in seven major Korean cities. Environ. Health Toxicol. 29, e2014012 (2014).
Article PubMed PubMed Central Google Scholar
Raja, N. B., Aydin, O., Turkoglu, N. & Cicek, I. Characterising the seasonal variations and spatial distribution of ambient PM10 in Urban Ankara, Turkey. Environ. Process. 5(2), 349–362 (2018).
Article Google Scholar
Hopke, P. K. et al. Urban air quality in the Asian region. Sci. Total Environ. 404(1), 103–112 (2008).
Article ADS CAS PubMed Google Scholar
Saksena, S., Quang, T. N., Nguyen, T., Dang, P. N. & Flachsbart, P. Commuters’ exposure to particulate matter and carbon monoxide in Hanoi, Vietnam. Transp. Res. Part D Transp. Environ. 13(3), 206–211 (2008).
Article Google Scholar
Global Burden of Disease (GBD). Visualizations 2013. Institute for Health Metrics and Evaluation (2013).
Sanger, T. D. Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Netw. 2(6), 459–473 (1989).
Article Google Scholar
General Statistics Office of Vietnam (GSO). Vietmam Statistics Yearbook in 2017 (2018).
Hien, P. D., Bac, V. T., Tham, H. C., Nhan, D. D. & Vinh, L. D. Influence of meteorological conditions on PM2.5 and PM2.5–10 concentrations during the monsoon season in Hanoi, Vietnam. Atmos. Environ. 36, 3473–3484 (2002).
Article ADS CAS Google Scholar
Nghiem, T.-D., Mac, D.-H., Nguyen, A.-D. & Le, N. C. An integrated approach for analyzing air quality monitoring data: a case study in Hanoi, Vietnam. Air Qual. Atmos. Health https://doi.org/10.1007/s11869-020-00907-6 (2020).
Article Google Scholar

Download references

Acknowledgements

The authors gratefully acknowledge Vietnam Environment Administration and Hanoi Department of Natural Resources and Environment for the data providing Irish Research Council COALESCE Research Fund 2019—IRC-COALESCE-2020-31 under HealthyAIR project.

Author information

Authors and Affiliations

Department of Science and Technology, Ministry of Natural Resources and Environment, 10 Ton That Thuyet Street, My Dinh 2 Ward, Nam Tu Liem District, Hanoi City, Vietnam
Dung Anh Nguyen
Water Resources Institute, 8 Phao Dai Lang Street, Lang Thuong Ward, Dong Da District, Hanoi City, Vietnam
Son Hong Duong, Phuong Anh Tran & Hai Hoang Cao
Department of Water Resources Engineering and Technology, Water Resources Institute, 8 Phao Dai Lang Street, Lang Thuong Ward, Dong Da District, Hanoi City, Vietnam
Phuong Anh Tran & Hai Hoang Cao
Air Pollution and Climate Change Research Center (APAC), Institute for Environment and Resources (IER), 142 To Hien Thanh, Ward 14, District 10, Ho Chi Minh City, Vietnam
Bang Quoc Ho
Vietnam National University-Ho Chi Minh City (VNU-HCM), Linh Trung Ward, Thu Duc District, Ho Chi Minh City, Vietnam
Bang Quoc Ho

Authors

Dung Anh Nguyen
View author publications
Search author on:PubMed Google Scholar
Son Hong Duong
View author publications
Search author on:PubMed Google Scholar
Phuong Anh Tran
View author publications
Search author on:PubMed Google Scholar
Hai Hoang Cao
View author publications
Search author on:PubMed Google Scholar
Bang Quoc Ho
View author publications
Search author on:PubMed Google Scholar

Contributions

N.A.D. contributes as coordinator for whole research and develop air pollution map. D.H.S. contributes to to develop methodology for this paper. T.A.P. contributes to analysis data for this paper. C.H.H. contributes to analysis meteorological data for this paper. H.Q.B. contributes to prepare the manuscript and develop air pollution map.

Corresponding author

Correspondence to Bang Quoc Ho.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Nguyen, D.A., Duong, S.H., Tran, P.A. et al. Combination of data-driven models and interpolation technique to develop of PM10 map for Hanoi, Vietnam. Sci Rep 10, 19268 (2020). https://doi.org/10.1038/s41598-020-75547-y

Download citation

Received: 11 June 2020
Accepted: 28 September 2020
Published: 06 November 2020
DOI: https://doi.org/10.1038/s41598-020-75547-y

This article is cited by

Statistical modeling approach for PM10 prediction before and during confinement by COVID-19 in South Lima, Perú
- Rita Jaqueline Cabello-Torres
- Manuel Angel Ponce Estela
- Javier Linkolk López-Gonzales
Scientific Reports (2022)

Subjects

Abstract

Similar content being viewed by others

Prediction of atmospheric PM2.5 level by machine learning techniques in Isfahan, Iran

PM10 and PM2.5 real-time prediction models using an interpolated convolutional neural network

Meteorological variability and predictive forecasting of atmospheric particulate pollution

Introduction

Methodology and material

Development of data-driven models

Construction of input features

Development of data-driven models

Multiple linear regression model

Artificial neural network model

Development of a hybrid interpolation approach for PM10 concentration mapping

Study area

Data availability

Results

Construction of input features for data-driven models

Development of data-driven models

Multiple linear regression model

Artificial neutron network model

Comparison of model performance

Monthly PM10 concentration mapping

Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Statistical modeling approach for PM10 prediction before and during confinement by COVID-19 in South Lima, Perú

Search

Quick links

Prediction of atmospheric PM_2.5 level by machine learning techniques in Isfahan, Iran

Development of a hybrid interpolation approach for PM₁₀ concentration mapping

Monthly PM₁₀ concentration mapping