Evolution and forecasting of PM10 concentration at the Port of Gijon (Spain)

Sánchez Lasheras, Fernando; García Nieto, Paulino José; García Gonzalo, Esperanza; Bonavera, Laura; de Cos Juez, Francisco Javier

doi:10.1038/s41598-020-68636-5

Download PDF

Article
Open access
Published: 16 July 2020

Evolution and forecasting of PM10 concentration at the Port of Gijon (Spain)

Fernando Sánchez Lasheras¹,
Paulino José García Nieto¹,
Esperanza García Gonzalo¹,
Laura Bonavera² &
…
Francisco Javier de Cos Juez³

Scientific Reports volume 10, Article number: 11716 (2020) Cite this article

2611 Accesses
22 Citations
Metrics details

Subjects

Abstract

The name PM₁₀ refers to small particles with a diameter of less than 10 microns. The present research analyses different models capable of predicting PM₁₀ concentration using the previous values of PM₁₀, SO₂, NO, NO₂, CO and O₃ as input variables. The information for model training uses data from January 2010 to December 2017. The models trained were autoregressive integrated moving average (ARIMA), vector autoregressive moving average (VARMA), multilayer perceptron neural networks (MLP), support vector machines as regressor (SVMR) and multivariate adaptive regression splines. Predictions were performed from 1 to 6 months in advance. The performance of the different models was measured in terms of root mean squared errors (RMSE). For forecasting 1 month ahead, the best results were obtained with the help of a SVMR model of six variables that gave a RMSE of 4.2649, but MLP results were very close, with a RMSE value of 4.3402. In the case of forecasts 6 months in advance, the best results correspond to an MLP model of six variables with a RMSE of 6.0873 followed by a SVMR also with six variables that gave an RMSE result of 6.1010. For forecasts both 1 and 6 months ahead, ARIMA outperformed VARMA models.

Meteorological variability and predictive forecasting of atmospheric particulate pollution

Article Open access 02 January 2024

Analysis and prediction of air quality in Nanjing from autumn 2018 to summer 2019 using PCR–SVR–ARMA combined model

Article Open access 11 January 2021

Exploring PM_2.5 and PM₁₀ ML forecasting models: a comparative study in the UAE

Article Open access 21 March 2025

Introduction

The town of Gijón and its Port

Gijón is a town located on the north coast of Spain, in the Principality of Asturias. It is the most populated municipality of this region, with a total of 273,422 inhabitants according to 2016 census. This town, together with Oviedo (220,648 inhabitants) and Avilés (79,514 inhabitants) and other small towns, forms a metropolitan area with more than 850,000 inhabitants. It was founded in the fifth century B.C. During the twentieth century it underwent significant development due to industry, something which is still of great importance to the local economy.

The weather in Gijón is defined by its proximity to the sea and the low mean altitude. The annual level of precipitation is quite high, with a total of 920 L per square meter and year. Regarding temperature, the coldest month is January, with an average temperature of 8.9 °C, while the hottest is August with 19.7 °C. The average annual temperature is 13.8 °C. Winds are sporadic and seasonal. The wind regime is dominated by two main components¹. During winter it blows from W-WSW, while in summer it comes from E-ENE on the coast.

The Port of Gijón, named El Musel, is one of the main ports of the Atlantic Arc and the leading port in the movement of solid bulk in Spain. It is located in the Cantabrian Sea (43°34′N, 5°41′W). Figure 1a shows its position on the North Atlantic Spanish coast and Fig. 1b is an aerial picture of the town, where the location of the port can be observed.

The commercial exploitation of this port started in 1907. In the 1990s there was a development plan that doubled its area and which led to a significant increase in its activity. Its infrastructure is adapted to modern market requirements in terms of drafts, springs and storage areas and a range of services with the best standards of quality. It has 415 hectares of land surface and 7,000 linear meters of dock, structured in areas with the appropriate characteristics to serve each kind of traffic, i.e. specialized terminals for solid bulks, liquids and containers, and multi-purpose facilities for various types of traffic.

In the beginning, the exports were mainly iron ore and coal. Subsequently, the port would expand on its breakwaters and piers, and in the 1940s became the main Spanish port in traffic movement. The industrial activity of the Principality of Asturias has its main ally in the Port of Gijón. Currently, it is the main bulk port in Spain and one of the most important ports of the Atlantic Arc. According to the traffic statistics of the Annual Report of 2018, a total of 18,226 ships entered the port during that year, which meant a total of 79,294 containers and 12.7 millions of tonnes in the dry bulk terminal, of which 6.4 corresponded to iron ore, 3.4 to iron steel and 2.8 to steam coal. The net revenue in 2018 was 42.2 million euros.

Pollution and particulate matter studies

The World Health Organisation has reported that air pollution has an adverse effect on people’s health and development². It is well-known that long-term exposure to high levels of air pollution is linked to decrements in lung function in children³. A Swiss study found increased levels of allergic sensitisation in adults living in proximity to busy roads for periods longer than 10 years⁴. Also, the PM₁₀ pollutant is amongst those regulated under the Air Quality Framework Directive on ambient air quality assessment and management⁵.

A continuous exposure to pollutants such as Carbon Monoxide (CO), Carbon Dioxide (CO₂), oxides of nitrogen (NO_x), and particulate matter is reported to cause health problems in the population living in the affected areas^6,7. Particulate matter is formed by different chemical products, mostly produced by anthropogenic processes⁶ and with significantly variable diameters. Their anthropogenic origin is the reason why they are more present in urban areas⁸ than in unpopulated areas.

Air quality issues are relevant in ports and areas nearby. In general, the duty cycle of marine vessels is longer than that of roadside vehicles. This means that ship engines generally use older technology than cars and due to their engine power they are also much more pollutant⁹. Previous studies have analysed PM₁₀ concentrations in ports and coastal areas like the Bay of Algeciras in Southern Spain¹⁰. Another study analysed the impact of PM_2.5 particles from ship emissions in southern California¹¹. In Turkey, shipping emissions in the regions of Candelari Gulf¹² and Ambarli Port¹³, both with heavy shipping traffic, were investigated. Research carried out in the port¹⁴ of Tarragona, Spain, made use of multi-linear regression models to study the contribution of different harbour activities to the levels of particulate matter in its area. In the same line there is another study,¹⁵performed in Barcelona’s harbour, also located in Spain and about 80 kms. as the crow flies from Tarragona, which has estimated that around 50–55% of PM₁₀ and PM_2.5 concentrations measured at the port could be attributed to harbour activities and that such activities provide about 9–12% of the total PM₁₀ concentration in the air and about 11–15% of PM_2.5 to the metropolitan area of this city Another interesting and innovative study¹⁶ that deals with the problem of particulate matter in ports was performed in the port of Zhejiang. In this research, with the help of an unmanned aerial vehicle that integrated different sensors, authors have been able to create a profile of the vertical distribution of PM_2.5, PM₁₀ and total suspended particles from ground level to a height of 120 m. A study made at the port of Volos¹⁷, in Greece, found that the highest PM₁₀ concentration values were associated with days of calm winds, meaning a wind speed under $0.5\;\frac{m.}{{s.}}$. The only research into ports that made use of a supervised learning methodology was the one concerning the port of Koper ¹⁸. Koper is the only port in Slovenia and is located at the northern tip of the Adriatic Sea. Researchers made use of hourly PM₁₀ concentrations and employed k-means clustering with Euclidean and city-block distances to cluster days. The results obtained showed the influence of rain intensity and wind speed in the clusters performed but the influence of any other pollutant was not studied. Finally, another study of interest was performed at the Port of Cork, which, like Gijón, is located on the Atlantic coast¹⁹.

Use of machine learning techniques to forecast pollutant concentrations

In general, machine learning can be understood as a subset of methodologies of the artificial intelligence field that are able to learn in an automatic way. In other words, they can learn from data and predict future events. Nowadays, the use of machine learning methodologies has extended to almost all branches of science, including environmental studies. One of the main reasons for the use of machine learning approaches for air quality forecasting is the ability of these methodologies to capture non-linear relationships among variables.

Interest in the forecasting of air pollution in urban area dates back to more than a century ago, when large cities began to have problems with pollution ²⁰. In the 1970s, several statistical models for pollution forecasting were proposed^21,22. The first applications of machine learning methodologies in this field were in the 1990s. In those days most research performed made use of artificial neural networks^23,24.

Since then, the different studies performed have made use of other techniques such as genetic algorithms²⁵, Hierarchical Agglomerative Clustering²⁶, k-means²⁷ or support vector machines as regressors²⁸.

Genetic algorithms have been employed as a supporting methodology for selecting the input variables and designing the high-level architecture of neural networks models. In certain research works²⁵, they were applied to the selection of the architecture and input variables of a multilayer-perceptron model for forecasting of hourly concentrations of NO. One of the limitations found by this technique is that training each neural network model is a time-consuming task and therefore, the number of parameters to be tuned must be limited.

Hierarchical agglomerative clustering is employed to group objects that are similar in subsets called clusters. The agglomerative clustering methodology starts with many small clusters and merges them together to create large ones. It has been successfully applied in order to study ozone exposure and cardiovascular-related mortality in Canada²⁶. The results obtained showed that this methodology is useful for studying the long-term effects of air pollution on cardiovascular diseases.

A recent study has shown how k-means clustering can be employed to categorize different locations in a big and populated city representing the variability of pollution according to the variables employed for the study²⁷. Finally, the use of support vector machines as a regressor has also been reported in some studies^28,29. In one of these²⁸ the support vector machine is employed as a regressor model for the forecast of the daily Beijing air quality index from 1^st January 2014 to mid-2016, while in the others^29,30 they are employed for the forecast of the daily average ${PM}_{10}$.

The aim of the present research is to forecast the air quality in a port area, specifically in the port area of the city of Gijón. For this purpose, the article applied different machine learning models (multilayer perceptron neural networks, support vector machine as regressor and MARS) and compared the performance of the predictions obtained for different time intervals with those given by two time series methodologies, one of them univariate (ARIMA) and the other multivariate (VARMA). This means that an exhaustive comparison is made of the prediction from 1 to 6 months in advance of the performance of five methods. This provides an interesting framework for the comparison of methodologies. All these methods were employed in the past for pollution forecasting, but never all in the same research, as far as the authors know. Therefore, the relevance of the present research is that it deals with the topic of monitoring air quality in a city, comparing different machine learning methodologies applied to the same data set.

The database

The information employed for this research has been obtained from one of the meteorological stations belonging to the network of Air Quality Monitoring of the Government of the Principality of Asturias, and more specifically from the one closest to the Port of Gijón, which is located at Argentina Avenue. This station records environmental measurements hourly. As is normal in all this kind of databases, about 0.23% of the raw observations taken each 15 min for all variables were missing. They were imputed with the help of the Multivariate Imputation by Chained Equiations (MICE) algorithm³¹.

Table 1 shows the minimum, mean, maximum and standard deviation of the pollutants measured at Gijón Port for the period of study. The values considered for the present research were average monthly measurements from January 2010 to June 2018. Information from January 2010 to December 2017 was employed to forecast values from January to June 2018. Pollutants measured at the Port of Gijón were SO₂, NO, NO₂, CO, O₃ and PM₁₀.

Table 1 Port of Gijón. Minimum, mean, maximum and standard deviation of the variables of the study: sulfur dioxide (SO₂), nitrogen monoxide (NO), nitrogen dioxide (NO₂), carbon oxide (CO), ozone (O₃) and particulate matter with a diameter less than 10 µm (PM₁₀).

Full size table

Materials and methods

The present research calculates predictive models of PM₁₀ concentration by means of autoregressive integrated moving average (ARIMA), vector autoregressive moving-average (VARMA), multilayer perceptron neural networks (MLP), support vector machines as regressor (SVMR) and multivariate adaptive regression splines (MARS) models. In all cases the PM₁₀ values were calculated in two ways: firstly, using the concentration of the six pollutants available as input variables and afterwards employing only four: SO₂, NO, NO₂ and PM₁₀. The main reason why new models using only four variables of the six available are also trained and validated is that many meteorological stations, including some pertaining to the net of Air Quality Monitoring of the Government of the Principality of Asturias are only able to measure these four variables. In other words, the use of only the aforementioned four variables will allow us to compare the model performance according to the input variables employed and will serve as a reference for future studies. Please note that what was said before relates to all the models of the present research except for ARIMA, where only concentration of PM₁₀ are employed for the forecasting. In all cases, for continuous variables minimum, mean, maximum and standard deviation were calculated.

Forecasts are performed from 1 to 6 months in advance. The reason why it might be of interest to perform forecasts 6 months in advance is two-fold. On the one hand, high PM₁₀ concentrations have adverse effects on human health and on the other, having such a forecast would be helpful in order to take measurements that would make it possible to comply with European air quality standards. According to the results obtained, the best forecast of PM₁₀ concentration 1 month ahead is obtained by the SVMR model calculated with six variables. In the case of the forecast 6 months ahead the results of the MLP with six variables are slightly better. In other words, in the short-term the best forecasts are given by SVMR but in the long-term it is outperformed by MLP.

Autoregressive integrated moving average (ARIMA)

ARIMA models can be considered as being an extension of ARMA (autoregressive moving average) known for their ability to provide a parsimonious description of a stationary stochastic process³². ARMA models are composed of two polynomial terms, one for autoregression (AR) and another for moving average (MA). Given a time series of data $X_{t}$, the ARMA model can be expressed as:

$$X_{t} = c + \varepsilon_{t} + \mathop \sum \limits_{i = 1}^{p} \varphi_{i} X_{t - i} + \mathop \sum \limits_{i = 1}^{q} \sigma_{i} \varepsilon_{t - i}$$

where $c$ is a constant, $\varepsilon_{t}$ are white noise error terms, $\sum\nolimits_{i = 1}^{P} {\varphi_{i} X_{t - i} }$ is the autoregressive addend where $\varphi_{i}$ are parameters and $X_{t - i}$ is the value of variable $X$ in time $t - i$. $\sum\nolimits_{i = 1}^{qq} {\sigma_{i} \varepsilon_{t - i} }$ is the moving-average addend where $\sigma_{i}$ are the parameters of the model.

ARIMA models are appropriate for those observation sets that are not necessarily generated by a time series, as is the case of the present problem. They considerably improve the empirical description of non-stationary time series²⁹. A stochastic process can be characterized as an ARIMA model if the d-th difference of $X_{t}$, constitutes an ARMA stationary and invertible process of $p$, $q$ orders.

In this case, $p$ represents the order of the autoregressive part of the model, $q$ is the order of the weighted moving average and another parameter called $d$ represents the number of differencing required to reach stationarity³³. If the differencing operator is denoted by $\nabla$, the general ARIMA equation can be written as follows³⁰:

$$\emptyset_{p} \left( B \right)\nabla^{d} \left( {X_{t} - L} \right) = \theta_{q} \left( B \right)_{{\varepsilon_{i} }}$$

where $\emptyset_{p} \left( B \right)$ and $\theta_{q} \left( B \right)$ are the autoregressive polynomials of weighted moving averages and ${\upvarepsilon }_{{\text{i}}}$ is the model perturbation.

$$\begin{aligned} & \emptyset_{p} \left( B \right) = 1 - \emptyset_{1} B - \emptyset_{2} B^{2} - \cdots - \emptyset_{p} B^{p} \\ & \theta_{q} \left( B \right) = 1 - \theta_{1} B - \theta_{2} B^{2} - \cdots - B_{q} B^{q} \\ \end{aligned}$$

A more in-depth explanation of ARIMA models goes beyond the scope of this research and can be found elsewhere³⁴. All the models employed in the present research were calculated with the help of the statistical software R³⁵. ARIMA models were calculated with the help of the series library³⁶.

Vector autoregressive moving-average (VARMA)

The Vector autoregression Moving-Average (VARMA) method models the next step in each time series using an ARMA model. In other words, it can be considered the generalization of ARMA to multivariate time series. This kind of model makes it possible to compute a set of time series at the same time, obtaining their within-correlations and cross-correlations³². For these models calculus was performed with the help of the MTS library³⁷.

If a k-dimensional time series is represented by $z_{t}$, the vector autoregressive moving-average VARMA $\left( {p,q} \right)$ process can be expressed as:

$$\phi \left( B \right)z_{t} = \phi_{0} + \theta \left( B \right)a_{t}$$

where $\phi_{0}$ is a constant vector

$$\begin{aligned} & \phi \left( B \right) = I_{k} - \mathop \sum \limits_{t = 1}^{p} \phi_{t} B_{t} \\ & \theta \left( B \right) = I_{k} - \mathop \sum \limits_{t = 1}^{q} \theta_{t} B_{t} \\ \end{aligned}$$

are two matrix polynomials and $a_{t}$ is a sequence of independent and identically-distributed random vectors with mean zero and positive-definitive covariance matrix $\sum_{a}$.

A general VARMA $\left( {p,q} \right)$ model is represented as follows³⁷:

$$z_{t} = \phi_{0} + \mathop \sum \limits_{t = 1}^{p} \phi_{i} z_{t - 1} + a_{t} - \mathop \sum \limits_{t = 1}^{q} \theta_{i} a_{t - i}$$

In this equation $p$ and $q$ are nonnegative integers, $\phi_{0}$ is a vector of constants, $\phi_{i}$ and $\theta_{j}$ are two constant matrix and $\left\{ {a_{t} } \right\}$ is a sequence of independent and identically-distributed random vectors with mean zero and positive definite covariance matrix.

According to Tsay and Wood³⁷, the VARMA model expressed in the previous equation can be rewritten in a more convenient way as follows:

$$z_{t} = \phi_{0} + \mathop \sum \limits_{t = 1}^{p} \phi_{i} z_{t - 1} + Lb_{t} - \mathop \sum \limits_{t = 1}^{q} \theta_{j}^{*} b_{t - j}$$

where $\theta_{j}^{*} = \theta_{j} L$ where $L$ is a lower triangular matrix with 1 being the diagonal elements. The determination of $p$ and $q$ values was performed following a methodology suggested in previous research³⁸. Akaike information criterion³⁹ (AIC) and Schwarz information criterion⁴⁰ (SIC) were employed to balance the improvement in the value of the log-likelihood function with the loss of degrees of the freedom which results from increasing the lag order of a time series model. With the help of both the maximum $p$ and $q$ values were calculated. All those models with $p$ and $q$ values less or equal to then were calculated and finally, those with the best RMSE were presented in this paper.

Multilayer perceptron neural networks (MLP)

One of the first bio-inspired machine learning models was the one-layer perceptron. This kind of network was proposed by Rosemblatt⁴¹ as a possible modelization of the neuron of the human brain. The rule of the perceptron adaption consists of a supervised iterative method that modifies the neuron weights. The multilayer perceptron is useful as a way in which to modelize a function. In a neural network the outcome is modelled by an intermediary data set of unobservable variables called hidden variables, which are linear combinations of the original predictors. However, this linear combination is typically transformed by a nonlinear function.

Kolmogorov⁴² demonstrated that a two-layer network (one hidden layer and one output layer), with a non-linear differentiable activation function is able to approach any “soft” mapping if the number of neurons in the hidden layer is high enough. If a two-layer network like the one employed in the present research is considered, the operations for a system with $p$ input variables, one output variable and $q$ neurons in the hidden layer can be expressed as:

$$y\left( n \right) = \sigma \left( {w^{y} \cdot \varphi \left( {w^{h} \cdot x\left( n \right)} \right)} \right)$$

where $y\left( n \right)$ and $x\left( n \right)$ are the output and input of the net; $\sigma$ is the activation function of the output layer; $\varphi$ is the activation function of the hidden layer; $w^{y}$ and $w^{h}$ are the weights matrix for the output and hidden layer respectively.

One main requirement in order to make possible the MLP training⁴³ is that $\sigma$ and $\varphi$ be continuously-differentiable functions. Training is performed with the backpropagation method, which is a recursive application of the gradient descent method. For the purposes of this research, the neural network models were trained and validated with the help of the library neuralnet⁴⁴. The activation function employed is the logistic function. A more in-depth explanation of the foundations of neural networks may be found elsewhere⁴⁵.

Support vector machines as regressor (SVMR)

Support Vector Machines were introduced by the work of Vapnik⁴⁶. Although they were created by binary classification, nowadays they are used for different kinds of problems. Those employed for regression problems are called SVMR²⁹.

Let a training data set $S = \left\{ {\left( {x_{1} ,x_{2} } \right), \ldots \left( {x_{n} ,y_{n} } \right)} \right\}$, where $x_{i} \in \Re^{d}$ and $y_{i} \in \Re$ the regression task involves finding those parameters $w = \left( {w_{1} , \ldots ,w_{d} } \right)$ that make it possible to find the following lineal function²⁷:

$$f\left( x \right) = w_{1} x_{1} + \cdots + w_{d} x_{d} + b$$

As in practice it is not possible to find these parameters with a prediction error equal to zero, a concept called soft margin is employed. For this, variable $\xi_{i}$ is employed and the equation is written as follows:

$$\min \frac{1}{2}w,w + c\mathop \sum \limits_{i = 1}^{n} \left( {\xi_{i}^{ + } + \xi_{i}^{ - } } \right)$$

Please note that $\xi_{i}^{ + } > 0$ when the forecast of the model $f\left( {x_{i} } \right)$ is larger than its real value $y_{i}$ and $\xi_{i}^{ + } < 0$ in other cases.

With the help of the lagrangian function and the Karush–Kuhn–Tucker conditions, the problem can be expressed as follows:

$$f\left( x \right) = \mathop \sum \limits_{i = 1}^{n} \left( {\alpha_{i}^{ - } - \alpha_{i}^{ + } } \right)x,x_{0} + b^{*}$$

where

$$\begin{aligned} & \alpha_{i}^{ + } = C - \beta_{i}^{ + } \\ & \alpha_{i}^{ - } = C - \beta_{i}^{ - } \\ & b^{*} = y_{i} - w^{*} ,x_{i} \pm \varepsilon \\ \end{aligned}$$

In those cases where data cannot be adjusted with the help of a linear function, kernels are employed⁴⁷. Kernels transform data into a new space called characteristics space.

The regressor associated to the lineal function in the new space is as follows:

$$f\left( x \right) = \mathop \sum \limits_{i = 1}^{n} \left( {\alpha_{i}^{ - } - \alpha_{i}^{ + } } \right)K\left( {x,x_{i} } \right)$$

please note that $b^{*}$ is not included in the function as it can be included as a constant inside the kernel. The kind of kernel function to be employed depends on the problem to be solved. For example, the radial basis function has been shown to be very effective, but in those cases where the data set comes from a linear regression, the linear kernel function obtains better results⁴⁸. The SVM as regressor models have been implemented with the functionalities of the library e1071⁴⁹. A good explanation of the use of SVM as regressor can be found in the work of Drucker et al.⁵⁰.

Multivariate adaptive regression splines (MARS)

MARS is a non-parametric modelling method driven by the following equation⁵¹:

$$y_{t} = f\left( {x_{t} } \right) = \beta_{0} + \mathop \sum \limits_{i = 1}^{k} \beta_{i} \cdot B\left( {x_{it} } \right)$$

where $y_{t}$ is the output variable for each time $t$ and $\beta_{i}$ are the model parameters for the different $x_{it}$. $\beta_{0}$ is the intercept and $B$ represents the model basis functions.

One of the main characteristics of the MARS models is that they do not make use of any a priori hypothesis concerning the relationships among the variables⁵². The basis functions are defined as follows:

$$\begin{aligned} & B^{ - } = \left\{ {\begin{array}{*{20}l} {\left( {t - x} \right)^{q} } \hfill & \quad {if \;x < t} \hfill \\ 0 \hfill & \quad {otherwise} \hfill \\ \end{array} } \right. \\ & B^{ + } = \left\{ {\begin{array}{*{20}l} {\left( {t - x} \right)^{q} } \hfill & \quad {if\; x \ge t} \hfill \\ 0 \hfill & \quad {otherwise} \hfill \\ \end{array} } \right. \\ \end{aligned}$$

$q$ is the power of the basis function as is always a value either equal o larger than zero. In order to adjust a MARS model and decide which basis functions are to be included, MARS makes use of the generalized cross validation (GCV). This represents the root mean squared error divided by a penalty parameter that is defined by the model complexity⁵³. Its equation is as follows:

$$C\left( M \right) = M + 1 + d \cdot M$$

where $M$ represents the number of basis functions in the equation and $d$ is a penalty parameter for each base function included in the model. For this research, a value of 2 has been assigned to such a parameter, while the maximum number of tracer interaction type base functions is restricted to 3. The MARS models employed in this research are based on those programmed in the library earth⁵⁴. A complete explanation of MARS models can be found in the original work of Friedman⁵¹. Also, an easy-to-read introduction to this methodology can be found in the works of Put et al.⁵⁵.

Results and discussion

Table 2 shows the Pearson’s correlation coefficients of all the variables in the study. The largest correlation coefficient in absolute value corresponds to variables NO and NO₂ with 0.8626, followed by NO and O₃ with − 0.7593 (inverse relationship) and SO₂ and NO₂ and SO₂ and NO with 0.7160 and 0.7090 respectively. Correlation coefficients of variables SO₂, NO, NO₂ and O₃ with PM₁₀ can be considered in absolute value terms as moderate as they range from 0.4320 (CO and PM₁₀) to 0.5251 (NO₂ and PM₁₀).

Table 2 Pearson’s correlation coefficients of the variables of the study.

Full size table

Table 3 shows the results of the ARIMA model using the previous values of PM₁₀ as the input variable. Tables 4, 5, 6, 7 and 8 show the results obtained using the different models of four (SO₂, NO, NO₂ and PM₁₀) and six variables (SO₂, NO, NO₂, CO, O₃ and PM₁₀) employed in the present research. In all cases, the results are presented in the same way. The first line represents the forecast performed using information from January 2010 to December 2017 as training values. This forecast is performed for the following 6 months. The second line shows the forecast performed using information from January 2010 to January 2018 and the forecasts from February 2018 (1 month ahead) to June 2018 (5 months ahead) as training values. For all the cases, and in order to make an easy comparison of real values with forecasting, root mean squared errors (RMSE) forecasting values from 1 to 6 months ahead and 1 month ahead for all models are presented in Table 9. In the case of the ARIMA model (Table 3), the one that only makes use of past PM₁₀ concentrations in order to predict their future values, the RMSE obtained for forecasts performed 1 month ahead was 6.3163 while the RMSE for forecast performed from 1 to 6 months ahead, the RMSE value was 7.6312. Please note that when we speak about the RMSE obtained for a forecast performed 1 month ahead, we refer to the values that are in the diagonal of the table (in the case of Table 3: 22.2217, 32.0564, 19.7957, 22.9000, 34.6428 and 29.6487) as they are the ones calculated 1 month ahead. Regarding the forecast from 1 to 6 months ahead, we compare real values with the forecast of the first row of the table from January 2018 to June 2018 (in the case of Table 3: 22.2217, 31.5194, 19.8269, 20.1082, 37.0095 and 31.8833). Please note that the real monthly averaged values from January to June 2018 were 29, 27, 26, 31, 29 and 24 respectively. These values are included in Tables 3, 4, 5, 6, 7 and 8 make comparisons more direct.

Table 3 Port of Gijón. Results of the ARIMA models using variable PM₁₀.

Full size table

Table 4 Port of Gijón. Results of the VARMA models using variables SO₂, NO, NO₂ and PM₁₀.

Full size table

Table 5 Port of Gijón. Results of the VARMA models using variables SO₂, NO, NO₂, CO, O₃ and PM₁₀.

Full size table

Table 6 Port of Gijón. Results of the MLP models with variables SO₂, NO, NO₂ and PM₁₀ and with variables SO₂, NO, NO₂, CO, O₃ and PM₁₀.

Full size table

Table 7 Port of Gijón. Results of the SVMR models with variables SO₂, NO, NO₂ and PM₁₀ and with variables SO₂, NO, NO₂, CO, O₃ and PM₁₀.

Full size table

Table 8 Port of Gijón. Results of the MARS models with variables SO₂, NO, NO₂ and PM₁₀ and with variables SO₂, NO, NO₂, CO, O₃ and PM₁₀.

Full size table

Table 9 RMSE values 1 and up to 6 months ahead of all the models employed in the present study.

Full size table

The RMSE values achieved 1 and up to 6 months ahead for all the models trained in the present research are shown in Table 9. For forecasting 1 month ahead, the best results are obtained for the six variables of SVMR and MLP models, followed by the same models including only four variables. These results give us the idea that all the variables included in the study have a certain relevance in terms of performing an accurate PM₁₀ prediction. After the MLP and SVMR models, according to RMSE values the next best in forecasting 1 month ahead is the ARIMA model, the only one that makes exclusive use of past PM₁₀ values in order to forecast future concentrations. The ARIMA model is followed by MARS with six and four variables, while VARMA are the models that give the worst performance.

In the case of a forecast of up to 6 months ahead, the best performance according to RMSE value is also achieved by 6 variables MLP and SVMR models followed by the same models using only four variables. A remarkable change when compared with the forecast 1 month ahead is that the MARS model that includes 6 variables performs better than the ARIMA model. Finally, and as also happened with the forecasts 1 month ahead, the worst performance was shown by the VARMA models.

From our point of view, a remarkable fact is that the model performance in terms of RMSE in both 1- and 6-month ahead models is not only linked to the number of variables considered in it, but also to the kind of model selected. In other words, it is possible to find a model of only one variable (ARIMA) that performs better than others that include six variables in both 1- and 6-month ahead predictions (VARMA). Finally, the importance of a variable is very easy to assess with the help of a MARS model. The importance order found for the prediction of PM₁₀ was as follows: PM₁₀ value in the previous moments, followed by the previous measurements of CO, NO, O₃, SO₂ and NO₂.

The main limitation of this study is that although original data is taken each 15 min, forecasts are performed for average monthly values. The reason why average monthly values were forecasted is that the results obtained by the authors when daily or hourly forecasts were performed were not as stable as the average monthly values. This is due to the influence of the port traffic in the pollution area, which does not follow a fixed cycle like urban traffic. Another limitation to be overcome in future studies is that in order to improve the results obtained it would be of interest to introduce some meteorological variables such as temperature, humidity, pressure, sun radiation, rainfall and wind speed and direction.

Conclusions

The results obtained in this research allow us to say that it is possible to predict PM₁₀ concentration with the help of the value of this variable and the concentration of other pollutants by means of statistical and machine learning models. Also, another interesting issue is that as had already been found in previous studies,⁵⁶ the use of the concentration of other pollutants helps to obtain a more accurate prediction. In fact, the most accurate results were obtained for two kind of machine learning models, SVMR and MLP, when they made use of the values of the six available variables. The results obtained show how regression-based models like SVMR, MARS and MLP outperform univariate and multivariate time series-based models (ARIMA and VARMA). According to the findings of this paper and other previous ones²⁹, this is because the short-term relationships among pollutants are stronger than temporal relationships of PM₁₀ concentration values with itself and with other variables. In other words, although it is possible to find certain seasonal patterns in monthly average pollutant values, the relationship of PM₁₀ with the concentration of other pollutants is more important than the seasonal pattern.

Finally, this research affords the reader the opportunity to compare different machine learning and time series methodologies applied to the same data set to establish whether they are useful for PM₁₀ concentration forecasting. If the average monthly values of PM₁₀ from January to June 2018 are compared with those corresponding to the same months of the previous year, the RMSE result is 6.8557. This means that in forecasts 1 month ahead, MLP and SVM models of four and six variables and MARS of six variables outperform it. When forecasts are performed 6 months ahead MLP models of four and six variables and SVM of six variables outperform it. Although the proposed methodologies do not always outperform the mere use of the average values of PM₁₀ concentrations of the same months of the previous year, they are a useful complementary tool for planning and taking decisions in advance.

References

González-Marco, D., Pau Sierra, J., Fernández de Ybarra, O. & Sánchez-Arcilla, A. Implications of long waves in harbour management: the Gijon port case study. Ocean Coast. Manag. 51, 180–201 (2018).
Article Google Scholar
World Health Organization. Effects of air pollution on children’s health and development: A review of the evidence. (2005).
Gauderman, W. J. et al. The effect of air pollution on lung development from 10 to 18 years of age. New Engl. J. Med. 351(11), 1057–1067 (2004).
Article CAS PubMed Google Scholar
Wyler, C. et al. Exposure to motor vehicle traffic and allergic sensitization. Epidemiology 11(4), 450–456 (2000).
Article CAS PubMed Google Scholar
European Commission. Council Directive 1996/62/EC of 27 September 1996 on ambient air quality assessment and management. Official Journal of the European Communities, 55–63 (1996).
Ganguly, R., Sharma, D. & Kumar, P. Trend analysis of observational PM₁₀ concentrations in Shimla city, India. Sustain. Cities Soc. 51, 101719 (2019).
Article Google Scholar
Grange, S. K., Salmond, J. A., Trompetter, W. J., Davy, P. K. & Ancelet, T. Effect of atmospheric stability on the impact of domestic wood combustion to air quality of a small urban township in winter. Atmos. Environ. 70, 28–38 (2013).
Article CAS ADS Google Scholar
Yadav, R., Sahu, L. K., Jaaffrey, S. N. A. & Beig, G. Temporal variation of particulate matter (PM) and potential sources at an urban station of Udaipur in Western India. Aerosol. Air Qual. Res. 14, 1613–1629 (2014).
Article CAS Google Scholar
Mueller, D., Uibel, S., Takemura, M., Klingelhoefer, D. & Groneberg, D. A. Ships, ports and particulate air pollution—an analysis of recent studies. J. Occup. Med. Toxicol. 5, 6–31. https://doi.org/10.1186/1745-6673-6-3 (2011).
Article Google Scholar
Pandolfi, M., Gonzalez-Castanedo, Y., Alastuey, A., de la Rosa, J. D., Mantilla, E., de la Campa, A. S., Querol, X., Pey, J., Amato, F. & Moreno, T. Source apportionment of PM(10) and PM(2.5) at multiple sites in the strait of Gibraltar by PMF: impact of shipping emissions. Environ. Sci. Pollut. R. Int. 18(2), 260–269. doi: 10.1007/s11356–010–0373–4 (2011).
Agrawal, H. et al. Primary particulate matter from ocean-going engines in the Southern California Air Basin. Environ. Sci. Technol. 43, 5398–5402 (2009).
Article CAS PubMed ADS Google Scholar
Deniz, C., Kilic, A. & Civkaroglu, G. Estimation of shipping emissions in Candarli Gulf, Turkey. Environ. Monit. Assess. 17(1–4), 219–228. https://doi.org/10.1007/s10661-009-1273-2 (2010).
Article CAS Google Scholar
Deniz, C. & Kilic, A. Estimation and assessment of shipping emissions in the region of Ambarli Port, Turkey. Environ. Prog. Sustain. 29(1), 107–115 (2009).
Google Scholar
Alastuey, A. et al. Contribution of harbour activities to levels of particulate matter in a harbour area: Hada Project-Tarragona Spain. Atmos. Environ. 41(30), 6366–6378 (2007).
Article CAS ADS Google Scholar
Pérez, N. et al. Impact of harbour emissions on ambient PM₁₀ and PM_2.5 in Barcelona (Spain): evidences of secondary aerosol formation within the urban area. Sci. Total Environ. 571, 237–250 (2016).
Article PubMed ADS Google Scholar
Shen, J. et al. Vertical distribution of particulates within the near-surface layer of dry bulk port and influence mechanism: a case study in China. Sustainability 11(24), 1–16 (2019).
Article Google Scholar
Manoli, E. et al. Polycyclic aromatic hydrocarbons and trace elements bounded to airborne PM10 in the harbor of Volos, Greece: Implications for the impact of harbor activities. Atmos. Environ. 167, 61–72 (2017).
Article CAS ADS Google Scholar
Žibert, J. & Pražnikar, J. Cluster analysis of particulate matter (PM₁₀) and black carbon (BC) concentrations. Atmos. Environ. 57, 1–12 (2012).
Article ADS Google Scholar
Healy, R. M. et al. Characterisation of single particles from in-port ship emissions. Atmos. Environ. 43, 6408–6414. https://doi.org/10.1016/j.atmosenv.2009.07.039 (2009).
Article CAS ADS Google Scholar
Meisner Rosen, C. Businessmen against pollution in late nineteenth century Chicago. Bus. Hist. Rev. 69(3), 351–397 (1995).
Article Google Scholar
Desalu, A., Gould, L. & Schweppe, F. Dynamic estimation of air pollution. IEEE Trans. Automat. Contr. 19(6), 904–910. https://doi.org/10.1109/TAC.1974.1100742 (1974).
Article Google Scholar
Lamb, R. G. & Neiburger, M. An interim version of a generalized urban air pollution model. Atmos. Environ. 5, 239–264 (1971).
Article ADS Google Scholar
Roadknight, C. M., Balls, G. R., Mills, G. E. & Palmer-Brown, D. Modeling complex environmental data. IEEE Trans. Neural Netw. 8(4), 852–862. https://doi.org/10.1109/72.595883 (1997).
Article CAS PubMed Google Scholar
Spellman, G. An application of artificial neural networks to the prediction of surface ozone concentrations in the United Kingdom. Appl. Geogr. 19(2), 123–136 (1999).
Article Google Scholar
Niska, H., Hiltunen, T., Karppinen, A., Ruuskanen, J. & Kolehmainen, M. Evolving the neural network model for forecasting air pollution time series. Eng. Appl. Artif. Intell. 17(2), 159–167. https://doi.org/10.1016/j.engappai.2004.02.002 (2004).
Article Google Scholar
Cakmak, S., Hebbern, C., Vanos, J., Crouse, D. L. & Burnett, R. Ozone exposure and cardiovascular-related mortality in the Canadian Census Health and Environment Cohort (CANCHEC) by spatial synoptic classification zone. Environ. Pollut. 214, 589–599. https://doi.org/10.1016/j.envpol.2016.04.067 (2016).
Article CAS PubMed Google Scholar
Govender, P. & Sivakumar, V. Application of k-means and hierarchical clustering techniques for analysis of air pollution: a review (1980–2019). Atmos. Pollut. Res. 11(1), 40–56 (2020).
Article CAS Google Scholar
Liu, B. C., Binaykia, A., Chang, P. C., Tiwari, M. K. & Tsao, C. C. Urban air quality forecasting based on multi-dimensional collaborative support vector regression (SVR): a case study of Beijing–Tianjin–Shijiazhuang. PLoS ONE 12(7), 1–17 (2017).
Google Scholar
García Nieto, P. J., Sánchez Lasheras, F., García-Gonzalo, E. & de Cos Juez, F. J. Estimation of PM₁₀ concentration from air quality data in the vicinity of a major steelworks site in the metropolitan area of Avilés (Northern Spain) using machine learning techniques. Stoch. Env. Res. Risk A. 32(11), 3287–3298 (2018).
Article Google Scholar
Riesgo García, M. V., Krzemień, A., del Campo, M., García-Miranda, C. E. & Sánchez Lasheras, F. Rare earth elements price forecasting by means of transgenic time series developed with ARIMA models. Resour. Policy. 59, 95–102 (2018).
Article Google Scholar
Van Buuren, S. & Groothuis-Oudshoorn, K. Mice: multivariate imputation by chained equations in R. . J. Stat. Softw. 45, 1–67 (2011).
Article Google Scholar
Ruey, S. T. Multivariate Time Series Analysis with R and Financial Applications (Wiley, New York, 2014).
MATH Google Scholar
Ordóñez, C., Sánchez Lasheras, F., Roca-Pardiñas, J. & de Cos Juez, F. J. A hybrid ARIMA–SVM model for the study of the remaining useful life of aircraft engines. J. Comput. Appl. Math. 346, 184–191 (2019).
Article MathSciNet MATH Google Scholar
Peter, J. B. & Davis, R. A. Introduction to Time Series and Forecasting (Springer, New York, 2002).
MATH Google Scholar
R Core Team. R: a language and environment for statistical computing. R Foundation for Statistical Computing (Vienna, Austria, 2019). https://www.R-project.org/.
Trapletti, A, & Hornik, K. tseries: Time Series Analysis and Computational Finance. R package version 0.10-47.
Ruey, S.T. & Wood, D. MTS: All-Purpose Toolkit for Analyzing Multivariate Time Series (MTS) and Estimating Multivariate Volatility Models. R package version 1.0. https://CRAN.R-project.org/package=MTS (2018).
Martin, V., Hurn, S. & Harris, D. Econometric Modelling with Time Series. Specification, Estimation and Testing (Cambridge University Press, Cambridge, 2013).
MATH Google Scholar
Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 19, 716–723 (1974).
Article MathSciNet MATH ADS Google Scholar
Schwarz, G. Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978).
Article MathSciNet MATH Google Scholar
Rosenblatt, F. Principles of Neurodynamics (Spartan Books, Washington, 1962).
MATH Google Scholar
Kolmogorov, A. N. On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition. Dokl. Akad. Nauk SSSR 114(5), 953–956 (1957).
MathSciNet MATH Google Scholar
García-Nieto, P. J., Martínez Torres, J., de Cos Juez, F. J. & Sánchez Lasheras, F. Using multivariate adaptive regression splines and multilayer perceptron networks to evaluate paper manufactured using Eucalyptus globulus. Appl. Math. Comput. 219(2), 755–763 (2012).
Google Scholar
Fritsch, S., Guenther, F. & Wright, M.N. neuralnet: Training of Neural Networks. R package version 1.44.2. https://CRAN.R-project.org/package=neuralnet (2019).
Haykin, S. Neural Networks: A Comprehensive Foundation (Prentice Hall, Upper Saddle River, 1998).
MATH Google Scholar
Vapnik, V. The Nature of Statistical Learning Theory (Springer, Berlin, 2000).
Book MATH Google Scholar
Suárez Sánchez, A., Riesgo Fernández, P., Sánchez Lasheras, F., de Cos Juez, F. J. & García Nieto, P. J. Prediction of work-related accidents according to working conditions using support vector machines. Appl. Math. Comput. 218(7), 3539–3552 (2011).
MathSciNet Google Scholar
Kuhn, M. & Johnson, K. Applied Predictive Modeling (Springer, New York, 2013).
Book MATH Google Scholar
Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A. & Leisch, F. e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. R package version 1.7-2. https://CRAN.R-project.org/package=e1071 (2019).
Drucker, H., Burges, C., Kaufman, L., Smola, A. & Vapnik, V. Support Vector Regression Machines. Adv. Neural Inf. 9, 155–161 (1997).
Google Scholar
Friedman, J. H. Multivariate adaptive regression splines. Ann. Stat. 19(1), 1–67. https://doi.org/10.1214/aos/1176347963 (1991).
Article MathSciNet MATH Google Scholar
Sánchez Lasheras, F., García Nieto, P. J., de Cos Juez, F., Mayo Bayón, R. & González Suárez, V. A hybrid PCA-CART-MARS-based prognostic approach of the remaining useful life for aircraft engines. Sensors. 15(3), 7062–7083 (2015).
Article PubMed Google Scholar
de Andrés Suárez, J., Lorca Fernández, P. & Sánchez Lasheras, F. Bankruptcy forecasting: a hybrid approach using Fuzzy c-means clustering and Multivariate Adaptive Regression Splines (MARS). Expert Syst. Appl. 38(3), 1866–1875 (2011).
Article Google Scholar
Milborrow, S. Derived from mda:mars by Trevor Hastie and Rob Tibshirani. Uses Alan Miller's Fortran utilities with Thomas Lumley's leaps wrapper. earth: Multivariate Adaptive Regression Splines. R package version 5.1.1. https://CRAN.R-project.org/package=earth (2019).
Put, R., Xu, Q. S., Massart, D. L. & Vander Heyden, Y. Multivariate adaptive regression splines (MARS) in chromatographic quantitative structure–retention relationship studies. J. Chromatogr. A 1055(1–2), 11–19. https://doi.org/10.1016/j.chroma.2004.07.112 (2004).
Article CAS PubMed Google Scholar
García Nieto, P. J., Sánchez Lasheras, F., García-Gonzalo, E. & de Cos Juez, F. J. PM₁₀ concentration forecasting in the metropolitan area of Oviedo (Northern Spain) using models based on SVM, MLP, VARMA and ARIMA: a case study. Sci. Total Environ. 621, 753–761 (2018).
Article PubMed ADS Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics, Faculty of Sciences, University of Oviedo, c/ Federico García Lorca 18, 33007, Oviedo, Spain
Fernando Sánchez Lasheras, Paulino José García Nieto & Esperanza García Gonzalo
Department of Physics, Faculty of Sciences, University of Oviedo, c/ Federico García Lorca 18, 33007, Oviedo, Spain
Laura Bonavera
Department of Mining Exploitation and Prospecting, University of Oviedo, c/ Independencia 13, 33004, Oviedo, Spain
Francisco Javier de Cos Juez

Authors

Fernando Sánchez Lasheras
View author publications
Search author on:PubMed Google Scholar
Paulino José García Nieto
View author publications
Search author on:PubMed Google Scholar
Esperanza García Gonzalo
View author publications
Search author on:PubMed Google Scholar
Laura Bonavera
View author publications
Search author on:PubMed Google Scholar
Francisco Javier de Cos Juez
View author publications
Search author on:PubMed Google Scholar

Contributions

F.S.L. conceived the ideas, F.S.L. and P.J.G.N. designed the study and retrieved the information. F.S.L. and F.J.C.J. trained and validated the machine learning models. F.S.L., L.B. and E.G.G. wrote the draft of the manuscript. L.B. revised the manuscript.

Corresponding author

Correspondence to Fernando Sánchez Lasheras.

Ethics declarations

Competing interests

Laura Bonavera acknowledges the PGC 2018 project PGC2018-101948-B-I00 (MINECO/FEDER) and PAPI-19-EMERG-11 (UNIOVI). The rest of authors declare no other competing financial interest.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Sánchez Lasheras, F., García Nieto, P.J., García Gonzalo, E. et al. Evolution and forecasting of PM10 concentration at the Port of Gijon (Spain). Sci Rep 10, 11716 (2020). https://doi.org/10.1038/s41598-020-68636-5

Download citation

Received: 10 January 2020
Accepted: 30 June 2020
Published: 16 July 2020
DOI: https://doi.org/10.1038/s41598-020-68636-5

This article is cited by

Integrating D–S evidence theory and multiple deep learning frameworks for time series prediction of air quality
- Siling Feng
- Le Tang
- Yuanyuan Wu
Scientific Reports (2025)
Enhancing particulate matter prediction in Delhi: insights from statistical and machine learning models
- Divyansh Sharma
- Sapan Thapar
- Kamna Sachdeva
Environmental Monitoring and Assessment (2025)
Fourier-Enhanced Deep Learning and Machine Learning Models for Predicting Multi-Scale PM2.5 Dynamics in Megacities: A Case Study of Delhi
- Divyansh Sharma
- Sapan Thapar
- Kamna Sachdeva
Earth Systems and Environment (2025)
Meteorological variability and predictive forecasting of atmospheric particulate pollution
- Wan Yun Hong
Scientific Reports (2024)
Forecasting of AQI (PM2.5) for the three most polluted cities in India during COVID-19 by hybrid Daubechies discrete wavelet decomposition and autoregressive (Db-DWD-ARIMA) model
- Jatinder Kaur
- Sarbjit Singh
- Kulwinder Singh Parmar
Environmental Science and Pollution Research (2023)