Modeling health outcomes of air pollution in the Middle East by using support vector machines and neural networks

Ayesha; Noor-ul-Amin, Muhammad; Albalawi, Olayan; Mushtaq, Nadia; Mahmoud, Emad E.; Yasmeen, Uzma; Nabi, Muhammad

doi:10.1038/s41598-024-71694-8

Download PDF

Article
Open access
Published: 14 September 2024

Modeling health outcomes of air pollution in the Middle East by using support vector machines and neural networks

Ayesha¹,
Muhammad Noor-ul-Amin²,
Olayan Albalawi³,
Nadia Mushtaq⁴,
Emad E. Mahmoud⁵,
Uzma Yasmeen^1,6 &
…
Muhammad Nabi⁷

Scientific Reports volume 14, Article number: 21517 (2024) Cite this article

1880 Accesses
8 Altmetric
Metrics details

Subjects

Abstract

This study investigates the impact of air pollution on health outcomes in Middle Eastern countries, a region facing severe environmental challenges. As such, these are important in an effort to add up to policy-level as well as interventional changes that can be put in practice in the area of public health. Numeration analysis and association with health parameters was carried out by using Analytical tools such as, AIR Data, ARIMA,ANN, SVM and Exponential smoothing. Amongst the models, Support Vector Machine came again on top, with high accuracy yielding Mean Absolute Percentage Error of approximately 1%. Mortality of Air pollution in Qat from the case of Mortality of Air Pollution in Qatar is 959 while Auto regressive Integrated Moving average is 11.096, Exponential Smoothing 9.892 and Artificial Neural Networks are the source of inspiration for the development of this paper 4.61. The above perceptions indicate that there is need to adapt modeling strategies depending on the context and establish that it is possible to implement ML models in public health planning basket. This paper publishes the methodological frameworks for the purpose of modeling and analysis of the EHDs and serves as policy prescription for the policy makers to intending to reduce the effects of air borne pollution on health.

Assessment of disease burden and mortality attributable to air pollutants in northwestern Iran using the AirQ+ software

Article Open access 19 April 2025

Public engagement with air quality data: using health behaviour change theory to support exposure-minimising behaviours

Article Open access 28 June 2022

Fusing satellite imagery and ground-based observations for PM_2.5 air pollution modeling in Iran using a deep learning approach

Article Open access 01 July 2025

Introduction

Middle East region there has been economic development and expansion of built up areas, as well as raised level of industrial activity and motorised traffic. While such developments have helped in the fight and growth of the regions development story the quality of air pollution has over time threatened the health of the people. It becomes easier to deal with health risks posed by pollution, if the relation between air quality and human welfare is properly understood in the Middle East. This research paper aims at undertaking the following exploratory research question in an attempt to establish the correlation between various types of pollution and public health in the Middle East region. Specifically, it intend to forecast and analyse the impact of pollution in relation to health drawing attention to the critical decade of 2020–2030. The research explores the complex aspects of this issue by examining various pollutants, their origins, and the socio-economic factors that affect population vulnerability in the region. Pollution, in this context, refers to the emission of substances or energy into the atmosphere, leading to harmful impacts on the environment, including risks to human health, damage to ecosystems, and interference with environmental functions and other legitimate uses. This definition is provided by the European Environmental Agency^1,2. Land pollution, as defined by the Texas Disposal System, pertains to the degradation of the surfaces of the Earth's land, encompassing the above ground levels and underground level³. The contamination of water sources by substances renders the water unsuitable for various activities such as cooking, drinking, swimming, cleaning, and more. This information is provided by Harvard T.H. Chan School of Public Health⁴. Air pollution is a perilous form of environmental contamination. As outlined by the National Institute of Environmental Health Sciences, it is described as a combination of harmful substances originating from both human-made and natural sources (National Institute of Environmental Health Science, NIH⁵.

Nature releases hazardous substances into the air, including gases produced from the decay of organic matter in soils, gases and ashes from volcanic explosions, and fumes from wildfires, often ignited by human activities. This information is sourced from the National Institute of Health Sciences⁶.

Ambient Particulate Matter Pollution, abbreviated as PMP, comprises specific solid and liquid particles present in the air, including dirt, soot, smoke, and dust. These particles do not originate from a single source but rather form a mixture of diverse particles. They vary in composition, size, and shape, collectively found in the air. HAP is produced through the ignition of household fuels, resulting in indoor air pollution and adding to the overall burden of outdoor air pollution. This insight is provided by the World Health Organization⁷. As stated by the “Pan American Health Organization (PAHO)”, the ineffective burning of fuels stands out as one of the primary factors contributing to Disability-Adjusted Life Years (DALYs) and fatalities worldwide, Pan American Health Organization⁸.Ozone, a gas composed of three oxygen atoms, exists naturally and is also produced through human activities. It is highly reactive and is found in both the troposphere and stratosphere, with the majority located in the stratosphere. The upper stratosphere contains a protective layer of ozone that shields the Earth from harmful ultraviolet radiations emitted by the sun. An atmosphere contaminated by pollution poses a threat to everyone, with infants being particularly vulnerable. The number of infant deaths resulting from air pollution is staggering, contributing significantly to global child mortality rates annually. This issue is especially prominent in underdeveloped and developing countries. The primary culprits behind these life-threatening diseases are air pollution, whether indoors or outdoors. Notably, even in developed nations such as China, the toll of premature mortalities due to air pollution remains alarmingly high⁹. In this analysis, they assess the effectiveness of employing different models¹⁰. MLM has been useful to conduct in-depth studies on the suitability of various existing architectures for diverse forecasting tasks¹¹. A comprehensive examination is conducted, examining all air pollutant factors by correlating these elements from top to bottom¹². The SARIMA model is employed to project future concentrations of PM2.5, anticipating an increase in the number of PM2.5 particles in the upcoming year. The forecast provides both the minimum and maximum predictions, ranging over 100 µg/m³. In previous literature, the Autoregressive Integrated Moving Average (ARIMA)model has been utilized to predict the prevalence of diseases due to its ability and ease in elucidating dataset^13,16,17,18. Holt's introduced Winter exponential smoothing (ES) that has found application in modeling and forecasting various issues, including but not limited to electricity consumption ¹⁶^.Artificial Neural Networks (ANN) were utilized to predict the daily cases of COVID-19 in various countries, including China, Iran, Italy, Japan, South Africa, Singapore, and the USA¹⁵. In forecasting the outbreak of COVID-19, Artificial Neural Networks (ANN) have been sparingly employed to predict death and recovery cases¹⁹. The study utilizes an ANN to accurately predict COVID-19 confirmed cases and deaths, demonstrating the efficacy of the ANN model in forecasting future trends and emphasizing the importance of stringent control measures to curb the pandemic's spread²⁰. ANN proved to be dependable in forecasting increases in timber prices, yet encountered challenges when faced with unforeseen surges. Research indicated their superiority over traditional methods, highlighting the necessity for enhancements, such as incorporating exogenous variables and optimizing model structures, to achieve more accurate predictions of timber prices²¹. The research, which conducted a comparison of statistical and machine learning models (ARIMA, SARIMA, NAR, LSTM) based on Mean Absolute Percentage Error (MAPE), demonstrated that, in general, open-loop models outperformed closed-loop ones, although exceptions were observed at specific stations. While there wasn't a clear optimal method identified for open loops, machine learning techniques, particularly Long Short-Term Memory (LSTM) and Neural Autoregression (NAR-NN), exhibited superior performance compared to statistical methods in closed-loop setups. The study underscored a notable correlation between MAPE and Relative Standard Deviation (RSD) for both loop types ²². The study utilized artificial neural networks for mortality forecasting in the Kurdistan Region, citing their superior predictive accuracy. It underscored the importance of data quality and recommended the adoption of the KRG-HIS program to enhance health data collection. The study highlighted that widespread implementation of such a system could add to better health planning and more accurate mortality forecasts²³.

This study provides a complete comparative analysis of classical statistical models and advanced machine learning techniques for predicting air pollution levels and the health impacts linked with them. The study recognizes the strengths and weaknesses of each approach by evaluating the performance of models such as ARIMA, ES, SVM, and ANN. This benchmarking is decisive for selecting the most appropriate model for specific forecasting tasks. The outcomes offer valuable insights into model performance, practical guidance for model selection and support for informed decision-making in public health and policy. This study contributes to environmental health research by highlighting the potential of advanced machine learning techniques to improve predictive accuracy and reliability.

Data source and methodology

This research employed information sourced from “the Institute of Health Metrics and Evaluation GBD” websites (1990–2019) focusing on eight Middle East Arab States: Syrian Arab Republic, Iraq, Kuwait, United Arab Emirates, Iran, Lebanon, Saudi Arabia, and Qatar.

The research employs the aforementioned dataset, covering the years 1990–2019 for Middle Eastern countries, consisting of 240 observations (eight countries with 30 values each). In applying machine learning techniques, 80% of the data (1990–2013) was used for model training, while the remaining 20% (2013–2019) was reserved for model testing. Forecasting plots were created to visually represent the predicted values generated by the models. The Mean Absolute Percentage Error (MAPE) was utilized for model validation. Both traditional statistical methods and machine learning algorithms, including ARIMA, Exponential Smoothing (ES), Support Vector Machines (SVM), and Neural Networks (NN), were employed in constructing these forecasting models.

ARIMA model

The ARIMA modeling approach with Box Jenkin Methodology offers the conventional framework for time series analysis. Its strengths lie in flexibility and automatic model selection through functions like auto.arima() in R and the interpretability of its components. The steps for ARIMA model are illustrated in Fig. 1.

The ARIMA is denoted as “ARIMA(p, d, q)”. The expression is as follows:

$${\widehat{Y}}_{t}^{\prime} = \mu +{\phi }_{1}{y}_{t-1}^{\prime} + \cdots +{\phi }_{p} {y}_{t-p}^{\prime}+ {\theta }_{1} {\varepsilon }_{t-1}+\cdots +{\theta }_{q} {\varepsilon }_{(t-q)}+{\varepsilon }_{t}$$

(1)

where, $d$: the integrated value,$p$: the order of autoregressive process, $\mu$: mean,$q$: the order of moving average process,$\phi$: the autoregressive operator, ${y}_{t}$: the time series, $\varepsilon$: the random error, ${y}_{t}^{\prime}$: the first derivative of time series, $\theta$:the moving average operator.

Exponential smoothing method

The Single Exponential Smoothing (ES) method was utilized to generate forecasts for a time series dataset, selected due to the absence of any seasonal patterns and the lack of a significant upward or downward trend in the data. This approach provides a way to estimate the current level of the time series using a single smoothing parameter, α, which determines the weight assigned to the most recent observation. The value of α varies between 0 and 1. The method uses exponentially decreasing weights and also incorporates weights moving average; the calculation can be explained with the help of the following formula.

$${s}_{t}=\alpha {x}_{t}+\left(1-\alpha \right){s}_{t-1}={s}_{t-1}+\alpha \left({x}_{t} - {s}_{t-1}\right)$$

(2)

$t$ the time, st: c: the quantity of interest, yt: the target statistic, α: the smoothing factor, st − 1: the previous smoothed statistic. In the course of our analysis, we chose Holt’s Exponential Smoothing (ES) method because in the observed data samples no periodic relations were identified. The goal of employing Holt's Exponential Smoothing in our analysis is to capture and understand the fundamental patterns in the data, ultimately leading to more precise predictions and better-informed decision-making:the time, st: the smoothed statistic, and α: the smoothing factor, st − 1: the previous smoothed statistic However, there is conflict in the data; therefore, in the current analysis, we opted for Holt’s Exponential Smoothing (ES) method since there was no evidence of a seasonality present in the data. The objective to apply Holt’s Exponential Smoothing as the method of our analysis will allow to unearth the underlying characteristics of the patterns and generate more accurate forecasts and effective decisions. The formula is expressed as: ${s}_{1}={x}_{1}$ and ${b}_{1}={x}_{1}-{x}_{0}$ for $t>1$,

$${s}_{t}=\alpha {x}_{t}+\left(1 - \alpha \right)\left({s}_{t-1}+{b}_{t-1}\right)$$

(3)

$${\beta }_{t}=\beta \left({s}_{t} - {s}_{t-1}\right)+\left(1 - \beta \right){b}_{t-1}$$

(4)

here, $t$:the time, $\alpha$: the smoothing factor, ${b}_{t}$: the trend best estimate at time $t$, ${s}_{t-1}$: the previous smoothed statistic, ${s}_{t}$: the smoothed statistic, $\beta$: the trend smoothing factor; $0<\beta <1$

Support Vector Machine Regression

The ordinary Support Vector Machine (SVM) Regression is a popular technique in machine learning best used on time series data. SVMs are particularly good at working with large and complex data and have following characteristics; Time series data analysis. One of the significant advantages is that they do not tend to overfit because they are not influenced by outliers: they work with support vectors—the most significant points. This characteristic renders SVMs flexible and able to generalize if given new data, this boost up the accuracy in different applications.

SVR was implemented using a linear kernel which is defined as $({\text{K}}\left({\text{x}}, {\text{x}}^{\prime}\right)= {\text{x}}^{{\rm T}}{\text{x}}^{\prime}).$ This choice of kernel allows for effective modeling of linear relationships within the data. Alternative kernels such as polynomial or radial basis function (RBF) could be employed for complex data. The regularization parameter (C) was set to 1.0 that helps balance the trade-off between minimizing training error and maintaining model simplicity to prevent overfitting. The ε-insensitive loss function was applied to ignore errors within a margin ε. It focuses only on deviations that exceed this threshold. The SVR objective function is expressed to minimize the succeeding objective function:

$$\text{Objective:}\hspace{1em}\frac{1}{2}|w{|}^{2}+C{\sum }_{i=1}^{n}{\upxi }_{i}$$

subject to the constraints:

$${y}_{i}-\left({w}^{T}\upphi \left({x}_{i}\right)+b\right)\le\upepsilon +{\upxi }_{i}$$

$$\left({w}^{T}\upphi \left({x}_{i}\right)+b\right)-{y}_{i}\le\upepsilon +{\upxi }_{i}$$

where ${\upxi }_{i}$ are slack variables, w represents the weight vector, $(\upphi \left({\text{x}}_{\text{i}}\right))$ denotes the kernel-transformed feature space, and b is the bias term. Predictions are obtained using the function:

$$\text{f}\left(\text{x}\right)={\sum }_{\text{i}=1}^{\text{n}}{{\alpha }}_{\text{i}}\text{K}\left({\text{x}}_{\text{i}},\text{x}\right)+\text{b}$$

where $({{\alpha }}_{\text{i}})$ are the Lagrange multipliers from the training phase and $(\text{K}\left({\text{x}}_{\text{i}},\text{x}\right))$ is the kernel function applied to the test input $(\text{x}).$ Predictions were made on the test set to assess the model's performance, and a forward-looking forecasting loop was incorporated. This loop generates predictions for future years based on the last three observations in the processed data. The Fig. 2 presented the structure of SVM model.

Artificial neural networks

Artificial neural networks is a sequence of algorithms inspired by the cognitive processes of the human brain. They are at the lead of this predictive revolution. These networks showcase an amazing ability to approximate all functions without requiring extensive knowledge about the data's particulars. ANNs are extensively applied in time series forecasting. Neural networks play a crucial role in shaping the future of predictions and form the cornerstone of deep learning, a subfield of machine learning inspired by the intricate structure of the human brain. These networks as shown in figure three, take data through them, and are capable of independent training to detect particular patterns. This unique ability allows them to predict the output for the new similar data sets, emulate the brain’s multiple steps to analyze the information. The methodology adopted in the present study can be described in the following manner:

Data preprocessing

The first resolute within the undertaken methodology is the preliminary analysis of the time series to ensure their compatibility with the input data for the neural network model. Next to this, the data is first ordered chronologically to preserve the sequential structure which is required when undertaking time series analysis. This alters the range of the target variable in order that it may be normally distributed. This normalization process is performed through reducing the mean to zero and standard deviation to one of the data it is used to improve the stability and the performance of the neural network while training.

Dataset construction

They are expressed in the form of data for training a neural network based on the sequences of historical observation expanded in time. This involves a creation of tuples each of which has the past time steps as an input together with the value that has to be predicted on the next time step as an output. It is favorable to the model to use a sequence-based way to analyze the data, as temporal dependencies and temporal patterns can be learned using the data. Either the value of ‘t’ or the length of the input sequence is a dire parameter that determines the extent of the temporal information that the models can learn.

Neural network architecture

Specifically, with the view of operationalizing the proposed forecasting model, a feedforward neural network has been used. The network includes multiple layers: The network can contain multiple levels, such as the system and the subjects filling posts in it.

A typical structure of a Neural Network (NN) design comprises of an input layer, hidden layers and an output layer. The input layer receives the input data and each of the neurons in this layer is in some way related to the feature present in the data. The latent layers are the layers that come in between the input and output layers and these neurons work on the inputs with applying weights, and sigmoid or tanh to put non-linearity in the model. All the results and decisions are given in the output layer and each neuron of the output layer contributes to the special possibility. While in forward propagation the network is in a position to make certain predictions and Sugimura, while using backpropagation, the weights of the network are in a position to be modified depending on the errors. The structure of the NN can be basic and have only the one hidden layer, or can be multilayer which consists from many layers in what is called Deep Learning Network with numerous layers or can have the adjusted layers as in convolutional or Recurrent NNs. Likewise, the structure of NN that Fig. 3 depicted was about organizing relation.

Model training and evaluation

Neural network learning occurs on a sample of the data referred as the training set while the performance is evaluated on another sample known as the testing set. To measure the accuracy of the created model, Performance metrics named Mean Absolute Percentage Error (MAPE) is used. MAPE computes the level of prediction accuracy in proportional levels.

Forecasting and visualization

Having derived the model, it is then applied to make forecast for subsequent periods. In this method, the subsequent values are predicted from the data set and this is done repeatedly. These forecasts are linked with historical data to make visualization of the performance of the model possible. To illustrate the actual data and forecasted values, plots are always created. This in turn allows consideration of the predictive accuracy of the model.

It is a flexible and hierarchical structure that enables the training of the neural network for the patterns which exist in the time series data for robust predictions. Its ability to learn different and more accurate relationships within the data makes it more reliable for forecasting in the time series analysis. This has been made clear by the following representation of the Neural Network model;

$${y}_{i}=f\left({y}_{t-i}\right)+{\epsilon }_{t}$$

(5)

where, ${y}_{t-i}={\left({y}_{t-1},{y}_{t-2},\dots ,{y}_{t-8}\right)}{\prime}$ comprises a vector that includes past values of the series, and f represents ANN including 2 hidden nodes. The error system {ϵt} is presumed to exhibit homoscedasticity. ϵT + 1 is a stochastic draw from the error distribution at time T + 1, then ${y}_{T+1}^{*}=f\left({y}_{T+1}\right)+{\epsilon }_{T+1}^{*}$ represents a potential realization from the predicted distribution. for ${y}_{T+1}$.

$${y}_{T+1}^{*}=({y}_{T+1}^{*},{y}_{T},\cdots ,{y}_{T-6})^{\prime}$$

$${y}_{T+2}=f\left({y}_{T+1}^{*}\right)+{\epsilon }_{T+1}^{*}.$$

(6)

${y}_{T+2}$ is the value of the series at time T + 2. f is a function that describes the relationship between the past values and the future values of the series. ${y}_{T+1}^{*}$ is a modified or transformed value of the series at time T + 1. ${\epsilon }_{T+1}^{*}$ is an error term at time T + 1. This method allows a scholar to simulate upcoming sample paths iteratively. Through the repetitive simulation of these paths the researcher acquires an understanding of the distribution for all forthcoming values leveraging the information provided by the fitted ANN.

Descriptive study

The time series analysis includes variables such as fatalities attributed to APM, HAP and AOP as well as DALYs associated with PMP, AOP and HAP. The descriptive overview of these variables for specific Middle East countries is presented in Table 1.

Table 1 Descriptive statistics for time-series variables in middle east countries.

Full size table

Table 1 offers a detailed overview of descriptive statistics for time-series variables related to pollution in various Middle Eastern countries, including Mortalities and Disability-Adjusted Life Years (DALYs) associated with Particulate Matter Pollution (PMP), Ambient Outdoor Pollution (AOP), and Household Air Pollution (HAP). In the United Arab Emirates, the mean Mortality from PMP is 1418.2, with a standard deviation of 761.60, reflecting moderate variability. The range of Mortality spans from a minimum of 587.9 to a maximum of 3252.0. In the Syrian Arab Republic, the mean Mortality resulting from HAP is 173.9, with a high standard deviation of 1066.18, indicating significant variability. The Mortality range for AOP is between 125.8 and 229.9. In Iraq, the mean Mortality due to PMP is notably high at 20,391, with a substantial standard deviation of 2594.06, underscoring considerable variability. The Mortality for AOP ranges from a minimum of 111.3 to a maximum of 259.4. Kuwait exhibits a mean Mortality from PMP of 905.6, with a lower standard deviation of 319.68, indicating less variability. The range of Mortality in Kuwait extends from a minimum of 4.646 to a maximum of 24.666, showing a narrower range compared to other countries. This analysis reveals distinctive patterns in mortality rates due to different types of pollution across the region. For the Islamic Republic of Iran, mean Mortality resulting from PMP stand at 35,029 that showcases a higher impact and the standard deviation of 3782.674 signifying a significant degree of variability. The minimum of 495.2 and maximum of 1794.2 highlights the broad spectrum of Mortality resulting from AOP.

In Lebanon, the mean values for Mortality resulting from PMP, AOP and HAP stand at 2636, 71.01 and 2692 respectively. The corresponding DALYs means are 72,760, 1231.61 and 73,737. Lebanon shows relatively moderate variability. Saudi Arabia, on the other hand, portrays higher mean values across all categories such as 13,036 for Mortality resulting from PMP and 13,182 for Mortality resulting from HAP. The DALYs mean values are also substantial indicating a significant impact on public health. Saudi Arabia exhibits moderate to high variability especially in Mortality resulting from PMP and DALYs related to PMP and HAP. These findings collectively underscore the diverse impacts of different pollutants on public health, considering both Mortalities and DALYs.

Results

The examination of Mortalities and DALYs attributed to APM, HAP and AOP in Middle East countries that includes Syria, Iraq, Kuwait, United Arab Emirates, Iran, Lebanon, Saudi Arabia and Qatar encompassed the years 1990 to 2019. The study employed ARIMA, ES, SVM and Neural Network methodologies for forecasting purposes. The time series analysis focused on variables such as Mortalities due to APM, HAP and AOP as well as DALYs related to these three pollution categories. The selected countries' data from 1990 to 2019 served as the foundation for employing these predicting techniques, providing a comprehensive understanding of the trends and potential future trajectories in pollution-related health outcomes in the Middle East.

Time series plot

Time series plots were generated for the period spanning 1990 to 2019 to explore data patterns.

The Fig. 4a,e Mortality resulting from PMP and HAP show a rising pattern in selected middle east countries. Figure 4b,f DALYs resulting from HAP and PMP show a rising pattern in selected middle east countries except for Iran where it shows u-shape pattern. Figure 4c,d mortalities and DALYs resulting from AOP show a rising pattern in selected middle east countries with Iran having higher mortalities and DALYs resulting from AOP and subtle increase from year 2005–2019.

Model summaries

ARIMA and Exponential Smoothing models were compared using the Bayesian Information Criterion (BIC). The results are presented in Table 2 for all the countries. In the context of time series analysis, this criterion helps in evaluating and selecting the model that best balances complexity and accuracy, ultimately guiding the choice of the most appropriate forecasting model for each country.

Table 2 Comparison with BIC.

Full size table

The ARIMA(p,d,q) model was selected from the Box-Jenkins Methodology. Model was selected based on smaller BIC score. ES performs better than ARIMA having lower BIC score (see Table 2). Across all countries and metrics, the ES model consistently achieves lower BIC scores compared to the ARIMA model. This indicates that, for the given data, the ES model provides a better fit than the ARIMA model for predicting both DALYs and mortalities. The analysis shows that the ES model is generally superior to the ARIMA model in predicting DALYs and mortalities across the examined Middle Eastern countries. This insight can guide future health data modeling and improve the accuracy of health outcome predictions, ultimately aiding in better health policy formulation and implementation.

Visualization for ARIMA

The forecasting charts for selected Middle East countries is shown below:

The projected values for next 10 years were visualized for ARIMA. In Fig. 5a,e depicting mortality resulting from PMP and HAP, an ascending trend is observed in Iraq, suggesting an increase in the number of mortalities each year for the forecasted period. Conversely, all other figures exhibit relatively constant forecasts, indicating a consistent number for the next 10 years. The Fig. 5c,d, illustrating mortalities and DALYs resulting from AOP, reveal an ascending trend in Kuwait and Iran, signaling a yearly increase in both Mortalities and DALYs resulting from AOP. In Fig. 5b,f representing DALYs resulting from HAP and PMP an upward trend is obvious across all countries that indicates a yearly rise in DALYs.

Visualization for exponential smoothing

The plots were designed for selected Middle East countries for forecasting purposes (2020–2030).

The projected values for the next ten years were visualized for ES. In Fig. 6a,e, depicting mortality resulting from PMP and HAP, an ascending trend is observed in Iraq, suggesting an increase in the number of mortalities each year for the forecasted period. Conversely, all other figures exhibit relatively constant forecasts, indicating a consistent number for the next 10 years. Figure 6c,d, illustrating mortalities and DALYs resulting from AOP, reveal an ascending trend in Kuwait and Iran, signaling a yearly increase in both Mortalities and DALYs resulting from AOP. In Fig. 6b,f showing DALYs resulting from HAP and PMP a rising trend is evident across all Middle East that shows a yearly rise in DALYs.

Visualization for SVM

The plots for eight Middle East countries, examining trends from 2020 to 2030, were generated using SVM projections. In Fig. 7a,b, which represent mortalities and DALYs resulting from PMP, an upward trend is observed in all countries. This indicates an increase in the number of mortalities and DALYs each year. However, in Syria and the Maldives, a downward trend suggests a decrease in both mortalities and DALYs over the same period.

Figure 7e,f, depicting mortalities and DALYs resulting from HAP, show rising trend in all middle east countries, signaling an annual rise in mortalities and DALYs resulting from HAP, except for Kuwait and Iraq where no change is observed. Figure 7c,d, illustrating Mortalities and DALYs resulting from AOP, reveal rising pattern in selected middle east countries except for Iran, which exhibits a different pattern.

Visualizations for NN

The plots designed for eight Middle East countries to analyze trend (2020–2030).

The predicted values for the years 2020–2030 were visualized for ANN models. In Fig. 8a,b illustrating mortalities and DALYs resulting from PMP an upward trend is evident in all countries indicating an annual increase in the number of mortalities and DALYs resulting from PMP, except in Syria and Maldives where a downward trend suggests a decrease in both Mortalities and DALYs. Figure 8e,f, portraying mortalities and DALYs resulting from HAP, exhibit an upward pattern in selected middle east countries, signaling an annual rise in mortalities and DALYs resulting from HAP, except for Kuwait and Iraq where no change is observed. Figure 8c,d, representing mortalities and DALYs resulting from AOP, reveal a rising trend in middle east countries, except in Iran, which displays a different pattern.

Comparative study

The study used the ARIMA, Exponential Smoothing (ES), Support Vector Machines (SVM) and Artificial Neural Networks (ANN) methods for data analysis. The degree of accuracy of these models was measured using a statistical criterion known as Mean Absolute Percentage Error (MAPE). MAPE is a very easy to understand approach of determining the accuracy of a forecast because it portrays the percentage difference between the forecast and the actual outcome. This metric is useful to understand how effective each of the method of forecasting is in time series analysis.

Table 3 provides an evaluation of the forecasting performance of different models with measurement in Mean Absolute Percentage Error (MAPE) and with the models applied which includes ARIMA, Exponential Smoothing (ES), Support Vector Machines (SVM), and Artificial Neural Networks (ANN). SVM and ANN are machine and deep learning model respectively has greater accuracy than the traditional statistical models such as ARIMA and ES. Namely, with regard to the set pollution variables in the Middle Eastern countries, SVM was found to outperform all compared classifiers in a consistent manner. In the context of the United Arab Emirates (UAE), it is possible to state that SVM provides lower MAPE values as compared to ARIMA, ES, and ANN across the different flavors of pollution. For instance, in the task of Mortality due to Particulate Matter Pollution (PMP), SVM has a phenomenal performance with the help of MAPE 0. 900, which is higher than other models such as ARIMA, which gave an RMSE of 24. 271, ES with 20. 607; and ANN with 22. 44. As with the other pollution variables, SVM gives lower MAPE values for AOP (5. 786) and HAP (0. 520).

Table 3 Comparison of models from MAPE.

Full size table

While SVM currently exhibits superior performance in this context, it is important to recognize that ANN has the potential to excel, especially with more extensive datasets. The findings highlight SVM’s reliability for pollution-related health predictions in the Middle East, but researchers should also consider ANN’s scalability and potential for improved performance with larger datasets. This underscores the importance of selecting the most suitable model based on the specific dataset characteristics and forecasting requirements.

Discussion

The unique aspect of this research lies in its dedicated focus on the Middle East countries and their contributions to air pollution along with the consideration of Air Quality Action Plans (AOP). This research goes a step further by delving into the forecasted health consequences including potential Mortalities and disabilities resulting from air pollution in the Middle East. This research comprises time series analysis and machine learning methods. The usage of machine learning models shows an exit from common practices in air pollution forecasting. The complexity of the research is augmented by the need to gather and integrate data from various sources surrounding air quality, health and environmental factors across several Middle East countries. The study presents a novel solution to navigate this data puzzle. This research extends beyond ordinary prediction, aiming to provide actionable intuitions for policymakers. These insights become precious tools for making informed decisions. The research acknowledges the unique characteristics and challenges of each Middle East country tailoring its forecasts to align with the specific situations in these nations. The research accepts a global perspective suggesting that the methods and findings could offer valuable insights for other regions dealing with similar pollution issues. It contributes to a broader understanding of the health impacts of air pollution in the Middle East offering a fresh perspective on this critical issue and presenting a model that could inform strategies for fighting pollution worldwide.

The implications of the results from this study are thereby significant in different fields. This research helps the public health departments to predict pollution and corresponding health effects and makes timely and efficient interventions possible because of the comparative study of the classical statistical models and the more sophisticated machine learning approaches. These understandings can hence be helpful for environmental agencies in improving the means by which air quality is being observed and also improving the management and mitigation of it. The work is useful to help health departments and government agencies minimize inefficiency in the use of resources if they are to apply the most accurate predictive models to areas with increased risks to the health of the people. The results may be used by the policymakers to provide regulations and guidelines that is based on facts to control air pollution and ultimately enhance public health. It also assists in creating awareness of the adverse effects of air pollution to the health of people hence increases the level of support by the public on environmental conservation. In conclusion, this study provides the guidelines for the further scientific work; opening the horizon for improvements and innovations in the field of environmental health prediction. The study offers a positive return on investment to the society by improving public health approaches, environmental surveillance, and decisions and policies.

Conclusion

This paper analyses the effects of air quality on health status in Middle Eastern countries employing Traditional and machine- learning forecasting models. Large differences were detected in pollution impacts as revealed by differences in mean values, SD values and range in each considered country. The analysis showed rising trends in mortalities and DALYs due to the three different types of pollution under study from 1990 to 2019. Among the models, SVM consistently outperformed others in accuracy, with a MAPE of 0.900 for predicting PMP mortalities in the UAE, compared to ARIMA's 24.271, ES's 20.607, and ANN's 22.44. While SVM demonstrated the best overall performance, ANN showed potential, particularly with larger datasets, as seen in predicting DALYs from PMP in Iraq, where it achieved a MAPE of 0.794 versus SVM's 0.887 and ARIMA's 2.191. The comparison of models using the MAPE metric reveals SVM's reliability and accuracy in predicting health outcomes related to air pollution. However, the study acknowledges the potential of ANN to excel particularly with more extensive datasets. The study emphasizes the uniqueness of its approach compairing traditional time series analysis with advanced neural network methodologies. It extends its implications globally. The study contributes to a broader understanding of this critical issue by offering a framework that could inform strategies worldwide. It presents a novel approach to forecasting and bridging the gap between traditional and advanced methodologies for more accurate and actionable predictions. These findings underscore the importance of selecting appropriate forecasting models based on dataset characteristics and offer valuable insights for policymakers aiming to mitigate the health impacts of air pollution in the Middle East and similar regions globally. The study's forecasts aid in emergency preparedness, while its comparative analysis of forecasting models helps refine predictive methodologies. Overall, it supports a comprehensive approach to mitigating air pollution and its health effects.

Data availability

All data analyzed in the course of this research were sourced from the “Institute of Health Metrics and Evaluation GBD” website, covering the years 1990 to 2019. The data can be accessed through the following link: https://ghdx.healthdata.org/gbd-2019.

References

European Environmental Agency. (2022, April 15). http://www.eea.europa.eu/.
GBD Results. (n.d.). Institute for Health Metrics and Evaluation. https://vizhub.healthdata.org/gbd-results/
Texas Disposal Systems. (2022, April 15). https://www.texasdisposal.com/blog/land-pollution/.
Harvard T.H. Chan, School of Public Health. (2022, March 22). https://www.hsph.harvard.edu/ehep/82-2/#:~:text=Water%20pollution%20is%20the%20contamination,make%20their%20way%20to%20water.
National Institute of Environmental Health Science (NIH). (2021). https://www.niehs.nih.gov/.
National Institute of Environmental Health Science (NIH). (2022). https://www.niehs.nih.gov/.
World Health Organization. (2022, April 21). https://www.who.int/data/gho/data/themes/air-pollution/household-air-pollution2022.
Pan American Health Organization. (2022). https://www.paho.org/en/topics/air-quality-and-health/ambient-and-household-air-pollution-and-health-frequently-asked.
Perera, F., Ashrafi, A., Kinney, P. & Mills, D. Towards a fuller assessment of benefits to children’s health of reducing air pollution and mitigating climate change due to fossil fuel combustion. Environmental Research 172, 55–72. https://doi.org/10.1016/j.envres.2018.12.016 (2019).
Article ADS CAS PubMed Google Scholar
Ejohwomu, O. A. et al. Modelling and forecasting temporal PM2.5 concentration using ensemble machine learning methods. Buildings 12(1), 46. https://doi.org/10.3390/buildings12010046 (2022).
Article Google Scholar
Lara-Benıtez, P., Carranza-Garcıa, M. & Riquelme, J. C. An experimental review on deep learning architectures. International Journal of Neural Systems 31(3), 1–25. https://doi.org/10.1142/S0129065721300011 (2021).
Article Google Scholar
Bhatti, U. A. et al. Time series analysis and forecasting of air pollution particulate matter (PM2.5): An SARIMA and factor analysis approach. IEEE Access 9, 41019–41031. https://doi.org/10.1109/ACCESS.2021.3060744 (2021).
Article Google Scholar
Guan, Y. et al. Molecular epidemiology of the novel coronavirus that causes severe acute respiratory syndrome. Lancet (London, England) 363(9403), 99–104. https://doi.org/10.1016/s0140-6736(03)15259-2 (2004).
Article CAS PubMed Google Scholar
Cao, Q., Rui, G. & Liang, Y. Study on PM2.5 pollution and the mortality due to lung cancer in China based on the geographic weighted regression model. BMC Public Health 18(925), 1–10. https://doi.org/10.1186/s12889-018-5844-4 (2018).
Article Google Scholar
Niazkar, H. R. & Niazkar, M. Application of artificial neural networks to predict the COVID-19 outbreak. Global Health Research and Policy https://doi.org/10.1186/s41256-020-00175-y (2020).
Article PubMed PubMed Central Google Scholar
Jiang, W., Wu, X., Gong, Y., Yu, W. & Zhong, X. Holt-Winters smoothing enhanced by fruit fly optimization algorithm to forecast monthly electricity consumption. Energy 193, 116779. https://doi.org/10.1016/j.energy.2019.116779 (2020).
Article Google Scholar
Zhang, X. et al. Comparative study of four time series methods in forecasting typhoid fever incidence in China. PLoS ONE 8(5), e63116. https://doi.org/10.1371/journal.pone.0063116 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Zhang, Y. & Yew, W. W. Mechanisms of drug resistance in Mycobacterium tuberculosis: Update 2015. The International Journal of Tuberculosis and Lung Disease: The Official Journal of the International Union against Tuberculosis and Lung Disease 19(11), 1276–1289. https://doi.org/10.5588/ijtld.15.0389 (2015).
Article CAS PubMed Google Scholar
Al-Najjar, H., & Al-Rousan, N. A classifier prediction model to predict the status of Coronavirus COVID-19 patients in South Korea. European Review (2020). https://www.europeanreview.org/article/20709
Guo, Q. & He, Z. Prediction of the confirmed cases and deaths of global COVID-19 using artificial intelligence. Environmental Science and Pollution Research 28, 11672–11682 (2021).
Article CAS PubMed Google Scholar
Kozuch, A., Cywicka, D. & Adamowicz, K. A comparison of artificial neural network and time series models for timber price forecasting. Forests 14(2), 177–177. https://doi.org/10.3390/f14020177 (2023).
Article Google Scholar
Salih, S. O. et al. Forecasting causes of death in northern Iraq using neural network. Journal of Statistical Theory and Applications 21(2), 58–77. https://doi.org/10.1007/s44199-022-00042-4 (2022).
Article Google Scholar
Al Yammahi, A. & Aung, Z. Forecasting the concentration of NO2 using statistical and machine learning methods: A case study in the UAE. Heliyon https://doi.org/10.1016/j.heliyon.2022.e12584 (2022).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The authors extend their appreciation to Taif University, Saudi Arabia, for supporting this work through Project Number (TU-DSPP-2024-94).

Author information

Authors and Affiliations

The University of Lahore, Lahore, Pakistan
Ayesha & Uzma Yasmeen
Department of Statistics, COMSATS University Islamabad-Lahore Campus, Lahore, Pakistan
Muhammad Noor-ul-Amin
Department of Statistics, Faculty of Science, University of Tabuk, Tabuk, Saudi Arabia
Olayan Albalawi
Forman Christian College (A Chartered University), Lahore, Pakistan
Nadia Mushtaq
Department of Mathematics and Statistics, Collage of Science, Taif University, P.O. Box 11099, 21944, Taif, Saudi Arabia
Emad E. Mahmoud
Brock University, St. Catharines, Canada
Uzma Yasmeen
Khost Mechanics Institute, Khost, Afghanistan
Muhammad Nabi

Authors

Ayesha
View author publications
Search author on:PubMed Google Scholar
Muhammad Noor-ul-Amin
View author publications
Search author on:PubMed Google Scholar
Olayan Albalawi
View author publications
Search author on:PubMed Google Scholar
Nadia Mushtaq
View author publications
Search author on:PubMed Google Scholar
Emad E. Mahmoud
View author publications
Search author on:PubMed Google Scholar
Uzma Yasmeen
View author publications
Search author on:PubMed Google Scholar
Muhammad Nabi
View author publications
Search author on:PubMed Google Scholar

Contributions

A.B. and M.N.A. conceived and designed the study. A.B. and N.M. conducted the data collection and preprocessing. M.N.A. and U.Y. performed the statistical analysis and modeling. O.A. worked on the evaluation of Time Series/ARIMA models and contributed to the analysis. O.A. also improved the language and presentation of the manuscript. N.M. and M.N prepared the figures and tables. E.E.M. reviewed the manuscript and made technical corrections. E.E.M. also contributed to the analysis of figures. M.N. is the corresponding author and managed the manuscript submission process. All authors reviewed the manuscript and approved the final version.

Corresponding author

Correspondence to Muhammad Nabi.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Ayesha, Noor-ul-Amin, M., Albalawi, O. et al. Modeling health outcomes of air pollution in the Middle East by using support vector machines and neural networks. Sci Rep 14, 21517 (2024). https://doi.org/10.1038/s41598-024-71694-8

Download citation

Received: 15 March 2024
Accepted: 30 August 2024
Published: 14 September 2024
DOI: https://doi.org/10.1038/s41598-024-71694-8

Subjects

Abstract

Similar content being viewed by others

Assessment of disease burden and mortality attributable to air pollutants in northwestern Iran using the AirQ+ software

Public engagement with air quality data: using health behaviour change theory to support exposure-minimising behaviours

Fusing satellite imagery and ground-based observations for PM2.5 air pollution modeling in Iran using a deep learning approach

Introduction

Data source and methodology

ARIMA model

Exponential smoothing method

Support Vector Machine Regression

Artificial neural networks

Data preprocessing

Dataset construction

Neural network architecture

Model training and evaluation

Forecasting and visualization

Descriptive study

Results

Time series plot

Model summaries

Visualization for ARIMA

Visualization for exponential smoothing

Visualization for SVM

Visualizations for NN

Comparative study

Discussion

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links

Fusing satellite imagery and ground-based observations for PM_2.5 air pollution modeling in Iran using a deep learning approach