Introduction

Air pollution is a critical global issue that poses severe threats to public health, environmental sustainability, and economic stability. Industrialization, urbanization, and vehicular emissions have led to deteriorating air quality, resulting in respiratory diseases, cardiovascular complications, and increased mortality rates. Given the need for effective environmental management and public health interventions, accurate air pollution prediction has become an essential research area. Traditional air quality monitoring stations, while precise, are costly to install and maintain, making them infeasible for widespread implementation. To address these limitations, computational models have been developed to estimate air pollutant concentrations across broader spatial regions1.

Recent advancements in machine learning (ML) have significantly improved the ability to model complex atmospheric interactions while reducing computational costs. ML models have demonstrated high efficiency in predicting air quality indices (AQI) and various air pollutants, thereby offering real-time predictions at a reduced financial burden. Instead of relying on high-end air quality monitoring equipment, ML models can utilize data from low-cost sensors to predict intricate pollution indicators such as Incremental Lifetime Cancer Risk (ILCR), as explored in this study.

Artificial Neural Networks (ANN) have been widely used in air quality forecasting because of their capacity to capture nonlinear relationships. For example, studies in Liaocheng, Shanghai, and Chongqing have successfully applied ANN and wavelet-based ANN frameworks for PM₂.₅ and PM₁₀ prediction2,3,4,5. These works demonstrate the suitability of ANN in short-term pollution forecasting. More recent studies also extend such models to climate and ozone prediction, employing architectures such as CNNs, LSTMs, and hybrid wavelet-deep learning frameworks6,7,8,9. Such approaches highlight the growing potential of advanced neural networks to model complex atmospheric processes more effectively than traditional methods. Several prior studies have successfully applied ML models to predict air pollutant concentrations10,11,12,13 and AQI14,15. For instance, a study comparing four ML models for AQI prediction using air pollutants and weather parameters achieved RMSE values of 24.14, 15.97, and 18.72 for ANN, XGBoost, and decision tree models, respectively, outperforming traditional multilinear regression (MLR) models14. Another study evaluating seven regression and seven classification models found that random forest performed the best, achieving an R² value of 0.91 and an MSE of 0.006715. These findings highlight the necessity of ML models in air quality prediction, as traditional regression models struggle to capture complex relationships.

Optimizing ML models by selecting the most relevant input variables enhances their predictive performance. Several studies have explored different methods to minimize the number of input variables while maintaining model accuracy. A study on NO₂ prediction using an ANN model reduced weather parameters to two derived stochastic variables13, while others have employed techniques such as sensitivity analysis, genetic algorithms, principal component analysis (PCA), and correlation coefficient methods to refine input selection10,11,16,17. The primary goal of these studies was to ensure that input variables were highly relevant to the target variable, ultimately improving model efficiency. In contrast, this study aims to maintain input relevance across multiple locations, leading to an increase in the number of input variables.

Machine learning models also hold great potential for reducing the cost of air pollution monitoring. For example, a study in China utilized ML models to predict PM2.5 concentrations in metropolitan areas, such as Xinzhuang, Sanchong, and Cailiao, based on data from pollution measurement stations. This approach demonstrated the feasibility of reducing the number of expensive monitoring stations18. Similarly, other research efforts have sought to minimize reliance on costly air pollution monitoring devices by developing ML models capable of real-time or hourly pollutant predictions. One study trained an ANN model using meteorological data (temperature, relative humidity, wind speed, and wind direction) to predict hourly pollutant levels, achieving R² values of 0.87, 0.87, 0.85, 0.77, and 0.92 for PM10, NOx, NO₂, O₃, and CO, respectively12.

Efforts to replace traditional air pollution monitoring stations with virtual ones have gained traction. A study investigating the concentration of PM2.5, PM10, and NO₂ used five ML models, which include support vector regressor, ridge regressor, random forest, XGBoost, and extra tree regressor, to predict pollution levels with the objective of reducing reliance on physical monitoring stations19. Another study leveraged the Adaptive Neuro-Fuzzy Inference System (ANFIS) to predict same-day and one-day-ahead air quality, reducing computational costs and improving prediction accuracy16. To achieve high-resolution temporal and spatial predictions of NO₂, O₃, and HCHO, researchers have employed physically informed neural networks (PINNs), which integrate domain knowledge with ML algorithms20.

Beyond conventional pollutants like PMs and VOCs, ML models have also been utilized for predicting more complex environmental compounds such as polycyclic aromatic hydrocarbons (PAHs). A study using support vector regression (SVR) achieved an R² of 0.9468 and an RMSE of 7.3116 when predicting total PAHs based on total petroleum hydrocarbons (TPH) in soil21. Another study applied a backpropagation ANN model to predict PAH concentrations in soil, achieving an R² of 0.999422. Furthermore, PAH concentrations in the air have been predicted using recurrent neural networks trained on data related to forest fires, air emissions, sea ice cover, and meteorological parameters, with RMSE values ranging from 0.51 to 46.3623.

This study aims to predict ILCR due to the 16 most hazardous PAHs identified by USEPA (United States Environmental Protection Agency) in the air using PM2.5 concentrations and weather parameters. The key motivation is to develop a model capable of utilizing low-cost sensor data for real-time ILCR prediction, which would otherwise require substantial financial and computational resources. Additionally, the study introduces a novel approach to converting wind data into “source factors” (SF), which enhances model transferability across different locations. Since wind direction data from different locations cannot be directly combined or used interchangeably, the proposed method first identifies local pollution sources and then integrates this information with wind direction and speed to generate source factors. These factors serve as input variables for the ML model, improving both predictive accuracy and cross-location applicability. The study systematically compares this novel approach to conventional methods, where raw meteorological data is directly fed into the models. To achieve this, three ML models, Artificial Neural Networks (ANN), eXtreme Gradient Boosting (XGB), and Random Forest (RF) were systematically optimized and trained using data from two locations in India.

Results

PAHs profile and source factor-specific ILCR distribution

Figure 1: PAH profile and source-specific ILCR distribution across different pollution source factors. presents the profile of Polycyclic Aromatic Hydrocarbons (PAHs) and the corresponding Incremental Lifetime Cancer Risk (ILCR) distribution associated with different local pollution sources influenced by wind direction. The analysis reveals that vehicular emissions contribute to the highest PAH concentrations, leading to the highest ILCR values. This aligns with existing literature, where road traffic emissions have been identified as a dominant source of PAHs due to incomplete combustion of fossil fuels and vehicular exhaust24. Conversely, the lowest PAH levels and ILCR values are observed when wind is flowing from the railway tracks. This can likely be attributed to the increasing electrification of railway networks, reducing dependence on diesel-powered locomotives, which are traditionally known to be significant PAH sources25. To provide context on the baseline pollution levels of the study region, the sum of targeted PAHs concentrations was observed to be 16.31 ng/m³ at Jorhat and 21.54 ng/m³ at Shyamnagar. Correspondingly, BaPeq concentrations of all 16 PAHs estimated using the TEF values were 4.10 ng/m³ at Jorhat and 4.40 ng/m³ at Shyamnagar. These values indicate moderate PAH burdens in the ambient air, which serve as the baseline exposure levels for evaluating the Incremental Lifetime Cancer Risk (ILCR) in this study.

Fig. 1
figure 1

PAH profile and source-specific ILCR distribution across different pollution source factors.

The sharp contrast between vehicular and railway emissions highlights the effectiveness of transitioning towards cleaner energy sources in reducing carcinogenic air pollutants.

Industrial areas and densely populated urban regions also show considerable PAH concentrations and moderate ILCR values, suggesting that industrial combustion processes and domestic activities contribute significantly to PAH emissions.

Correlation and sensitivity analysis

The correlation analysis Fig. 2 and sensitivity analysis Fig. 3 provide critical insights into the relationship between input parameters and Incremental Lifetime Cancer Risk (ILCR). The correlation heatmaps (Fig. 2) indicate that PM₂.₅ exhibits the strongest correlation with ILCR (0.85), reinforcing its significance in predicting health risks associated with air pollution exposure. This strong association is further validated by the sensitivity analysis (Fig. 3), where PM₂.₅ demonstrates the highest sensitivity, highlighting its dominant influence on ILCR predictions.

Fig. 2: Correlation heatmaps of input parameters with ILCR.
figure 2

a Conventional meteorological (CM) inputs, b Pollution source method (PSM) inputs.

Fig. 3
figure 3

Sensitivity analysis of input parameters in ILCR prediction.

Apart from PM₂.₅, temperature and relative humidity (RH) also show a reasonable correlation with ILCR (−0.6 and −0.43, respectively) and exhibit moderate sensitivity values in Fig. 3. These findings suggest that meteorological parameters, despite being indirect contributors, play a crucial role in influencing pollutant dispersion and human exposure levels. Interestingly, wind direction in the conventional method (CM) presents a very weak correlation (−0.094) with ILCR, which suggests that its direct impact on cancer risk assessment is minimal when used in its traditional form.

When examining pollution source-specific parameters (PSM) in Fig. 2b, it becomes evident that many of these parameters exhibit weak correlations with ILCR. However, despite their poor correlation coefficients, their sensitivity values (Fig. 3) are comparable to those of conventional meteorological inputs. This highlights a crucial aspect of machine learning (ML) modelling, a strong correlation is not a prerequisite for a feature to contribute significantly to predictive models26. ML models can capture complex, non-linear interactions among variables, making them superior in handling features that may not exhibit high linear correlations but still influence the outcome through intricate dependencies.

Artificial Neural Network (ANN)

The performance of ANN models trained using the Pollution Source Method (PSM) and Conventional Method (CM) varies significantly depending on the activation function. Figure 4a highlights that while the ‘Purelin’ activation function ensures consistency, ‘Tansig’ performs better in terms of accuracy for both methods. Due to this, Fig. 4b focuses on ‘Tansig’ to explore the influence of neuron count and layer depth. A cyclic pattern in Fig. 4b suggests that model performance increases with the number of neurons per layer. The influence of the number of layers appears marginal, showing a peak at around 5–6 layers before slightly declining. A direct comparison in Fig. 5a along with Fig. 4 demonstrates that PSM-ANN consistently outperforms CM-ANN across various parameter configurations. The selected best model configurations are summarized in Table 1.

Fig. 4: ANN model performance for different training parameters.
figure 4

a Training and testing R2 for different activation functions. b Training and testing R2 for various parameter IDs (see Table S1 for parameter combination for parameter ID).

Fig. 5: Evaluation of ANN model outputs.
figure 5

a Actual ILCR compared with model predictions. b Residuals of both ANN models.

Table 1 Finalised parameters for each model.

Both modelling methods closely follow the general trend of the observed ILCR values, as illustrated in Fig. 5a. However, upon detailed inspection of Fig. 5a and the residual plot presented in Fig. 5b, it becomes evident that the Pollution Source Method (PSM) demonstrates superior performance. Specifically, the PSM approach yields lower residuals, particularly at higher ILCR values, where deviations between the predicted and observed data are more pronounced. This indicates that the PSM provides more accurate predictions during periods of elevated cancer risk compared to the Conventional Method (CM), effectively capturing critical variations that traditional meteorological inputs alone may miss. When analysing the scatter plot of model predictions against actual normalized ILCR values in Fig. S3, it is evident that both models struggle with higher ILCR values. However, the scatter plot of residuals against actual ILCR in Fig. S4, having a closer spread around the zero line for PSM, shows that PSM-ANN exhibits lower residuals than CM-ANN, indicating better overall accuracy. Figure S3 further supports this observation, as PSM-ANN predictions align more closely with the ideal 45-degree reference line. The residual spread in Fig. S4 reveals greater dispersion for CM-ANN, reinforcing that PSM-based modelling yields lower prediction errors and improved stability.

XGBoost

The XGBoost model’s performance for different hyperparameter combinations, as illustrated in Fig. 6, follows a cyclic pattern where a higher min_child_weight leads to lower accuracy, while a lower min_child_weight enhances model efficiency. The influence of learning rate and max_depth reduces this disparity, allowing for improved stability in model training. Notably, PSM-XGB consistently outperforms CM-XGB across all parameter combinations, showing superior training and testing R² values.

Fig. 6
figure 6

Training and testing R2 of both XGBoost models for different parameter combinations (for parameter combinations of each parameter ID, see table S2).

Table 1 presents the final model parameters, while Fig. 7 compare predictions and residuals. Figure 7 reveals that residuals are higher for elevated ILCR values, yet PSM-XGB consistently produces lower errors than CM-XGB. A scatter plot of predicted vs actual values for XGBoost models (Fig. S5) further confirms this, as CM-XGB predictions deviate more from the ideal model line, even for lower ILCR values, highlighting PSM’s superiority in capturing the underlying ILCR distribution more effectively. A residual against actual values of the ILCR plot is present in Fig. S6, showing the less spread of PSM residuals around the zero error line.

Fig. 7: Evaluation of XGBoost model outputs.
figure 7

a Actual ILCR compared with XGBoost model predictions. b Residuals of both XGBoost models.

Furthermore, XGBoost models are often preferred for structured data applications due to their ability to handle complex feature interactions, making them highly effective for environmental predictions27. XGBoost has shown promise in environmental applications, such as improving the accuracy of PM2.5 predictions in air quality models28. The PSM approach provides additional advantages by incorporating refined wind pollution mapping, which enhances the model’s ability to capture localized pollution source impacts more effectively than CM models.

Random Forest (RF)

Random Forest models perform well with complex datasets due to their ensemble nature. They are capable of handling high-dimensional data effectively, as at each split, only a random subset of features is considered, reducing computational complexity and preventing overfitting to irrelevant features.

Figure 8 illustrates that model performance improves as minLeafSize decreases, with other parameters showing minimal effect. Notably, PSM-RF exhibits greater stability and higher test R² values when minLeafSize is set to 1. Table 1 summarizes the final model parameters.

Fig. 8
figure 8

Training and testing R2 of both RF models for different parameter combinations (for parameter combinations of each parameter ID, see Table S3).

Prediction accuracy is assessed in Fig. 9, Figs. S7 and S8, with residual distributions in Figs. 9b and S8. Figure 9a shows that both models capture the trend well, though Fig. 9b, suggest that PSM-RF achieves slightly better performance compared to CM-RF. A scatter plot of predicted vs actual values for RF models is presented in Fig. S7 and a residuals against actual values of ILCR plot is shown in Fig. S8. Both of the plots show that the output of PSM is near the ideal line with residuals near zero in comparison to CM. The overall findings reaffirm that PSM-based models yield superior accuracy and generalization capabilities compared to CM-based models.

Fig. 9: Evaluation of RF model outputs.
figure 9

a Actual ILCR compared with RF model predictions. b Residuals of both RF models.

Model overfitting evaluation

To evaluate the risk of overfitting in the trained models, the study compared the R2 values obtained during training and testing across a wide range of parameter combinations. As shown in Figs. 4. 6, and 8 for ANN, XGB, and RF, respectively, the general trend of training and testing R2 remains consistent for all three models, suggesting good generalization capability. Only the ANN model shows occasional deviations where test R2 drops compared to train R2, specifically for parameter combinations related to hidden layers 4, 7, and 10. These cases suggest possible overfitting, but they are exceptions rather than the norm.

To further investigate this, we calculated the relative R2 gap defined as (R2train – R2test)/R2train, and plotted it for all parameter sets under both the PSM and CM methods. The results are provided in supplementary information as Fig. S9 (ANN), S10 (XGB), and S11 (RF). The XGB and RF models show consistently low relative gaps across all parameter IDs, indicating little to no overfitting. In the case of ANN (Fig. S9), a few parameter settings show noticeably higher gaps, confirming some level of overfitting in those cases. However, the majority of parameter combinations still maintain low gaps, reinforcing that the model was generally not overtrained. Overall, these assessments confirm that while some overfitting is observed in isolated cases for ANN, the trained models demonstrate stable and generalizable performance.

Model comparison

The statistical distribution patterns of actual normalized ILCR values and those predicted by different models are presented in Fig. 10. Among all models, PSM-ANN aligns most closely with actual data, confirming its effectiveness. Although in the lower range (<0.3), many models are performing well, they fail to match the performance for higher values of ILCR. This may be attributed to ANN’s superior ability to capture relationships effectively, even with a limited amount of data.

Fig. 10
figure 10

Statistical distributions, means, and standard deviations of actual values and predicted by various models developed.

The evaluation metrics of selected models for the test set are summarized in Table 2. Among both methods used in this study, PSM consistently achieves higher R² values while exhibiting lower MAE, MSE, and RMSE than CM models. The Table 2 includes metrics for both normalized and original ILCR values to provide a comprehensive assessment. The high MAPE values observed can be attributed to the small magnitude (near zero) of ILCR data, as values closer to zero tend to inflate MAPE disproportionately. This issue arises because the MAPE calculation involves dividing by the actual value; thus, as the actual value approaches zero, the percentage error approaches infinity29. The result of the MLR model for both methods (Table 2) shows that it can capture the part of the variance in ILCR under limited data conditions, but prediction accuracy and error matrix remain less favourable than those obtained using advanced ML models. The results, presented in Table 2, indicate that the PSM-ANN model achieved the highest R2 value (0.944), demonstrating superior predictive capabilities compared to other models. Followed by PSM-XGB, PSM-RF, CM-ANN, CM-XGB and CM-RF. Therefore, ANN is found to be a better ML technique to develop predictive models related to work similar to this study than XGBoost and random forest.

Table 2 Evaluation parameters for all the trained models.

The Regression Error Characteristic (REC) curve is a graphical evaluation metric used to assess the performance of regression models. It plots the cumulative percentage of predictions that fall within a given error tolerance against the error threshold. Unlike traditional metrics such as RMSE or MAE, the REC curve provides a visual representation of model accuracy across different error levels30. The REC curve in Fig. 11 further supports the interpretation of the distribution curve (Fig. 10). As the ideal model’s REC curve should be a vertical line at an error threshold of zero, a model with better predictive accuracy will have its REC curve positioned closer to the upper left corner of the plot. The larger the area under the REC curve, the better the model’s overall performance. The plot shows that overall PSM-ANN performance is best among other models, followed by PSM-XGB, further reinforcing the advantages of PSM-based models.

Fig. 11
figure 11

Regression Error Characteristic (REC) curve of all the models developed.

Discussion

From an ML perspective, merely supplying wind data from multiple locations and expecting accurate predictions related to air pollution (in this study, ILCR) might confuse the machine due to the presence of different local sources in different directions at various locations. Converting wind parameters into pollution source factors using PSM provides a more structured and informative input for ML models, resulting in superior predictions.

The distribution of errors across different models provides insight into the reliability of each approach. Residual plots and histograms reveal that PSM-based models demonstrate a tighter clustering of errors around zero, indicating less bias and more precise predictions. CM-based models, on the other hand, exhibit a broader error distribution, highlighting increased uncertainty.

All three ML models evaluated in this study, ANN, XGBoost, and Random Forest, demonstrate a clear advantage when trained using PSM over CM. Despite having a lower correlation coefficient and similar sensitivity of PSM features to those of CM, PSM consistently delivers improved model performance, reducing residual errors and achieving higher predictive accuracy. Moreover, the application of PSM not only enhances individual model performance but also ensures broader applicability across multiple locations, making it a more generalizable and robust approach for air pollution modelling. Our results are consistent with earlier findings that ANN and hybrid models can effectively capture nonlinear interactions between meteorology and air pollutants2,3,4,5,9,31,32. However, unlike prior studies that mainly predicted pollutant concentrations, our work shows that using transformed wind parameters for the pollution source method significantly enhances the reliability of ILCR predictions. This aligns with recent studies in applying advanced models such as LSTM and CNN for climate and air quality forecasting6,7,8,33, while addressing a novel application in health risk assessment.

This study presents strong preliminary evidence supporting the use of transformed wind parameters to enhance ILCR prediction accuracy through machine learning models. Among the evaluated models, the PSM-ANN model demonstrated the best predictive performance, achieving an R2 value of 0.944 with a low Mean Absolute Error (MAE) of 0.037. Additionally, the multiple linear regression (MLR) model exhibited significantly lower predictive performance, indicating its limitation in capturing the complex, nonlinear relationships inherent in the data and reinforcing the need for more advanced machine learning approaches. In contrast, the CM-XGB model showed the weakest performance among advanced ML models, with an R2 of 0.799 and the highest MAE of 0.061, suggesting that conventional meteorological inputs are insufficient to fully represent pollution dispersion dynamics. Across all error metrics, PSM-based models consistently outperformed CM-based models, with notably lower RMSE and MAPE values. The stability of the PSM-ANN model in high ILCR concentration scenarios highlights the effectiveness of using transformed wind features to improve the reliability of health risk predictions.

The implementation of this model in real-time applications presents considerable potential for public health and environmental decision-making. By integrating real-time PM2.5 and meteorological data from affordable sensors, cities can estimate ILCR continuously and dynamically, enabling individuals to make better-informed decisions about outdoor exposure. This approach is not only practical but also cost-effective, as traditional ILCR estimation methods based on PAH measurements involve time-consuming sampling and expensive laboratory analysis. Machine learning-based prediction models like PSM-ANN offer a scalable solution for real-time pollution risk assessment at a fraction of the cost.

While the proposed methodology significantly improves prediction accuracy, certain limitations must be acknowledged. In this context, it is worth noting that physics-based dispersion models such as AERMOD have also been successfully integrated with machine learning frameworks, as demonstrated in a recent study34. Integrating such models could further enhance the spatial and physical representativeness of ILCR predictions. However, due to the unavailability of detailed emission inventory data in the present study region, this integration was not feasible. Future research could combine AERMOD-based dispersion outputs with machine learning approaches to improve model interpretability and predictive performance. The dataset primarily focuses on two Indian cities, and further studies are required to validate the approach in different geographical regions. Additionally, integrating real-time air quality monitoring data could enhance model responsiveness and adaptability. While the present analysis focuses on external exposure through ambient PAH concentrations, future studies integrating both internal and external exposure pathways, as demonstrated in recent literature35,36,37, would provide a more comprehensive assessment of cumulative health risks. Although the dataset used in this study was relatively limited in size, it was adequate to train and validate the models and demonstrate their feasibility for ILCR prediction. However, we recognize that a larger and more diverse dataset would allow the models to capture broader variability in emission patterns and meteorological conditions, thereby further enhancing their generalizability. The framework developed in this study can be easily expanded, and future studies can use techniques such as transfer learning and domain adaptation to strengthen robustness across different geographic and environmental contexts. Future work may also explore deep learning techniques to further optimize predictive performance.

Methods

Study sites

For the development of an ANN model, aerosol sampling and data acquisition for weather parameters were done for two sites, Jorhat and Shyamnagar. Jorhat is a city in Assam located in north-east India at 26⁰45’ N 94⁰13’ E and an average elevation of 116 m. Jorhat has a population of approximately 1.26 lakhs as of the 2011 census38. The sampling was done at the Council of Scientific and Industrial Research - North East Institute of Science and Technology (CSIR-NEIST) in the west direction of Jorhat. Another sampling site, Shyamnagar, is a semi-urban town in West Bengal, India, which is located at 22°49’ N 88° 23’ E with an average elevation of 2 m. The location of both sites is shown in Fig. 12.

Fig. 12
figure 12

Showing sampling sites Jorhat (https://github.com/sssmartsearch/India_Boundary_Updated) and Shyamnagar (https://maps.google.com/). The red star represents the sampling location inside the city/town.

Sampling and PAHs analysis

Air samples were obtained using the Speciation Air Sampler System (SASS) from Met One Instruments, operating at an average flow rate of 6.72 L/min. This system features multiple channels designed for sample collection on various substrates, including Quartz, Teflon, and Nylon. Following collection, the samples were preserved at −19°C until further chemical analysis. Meteorological parameters such as temperature, rainfall, and humidity were recorded using the AIO2 weather station (Met One Inc., OR, USA) installed at the sampling site. The study sites adhered to a 24-hour time-integrated ambient aerosol sampling schedule, conducted every alternate day from January 1 to December 31, 2019.

PM2.5 samples were collected on quartz filter paper (47 mm diameter) over a 24-hour period on alternate days of the sampling period for the analysis of polycyclic aromatic hydrocarbons (PAHs). Additionally, 47 mm Teflon filters were used to measure gravimetric mass using a microbalance. PAHs in samples were extracted in a 1:1 mixture by volume of DCM and acetone solvents. This solvent combination was selected as it provided maximum extraction efficiency when compared to other common solvents such as toluene and hexane (Rajeev et al. 2021). A recorded number of punches (each of dia. 3.14 cm2) were taken in Q cups based on PM2.5 concentration of the samples and loaded in the energized dispersive extractor (EDGE, CEM Corporation, USA). Extraction was done with DCM and acetone (1:1 v/v) at 120 °C and 60–70 psi pressure with a holding time of 4 min for each sample. The samples were extracted in 30 ml solvent by the method of adding 10 ml top volume, 10 ml bottom volume, and 10 ml rinse volume of solvent. After the extraction of the sample in DCM and acetone, the samples were concentrated to 1–2 drops by the CentriVap Concentrator. The concentrator was programmed to vaporize the solvent at 30 °C for the first 30 min and at 50 °C for the next 80 min. Vaporization temperatures were selected based on the normal boiling points of DCM (boiling point 39.6 °C) and acetone (boiling point 56 °C), Toluene was added to the concentrated samples to make up the final volume to 2 ml and the resulting solutions were sonicated for 20–25 min for proper dissolution of PAHs. The amount of each PAH in the extracted sample was determined using Gas Chromatography Mass Spectrometer (GC–MS., Agilent technologies; GC: 7890B; MSD: 5977B). A column (DB-5 capillary column) of fused silica with polyamide coating was employed for this analysis. External standards for 5-point calibration were prepared by serial dilution of a 16-PAHs mix solution (EPA 610 PAH Kit 16 analyte in methanol, Sigma-Aldrich) in concentrations of 5 ppb, 10 ppb, 25 ppb, 50 ppb, 100 ppb for quantification of PAH compounds. Helium flow rate was kept at 1 ml/min. In order to separate compounds based on their boiling points and to reduce total run time, oven temperature ramping was provided. Oven temperature ranged from 90 °C–200 °C, 200 °C–260 °C, and 260 °C–310 °C with temperature ramping of 15 °C/min, 4 °C/min, and 9 °C/min, respectively (Rajeev et al. 2021). Identification of peaks was carried out with the help of the retention time of each PAHs. Replicate samples of field blank were run on the instrument, and the results obtained were used for the field sample correction. This methodology has been published by our group in a previous study24. Out of 175 data points, a random 25 data points were kept separate for testing and not used in the training and validation.

Quality assurance and quality control (QA/QC)

Extraction efficiency was evaluated by repeated extraction and analysis of selected aerosol samples (n = 10), which confirmed recovery of ~95%. Also, internal standards (phenanthrene d-10 and perylene d-12) response was found to be within ± 5% as a quality control measure.

Clean quartz microfiber filters were extracted and analysed with every 15 samples as field blanks, while solvent blanks were also included and analysed every 6 samples on the GC–MS. The PAHs detected in blanks were subtracted from sample concentrations.

ILCR calculation

PAHs can impact human health in several ways, including toxic, cancer-causing, birth defect-inducing, and gene-altering effects. People may be exposed to these compounds through multiple pathways, such as inhaling polluted air, consuming contaminated food or water, or contact with soil. According to USEPA (United States Environmental Protection Agency) list, 16 PAHs (naphthalene, acenaphthylene, acenaphthene, fluorene, phenanthrene, anthracene, fluoranthene, pyrene, benzo(a)anthracene, chrysene, benzo(b,j)fluoranthene, benzo(k)fluoranthene, benzo(a)pyrene, dibenzo(a,h)anthracene, indeno(1,2,3-cd)pyrene, and benzo(g,h,i)perylene.) have been identified as compounds of grave concern out of which seven PAHs have been marked as most probable human carcinogens like benzo(a)anthracene, benzo(a)pyrene, benzo(b,j)fluoranthene, benzo(k)fluoranthene, chrysene, dibenzo(a,h)anthracene, and indeno (1,2,3-cd) pyrene24. Benzo(a)pyrene is one of the most potent carcinogens among all 16 PAHs and is used as a marker for all PAHs in determining the carcinogenic potency. In the current study, the applied formulas follow USEPA risk assessment guidelines39,40. Benzo(a)pyrene equivalent is the parameter which is calculated for risk assessment as follows41:

$${\rm{B}}\left[{\rm{a}}\right]{\rm{Peq}}=\,{C}_{i}\,X\,{{TEF}}_{i}$$
(1)

Ci is the concentration of the ith species, and TFE is the Toxic Equivalent Factor (TEF) of the ith species. TEF values for each of the 16 PAHs were taken from the previous study41.

Incremental lifetime cancer risk (ILCR) represents the additional risk of cancer-related mortality beyond the natural background risk due to prolonged exposure to carcinogenic substances such as polycyclic aromatic hydrocarbons (PAHs). It is determined by calculating the lifetime average daily dose (LADD), which quantifies the daily intake of a chemical per kilogram of body weight. This measure helps evaluate potential health hazards associated with specific compounds. The formulas for LADD and ILCR, are as follows42:

$${\rm{LADD}}\left({\rm{mg}}{{\rm{kg}}}^{-1}{{\rm{day}}}^{-1}\right)=\frac{\left({\rm{CP}}\times {\rm{AIR}}\times {\rm{UCF}}\times {\rm{EF}}\times {\rm{LED}}\right)}{\left({\rm{BW}}\times {\rm{AT}}\right)}$$
(2)
$$\mathrm{ILCR}=\mathrm{LADD}\times \mathrm{Cancer\; slope\; factor}\,\left(\mathrm{CSF}\right)$$
(3)

CP represents the BaPeq concentration of individual PAHs in ng/m³. To estimate the Incremental Lifetime Cancer Risk (ILCR), the concentration of each PAH was first converted into its benzo(a)pyrene-equivalent (BaPeq) concentration by applying the respective Toxic Equivalency Factor (TEF), with benzo(a)pyrene (BaP) taken as the reference compound (Eq. 1). This study reports LADD and ILCR for adults. AIR refers to the air inhalation rate, set at 20 m³/day. UCF is the unit conversion factor from ng to mg (10⁻⁶). EF denotes the emission frequency, standardized at 350 days per year. LED represents lifetime exposure duration, calculated as 24 years. BW corresponds to body weight, with values of 70 kg. AT signifies the average lifespan, estimated at 25,550 days (70 × 365) (Rajeev et al., 2021; Singh & Gupta, 2016b).CSF, or cancer slope factor, is the key parameter for assessing carcinogenic hazards, with risk determined by the equation:

$${\ CSF}={risk\; per\; unit\; dose}={risk\; per\; mg}{{kg}}^{-1}{{day}}^{-1}$$
(4)

Previous research has reported the CSF value for benzo(a)pyrene as 3.1, with a geometric standard deviation of 1.8 for risk assessment42,43,44.

Model development

In this study, two approaches were employed to develop machine learning (ML) models using PM2.5 and weather data as input. The first approach, referred to as the Conventional Method (CM), utilized the meteorological parameters in their original form. Atmospheric temperature, relative humidity (RH), wind direction, and wind speed were commonly used in traditional ML model development. In the second approach, termed the Pollution Source Method (PSM), wind direction and wind speed data were transformed into novel variables called ‘source factor,’ while the other meteorological parameters, such as atmospheric temperature and RH, were retained in their original form. Figure 13 shows the methodology flow chart for both the CM and PSM methods. To understand the need for advanced machine learning models, multiple linear regression (MLR) models were also developed for both methods and compared to the other models’ results.

Fig. 13
figure 13

Flow chart of Model development using CM and PSM methods.

Source Factor (SF) calculations

For this study, eight common air pollution sources were identified for both cities: urban/densely populated areas, industries, villages, forests, rivers/water bodies, roads (vehicular emissions), railway tracks, and airports. The maps of the sampling locations were divided into 16 equal sectors, and the locations of these sources were determined using a combination of Google Maps data and local surveys.

For each sector, when wind originated from a specific direction corresponding to that sector, all pollution sources within the sector were assigned a value of 1 multiplied by the wind speed (to account for its weightage), while all other source factors were assigned a value of 0. Equation (4) calculates the source factor (SFi) for each pollution source by summing the product of the wind direction factor (WDFi) and wind speed (WSt) over a 24-hour period. Equation (5) defines the wind direction factor (WDFi). Which are given below:

$${{SF}}_{i}=\,\mathop{\sum }\limits_{t=0}^{1440}\left({{WDF}}_{i}\,\times \,{{WS}}_{t}\right)$$
(5)
$${{WDF}}_{i}=\left\{\begin{array}{l}1,When\,wind\,is\,blowing\,from\,sector\,of\,source\,i\\ 0,When\,wind\,is\,not\,blowing\,from\,sector\,of\,source\,i\end{array}\right.$$
(6)

Where SFi is the source factor for pollution source i, WDFi is the wind direction factor, WSt is the wind speed at time t, and t represents time in minutes (0–1440 minutes in a day).

The sector divisions and the spatial distribution of pollution sources are illustrated in Fig. 14 for Jorhat, and that of Shyamnagar is presented in Figs. S1 and S2 for Shyamnagar. For example, if a wind with a speed of 1.2 m/s was blowing from the east toward the sampling location in Jorhat, the source factors for the urban/densely populated area, industries, and airport within the corresponding sector were assigned a value of 1.2, while the other source factors were assigned a value of 0. These calculations were performed every minute based on real-time data from the weather station. To determine daily source factor values, the minute-wise values for each factor were summed up. Before training the models, the data were normalized along with the other input variables.

Fig. 14
figure 14

Sector division based on direction and pollution source locations of Jorhat city (https://maps.google.com/).

Artificial Neural Network (ANN) model development

Artificial Neural Networks (ANNs) are computational models inspired by the structure and functioning of the human brain, widely used for solving complex nonlinear problems in various fields, including environmental science and air pollution studies. ANNs consist of interconnected layers of nodes, also known as neurons, that process input data through weighted connections, non-linear activation functions, and bias terms to produce output predictions. In this study, a feedforward neural network architecture was employed, trained using the backpropagation algorithm45 to predict the incremental lifetime cancer risk (ILCR) associated with PM2.5 exposure. The ANN model was trained using a systematic approach to optimize its architecture. The number of hidden layers was varied from 1 to 10, while the number of neurons per layer was tested from 5 to 20 in increments of 5. Three activation functions, such as ‘logsig,’ ‘tansig,’ and ‘purelin,’ were evaluated to determine their impact on model performance. Each unique combination of these parameters was trained and validated 10 times to ensure consistency and reliability. The best-performing configuration for each combination was recorded and compared to identify the optimal architecture for the study. Input variables included conventional meteorological data (e.g., temperature, relative humidity) and transformed pollution source factors derived from wind parameters, as described earlier. The ANN model developed using the Conventional Method (CM) was designated as ‘CM-ANN’, while the model developed using the Pollution Source Method (PSM) was designated as ‘PSM-ANN’.

XGBoost (eXtreme Gradient Boosting) model development

Extreme Gradient Boosting (XGBoost) is an advanced machine learning algorithm based on decision-tree ensembles, widely recognized for its speed and accuracy in predictive modelling tasks. XGBoost incorporates a gradient boosting framework that optimizes model predictions by iteratively minimizing the loss function and updating weights for misclassified samples46. Both methods i.e., CM and PSM, were used to develop two models designated as CM-XGB and PSM-XGB respectively.

The XGBoost models were trained using grid search approach to optimize hyperparameters, including maximum tree depth (max_depth) varied between 3 and 7, learning rate (learning_rate) tested at 0.1, 0.3, and 0.5, minimum child weight (min_child_weight) evaluated at 1, 5, and 10, and subsample ratio (subsample) set at 0.8 and 1.0 to control the fraction of data used in each boosting iteration. Each unique combination of these parameters was trained and validated 10 times to ensure consistency and robustness. The best-performing configuration for each parameter combination was recorded, and the corresponding results were analysed to identify the optimal model configuration.

Random Forest (RF) model development

Random Forest (RF) model, a widely used ensemble learning method known for its robustness and predictive accuracy in regression and classification tasks. RF constructs multiple decision trees during the training phase, each using a randomly selected subset of features and data samples. By averaging the outputs of these trees, RF minimizes overfitting and enhances generalization, making it particularly effective for datasets with complex, non-linear relationships and high-dimensional feature spaces47. The inherent randomness in RF also provides a built-in mechanism for estimating the importance of individual features, adding interpretability to the model’s predictions. MATLAB’s ‘TreeBagger’ function was employed for regression modelling, offering flexibility in customizing critical hyperparameters to achieve optimal performance.

A grid search optimization approach was applied to tune three key hyperparameters: the number of trees in the ensemble (optimized over the range [50, 100, 200]), the number of features considered at each split (optimized over the range [3, 5, 7]), and the minimum leaf size for terminal nodes (optimized over the range [1, 5, 10]). Each hyperparameter configuration was evaluated 10 times to account for the stochastic nature of the RF algorithm, with the best model selected based on the highest coefficient of determination (R2) value achieved on the test data. Two RF models trained using CM and PSM were designated as CM-RF and PSM-RF, respectively.

Overall, model optimization was performed through backpropagation-based architecture tuning for ANN and grid search-based hyperparameter tuning for XGBoost and Random Forest.

Model evaluation

This study evaluates the accuracy, robustness, and generalizability of the models using five widely accepted performance metrics. The coefficient of determination (R2) indicates the proportion of variance in the target variable explained by the model, providing a measure of goodness-of-fit. Mean Absolute Error (MAE) quantifies the average magnitude of errors without considering their direction, offering an intuitive measure of prediction accuracy. Mean Squared Error (MSE) penalizes larger errors more than smaller ones by squaring the differences, making it sensitive to outliers. Root Mean Squared Error (RMSE), the square root of MSE, expresses the error in the same unit as the target variable, facilitating better interpretability. Mean Absolute Percentage Error (MAPE) measures the percentage error relative to the actual values, making it useful for assessing relative prediction accuracy across different scales. The mathematical formulas of these parameters are given by Eqs. 61048 as below:

$${R}^{2}\,=1-\,\frac{{\sum }_{i=1}^{n}{\left({y}_{i}-{\hat{y}}_{i}\right)}^{2}}{{\sum }_{i=1}^{n}{\left({y}_{i}-\bar{y}\right)}^{2}}$$
(7)
$${MAE}=\,\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}\left|{y}_{i}-{\hat{y}}_{i}\right|$$
(8)
$${MSE}=\,\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}{\left({y}_{i}-{\hat{y}}_{i}\right)}^{2}$$
(9)
$${RMSE}=\,\sqrt{{MSE}}$$
(10)
$${MAPE}=\,\frac{100}{n}\mathop{\sum }\limits_{i=1}^{n}\left|\frac{{y}_{i}-{\hat{y}}_{i}}{{y}_{i}}\right|$$
(11)

where \({y}_{i}\) represents the actual observed values, \({\hat{y}}_{i}\) denotes the predicted values, \(\bar{y}\) is the mean of the observed values, and n is the total number of observations. Table 3 provides the range and ideal value of these statistical parameters.

Table 3 Ideal value and range of statistical parameters used for evaluation.

Sensitivity calculation

To assess the sensitivity of input variables in predicting ILCR, the study employed the cosine amplitude method for sensitivity analysis using the correlation strength equation. The relationship between an input variable (xi) and the target variable (xj) is quantified by the sensitivity coefficient (Rij), calculated using the equation:

$${R}_{{ij}}=\,\frac{{\sum}_{k=1}^{m}{x}_{{ik}}{x}_{{jk}}}{\sqrt{{\sum}_{k=1}^{m}{x}_{{ik}}^{2}{\sum}_{k=1}^{m}{x}_{{jk}}^{2}}}$$
(12)

where Rij represents the strength of the relationship between the input and output variables across m observations. A higher Rij value indicates a stronger influence of the respective input variable on ILCR predictions. This analysis was conducted separately for both the conventional method (CM) and the proposed pollution source method (PSM) to evaluate how the transformation of meteorological inputs affects model sensitivity.