Introduction

Urban air pollution poses unprecedented challenges to public health and environmental sustainability in the 21st century1,2. With over 90% of the global population exposed to air pollution levels exceeding WHO guidelines3, the need for precise, actionable air quality assessment has never been more critical. Traditional monitoring approaches, while valuable for regulatory compliance, fall short of capturing the hyperlocal variability that characterizes urban air pollution landscapes4,5.

The complexity of urban air quality arises from the intricate interplay of emission sources, meteorological conditions, and urban morphology, creating pollution gradients that can vary by orders of magnitude within single city blocks6,7. This spatial heterogeneity, combined with dynamic temporal patterns, necessitates modeling approaches that can simultaneously address multiple scales and provide actionable insights for urban planning and public health protection8,9.

Current air quality modeling paradigms broadly fall into three categories: physics-based models, data-driven approaches, and hybrid methodologies. Physics-based models, such as dispersion models and chemical transport models (CTMs), provide a mechanistic understanding but require substantial computational resources and detailed input data10,11,12,13. Data-driven approaches, including land-use regression (LUR) and machine learning methods, offer computational efficiency but often lack physical interpretability and struggle with temporal dynamics14,15,16,17. Hybrid approaches attempt to combine the strengths of both paradigms but frequently result in increased complexity without proportional gains in practical utility18,19,20,21.

Despite significant advances in air quality modeling, a fundamental gap persists between the granular information needs of urban decision-makers and the capabilities of existing modeling frameworks22,23. Decision-makers require tools that not only describe and predict air quality patterns but also prescribe actionable interventions24. This necessitates a modeling paradigm that is simultaneously descriptive, predictive, and prescriptive. That is, it should be capable of identifying pollution sources, transport pathways, and exposure patterns, able to forecast air quality, and it should enable scenario analysis and optimization for policy intervention.

Agent-based modeling (ABM) emerges as a promising paradigm to address these requirements25. Originally developed for complex systems analysis, ABM has found applications across diverse domains, including urban planning26, epidemiology27, and ecology28. However, its application to urban air quality modeling remains largely unexplored, despite its inherent advantages for capturing emergent behaviors and spatial interactions.

Here, we introduce a novel agent-based modeling framework specifically designed for urban air quality at high spatio-temporal resolution. While classical ABM involves autonomous entities with rule-based decision-making, the agent-based paradigm more broadly encompasses three core principles: (1) a bottom-up approach where system-level patterns emerge from local interactions, (2) discrete entities that interact through defined rules, and (3) emergent behavior arising from individual entity actions25. We adopt these principles to model urban air quality as pollution exchange among spatially distributed agents governed by physics-based interaction rules.

Our framework discretizes the study area into spatial agents (discrete computational units representing geographic regions) that exchange pollutants, governed by parameterized mass-balance equations on a directed graph network. Equation (1) describes the governing mass balance for pollutant concentration at each agent ‘i’, while equations (2) and (3) describe the outflow and inflow of pollutant. The mass balance is solved simultaneously across all agents to obtain a spatio-temporally resolved pollutant concentration map for the entire study area.

$${P}_{i}(t)={P}_{i}(t-1)+B{P}_{i}(t)-{P}_{i\to }(t)+{P}_{i\leftarrow }(t)$$
(1)

where Pi(t) represents the pollutant concentration at agent i and time t, BPi(t) denotes the net base pollution (sourcing and sinking attribute), and Pi(t) and Pi(t) represent outgoing and incoming pollutant fluxes, respectively. The inter-agent transport terms are formulated as:

$${P}_{i\to }(t)=\mathop{\sum }\limits_{j\in {{\mathcal{N}}}_{i}}{K}_{ij}{D}_{ij}{P}_{i}(t-1)$$
(2)
$${P}_{i\leftarrow }(t)=\mathop{\sum }\limits_{j\in {{\mathcal{N}}}_{i}}{K}_{ji}{D}_{ji}{P}_{j}(t-1)$$
(3)
$${D}_{ij}=\frac{{w}_{ij}}{{\sum }_{k\in {{\mathcal{N}}}_{i}}{w}_{ik}}$$
(4)
$${D}_{ji}=\frac{{w}_{ji}}{{\sum }_{k\in {{\mathcal{N}}}_{j}}{w}_{jk}}$$
(5)

where Dij and Dji are the distance-based parameters which dictate the fractional distribution of pollutant as a function of distance, \({{\mathcal{N}}}_{i}\) represents the neighborhood set of agent i, Kij is the convective transfer coefficient, and wij = 1/dij represents an inverse distance-based weighting with dij being the Euclidean distance between agents.

Implementation of this model requires defining the agent network, parameterization of BP and K, followed by the estimation of these parameters. These model parameters incorporate real-world geospatial features, enabling the integration of urban characteristics into the model’s knowledge base. Additional details of the model and parameterization strategy can be found in the “Methods” section.

Our approach balances physical interpretability with data-driven learning to create a computationally efficient platform for assessing complex urban pollution dynamics. We demonstrate the framework’s viability across three application domains, viz., descriptive, predictive, and prescriptive analysis, using high-resolution data from a mobile monitoring campaign in Chennai, India.

Results

We demonstrate the proposed approach using data from a mobile monitoring campaign, which took place in Chennai, India, over a 37-day period across April and May of 2019. The study area is divided into 22 nodes (agents) (Fig. 1). Real-world features including traffic data, land-use data, emissions estimates, and weather information (wind speed, temperature, and relative humidity) are used to parameterize the sourcing effect (BP) and the transport effect (K), allowing the model to capture the pollution dynamics of the region under study as well as fundamental physics in the form of the mass balance.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Spatial aggregation from 100 measurement points (blue circles) to 22 agents (orange circles) for computational efficiency while preserving spatial representativeness.

Descriptive capabilities: understanding pollution dynamics

The framework’s descriptive power emerges through the decomposition of pollution variations into source contributions and transport effects. The factor responsible for the change in the pollution level can be determined by analyzing BP and K. BP, the representative of various sources and sinks, is associated with different anthropogenic activities. If the analysis suggests that BP is responsible for the change in the pollution level, then related anthropogenic activities are further analyzed and validated by assessing the influence of real-world data like vehicular emissions, land-use data, etc. Figure 2b shows the correlation between PM2.5 concentration and the average vehicular count at different agents. The study region comprises three road types, as shown in Fig. 2c. Agents 1–5, which contain road type 1(expressways), show low correlation between pollutants and vehicular count. It could be attributed to two factors: (1) industrial emissions could be a significant contributor since these locations are in the industrial zone, as seen in Fig. 2d, and (2) the expressway has minimal stagnation of vehicles and has more open area compared to interior roads. Agents in commercial and residential areas show higher correlation with traffic, indicating a higher contribution from vehicular sources. Across all the plots in Fig. 2, it is possible to obtain descriptive insights into the association between PM2.5, traffic, and land-use.

Fig. 2: Relationship between PM2.5, BP, traffic, and land-use.
Fig. 2: Relationship between PM2.5, BP, traffic, and land-use.The alternative text for this image may have been generated using AI.
Full size image

a Correlation between BP and average vehicle count at different agents. Agents in commercial, residential, and mixed land-use generally have a higher correlation between vehicle count and PM2.5 than those in industrial land-use. A similar behavior is observed in the correlation trends between BP and vehicle count. b Correlation between PM2.5 and average vehicle count at different agents. c Classification of road types, road types 1, 2, and 3, in the study area. d Land-Use profile of the study area.

Furthermore, concurrent analysis of BP and K can shed light on the dynamics of pollutant generation across agents.

Interplay between BP and K

In urban regions, localized anthropogenic factors contribute to the sourcing and sinking ability of each agent. These factors either add or remove the pollutant from the agent. This effect is captured through BP. K, on the other hand, ensures spatial continuity across the region and captures the effect of pollutant transport. Figure 3a, shows the change in PM2.5 concentration from 7 AM to 8 AM, while the arrows depict the direction and magnitude of net transport across agents in that time step. Figure 3b shows the value of BP at each agent at the time step corresponding to 8 AM. There is a large increase in the measured PM2.5 concentration at 8 AM at agent 19. BP value at node 19 indicates the addition of PM2.5 at this location. Furthermore, neighboring agents also seem to transport PM2.5 to agent 19, as seen in Fig. 3a. Locations 19 and 20 are school zones. At 8 AM, coinciding with school start times29, high PM2.5 concentrations are observed alongside elevated traffic volumes, as indicated by positive BP values. The model’s BP term suggests these factors are correlated, consistent with traffic-related emissions as a contributing source.

Fig. 3: Understanding variation in PM2.5 concentration across agents at 8 AM in terms of transport (K) and sourcing (BP) effects.
Fig. 3: Understanding variation in PM2.5 concentration across agents at 8 AM in terms of transport (K) and sourcing (BP) effects.The alternative text for this image may have been generated using AI.
Full size image

(a) shows the change in measured PM2.5 concentration from 7 to 8 AM. Arrows depict the net transfer of PM2.5 across agents. (b) shows the sourcing and sinking agents in the region at 8 AM.

There are also cases where the net increase in concentration could be attributed to transport rather than BP. For instance, agent 1 sinks PM2.5 at 8 AM; however, there is a net increase in absolute concentration due to a significant influx from agent 2 to 1 as seen in Fig. 3a.

Dominance of sourcing vs transport effect at each agent

Total PM2.5 contribution is affected by transport as well as localized addition or removal. It is observed that for certain nodes, the change in concentration is predominantly affected by the convection, while for others by BP. BP Ratio(BPR) and Convection Ratio(CR) are two metrics that are introduced to determine the relative influence of BP and K, respectively, on the overall PM2.5 concentration at each agent. The ratios are calculated for each node as discussed in the “Identifying the relative dominance of BP and K on the agent” subsection under the “Methods” section and depicted in Fig. 4. The value at the top of the box shows the BPR value, and the bottom of the box shows the CR value. Based on the chosen threshold, agents 1, 2, 9, 11, 13, 14, 15, and 20 are identified as BP dominant, while agents 4, 5, 6, 7, 8, 12, 18, 19, and 22 are K dominant. Since the difference in ratios is not significant for the remaining 5 agents, they are classified as agents with equal contribution from BP and K.

Fig. 4: Dominant influence for each agent based on BPR (top value in the box) and CR (bottom value in the box).
Fig. 4: Dominant influence for each agent based on BPR (top value in the box) and CR (bottom value in the box).The alternative text for this image may have been generated using AI.
Full size image

Based on the chosen threshold (>30% difference), agents 1, 2, 9, 11, 13, 14, 15, and 20 are identified as BP dominant, while agents 4, 5, 6, 7, 8, 12, 18, 19, and 22 are K dominant. The remaining agents are classified as equally dominant.

Through this analysis, it is possible to ascertain whether the pollutant concentration at a given agent is influenced by local effects or neighboring agents.

Figure 5a and b show the temporal decomposition of total PM concentration into its constituents, BPi,Piand Pi for agents 1 and 8, respectively. The Pi curve for agent 1, which is identified as BP dominant, follows BPi more closely than Pi or Pi. However, for agent 8, BPi has relatively lower influence on the Pi curve.

Fig. 5: Temporal decomposition of pollutant concentration for BP Dominant and K Dominant agents.
Fig. 5: Temporal decomposition of pollutant concentration for BP Dominant and K Dominant agents.The alternative text for this image may have been generated using AI.
Full size image

a The Pi curve for agent 1, which is identified as BP dominant, follows BPi more closely than Pi or Pi. b However, for agent 8, BPi has relatively lower influence on the Pi curve.

Predictive capabilities: Imputation and Forecasting

The agent-based framework can be used as a spatio-temporal predictive model. Conventional methods for prediction consider either the spatial or the temporal dimension alone while making predictions. However, the dynamics of air quality are affected in space and time in a coupled manner. The framework can be used in two modes: (1) as an imputation model to predict the missing values in the data and (2) as a forecasting model to predict PM2.5 values outside the observed period. The performance of the model in predictive analysis is discussed below.

Data imputation

The agent-based framework can be used as an imputation model for imputing missing values, a common issue in environmental monitoring. The model is compared with other methods of imputation, such as mean imputation, linear interpolation, cubic spline interpolation, and kriging. Details on the baseline methods are provided in the “Benchmarking spatio-temporal imputation capability” subsection. A Monte Carlo cross-validation was performed, wherein in each test, 50% of the measurements from the original dataset are randomly removed and predicted using different interpolation models. The simulation was run over 30 tests. The trained model is then used to predict the missing values. Table 1 lists the RMSE value for the various models.

Table 1 Comparison between various imputation models for 30 Monte Carlo tests

Mean imputation, linear, and cubic interpolation have considerably higher RMSE values. The proposed agent-based model performs better compared to the models evaluated in this work. Figure 6 compares the temporal profiles of observed PM2.5 and the profiles predicted by the agent-based model for three agents chosen randomly.

Fig. 6
Fig. 6The alternative text for this image may have been generated using AI.
Full size image

Comparing diurnal PM2.5 measurements with prediction (Agent-based model) at three randomly selected agents (6, 11, and 16).

Diurnal forecasting

In this mode, we demonstrate the ability of the model in forecasting the full diurnal profile across all the agents. The data collection campaign ran for 37 days with a single mobile monitor. We assume that the campaign ran for only 30 days, and the objective is to test the efficacy of this model in forecasting the diurnal PM2.5 profile based on the remaining 7 days. Due to a fundamental limitation in the dataset (described in the “Benchmarking diurnal forecasting capability” subsection), we need to aggregate data from the last 7 days of the monitoring campaign to obtain an average diurnal hourly profile to test the forecasting ability. We also compare the agent-based model with three different time-series-based forecasting models. While the dataset is the same for all the methods (including the agent-based method), the training data for each model is processed differently. This is because of the unique requirements of each method. Detailed implementation of each baseline forecasting method is described in the “Benchmarking diurnal forecasting capability” subsection.

The agent-based forecasting model is trained with the data from the first 30 days of the data collection campaign. To evaluate the predictions, data from the final 7 days were aggregated into a 24-h diurnal profile. Spatio-temporal kriging was employed to impute missing observations in both the training and test datasets. While the time-series models are trained on the temporal profile of each node individually, the agent-based model gets trained across all the nodes simultaneously. This allows the model to capture spatial association in addition to temporal information. This ability to capture spatial trends, along with the incorporation of real-world features, allows the agent-based approach to outperform the time-series methods as seen in Table 2. This table compares the average RMSE, MAE, and Pearson correlation across all nodes for each forecasting model. The ABM achieves the lowest prediction error and the highest correlation with observations. Predictions from the various forecasting methods for select nodes are visualized in Fig. 7, with circular markers indicating observations imputed via kriging.

Fig. 7: Comparing diurnal PM2.5 predictions from different forecasting models.
Fig. 7: Comparing diurnal PM2.5 predictions from different forecasting models.The alternative text for this image may have been generated using AI.
Full size image

Here, the hourly measurements are compared against the predictions for nodes 1, 8 and 13 in figures a) b) and c) respectively from the agent-based mode, the TVAR forecasting, the Fourier forecasting, and the persistence forecasting methods. The red circles represent observations that were imputed using spatio-temporal kriging.

Table 2 Performance metrics of different forecasting models

While the forecast shows the aggregate diurnal profile, day-to-day trends are masked in the forecast due to aggregation. Rather than day-to-day forecasts over 7 days, this forecast represents a “typical day” composite. To demonstrate multi-day forecast capabilities, continuous observations are required at each agent, which could be obtained with a static monitoring paradigm. Since this dataset spans a little over one month, the ability of the model to capture seasonal behavior is not demonstrated in this work.

Prescriptive capabilities: Applications in policy and intervention design

Urban air quality improvement requires multi-stage intervention strategies at both local and regional scales30. Recent work has demonstrated the utility of linking urban form indicators with air quality modeling tools to optimize intervention design at the locality level31. Given its detailed descriptive and predictive capabilities, the proposed agent-based framework naturally extends to prescriptive applications, enabling quantitative evaluation of intervention scenarios and optimization of mitigation strategies.

We discuss two prescriptive applications: (1) identification of minimum-exposure routes for personal mobility, and (2) source attribution for targeted intervention design.

Identifying the least-exposure route

Traditional routing algorithms optimize for travel time or distance. However, personal exposure to air pollution during travel depends on both pollutant concentration and travel duration, with substantial spatial and temporal variability32. While conventional routing algorithms optimize for the least distance or least travel, we propose a method to identify the least-exposure route. The agent-based framework simplifies the least-exposure route problem by reducing it to an aggregation over a discrete network of agents. A detailed mathematical formulation for computing travel exposure is provided in the “Identifying travel exposure at each agent” subsection.

Applying Dijkstra’s algorithm33 with ET as the cost function, we identified the minimum-exposure route between agents 5 and 12. Figure 8 compares this route with the time-optimal route. The minimum-exposure(cleanest) route is 37% longer in distance and requires 45% more travel time, yet reduces total exposure by 14% (Table 3). This reduction occurs because the time-optimal route traverses high-concentration zones, particularly high-traffic corridors where PM2.5 concentrations are elevated.

Fig. 8: Fastest route vs least-exposure route (cleanest route) from agent 5 to agent 12.
Fig. 8: Fastest route vs least-exposure route (cleanest route) from agent 5 to agent 12.The alternative text for this image may have been generated using AI.
Full size image

The fastest route is identified as the route with the least travel time, while the cleanest route identifies the route with the least ambient exposure.

Table 3 Comparing the fastest and cleanest route

Source attribution and intervention targeting

The framework enables decomposition of pollution contributions by source category, facilitating targeted intervention design. For instance, identification of elevated concentrations near schools (e.g., Agent 19) during peak traffic hours (Fig. 3b) coupled with the high correlation between PM2.5, BP, and traffic (Fig. 2), suggests temporal traffic management measures such as restricted entry zones or staggered school schedules.

The parameters (BP and K) encode relationships between urban form features and pollutant transport. By formulating parameter sensitivity as a feature importance problem, the relative impact of different urban form features (traffic, land-use, building density, vegetation coverage) on air quality can be quantified. This information directly informs intervention prioritization, as demonstrated by Li et al.31. The discrete, linear structure of the agent-based system facilitates integration with standard optimization algorithms such as linear programming, gradient descent, and genetic algorithms for automated intervention design and policy evaluation.

Discussion

We have presented a highly interpretable agent-based modeling framework for the analysis and prediction of urban air quality. Through the case study undertaken in this work, we have demonstrated the capability of this framework to assimilate mobile monitoring data and deliver diurnal insights. Separation of parameters into generation and transport effects allows the model to describe local pollution dynamics. It is also possible to quantify the influence of local sources versus neighborhood transport on pollution at a given location. Unlike conventional temporal-only or spatial-only imputation methods, this framework integrates space-time dynamics for both interpolation and forecasting applications, as we have demonstrated through validation exercises.

Beyond describing the process and making predictions, the framework enables prescriptive applications. The agent-based architecture simplifies the simulation and design of interventions, as demonstrated through the identification of the least-exposure route between locations.

The prescriptive capabilities of the framework we have presented here extend to applications in real-time air quality management, urban development impact assessment, transportation corridor optimization, and green infrastructure planning. The discrete agent-based structure facilitates integration with optimal control algorithms, enabling automated intervention design and policy optimization. Future work will explore advanced deep learning methods and control strategies to fully leverage the framework’s prescriptive potential for sustainable urban development.

The agent-based architecture naturally accommodates extensions. While the current work does not explicitly capture complex chemical transformations, they can be embedded as agent properties, enabling a comprehensive assessment of secondary pollutant formation and multi-species interactions.

The model is trained on a limited dataset from a single urban location. Validation across diverse urban environments with varying morphologies, emission profiles, and meteorological conditions is necessary to establish generalizability. Typically, meteorological interactions and chemical transformation introduce non-linearity in pollution sourcing (BP), whereas street canyon effect and atmospheric turbulence can introduce non-linearity in transport (K). While the governing mass-balance and the parameterization capture only linear relationships in this work, non-linearity could be introduced in the model through the parameterization of BP and K as a function of real-world features. The scope of the current work is a single criterion pollutant over a short time duration and over a relatively small area; hence, a linear model is considered a reasonable approximation. Advanced parameterization techniques, including deep-learning approaches capable of representing non-linear relationships, warrant investigation to enhance model robustness. Furthermore, the incorporation of additional features in the parameterization of BP and K could add richness to the model’s abilities.

The spatial resolution of the agent network presents a computational trade-off: increasing agent density improves spatial resolution but increases computational demand, while decreasing the number of agents would compromise detail. Currently, agent distribution across the modeling domain is governed by the spatial coverage of available data. Systematic investigation of optimal agent number and placement strategies based on spatial autocorrelation of urban features (e.g., road networks, emission sources, building geometry) represents a key research direction. Although this work focuses only on diurnal modeling, the framework is inherently scalable. Extending to different spatial and temporal scales and resolutions is an important direction of study.

This work presents a framework for explainable hyperlocal air quality modeling with descriptive, predictive, and prescriptive capabilities. The key idea of the framework is to discretize the study area into a number of heterogeneous agents, characterized by real-world features, which describe the air quality as a series of exchanges of pollutants among one another. While this work presents a simplified, linear model, tested within a limited domain, there is significant scope to develop the framework in multiple directions to improve robustness and accuracy.

As cities worldwide address air quality challenges, this agent-based approach offers a pathway from developing a hyperlocal understanding to designing scientifically grounded environmental management strategies.

Methods

Agent-based modeling framework

The proposed agent-based framework conceptualizes urban environments as a collection of interacting spatial entities, called agents, which exchange pollutants according to predetermined rules. This approach draws inspiration from cellular automata and agent-based systems, while maintaining adherence to fundamental conservation laws (in the present case, mass balance)34,35.

The system of agents is represented in the form of a directed graph network, with each node being an agent and the edges depicting the transfer pathway for pollutants. Each agent embodies a discrete spatial unit characterized by the geometry (area, connectivity, and distance from other agents), dynamic attributes (meteorology, traffic, etc.), static attributes (land-use, tree cover, water bodies, etc.), and the state (pollutant concentration). These features serve as the properties of each agent, influencing its behavior within the system. The objective of each agent is to update its state (pollutant concentration) through a series of exchanges across other agents as the system evolves over time. The rules of the exchange are governed by a parameterized mass balance. The parameters of the mass balance are derived from the characteristic features defined above, which embody the behavior of each agent. The generation and transport of pollutants across the agents is tracked over discrete time steps as the system evolves.

An agent is capable of increasing or decreasing pollutant levels at a given time. Pollutants may be generated at the agent through local sources such as vehicular traffic, industrial emissions, etc. An agent may also act as a sink for pollutants and remove them from the system. This could be attributed to pollutant deposition on surfaces such as trees36 or water bodies, or removal by any other means within the bounds of the agent. Pollutants removed in this manner leave the system and are not accounted for in the subsequent time step.

Pollutant transfer across agents is primarily convective and depends on the local convective forces, such as wind, vehicular motion, and other meteorological factors. In this work, it is assumed that transfer is only across nodes that are directly connected in the graph network.

An agent-centric perspective allows the framework to embody properties such as spatial heterogeneity, emergent behavior, and individual assessment. Since each agent captures unique local characteristics (land-use, traffic, meteorology), it allows for spatial heterogeneity to be embedded in the framework. Regional pollution patterns emerge from interactions among discrete agents that are not explicitly programmed into the agents, thus exhibiting emergent behavior. Agent-level interventions can be evaluated independently, thus enabling individual assessment. For instance, BP and K parameters have physical meaning at an agent scale.

Data collection

This work aims to establish an agent-based framework for modeling high-resolution spatio-temporal variations in air quality. Recognizing the limitations of sparse fixed monitoring networks, we employed a mobile monitoring paradigm using IoT-enabled sensors deployed on a vehicle traversing a predefined route32. The sensor measures PM2.5, temperature, and relative humidity; see ref. 32 for details of the sensor package and the study. In this study, a monitoring vehicle covered a 25-km-long study route, depicted in Fig. 9, covering an effective area of 13 km2. The route was chosen such that the vehicle could capture various land profiles, such as residential, industrial, commercial, and schools. The sensor package was fitted on the top of the vehicle, which was operated continuously over a 37-day period between April and May of 2019. The 25 km route was discretized into 100 locations, each being 250 meters apart. Samples were collected at each location for 3 min. The study was designed so that data points are available at all 100 locations for every hour of the day (although, since mobile monitoring necessitates sequential sampling, generally not on the same day). This approach enabled high spatio-temporal resolution data collection, essential for the model-building process.

Fig. 9: Route covered by the vehicle during the pilot study, highlighting key urban zones including residential, industrial, commercial, and school zones.
Fig. 9: Route covered by the vehicle during the pilot study, highlighting key urban zones including residential, industrial, commercial, and school zones.The alternative text for this image may have been generated using AI.
Full size image

The data collection route passes through a diverse land-use aimed at capturing rich spatio-temporal variation in air quality.

Data preprocessing and agent configuration

The original 100 data collection locations (represented by blue circles in Fig. 1) were spatially downsampled to a final set of representative nodes (depicted as orange circles). These nodes represent the agents, and these two terms are used interchangeably in this work. This reduction in spatial resolution by approximately one-fifth was implemented to enhance the signal-to-noise ratio and increase the data density at each agent. The choice of the number of agents and their positioning is an important consideration in this framework. A large number of agents would increase the number of variables to be estimated in the mathematical model and introduce unnecessary computational complexity, whereas a small number of agents would compromise spatial continuity and resolution. In this work, the number and positioning of the agents were determined heuristically to achieve a balance between spatial granularity and statistical robustness. While a mathematically optimal configuration involving multi-objective optimization (e.g., land-use homogeneity, pollutant profile similarity, or end-user resolution requirements) is reserved for future study, the current selection of 22 nodes prioritizes spatial continuity and equitable data distribution temporally. For a comprehensive description of the selection and aggregation logic, refer to the “Methodology for representative node selection and data aggregation” section in Supplementary Material. To delineate the spatial catchment of each agent, Voronoi diagrams were employed to define their boundaries37,38. Temporally, PM2.5 concentration within each agent was aggregated to an hourly resolution over a 24-h cycle. This configuration allows the model to effectively capture the characteristic diurnal behavior of the local atmospheric environment.

Since data were collected with a single mobile monitor, there are spatio-temporal gaps in the original dataset. While temporal aggregation gives average diurnal trends, day-to-day variability gets masked. For this particular study, since the data collection was performed over the course of one month in summer, the influence of meteorology(particularly seasonal meteorology) was minimal. Other than local short-term sourcing, the variability in the dataset is not significant. The dataset lacks seasonality and has stationary statistical properties. Detailed analysis of stationarity of the dataset is available in the “Stationarity” section in Supplementary Material.

Defining agent adjacency

Once the number and position of the agents have been identified, the next step in the framework is to determine the mode of agent interaction. In this study, we assume an immediate neighbor interaction mode. Here, pollution exchange happens only between agents that are immediate neighbors. Agents are considered immediate neighbors only if they meet the following criteria:

Unobstructed line-of-sight: Pollution exchange happens only at street level between agents that have direct line-of-sight with each other. Line-of-sight is determined using OpenStreetMap building footprints. Agents separated by major structures (e.g., trees, building complexes) are considered not connected.

Sequential adjacency: Connectivity is non-transitive. Even if multiple agents share a single unobstructed line-of-sight, they are not considered ‘immediate’ neighbors. Interaction is limited to the immediate predecessor and successor. This prevents “skipping” over intermediate agents.

Contiguous data record: In order to ensure physical continuity of data points, only those nodes between which measurements are recorded along the collection route are considered interacting.

For example, in Fig. 1, agent 21 is a neighbor to agents 18 and 2. However, there is no interaction between them since they do not have an ‘unobstructed’ line-of-sight or contiguous data record between them. Similarly, 21 is not adjacent to 7 even though they have line-of-sight and contiguous data records because agent 8 is in between, breaking sequential adjacency. While other, more complex modes of interaction can also be defined, this paper only considers immediate neighbor interaction. Changing the mode of interaction would impact the system complexity and the interpretation of the results.

The inter-agent interaction is defined in terms of the adjacency matrix \({\bf{A}}\in {{\mathbb{R}}}^{nxn}\)(where n denotes the number of nodes/agents). The (i, j) entry of the adjacency matrix A is defined as

$${A}_{i,j}=\left\{\begin{array}{l}1\,\,{\mathrm{if}}\,{\mathrm{node}}\,i\,{\mathrm{interacts}}\; {\mathrm{with}}\;{\mathrm{node}}\,j\\ 0\,\,{\mathrm{if}}\; {\mathrm{there}}\; {\mathrm{is}}\; {\mathrm{no}}\; {\mathrm{interaction}}\; {\mathrm{between}}\; {\mathrm{node}}\,i\,{\mathrm{and}}\; {\mathrm{node}}\,j\end{array}\right.$$
(6)

The distance coefficient matrix \({\bf{D}}\in {{\mathbb{R}}}^{nxn}\) whose (i, j) entry is,

$${D}_{ij}=\left\{\begin{array}{ll}\frac{{w}_{ij}}{{\sum }_{k\in {{\mathcal{N}}}_{i}}{w}_{ik}} & \,{\rm{if}}\,{A}_{i,j}=1\\ 0 & \,{\rm{if}}\,{A}_{i,j}=0\end{array}\right.$$
(7)

and is populated based on the adjacency matrix. The elements of the distance coefficient matrix are populated only for those nodes that are ‘adjacent’. The number of non-zero entries in the adjacency matrix determines the number of Kij to be estimated. The final adjacency matrix has 54 inter-nodal interactions, the corresponding convective transfer coefficients for which need to be estimated. Figure S1 shows a directed graph depicting the inter-agent interactions.

Parameterization strategy for BP

Base pollution (BP) represents the amount of pollutant added or removed at each agent. The primary local sources for PM2.5 in Indian cities include vehicular emission39, industrial emissions, and anthropogenic activities in residential and commercial zones. The potential sinks are deposition onto surfaces, particularly to water bodies or foliage36. Thus, BP should be characterized in terms of land-use features that capture these sourcing and sinking effects.

High spatial-temporal vehicular emission information was not available for this area. Vehicle count is used as a surrogate for the contribution of vehicular emissions. The data is collected from the TomTom website40, which gives the average vehicular count for a given segment of a road network. Hourly vehicular count information was collected for a typical day during the measurement period. The service also classifies roads into seven functional classes, described in Table 4. These seven classes are grouped into three road types by merging classes with similar functional roles. Figure 2c shows the three road types in the study area that are considered in the parameterization.

Table 4 Road classification system for traffic emission characterization

The study area, although geographically compact, has a diverse land-use profile, with residential, industrial, and commercial areas. Figure 2d shows the diverse land-use profile across the study route. The land-use data for Chennai is available on the Chennai Metropolitan Department Authority (CMDA) website41. Residential and industrial emission information is not readily available for this area. Hence, the fractional area of each land-use type is taken as a proxy for sectoral contribution. For the study area, unique categories of land-use are: Primary Residential, Mixed Residential, Industrial, Commercial, Institutional, and Water Body. Since the mixed residential and institutional area is minimal, and to reduce the number of land-use categories considered, they are grouped with commercial land-use. The remaining four land-use categories are: Residential, Industrial, Commercial, and Water Body. For each agent, the fractional area of each land-use category is calculated and used as a predictor variable.

Finally, BP is parameterized in terms of land-use and vehicular count. Hence, BP at agent i can be written as:

$$\begin{array}{rcl}B{P}_{i}(t)={a}_{1}(t){R}_{i}+{a}_{2}(t){I}_{i}+{a}_{3}(t){C}_{i} & & +{a}_{4}(t)W{b}_{i}\\ & & +{a}_{5}(t)R{1}_{i}+{a}_{6}(t)R{2}_{i}+{a}_{7}(t)R{3}_{i}\end{array}$$
(8)

where, Ri, Ii, Ci, Wbi represents the fractional area of residential, industrial, commercial, and water bodies at a given agent i. R1i, R2i, and R3i represent the traffic count in road types 1, 2, and 3, respectively, at agent i. This linear parameterization strategy for real-world features is borrowed from conventional land-use regression models42.

Parameterization strategy for K

PM2.5 can be transported from one agent to another in two ways: convection or diffusion. It is assumed that diffusion effects are negligible over the inter-agent distances considered in this study, and therefore convective transport is the dominant effect contributing to the movement of PM2.5 between agents. Wind speed, temperature, and humidity affect the convection of PM2.5. Hence, K is expressed as a function of these factors. Transfer across agents depends on the relative position of the agents, the wind speed, and the wind direction.

Kij denotes the convective transfer coefficient of agent i with respect to agent j. Kij is parameterized in terms of Temperature, Relative Humidity, Wind speed, and direction as:

$${K}_{ij}(t)={b}_{1}(t){T}_{i}(t)+{b}_{2}(t){H}_{i}(t)+{b}_{3}(t)W{S}_{i}(t)* {\phi }_{ij}(t)$$
(9)

In equation (9), Ti represents the temperature at agent i, and Hi represents the relative humidity at agent i. WSi is the wind speed at agent i, and ϕij is the fractional area under agent j along the wind direction through agent i.

Temperature and humidity measurements are gathered from the mobile monitoring data. Wind data is obtained from the European Centre for Medium-Range Weather Forecasts (ECMWF) database. The temporal resolution of the available data is one hour, and the spatial resolution is 25 km. The value of wind speed and wind direction at the desired agent is interpolated using the Clough-Tocher method43 using the CloughTocher2DInterpolator sub-package in SciPy44.

Parameter estimation

The base model depicted in the form of the mass balance is shown in equation (1). By incorporating parameterized BP and K into the mass balance, real-world features are introduced in the model. The final agent-based model in terms of the real-world parameters is formulated, after simplification and generalization, as:

$$\begin{array}{rcl}{P}_{i}(t)\!\!\!&=&\!\!\!{P}_{i}(t-1)+{a}_{1}(t){R}_{i}+{a}_{2}(t){I}_{i}+{a}_{3}(t){C}_{i}+{a}_{4}(t)W{b}_{i}+{a}_{5}(t)R{1}_{i}+{a}_{6}(t)R{2}_{i}\\&&\!\!\!+{a}_{7}(t)R{3}_{i}+{b}_{1}(t){T}_{i,new}(t)+{b}_{2}(t){H}_{i,new}(t)+{b}_{3}(t)g({W}_{s},\phi )(t)\end{array}$$
(10)

where

$${T}_{i,new}(t)=-{T}_{i}(t){P}_{i}(t-1)\mathop{\sum }\limits_{j\in {{\mathbb{Z}}}_{i}}{D}_{ij}+\mathop{\sum }\limits_{j\in {{\mathbb{Z}}}_{i}}{T}_{j}(t){D}_{ji}* {P}_{j}(t-1)$$
(11)
$${H}_{i,new}(t)=-{H}_{i}(t){P}_{i}(t-1)\mathop{\sum }\limits_{j\in {{\mathbb{Z}}}_{i}}{D}_{ij}+\mathop{\sum }\limits_{j\in {{\mathbb{Z}}}_{i}}{H}_{j}(t){D}_{ji}* {P}_{j}(t-1)$$
(12)

and

$$g({W}_{s},\phi )(t)=-{W}_{si}(t)\cdot {P}_{i}(t-1)\mathop{\sum }\limits_{j\in {{\mathbb{Z}}}_{i}}{D}_{ij}{\phi }_{ij}(t)+\mathop{\sum }\limits_{j\in {{\mathbb{Z}}}_{i}}{W}_{sj}(t){\phi }_{ji}(t){D}_{ji}{P}_{j}(t-1)$$
(13)

For step-by-step simplification, please refer to the “Generalized representation of ABM with parameterization” section in Supplementary Material. Ti,new, Hi,new, and g(Wsi) are obtained through real-world data. The agent-based model, in terms of real-world features, is still linear in parameters. The 22 agents are described by a system of 22 coupled equations, which are solved at an hourly time step. Thus, 10 coefficients are to be estimated from 22 equations at each time step, yielding 12 residual degrees of freedom. The system is overdetermined, ensuring a unique minimum-residual solution at each hour. The system of equations is described by equations (1417). The y[t], X[t], θ[t] are as follows,

$${\bf{y}}[{\bf{t}}]=\left[\begin{array}{c}{P}_{1}(t)-{P}_{1}(t-1)\\ {P}_{2}(t)-{P}_{2}(t-1)\\ \vdots \\ {P}_{21}(t)-{P}_{21}(t-1)\\ {P}_{22}(t)-{P}_{22}(t-1)\end{array}\right]$$
(14)
$${\bf{X}}[{\bf{t}}]=\left[\begin{array}{cccccccccc}{R}_{1} & {I}_{1} & {C}_{1} & {W}_{b1} & R{1}_{1} & R{2}_{1} & R{3}_{1} & {T}_{1,new} & {H}_{1,new} & {g}_{1}({W}_{s},\phi )\\ {R}_{2} & {I}_{2} & {C}_{2} & {W}_{b2} & R{1}_{2} & R{2}_{2} & R{3}_{2} & {T}_{2,new} & {H}_{2,new} & {g}_{2}({W}_{s},\phi )\\ & & & & & \vdots & & & & \\ {R}_{21} & {I}_{21} & {C}_{21} & {W}_{b21} & R{1}_{21} & R{2}_{21} & R{3}_{21} & {T}_{21,new} & {H}_{21,new} & {g}_{21}({W}_{s},\phi )\\ {R}_{22} & {I}_{22} & {C}_{22} & {W}_{b22} & R{1}_{22} & R{2}_{22} & R{3}_{22} & {T}_{22,new} & {H}_{22,new} & {g}_{22}({W}_{s},\phi )\end{array}\right]$$
(15)
$${\boldsymbol{\theta }}[t]={\left[\begin{array}{cccccccccc}{a}_{1} & {a}_{2} & {a}_{3} & {a}_{4} & {a}_{5} & {a}_{6} & {a}_{7} & {b}_{1} & {b}_{2} & {b}_{3}\end{array}\right]}^{T}$$
(16)
$${\bf{y}}[t]={\bf{X}}[t]{\boldsymbol{\theta }}[t]$$
(17)
$$\widehat{{\boldsymbol{\theta }}}[t]=\mathrm{argmin}||{\bf{y}}[t]-{\bf{X}}[t]{\boldsymbol{\theta }}[t]|{|}_{2}^{2}$$
(18)

The solution to this linearized agent-based model (Equation (18)) is obtained by minimizing the squared L2 norm of the residual error, which corresponds to the ordinary least squares (OLS) estimation problem. This formulation is well-suited for parameter estimation in linear dynamical systems where observations are available at multiple spatial locations45.

Validation of model parameters

An ordinary least squares (OLS) formulation is used to learn the coefficients of the proposed model. Hourly average PM2.5 concentration data is available from the data collection endeavor. At each time step t, a system of 22 equations in 10 unknowns is solved independently, constituting an overdetermined problem. Since the coefficient vector θ[t] is estimated from the 22 spatial observations at each hour, the ratio of the number of observations to the number of unknowns is considerably small in this case. Consequently, while the least squares solution is useful, an appropriate cross-validation strategy is required. A spatial, leave-one-agent-out cross-validation (LOOCV) is performed, wherein one agent is held out for each iteration. Coefficients are estimated from the remaining 21 agents, and the PM2.5 concentration change at the held-out agent is predicted using those coefficients. This is repeated for all 22 agents and all 24 h. The RMSE and Spearman rank correlation are summarized in Table 5. The fit on the test dataset for each agent is shown in Fig. 10, where the model is trained on the 21 agents and tested on the one left out agent. The plot here shows the concentration change between 24th and 23rd hour, i.e., t is 24. The estimated OLS coefficients are analyzed to verify that BP and K are physically meaningful. For this purpose, traffic, wind speed, and wind direction data are analyzed.

Fig. 10: Model performance on held-out test agent showing agent-specific prediction accuracy across the study domain for the 24th time step (t = 24).
Fig. 10: Model performance on held-out test agent showing agent-specific prediction accuracy across the study domain for the 24th time step (t = 24).The alternative text for this image may have been generated using AI.
Full size image

The plot shows the model output for the deviation between P(t) − P(t − 1) for each agent at time step t = 24 and compares the same for the measurements made by the sensor.

Table 5 Leave-one-out cross-validation performance metrics

Model predictions demonstrate strong agreement with observations (Fig. 10).

Convective transfer coefficient K is averaged for all agents for each hour of the day. The wind speed is averaged over 22 agents. The temporal profile of the K and wind speed over the region is analyzed. The temporal pattern of the wind speed and the convective transfer coefficient is shown in Fig. 11. It can be observed that the diurnal profile of the convective transfer coefficient matches the wind speed profile qualitatively. A Spearman rank correlation of 0.41 and a Pearson correlation of 0.55 indicate that the convective transfer coefficient and average wind speed are moderately correlated. The convective transfer coefficient dictates the proportion of mass transferred from one agent to another. It is expected that the higher the wind speed, the higher the pollutant movement across agents. Similarly, as the wind speed reduces, the pollutant movement reduces. The same phenomenon is captured in the estimated convection transfer coefficient.

Fig. 11: Temporal profile of wind speed and convective transfer coefficient.
Fig. 11: Temporal profile of wind speed and convective transfer coefficient.The alternative text for this image may have been generated using AI.
Full size image

It can be observed that the diurnal profile of the convective transfer coefficient matches the wind speed profile qualitatively. It is expected that the higher the wind speed, the higher the pollutant movement across agents. Similarly, as the wind speed reduces, the pollutant movement reduces. The same phenomenon is captured in the estimated convection transfer coefficient.

Major sources of the base pollution in urban areas are traffic sources and emissions from industries. Figure 2b shows the correlation between BP and average vehicular count along with the land-use information for each agent. The region comprises four land-use categories, namely, commercial, industrial, residential, and water regions, and is distributed as shown in Fig. 2d. BP for most agents (2,4,6,8,20,21) in the industrial region shows a lower correlation with vehicular count, which indicates that the contribution of industrial emissions to BP is possibly higher than the vehicular contribution. Agents 1, 3, and 5, however, show slightly higher correlation. These higher correlation values can, however, be attributed to the fact that they are major traffic intersections. Residential and commercial agents have a significantly higher correlation with traffic, indicating vehicular traffic as a significant contributor.

Identifying the relative dominance of BP and K on the agent

Each agent, depending on local factors, would have a dominant contributor. Two metrics, BPR (BP Ratio) and CR (Convection Ratio), are proposed to determine the relative strength of BP and K in influencing overall pollution at a given agent across all times (Equations (1921)).

While BPR quantifies the change in total pollution, P(t), relative to the change in BP, CR quantifies the change in P(t) relative to the transport of the pollutant.

$$BPR=\frac{{\sum }_{t=1}^{24}{(B{P}_{i}(t)-\bar{B}{P}_{i})}^{2}}{{\sum }_{t=1}^{24}{({P}_{i}(t)-{\bar{P}}_{i})}^{2}}$$
(19)
$${C}_{i}(t)={P}_{i\leftarrow }(t)-{P}_{i\to }(t)$$
(20)
$$CR=\frac{{\sum }_{t=1}^{24}{({C}_{i}(t)-{\bar{C}}_{i})}^{2}}{{\sum }_{t=1}^{24}{({P}_{i}(t)-{\bar{P}}_{i})}^{2}}$$
(21)

An agent is identified as source dominant (‘BP dominant’) if its BPR is at least 30% larger than its CR. It is identified as transport dominant (‘K Dominant’) if its CR is 30% more than its BPR. In case the difference between BPR and CR is less than 30%, the agent is identified as ‘Equal’. This implies that both sourcing and transport contribute almost equally to total pollution concentration. Although a simple max(BPR,CR) selection is the theoretical ideal, we adopt a 30% threshold to safeguard against modeling noise and aggregation uncertainties. This suggestive threshold is employed as a heuristic to ensure that dominance is only assigned when one factor significantly outweighs the other. Users seeking higher sensitivity or different confidence levels may adjust this value to suit specific air quality management objectives.

Benchmarking spatio-temporal imputation capability

To evaluate the quality of ABM-based imputation, it is compared with four widely-used alternatives: mean imputation, linear interpolation, cubic spline interpolation, and kriging.

Mean imputation is the simplest gap-filling strategy. Every missing value at a given node is replaced by the arithmetic mean of all available PM2.5 observations at that node across the entire record. Formally, if the set of observed values at node i is {ci(t): tϵTobs}, then every missing time step is filled with equation (22):

$${\widehat{c}}_{i}(t)=(1/| {T}_{obs}| )\sum {c}_{i}(s)\,for\,all\,s\,\epsilon \,{T}_{obs}$$
(22)

This approach preserves the mean of each node but ignores temporal structure, diurnal patterns, and spatial relationships. It therefore tends to produce flat imputed segments that distort both variability and correlation statistics.

Linear Interpolation assumes that the change in PM2.5 between two temporally adjacent known values is constant. For a gap bounded by observations at times ta and tb, the imputed value at any intermediate time t(ta < t < tb) is:

$${\widehat{c}}_{i}(t)={c}_{i}({t}_{a})+[{c}_{i}({t}_{b})-{c}_{i}({t}_{a})]\frac{(t-{t}_{a})}{({t}_{b}-{t}_{a})}$$
(23)

Linear interpolation respects the local level at the gap boundaries and is computationally trivial, but it cannot capture the curvature or diurnal periodicity that typically characterizes PM2.5 profiles. It operates purely in the temporal dimension at each node independently. The implementation uses the interpolate module of SciPy.

Cubic spline interpolation generalizes the linear approach by fitting a piecewise third-order polynomial (cubic spline) through the known data points, requiring that the resulting curve and its first two derivatives are continuous. For both linear and cubic spline interpolation implementations, we use SciPy’s interpolate module46.

Kriging is a geo-statistical interpolation technique that fills missing values by modeling the spatial or temporal correlation structure of the data through a variogram47. Two variants are employed in this study. In temporal kriging, the autocorrelation of the PM2.5 time series at each node is modeled and used to interpolate across time gaps at that node. In spatial kriging, the cross-correlation among all nodes at a given time step is modeled and used to interpolate the value at a node that has a missing observation from the values observed at neighboring nodes at the same time step. Ordinary 2D kriging was performed using the PyKrige library in Python48.

Among the imputation baselines, mean imputation and the two interpolation methods operate exclusively in the temporal domain at each node in isolation, whereas kriging can additionally exploit spatial correlations. The ABM is unique in that its imputation leverages both the learned spatial diffusion parameters (edge weights) and the temporal transition structure simultaneously, drawing on the full network topology. Results of the imputation benchmarking are available in the “Data imputation” subsection.

Benchmarking diurnal forecasting capability

Forecasting differs from imputation in that the model must predict concentrations for a future period with no access to observations within that period. As mentioned earlier, the data collection campaign was run over a duration of 37 days. Data collected over the last 7 days of the campaign was used to test the agent-based model’s forecasting capability. However, instead of forecasting over a horizon of 7 days, the model is used to predict the average diurnal profile across the 7 days. This intentional choice was based on the limitations of the dataset. During the data collection campaign, each measurement location is visited only twice in a period of 24 h (the entire 25 km circuit takes roughly 10 h to complete). This limits the number of observations available for each node for each day. Without suitable aggregation, observations for each node for each day are too sparse to infer meaningful insights. Thus, diurnal forecasting is performed to predict the average PM2.5 concentration over the last 7 days of the data collection campaign. The training dataset, on the other hand, uses data only from the first 30 days of the campaign.

Forecast from the agent-based approach is benchmarked against three time-series forecasting methods, each representing a distinct philosophy: persistence (a basic replay of the previous day), a Time-Varying Autoregressive (TVAR) model, and a Fourier-based method (spectral extrapolation of periodic components).

The persistence model is the simplest baseline. It assumes that the average diurnal PM2.5 cycle observed during the training period will repeat unchanged during the test period. For a given node i and forecast hour h on day d, the prediction is:

$${\widehat{c}}_{i}(d,h)={c}_{i}(d-1,h)$$
(24)

In this implementation, the predicted test profile for each node is an exact copy of the average profile aggregated over the initial 30 days. This method involves no parameter estimation, no temporal modeling, and no spatial information. It serves as a critical lower-bound baseline.

Persistence differs from the other two benchmarking methods in that it makes no attempt to extrapolate or modify the training profile. TVAR and Fourier both transform the training cycle through learned parameters to produce a prediction that may diverge from the training average. Persistence, by contrast, assumes perfect stationarity of the diurnal pattern between the training and test periods.

Time-Varying Autoregressive (TVAR) model is a non-stationary time-series model in which the regression coefficients change with time. It predicts current values using past values, allowing the relationship between data points to evolve, which makes it ideal for modeling dynamic systems49.

In this case, the first 30 days of the data are used to train the TVAR model. For each target hour t in the cycle, a local estimation window of ±12 time steps centered on the middle repetition is extracted. Within this window an AR(2) model with an intercept is fitted by ordinary least squares:

$${y}_{t}={\beta }_{0}(t)+{\beta }_{1}(t)* {y}_{t-1}+{\beta }_{2}(t)* {y}_{t-2}+{\epsilon }_{t}$$
(25)

where yt is the PM2.5 value at position t within the estimation window, and β0(t), β1(t), β2(t) are the hour-specific intercept and AR coefficients. The use of a sliding local window rather than a single global fit is what makes the model “time-varying”: the coefficients are re-estimated for each hour of the day. The forecast is then produced by recursive one-step-ahead prediction. Starting from the previous 2 values (last 2 h of the training data), each subsequent hour uses the AR coefficients estimated for that target hour and feeds its prediction back as a lag for the next step:

$${\widehat{c}}_{(t)}={\beta }_{0}(t)+{\beta }_{1}(t)* \widehat{c}(t-1)+{\beta }_{2}(t)* \widehat{c}(t-2)$$
(26)

Compared with persistence, TVAR exploits multi-day history and adapts its coefficients to the most recent regime, making it more responsive to trends and level shifts. However, like persistence, it operates independently at each node and does not incorporate spatial dependencies. It differs from the Fourier approach in that it models temporal structure through autoregressive lags.

Fourier-Based Forecasting uses a frequency-aware approach wherein a Fourier harmonic regression is applied. The imputed training data is regressed onto an intercept, a linear trend term, and K = 4 pairs of sine and cosine harmonics with periods of 24, 12, 8, and 4 h. The four harmonic pairs represent the fundamental diurnal cycle (24 h) together with its first three sub-harmonics (12, 8, and 4 h). The forecast is then obtained by evaluating the fitted equation at the 24 time indices immediately following the training series. The Fourier regression produces a non-trivial forecast as long as the training profile exhibits any hourly variation, because the harmonics are inherently oscillatory. By retaining only four harmonic pairs, the method acts as a low-pass filter, smoothing out high-frequency noise while preserving the dominant diurnal shape.

All three benchmarking methods treat each monitoring node independently. Thus, the forecast at one node is not influenced by observations or predictions at any other node. The ABM, by contrast, jointly models all 22 nodes through directed-edge flow parameters weighted by inter-node distances, enabling it to propagate spatial information across the monitoring network at every time step. The forecasting performance of the three time-series methods is compared to the spatio-temporal agent-based framework(this work) in the “Diurnal forecasting” subsection.

Identifying travel exposure at each agent

Each agent represents a finite space of pollutant concentration that changes with time. One of the possible utilities of this property is in determining ambient exposure during travel. We propose a method in this work that simplifies travel exposure estimation as a simple aggregate of pollution concentration across agents. If we consider a route along which an individual travels, the personal exposure along the route is given by equations (2730):

$${E}_{i}={\kappa }_{E}* {P}_{i}* {T}_{i}$$
(27)
$${T}_{i}=\frac{{d}_{i}}{{s}_{i}}$$
(28)
$${E}_{T}=\frac{{\sum }_{i=1}^{24}{E}_{i}}{{\sum }_{i=1}^{24}{T}_{i}}$$
(29)
$${\kappa }_{E}=f(\,{\rm{Inhalation\; dose,\; mode\; of\; transport}})$$
(30)

where, Ei is the exposure to an individual at agent i, Pi is the pollution concentration at agent i, di is the distance covered within agent i, si is the speed with which the individual travels within agent i, ET is the total exposure along the route and κE is the dimensionless exposure coefficient which we define as a function of intake efficiency rate of the individual and their mode of transport. Intake efficiency is the fraction of the ambient particulate matter concentration that is inhaled by the individual. These factors can be incorporated based on empirical values. For the sake of simplicity, κE in this simulated case is assumed to be unity.

Travel exposure has applications in determining the least-exposure route as described in the “Identifying the least-exposure route” section.