Agent-based framework for modeling hyperlocal urban air quality

Swaminathan, Sathish; Agrawal, Pranav; McNeill, V. Faye; Rengaswamy, Raghunathan

doi:10.1038/s44407-026-00073-6

Download PDF

Article
Open access
Published: 12 May 2026

Agent-based framework for modeling hyperlocal urban air quality

Sathish Swaminathan^1,2,3,
Pranav Agrawal¹,
V. Faye McNeill^2,3,4 &
…
Raghunathan Rengaswamy¹

npj Clean Air volume 2, Article number: 32 (2026) Cite this article

721 Accesses
2 Altmetric
Metrics details

Subjects

Abstract

Urban air quality exhibits significant spatial and temporal heterogeneity at hyperlocal scales, necessitating advanced modeling paradigms that can bridge the gap between computationally intensive physics-based models and empirically driven statistical approaches. This paper introduces a novel agent-based modeling framework specifically designed for hyperlocal air quality assessment, capable of providing descriptive, predictive, and prescriptive analyses. The proposed framework discretizes urban environments into interacting agents, with pollutant dynamics governed by a parameterized mass balance that preserves fundamental physics while maintaining computational efficiency. The framework is demonstrated through a mobile monitoring case study in Chennai, India. Geospatial features are encoded into agent properties through physically interpretable parameters enabling attribution of pollution sources and transport pathways, thereby strengthening the framework’s descriptive capabilities. The approach successfully captures complex spatio-temporal pollution dynamics and describes pollution hotspots by attributing them to sourcing and transport influences. Predictive capabilities are demonstrated through spatio-temporal interpolation and forecasting. Spatio-temporal formulation of the framework enables it to outperform regular unidimensional methods. The discrete agent structure facilitates prescriptive applications, demonstrated here through the identification of the least-exposure route between locations. The unification of descriptive, predictive, and prescriptive capabilities within a single interpretable framework makes it a valuable tool for urban environmental management and real-time decision support systems.

A hybrid bio-inspired model for predicting urban air pollution using deep learning

Article Open access 02 April 2026

Hyperlocal environmental data with a mobile platform in urban environments

Article Open access 05 August 2023

Integrated assessment of environmental infrastructural and social risks for urban public safety

Article Open access 21 January 2026

Introduction

Urban air pollution poses unprecedented challenges to public health and environmental sustainability in the 21st century^1,2. With over 90% of the global population exposed to air pollution levels exceeding WHO guidelines³, the need for precise, actionable air quality assessment has never been more critical. Traditional monitoring approaches, while valuable for regulatory compliance, fall short of capturing the hyperlocal variability that characterizes urban air pollution landscapes^4,5.

The complexity of urban air quality arises from the intricate interplay of emission sources, meteorological conditions, and urban morphology, creating pollution gradients that can vary by orders of magnitude within single city blocks^6,7. This spatial heterogeneity, combined with dynamic temporal patterns, necessitates modeling approaches that can simultaneously address multiple scales and provide actionable insights for urban planning and public health protection^8,9.

Current air quality modeling paradigms broadly fall into three categories: physics-based models, data-driven approaches, and hybrid methodologies. Physics-based models, such as dispersion models and chemical transport models (CTMs), provide a mechanistic understanding but require substantial computational resources and detailed input data^10,11,12,13. Data-driven approaches, including land-use regression (LUR) and machine learning methods, offer computational efficiency but often lack physical interpretability and struggle with temporal dynamics^14,15,16,17. Hybrid approaches attempt to combine the strengths of both paradigms but frequently result in increased complexity without proportional gains in practical utility^18,19,20,21.

Despite significant advances in air quality modeling, a fundamental gap persists between the granular information needs of urban decision-makers and the capabilities of existing modeling frameworks^22,23. Decision-makers require tools that not only describe and predict air quality patterns but also prescribe actionable interventions²⁴. This necessitates a modeling paradigm that is simultaneously descriptive, predictive, and prescriptive. That is, it should be capable of identifying pollution sources, transport pathways, and exposure patterns, able to forecast air quality, and it should enable scenario analysis and optimization for policy intervention.

Agent-based modeling (ABM) emerges as a promising paradigm to address these requirements²⁵. Originally developed for complex systems analysis, ABM has found applications across diverse domains, including urban planning²⁶, epidemiology²⁷, and ecology²⁸. However, its application to urban air quality modeling remains largely unexplored, despite its inherent advantages for capturing emergent behaviors and spatial interactions.

Here, we introduce a novel agent-based modeling framework specifically designed for urban air quality at high spatio-temporal resolution. While classical ABM involves autonomous entities with rule-based decision-making, the agent-based paradigm more broadly encompasses three core principles: (1) a bottom-up approach where system-level patterns emerge from local interactions, (2) discrete entities that interact through defined rules, and (3) emergent behavior arising from individual entity actions²⁵. We adopt these principles to model urban air quality as pollution exchange among spatially distributed agents governed by physics-based interaction rules.

Our framework discretizes the study area into spatial agents (discrete computational units representing geographic regions) that exchange pollutants, governed by parameterized mass-balance equations on a directed graph network. Equation (1) describes the governing mass balance for pollutant concentration at each agent ‘i’, while equations (2) and (3) describe the outflow and inflow of pollutant. The mass balance is solved simultaneously across all agents to obtain a spatio-temporally resolved pollutant concentration map for the entire study area.

$${P}_{i}(t)={P}_{i}(t-1)+B{P}_{i}(t)-{P}_{i\to }(t)+{P}_{i\leftarrow }(t)$$

(1)

where P_i(t) represents the pollutant concentration at agent i and time t, BP_i(t) denotes the net base pollution (sourcing and sinking attribute), and P_i→(t) and P_i←(t) represent outgoing and incoming pollutant fluxes, respectively. The inter-agent transport terms are formulated as:

$${P}_{i\to }(t)=\mathop{\sum }\limits_{j\in {{\mathcal{N}}}_{i}}{K}_{ij}{D}_{ij}{P}_{i}(t-1)$$

(2)

$${P}_{i\leftarrow }(t)=\mathop{\sum }\limits_{j\in {{\mathcal{N}}}_{i}}{K}_{ji}{D}_{ji}{P}_{j}(t-1)$$

(3)

$${D}_{ij}=\frac{{w}_{ij}}{{\sum }_{k\in {{\mathcal{N}}}_{i}}{w}_{ik}}$$

(4)

$${D}_{ji}=\frac{{w}_{ji}}{{\sum }_{k\in {{\mathcal{N}}}_{j}}{w}_{jk}}$$

(5)

where D_ij and D_ji are the distance-based parameters which dictate the fractional distribution of pollutant as a function of distance, ${{\mathcal{N}}}_{i}$ represents the neighborhood set of agent i, K_ij is the convective transfer coefficient, and w_ij = 1/d_ij represents an inverse distance-based weighting with d_ij being the Euclidean distance between agents.

Implementation of this model requires defining the agent network, parameterization of BP and K, followed by the estimation of these parameters. These model parameters incorporate real-world geospatial features, enabling the integration of urban characteristics into the model’s knowledge base. Additional details of the model and parameterization strategy can be found in the “Methods” section.

Our approach balances physical interpretability with data-driven learning to create a computationally efficient platform for assessing complex urban pollution dynamics. We demonstrate the framework’s viability across three application domains, viz., descriptive, predictive, and prescriptive analysis, using high-resolution data from a mobile monitoring campaign in Chennai, India.

Results

We demonstrate the proposed approach using data from a mobile monitoring campaign, which took place in Chennai, India, over a 37-day period across April and May of 2019. The study area is divided into 22 nodes (agents) (Fig. 1). Real-world features including traffic data, land-use data, emissions estimates, and weather information (wind speed, temperature, and relative humidity) are used to parameterize the sourcing effect (BP) and the transport effect (K), allowing the model to capture the pollution dynamics of the region under study as well as fundamental physics in the form of the mass balance.

Descriptive capabilities: understanding pollution dynamics

The framework’s descriptive power emerges through the decomposition of pollution variations into source contributions and transport effects. The factor responsible for the change in the pollution level can be determined by analyzing BP and K. BP, the representative of various sources and sinks, is associated with different anthropogenic activities. If the analysis suggests that BP is responsible for the change in the pollution level, then related anthropogenic activities are further analyzed and validated by assessing the influence of real-world data like vehicular emissions, land-use data, etc. Figure 2b shows the correlation between PM_2.5 concentration and the average vehicular count at different agents. The study region comprises three road types, as shown in Fig. 2c. Agents 1–5, which contain road type 1(expressways), show low correlation between pollutants and vehicular count. It could be attributed to two factors: (1) industrial emissions could be a significant contributor since these locations are in the industrial zone, as seen in Fig. 2d, and (2) the expressway has minimal stagnation of vehicles and has more open area compared to interior roads. Agents in commercial and residential areas show higher correlation with traffic, indicating a higher contribution from vehicular sources. Across all the plots in Fig. 2, it is possible to obtain descriptive insights into the association between PM_2.5, traffic, and land-use.

Fig. 2: Relationship between PM2.5, BP, traffic, and land-use. — **Fig. 2: Relationship between PM_2.5, BP, traffic, and land-use.**

Furthermore, concurrent analysis of BP and K can shed light on the dynamics of pollutant generation across agents.

Interplay between BP and K

In urban regions, localized anthropogenic factors contribute to the sourcing and sinking ability of each agent. These factors either add or remove the pollutant from the agent. This effect is captured through BP. K, on the other hand, ensures spatial continuity across the region and captures the effect of pollutant transport. Figure 3a, shows the change in PM_2.5 concentration from 7 AM to 8 AM, while the arrows depict the direction and magnitude of net transport across agents in that time step. Figure 3b shows the value of BP at each agent at the time step corresponding to 8 AM. There is a large increase in the measured PM_2.5 concentration at 8 AM at agent 19. BP value at node 19 indicates the addition of PM_2.5 at this location. Furthermore, neighboring agents also seem to transport PM_2.5 to agent 19, as seen in Fig. 3a. Locations 19 and 20 are school zones. At 8 AM, coinciding with school start times²⁹, high PM2.5 concentrations are observed alongside elevated traffic volumes, as indicated by positive BP values. The model’s BP term suggests these factors are correlated, consistent with traffic-related emissions as a contributing source.

Fig. 3: Understanding variation in PM2.5 concentration across agents at 8 AM in terms of transport (K) and sourcing (BP) effects. — **Fig. 3: Understanding variation in PM_2.5 concentration across agents at 8 AM in terms of transport (K) and sourcing (BP) effects.**

There are also cases where the net increase in concentration could be attributed to transport rather than BP. For instance, agent 1 sinks PM_2.5 at 8 AM; however, there is a net increase in absolute concentration due to a significant influx from agent 2 to 1 as seen in Fig. 3a.

Dominance of sourcing vs transport effect at each agent

Total PM_2.5 contribution is affected by transport as well as localized addition or removal. It is observed that for certain nodes, the change in concentration is predominantly affected by the convection, while for others by BP. BP Ratio(BPR) and Convection Ratio(CR) are two metrics that are introduced to determine the relative influence of BP and K, respectively, on the overall PM_2.5 concentration at each agent. The ratios are calculated for each node as discussed in the “Identifying the relative dominance of BP and K on the agent” subsection under the “Methods” section and depicted in Fig. 4. The value at the top of the box shows the BPR value, and the bottom of the box shows the CR value. Based on the chosen threshold, agents 1, 2, 9, 11, 13, 14, 15, and 20 are identified as BP dominant, while agents 4, 5, 6, 7, 8, 12, 18, 19, and 22 are K dominant. Since the difference in ratios is not significant for the remaining 5 agents, they are classified as agents with equal contribution from BP and K.

Fig. 4: Dominant influence for each agent based on BPR (top value in the box) and CR (bottom value in the box). — **Fig. 4: Dominant influence for each agent based on *BPR* (top value in the box) and CR (bottom value in the box).**

Through this analysis, it is possible to ascertain whether the pollutant concentration at a given agent is influenced by local effects or neighboring agents.

Figure 5a and b show the temporal decomposition of total PM concentration into its constituents, BP_i,P_i→and P_i← for agents 1 and 8, respectively. The P_i curve for agent 1, which is identified as BP dominant, follows BP_i more closely than P_i→ or P_i←. However, for agent 8, BP_i has relatively lower influence on the P_i curve.

**Fig. 5: Temporal decomposition of pollutant concentration for BP Dominant and K Dominant agents.**

Predictive capabilities: Imputation and Forecasting

The agent-based framework can be used as a spatio-temporal predictive model. Conventional methods for prediction consider either the spatial or the temporal dimension alone while making predictions. However, the dynamics of air quality are affected in space and time in a coupled manner. The framework can be used in two modes: (1) as an imputation model to predict the missing values in the data and (2) as a forecasting model to predict PM_2.5 values outside the observed period. The performance of the model in predictive analysis is discussed below.

Data imputation

The agent-based framework can be used as an imputation model for imputing missing values, a common issue in environmental monitoring. The model is compared with other methods of imputation, such as mean imputation, linear interpolation, cubic spline interpolation, and kriging. Details on the baseline methods are provided in the “Benchmarking spatio-temporal imputation capability” subsection. A Monte Carlo cross-validation was performed, wherein in each test, 50% of the measurements from the original dataset are randomly removed and predicted using different interpolation models. The simulation was run over 30 tests. The trained model is then used to predict the missing values. Table 1 lists the RMSE value for the various models.

Table 1 Comparison between various imputation models for 30 Monte Carlo tests

Full size table

Mean imputation, linear, and cubic interpolation have considerably higher RMSE values. The proposed agent-based model performs better compared to the models evaluated in this work. Figure 6 compares the temporal profiles of observed PM_2.5 and the profiles predicted by the agent-based model for three agents chosen randomly.

Diurnal forecasting

In this mode, we demonstrate the ability of the model in forecasting the full diurnal profile across all the agents. The data collection campaign ran for 37 days with a single mobile monitor. We assume that the campaign ran for only 30 days, and the objective is to test the efficacy of this model in forecasting the diurnal PM_2.5 profile based on the remaining 7 days. Due to a fundamental limitation in the dataset (described in the “Benchmarking diurnal forecasting capability” subsection), we need to aggregate data from the last 7 days of the monitoring campaign to obtain an average diurnal hourly profile to test the forecasting ability. We also compare the agent-based model with three different time-series-based forecasting models. While the dataset is the same for all the methods (including the agent-based method), the training data for each model is processed differently. This is because of the unique requirements of each method. Detailed implementation of each baseline forecasting method is described in the “Benchmarking diurnal forecasting capability” subsection.

The agent-based forecasting model is trained with the data from the first 30 days of the data collection campaign. To evaluate the predictions, data from the final 7 days were aggregated into a 24-h diurnal profile. Spatio-temporal kriging was employed to impute missing observations in both the training and test datasets. While the time-series models are trained on the temporal profile of each node individually, the agent-based model gets trained across all the nodes simultaneously. This allows the model to capture spatial association in addition to temporal information. This ability to capture spatial trends, along with the incorporation of real-world features, allows the agent-based approach to outperform the time-series methods as seen in Table 2. This table compares the average RMSE, MAE, and Pearson correlation across all nodes for each forecasting model. The ABM achieves the lowest prediction error and the highest correlation with observations. Predictions from the various forecasting methods for select nodes are visualized in Fig. 7, with circular markers indicating observations imputed via kriging.

Fig. 7: Comparing diurnal PM2.5 predictions from different forecasting models. — **Fig. 7: Comparing diurnal PM_2.5 predictions from different forecasting models.**

Table 2 Performance metrics of different forecasting models

Full size table

While the forecast shows the aggregate diurnal profile, day-to-day trends are masked in the forecast due to aggregation. Rather than day-to-day forecasts over 7 days, this forecast represents a “typical day” composite. To demonstrate multi-day forecast capabilities, continuous observations are required at each agent, which could be obtained with a static monitoring paradigm. Since this dataset spans a little over one month, the ability of the model to capture seasonal behavior is not demonstrated in this work.

Prescriptive capabilities: Applications in policy and intervention design

Urban air quality improvement requires multi-stage intervention strategies at both local and regional scales³⁰. Recent work has demonstrated the utility of linking urban form indicators with air quality modeling tools to optimize intervention design at the locality level³¹. Given its detailed descriptive and predictive capabilities, the proposed agent-based framework naturally extends to prescriptive applications, enabling quantitative evaluation of intervention scenarios and optimization of mitigation strategies.

We discuss two prescriptive applications: (1) identification of minimum-exposure routes for personal mobility, and (2) source attribution for targeted intervention design.

Identifying the least-exposure route

Traditional routing algorithms optimize for travel time or distance. However, personal exposure to air pollution during travel depends on both pollutant concentration and travel duration, with substantial spatial and temporal variability³². While conventional routing algorithms optimize for the least distance or least travel, we propose a method to identify the least-exposure route. The agent-based framework simplifies the least-exposure route problem by reducing it to an aggregation over a discrete network of agents. A detailed mathematical formulation for computing travel exposure is provided in the “Identifying travel exposure at each agent” subsection.

Applying Dijkstra’s algorithm³³ with E_T as the cost function, we identified the minimum-exposure route between agents 5 and 12. Figure 8 compares this route with the time-optimal route. The minimum-exposure(cleanest) route is 37% longer in distance and requires 45% more travel time, yet reduces total exposure by 14% (Table 3). This reduction occurs because the time-optimal route traverses high-concentration zones, particularly high-traffic corridors where PM_2.5 concentrations are elevated.

**Fig. 8: Fastest route vs least-exposure route (cleanest route) from agent 5 to agent 12.**

Table 3 Comparing the fastest and cleanest route

Full size table

Source attribution and intervention targeting

The framework enables decomposition of pollution contributions by source category, facilitating targeted intervention design. For instance, identification of elevated concentrations near schools (e.g., Agent 19) during peak traffic hours (Fig. 3b) coupled with the high correlation between PM_2.5, BP, and traffic (Fig. 2), suggests temporal traffic management measures such as restricted entry zones or staggered school schedules.

The parameters (BP and K) encode relationships between urban form features and pollutant transport. By formulating parameter sensitivity as a feature importance problem, the relative impact of different urban form features (traffic, land-use, building density, vegetation coverage) on air quality can be quantified. This information directly informs intervention prioritization, as demonstrated by Li et al.³¹. The discrete, linear structure of the agent-based system facilitates integration with standard optimization algorithms such as linear programming, gradient descent, and genetic algorithms for automated intervention design and policy evaluation.

Discussion

We have presented a highly interpretable agent-based modeling framework for the analysis and prediction of urban air quality. Through the case study undertaken in this work, we have demonstrated the capability of this framework to assimilate mobile monitoring data and deliver diurnal insights. Separation of parameters into generation and transport effects allows the model to describe local pollution dynamics. It is also possible to quantify the influence of local sources versus neighborhood transport on pollution at a given location. Unlike conventional temporal-only or spatial-only imputation methods, this framework integrates space-time dynamics for both interpolation and forecasting applications, as we have demonstrated through validation exercises.

Beyond describing the process and making predictions, the framework enables prescriptive applications. The agent-based architecture simplifies the simulation and design of interventions, as demonstrated through the identification of the least-exposure route between locations.

The prescriptive capabilities of the framework we have presented here extend to applications in real-time air quality management, urban development impact assessment, transportation corridor optimization, and green infrastructure planning. The discrete agent-based structure facilitates integration with optimal control algorithms, enabling automated intervention design and policy optimization. Future work will explore advanced deep learning methods and control strategies to fully leverage the framework’s prescriptive potential for sustainable urban development.

The agent-based architecture naturally accommodates extensions. While the current work does not explicitly capture complex chemical transformations, they can be embedded as agent properties, enabling a comprehensive assessment of secondary pollutant formation and multi-species interactions.

The model is trained on a limited dataset from a single urban location. Validation across diverse urban environments with varying morphologies, emission profiles, and meteorological conditions is necessary to establish generalizability. Typically, meteorological interactions and chemical transformation introduce non-linearity in pollution sourcing (BP), whereas street canyon effect and atmospheric turbulence can introduce non-linearity in transport (K). While the governing mass-balance and the parameterization capture only linear relationships in this work, non-linearity could be introduced in the model through the parameterization of BP and K as a function of real-world features. The scope of the current work is a single criterion pollutant over a short time duration and over a relatively small area; hence, a linear model is considered a reasonable approximation. Advanced parameterization techniques, including deep-learning approaches capable of representing non-linear relationships, warrant investigation to enhance model robustness. Furthermore, the incorporation of additional features in the parameterization of BP and K could add richness to the model’s abilities.

The spatial resolution of the agent network presents a computational trade-off: increasing agent density improves spatial resolution but increases computational demand, while decreasing the number of agents would compromise detail. Currently, agent distribution across the modeling domain is governed by the spatial coverage of available data. Systematic investigation of optimal agent number and placement strategies based on spatial autocorrelation of urban features (e.g., road networks, emission sources, building geometry) represents a key research direction. Although this work focuses only on diurnal modeling, the framework is inherently scalable. Extending to different spatial and temporal scales and resolutions is an important direction of study.

This work presents a framework for explainable hyperlocal air quality modeling with descriptive, predictive, and prescriptive capabilities. The key idea of the framework is to discretize the study area into a number of heterogeneous agents, characterized by real-world features, which describe the air quality as a series of exchanges of pollutants among one another. While this work presents a simplified, linear model, tested within a limited domain, there is significant scope to develop the framework in multiple directions to improve robustness and accuracy.

As cities worldwide address air quality challenges, this agent-based approach offers a pathway from developing a hyperlocal understanding to designing scientifically grounded environmental management strategies.

Methods

Agent-based modeling framework

The proposed agent-based framework conceptualizes urban environments as a collection of interacting spatial entities, called agents, which exchange pollutants according to predetermined rules. This approach draws inspiration from cellular automata and agent-based systems, while maintaining adherence to fundamental conservation laws (in the present case, mass balance)^34,35.

The system of agents is represented in the form of a directed graph network, with each node being an agent and the edges depicting the transfer pathway for pollutants. Each agent embodies a discrete spatial unit characterized by the geometry (area, connectivity, and distance from other agents), dynamic attributes (meteorology, traffic, etc.), static attributes (land-use, tree cover, water bodies, etc.), and the state (pollutant concentration). These features serve as the properties of each agent, influencing its behavior within the system. The objective of each agent is to update its state (pollutant concentration) through a series of exchanges across other agents as the system evolves over time. The rules of the exchange are governed by a parameterized mass balance. The parameters of the mass balance are derived from the characteristic features defined above, which embody the behavior of each agent. The generation and transport of pollutants across the agents is tracked over discrete time steps as the system evolves.

An agent is capable of increasing or decreasing pollutant levels at a given time. Pollutants may be generated at the agent through local sources such as vehicular traffic, industrial emissions, etc. An agent may also act as a sink for pollutants and remove them from the system. This could be attributed to pollutant deposition on surfaces such as trees³⁶ or water bodies, or removal by any other means within the bounds of the agent. Pollutants removed in this manner leave the system and are not accounted for in the subsequent time step.

Pollutant transfer across agents is primarily convective and depends on the local convective forces, such as wind, vehicular motion, and other meteorological factors. In this work, it is assumed that transfer is only across nodes that are directly connected in the graph network.

An agent-centric perspective allows the framework to embody properties such as spatial heterogeneity, emergent behavior, and individual assessment. Since each agent captures unique local characteristics (land-use, traffic, meteorology), it allows for spatial heterogeneity to be embedded in the framework. Regional pollution patterns emerge from interactions among discrete agents that are not explicitly programmed into the agents, thus exhibiting emergent behavior. Agent-level interventions can be evaluated independently, thus enabling individual assessment. For instance, BP and K parameters have physical meaning at an agent scale.

Data collection

This work aims to establish an agent-based framework for modeling high-resolution spatio-temporal variations in air quality. Recognizing the limitations of sparse fixed monitoring networks, we employed a mobile monitoring paradigm using IoT-enabled sensors deployed on a vehicle traversing a predefined route³². The sensor measures PM_2.5, temperature, and relative humidity; see ref. ³² for details of the sensor package and the study. In this study, a monitoring vehicle covered a 25-km-long study route, depicted in Fig. 9, covering an effective area of 13 km². The route was chosen such that the vehicle could capture various land profiles, such as residential, industrial, commercial, and schools. The sensor package was fitted on the top of the vehicle, which was operated continuously over a 37-day period between April and May of 2019. The 25 km route was discretized into 100 locations, each being 250 meters apart. Samples were collected at each location for 3 min. The study was designed so that data points are available at all 100 locations for every hour of the day (although, since mobile monitoring necessitates sequential sampling, generally not on the same day). This approach enabled high spatio-temporal resolution data collection, essential for the model-building process.

**Fig. 9: Route covered by the vehicle during the pilot study, highlighting key urban zones including residential, industrial, commercial, and school zones.**

Data preprocessing and agent configuration

The original 100 data collection locations (represented by blue circles in Fig. 1) were spatially downsampled to a final set of representative nodes (depicted as orange circles). These nodes represent the agents, and these two terms are used interchangeably in this work. This reduction in spatial resolution by approximately one-fifth was implemented to enhance the signal-to-noise ratio and increase the data density at each agent. The choice of the number of agents and their positioning is an important consideration in this framework. A large number of agents would increase the number of variables to be estimated in the mathematical model and introduce unnecessary computational complexity, whereas a small number of agents would compromise spatial continuity and resolution. In this work, the number and positioning of the agents were determined heuristically to achieve a balance between spatial granularity and statistical robustness. While a mathematically optimal configuration involving multi-objective optimization (e.g., land-use homogeneity, pollutant profile similarity, or end-user resolution requirements) is reserved for future study, the current selection of 22 nodes prioritizes spatial continuity and equitable data distribution temporally. For a comprehensive description of the selection and aggregation logic, refer to the “Methodology for representative node selection and data aggregation” section in Supplementary Material. To delineate the spatial catchment of each agent, Voronoi diagrams were employed to define their boundaries^37,38. Temporally, PM_2.5 concentration within each agent was aggregated to an hourly resolution over a 24-h cycle. This configuration allows the model to effectively capture the characteristic diurnal behavior of the local atmospheric environment.

Since data were collected with a single mobile monitor, there are spatio-temporal gaps in the original dataset. While temporal aggregation gives average diurnal trends, day-to-day variability gets masked. For this particular study, since the data collection was performed over the course of one month in summer, the influence of meteorology(particularly seasonal meteorology) was minimal. Other than local short-term sourcing, the variability in the dataset is not significant. The dataset lacks seasonality and has stationary statistical properties. Detailed analysis of stationarity of the dataset is available in the “Stationarity” section in Supplementary Material.

Defining agent adjacency

Once the number and position of the agents have been identified, the next step in the framework is to determine the mode of agent interaction. In this study, we assume an immediate neighbor interaction mode. Here, pollution exchange happens only between agents that are immediate neighbors. Agents are considered immediate neighbors only if they meet the following criteria:

Unobstructed line-of-sight: Pollution exchange happens only at street level between agents that have direct line-of-sight with each other. Line-of-sight is determined using OpenStreetMap building footprints. Agents separated by major structures (e.g., trees, building complexes) are considered not connected.

Sequential adjacency: Connectivity is non-transitive. Even if multiple agents share a single unobstructed line-of-sight, they are not considered ‘immediate’ neighbors. Interaction is limited to the immediate predecessor and successor. This prevents “skipping” over intermediate agents.

Contiguous data record: In order to ensure physical continuity of data points, only those nodes between which measurements are recorded along the collection route are considered interacting.

For example, in Fig. 1, agent 21 is a neighbor to agents 18 and 2. However, there is no interaction between them since they do not have an ‘unobstructed’ line-of-sight or contiguous data record between them. Similarly, 21 is not adjacent to 7 even though they have line-of-sight and contiguous data records because agent 8 is in between, breaking sequential adjacency. While other, more complex modes of interaction can also be defined, this paper only considers immediate neighbor interaction. Changing the mode of interaction would impact the system complexity and the interpretation of the results.

The inter-agent interaction is defined in terms of the adjacency matrix ${\bf{A}}\in {{\mathbb{R}}}^{nxn}$(where n denotes the number of nodes/agents). The (i, j) entry of the adjacency matrix A is defined as

$${A}_{i,j}=\left\{\begin{array}{l}1\,\,{\mathrm{if}}\,{\mathrm{node}}\,i\,{\mathrm{interacts}}\; {\mathrm{with}}\;{\mathrm{node}}\,j\\ 0\,\,{\mathrm{if}}\; {\mathrm{there}}\; {\mathrm{is}}\; {\mathrm{no}}\; {\mathrm{interaction}}\; {\mathrm{between}}\; {\mathrm{node}}\,i\,{\mathrm{and}}\; {\mathrm{node}}\,j\end{array}\right.$$

(6)

The distance coefficient matrix ${\bf{D}}\in {{\mathbb{R}}}^{nxn}$ whose (i, j) entry is,

$${D}_{ij}=\left\{\begin{array}{ll}\frac{{w}_{ij}}{{\sum }_{k\in {{\mathcal{N}}}_{i}}{w}_{ik}} & \,{\rm{if}}\,{A}_{i,j}=1\\ 0 & \,{\rm{if}}\,{A}_{i,j}=0\end{array}\right.$$

(7)

and is populated based on the adjacency matrix. The elements of the distance coefficient matrix are populated only for those nodes that are ‘adjacent’. The number of non-zero entries in the adjacency matrix determines the number of K_ij to be estimated. The final adjacency matrix has 54 inter-nodal interactions, the corresponding convective transfer coefficients for which need to be estimated. Figure S1 shows a directed graph depicting the inter-agent interactions.

Parameterization strategy for BP

Base pollution (BP) represents the amount of pollutant added or removed at each agent. The primary local sources for PM_2.5 in Indian cities include vehicular emission³⁹, industrial emissions, and anthropogenic activities in residential and commercial zones. The potential sinks are deposition onto surfaces, particularly to water bodies or foliage³⁶. Thus, BP should be characterized in terms of land-use features that capture these sourcing and sinking effects.

High spatial-temporal vehicular emission information was not available for this area. Vehicle count is used as a surrogate for the contribution of vehicular emissions. The data is collected from the TomTom website⁴⁰, which gives the average vehicular count for a given segment of a road network. Hourly vehicular count information was collected for a typical day during the measurement period. The service also classifies roads into seven functional classes, described in Table 4. These seven classes are grouped into three road types by merging classes with similar functional roles. Figure 2c shows the three road types in the study area that are considered in the parameterization.

Table 4 Road classification system for traffic emission characterization

Full size table

The study area, although geographically compact, has a diverse land-use profile, with residential, industrial, and commercial areas. Figure 2d shows the diverse land-use profile across the study route. The land-use data for Chennai is available on the Chennai Metropolitan Department Authority (CMDA) website⁴¹. Residential and industrial emission information is not readily available for this area. Hence, the fractional area of each land-use type is taken as a proxy for sectoral contribution. For the study area, unique categories of land-use are: Primary Residential, Mixed Residential, Industrial, Commercial, Institutional, and Water Body. Since the mixed residential and institutional area is minimal, and to reduce the number of land-use categories considered, they are grouped with commercial land-use. The remaining four land-use categories are: Residential, Industrial, Commercial, and Water Body. For each agent, the fractional area of each land-use category is calculated and used as a predictor variable.

Finally, BP is parameterized in terms of land-use and vehicular count. Hence, BP at agent i can be written as:

$$\begin{array}{rcl}B{P}_{i}(t)={a}_{1}(t){R}_{i}+{a}_{2}(t){I}_{i}+{a}_{3}(t){C}_{i} & & +{a}_{4}(t)W{b}_{i}\\ & & +{a}_{5}(t)R{1}_{i}+{a}_{6}(t)R{2}_{i}+{a}_{7}(t)R{3}_{i}\end{array}$$

(8)

where, R_i, I_i, C_i, Wb_i represents the fractional area of residential, industrial, commercial, and water bodies at a given agent i. R1_i, R2_i, and R3_i represent the traffic count in road types 1, 2, and 3, respectively, at agent i. This linear parameterization strategy for real-world features is borrowed from conventional land-use regression models⁴².

Parameterization strategy for K

PM_2.5 can be transported from one agent to another in two ways: convection or diffusion. It is assumed that diffusion effects are negligible over the inter-agent distances considered in this study, and therefore convective transport is the dominant effect contributing to the movement of PM_2.5 between agents. Wind speed, temperature, and humidity affect the convection of PM_2.5. Hence, K is expressed as a function of these factors. Transfer across agents depends on the relative position of the agents, the wind speed, and the wind direction.

K_ij denotes the convective transfer coefficient of agent i with respect to agent j. K_ij is parameterized in terms of Temperature, Relative Humidity, Wind speed, and direction as:

$${K}_{ij}(t)={b}_{1}(t){T}_{i}(t)+{b}_{2}(t){H}_{i}(t)+{b}_{3}(t)W{S}_{i}(t)* {\phi }_{ij}(t)$$

(9)

In equation (9), T_i represents the temperature at agent i, and H_i represents the relative humidity at agent i. WS_i is the wind speed at agent i, and ϕ_ij is the fractional area under agent j along the wind direction through agent i.

Temperature and humidity measurements are gathered from the mobile monitoring data. Wind data is obtained from the European Centre for Medium-Range Weather Forecasts (ECMWF) database. The temporal resolution of the available data is one hour, and the spatial resolution is 25 km. The value of wind speed and wind direction at the desired agent is interpolated using the Clough-Tocher method⁴³ using the CloughTocher2DInterpolator sub-package in SciPy⁴⁴.

Parameter estimation

The base model depicted in the form of the mass balance is shown in equation (1). By incorporating parameterized BP and K into the mass balance, real-world features are introduced in the model. The final agent-based model in terms of the real-world parameters is formulated, after simplification and generalization, as:

$$\begin{array}{rcl}{P}_{i}(t)\!\!\!&=&\!\!\!{P}_{i}(t-1)+{a}_{1}(t){R}_{i}+{a}_{2}(t){I}_{i}+{a}_{3}(t){C}_{i}+{a}_{4}(t)W{b}_{i}+{a}_{5}(t)R{1}_{i}+{a}_{6}(t)R{2}_{i}\\&&\!\!\!+{a}_{7}(t)R{3}_{i}+{b}_{1}(t){T}_{i,new}(t)+{b}_{2}(t){H}_{i,new}(t)+{b}_{3}(t)g({W}_{s},\phi )(t)\end{array}$$

(10)

where

$${T}_{i,new}(t)=-{T}_{i}(t){P}_{i}(t-1)\mathop{\sum }\limits_{j\in {{\mathbb{Z}}}_{i}}{D}_{ij}+\mathop{\sum }\limits_{j\in {{\mathbb{Z}}}_{i}}{T}_{j}(t){D}_{ji}* {P}_{j}(t-1)$$

(11)

$${H}_{i,new}(t)=-{H}_{i}(t){P}_{i}(t-1)\mathop{\sum }\limits_{j\in {{\mathbb{Z}}}_{i}}{D}_{ij}+\mathop{\sum }\limits_{j\in {{\mathbb{Z}}}_{i}}{H}_{j}(t){D}_{ji}* {P}_{j}(t-1)$$

(12)

and

$$g({W}_{s},\phi )(t)=-{W}_{si}(t)\cdot {P}_{i}(t-1)\mathop{\sum }\limits_{j\in {{\mathbb{Z}}}_{i}}{D}_{ij}{\phi }_{ij}(t)+\mathop{\sum }\limits_{j\in {{\mathbb{Z}}}_{i}}{W}_{sj}(t){\phi }_{ji}(t){D}_{ji}{P}_{j}(t-1)$$

(13)

For step-by-step simplification, please refer to the “Generalized representation of ABM with parameterization” section in Supplementary Material. T_i,new, H_i,new, and g(W_si) are obtained through real-world data. The agent-based model, in terms of real-world features, is still linear in parameters. The 22 agents are described by a system of 22 coupled equations, which are solved at an hourly time step. Thus, 10 coefficients are to be estimated from 22 equations at each time step, yielding 12 residual degrees of freedom. The system is overdetermined, ensuring a unique minimum-residual solution at each hour. The system of equations is described by equations (14–17). The y[t], X[t], θ[t] are as follows,

$${\bf{y}}[{\bf{t}}]=\left[\begin{array}{c}{P}_{1}(t)-{P}_{1}(t-1)\\ {P}_{2}(t)-{P}_{2}(t-1)\\ \vdots \\ {P}_{21}(t)-{P}_{21}(t-1)\\ {P}_{22}(t)-{P}_{22}(t-1)\end{array}\right]$$

(14)

$${\bf{X}}[{\bf{t}}]=\left[\begin{array}{cccccccccc}{R}_{1} & {I}_{1} & {C}_{1} & {W}_{b1} & R{1}_{1} & R{2}_{1} & R{3}_{1} & {T}_{1,new} & {H}_{1,new} & {g}_{1}({W}_{s},\phi )\\ {R}_{2} & {I}_{2} & {C}_{2} & {W}_{b2} & R{1}_{2} & R{2}_{2} & R{3}_{2} & {T}_{2,new} & {H}_{2,new} & {g}_{2}({W}_{s},\phi )\\ & & & & & \vdots & & & & \\ {R}_{21} & {I}_{21} & {C}_{21} & {W}_{b21} & R{1}_{21} & R{2}_{21} & R{3}_{21} & {T}_{21,new} & {H}_{21,new} & {g}_{21}({W}_{s},\phi )\\ {R}_{22} & {I}_{22} & {C}_{22} & {W}_{b22} & R{1}_{22} & R{2}_{22} & R{3}_{22} & {T}_{22,new} & {H}_{22,new} & {g}_{22}({W}_{s},\phi )\end{array}\right]$$

(15)

$${\boldsymbol{\theta }}[t]={\left[\begin{array}{cccccccccc}{a}_{1} & {a}_{2} & {a}_{3} & {a}_{4} & {a}_{5} & {a}_{6} & {a}_{7} & {b}_{1} & {b}_{2} & {b}_{3}\end{array}\right]}^{T}$$

(16)

$${\bf{y}}[t]={\bf{X}}[t]{\boldsymbol{\theta }}[t]$$

(17)

$$\widehat{{\boldsymbol{\theta }}}[t]=\mathrm{argmin}||{\bf{y}}[t]-{\bf{X}}[t]{\boldsymbol{\theta }}[t]|{|}_{2}^{2}$$

(18)

The solution to this linearized agent-based model (Equation (18)) is obtained by minimizing the squared L2 norm of the residual error, which corresponds to the ordinary least squares (OLS) estimation problem. This formulation is well-suited for parameter estimation in linear dynamical systems where observations are available at multiple spatial locations⁴⁵.

Validation of model parameters

An ordinary least squares (OLS) formulation is used to learn the coefficients of the proposed model. Hourly average PM2.5 concentration data is available from the data collection endeavor. At each time step t, a system of 22 equations in 10 unknowns is solved independently, constituting an overdetermined problem. Since the coefficient vector θ[t] is estimated from the 22 spatial observations at each hour, the ratio of the number of observations to the number of unknowns is considerably small in this case. Consequently, while the least squares solution is useful, an appropriate cross-validation strategy is required. A spatial, leave-one-agent-out cross-validation (LOOCV) is performed, wherein one agent is held out for each iteration. Coefficients are estimated from the remaining 21 agents, and the PM_2.5 concentration change at the held-out agent is predicted using those coefficients. This is repeated for all 22 agents and all 24 h. The RMSE and Spearman rank correlation are summarized in Table 5. The fit on the test dataset for each agent is shown in Fig. 10, where the model is trained on the 21 agents and tested on the one left out agent. The plot here shows the concentration change between 24th and 23rd hour, i.e., t is 24. The estimated OLS coefficients are analyzed to verify that BP and K are physically meaningful. For this purpose, traffic, wind speed, and wind direction data are analyzed.

**Fig. 10: Model performance on held-out test agent showing agent-specific prediction accuracy across the study domain for the 24th time step (t = 24).**

Table 5 Leave-one-out cross-validation performance metrics

Full size table

Model predictions demonstrate strong agreement with observations (Fig. 10).

Convective transfer coefficient K is averaged for all agents for each hour of the day. The wind speed is averaged over 22 agents. The temporal profile of the K and wind speed over the region is analyzed. The temporal pattern of the wind speed and the convective transfer coefficient is shown in Fig. 11. It can be observed that the diurnal profile of the convective transfer coefficient matches the wind speed profile qualitatively. A Spearman rank correlation of 0.41 and a Pearson correlation of 0.55 indicate that the convective transfer coefficient and average wind speed are moderately correlated. The convective transfer coefficient dictates the proportion of mass transferred from one agent to another. It is expected that the higher the wind speed, the higher the pollutant movement across agents. Similarly, as the wind speed reduces, the pollutant movement reduces. The same phenomenon is captured in the estimated convection transfer coefficient.

Fig. 11: Temporal profile of wind speed and convective transfer coefficient.

Full size image

It can be observed that the diurnal profile of the convective transfer coefficient matches the wind speed profile qualitatively. It is expected that the higher the wind speed, the higher the pollutant movement across agents. Similarly, as the wind speed reduces, the pollutant movement reduces. The same phenomenon is captured in the estimated convection transfer coefficient.

Major sources of the base pollution in urban areas are traffic sources and emissions from industries. Figure 2b shows the correlation between BP and average vehicular count along with the land-use information for each agent. The region comprises four land-use categories, namely, commercial, industrial, residential, and water regions, and is distributed as shown in Fig. 2d. BP for most agents (2,4,6,8,20,21) in the industrial region shows a lower correlation with vehicular count, which indicates that the contribution of industrial emissions to BP is possibly higher than the vehicular contribution. Agents 1, 3, and 5, however, show slightly higher correlation. These higher correlation values can, however, be attributed to the fact that they are major traffic intersections. Residential and commercial agents have a significantly higher correlation with traffic, indicating vehicular traffic as a significant contributor.

Identifying the relative dominance of BP and K on the agent

Each agent, depending on local factors, would have a dominant contributor. Two metrics, BPR (BP Ratio) and CR (Convection Ratio), are proposed to determine the relative strength of BP and K in influencing overall pollution at a given agent across all times (Equations (19–21)).

While BPR quantifies the change in total pollution, P(t), relative to the change in BP, CR quantifies the change in P(t) relative to the transport of the pollutant.

$$BPR=\frac{{\sum }_{t=1}^{24}{(B{P}_{i}(t)-\bar{B}{P}_{i})}^{2}}{{\sum }_{t=1}^{24}{({P}_{i}(t)-{\bar{P}}_{i})}^{2}}$$

(19)

$${C}_{i}(t)={P}_{i\leftarrow }(t)-{P}_{i\to }(t)$$

(20)

$$CR=\frac{{\sum }_{t=1}^{24}{({C}_{i}(t)-{\bar{C}}_{i})}^{2}}{{\sum }_{t=1}^{24}{({P}_{i}(t)-{\bar{P}}_{i})}^{2}}$$

(21)

An agent is identified as source dominant (‘BP dominant’) if its BPR is at least 30% larger than its CR. It is identified as transport dominant (‘K Dominant’) if its CR is 30% more than its BPR. In case the difference between BPR and CR is less than 30%, the agent is identified as ‘Equal’. This implies that both sourcing and transport contribute almost equally to total pollution concentration. Although a simple max(BPR,CR) selection is the theoretical ideal, we adopt a 30% threshold to safeguard against modeling noise and aggregation uncertainties. This suggestive threshold is employed as a heuristic to ensure that dominance is only assigned when one factor significantly outweighs the other. Users seeking higher sensitivity or different confidence levels may adjust this value to suit specific air quality management objectives.

Benchmarking spatio-temporal imputation capability

To evaluate the quality of ABM-based imputation, it is compared with four widely-used alternatives: mean imputation, linear interpolation, cubic spline interpolation, and kriging.

Mean imputation is the simplest gap-filling strategy. Every missing value at a given node is replaced by the arithmetic mean of all available PM_2.5 observations at that node across the entire record. Formally, if the set of observed values at node i is {c_i(t): t ϵ T_obs}, then every missing time step is filled with equation (22):

$${\widehat{c}}_{i}(t)=(1/| {T}_{obs}| )\sum {c}_{i}(s)\,for\,all\,s\,\epsilon \,{T}_{obs}$$

(22)

This approach preserves the mean of each node but ignores temporal structure, diurnal patterns, and spatial relationships. It therefore tends to produce flat imputed segments that distort both variability and correlation statistics.

Linear Interpolation assumes that the change in PM_2.5 between two temporally adjacent known values is constant. For a gap bounded by observations at times t_a and t_b, the imputed value at any intermediate time t(t_a < t < t_b) is:

$${\widehat{c}}_{i}(t)={c}_{i}({t}_{a})+[{c}_{i}({t}_{b})-{c}_{i}({t}_{a})]\frac{(t-{t}_{a})}{({t}_{b}-{t}_{a})}$$

(23)

Linear interpolation respects the local level at the gap boundaries and is computationally trivial, but it cannot capture the curvature or diurnal periodicity that typically characterizes PM_2.5 profiles. It operates purely in the temporal dimension at each node independently. The implementation uses the interpolate module of SciPy.

Cubic spline interpolation generalizes the linear approach by fitting a piecewise third-order polynomial (cubic spline) through the known data points, requiring that the resulting curve and its first two derivatives are continuous. For both linear and cubic spline interpolation implementations, we use SciPy’s interpolate module⁴⁶.

Kriging is a geo-statistical interpolation technique that fills missing values by modeling the spatial or temporal correlation structure of the data through a variogram⁴⁷. Two variants are employed in this study. In temporal kriging, the autocorrelation of the PM2.5 time series at each node is modeled and used to interpolate across time gaps at that node. In spatial kriging, the cross-correlation among all nodes at a given time step is modeled and used to interpolate the value at a node that has a missing observation from the values observed at neighboring nodes at the same time step. Ordinary 2D kriging was performed using the PyKrige library in Python⁴⁸.

Among the imputation baselines, mean imputation and the two interpolation methods operate exclusively in the temporal domain at each node in isolation, whereas kriging can additionally exploit spatial correlations. The ABM is unique in that its imputation leverages both the learned spatial diffusion parameters (edge weights) and the temporal transition structure simultaneously, drawing on the full network topology. Results of the imputation benchmarking are available in the “Data imputation” subsection.

Benchmarking diurnal forecasting capability

Forecasting differs from imputation in that the model must predict concentrations for a future period with no access to observations within that period. As mentioned earlier, the data collection campaign was run over a duration of 37 days. Data collected over the last 7 days of the campaign was used to test the agent-based model’s forecasting capability. However, instead of forecasting over a horizon of 7 days, the model is used to predict the average diurnal profile across the 7 days. This intentional choice was based on the limitations of the dataset. During the data collection campaign, each measurement location is visited only twice in a period of 24 h (the entire 25 km circuit takes roughly 10 h to complete). This limits the number of observations available for each node for each day. Without suitable aggregation, observations for each node for each day are too sparse to infer meaningful insights. Thus, diurnal forecasting is performed to predict the average PM_2.5 concentration over the last 7 days of the data collection campaign. The training dataset, on the other hand, uses data only from the first 30 days of the campaign.

Forecast from the agent-based approach is benchmarked against three time-series forecasting methods, each representing a distinct philosophy: persistence (a basic replay of the previous day), a Time-Varying Autoregressive (TVAR) model, and a Fourier-based method (spectral extrapolation of periodic components).

The persistence model is the simplest baseline. It assumes that the average diurnal PM_2.5 cycle observed during the training period will repeat unchanged during the test period. For a given node i and forecast hour h on day d, the prediction is:

$${\widehat{c}}_{i}(d,h)={c}_{i}(d-1,h)$$

(24)

In this implementation, the predicted test profile for each node is an exact copy of the average profile aggregated over the initial 30 days. This method involves no parameter estimation, no temporal modeling, and no spatial information. It serves as a critical lower-bound baseline.

Persistence differs from the other two benchmarking methods in that it makes no attempt to extrapolate or modify the training profile. TVAR and Fourier both transform the training cycle through learned parameters to produce a prediction that may diverge from the training average. Persistence, by contrast, assumes perfect stationarity of the diurnal pattern between the training and test periods.

Time-Varying Autoregressive (TVAR) model is a non-stationary time-series model in which the regression coefficients change with time. It predicts current values using past values, allowing the relationship between data points to evolve, which makes it ideal for modeling dynamic systems⁴⁹.

In this case, the first 30 days of the data are used to train the TVAR model. For each target hour t in the cycle, a local estimation window of ±12 time steps centered on the middle repetition is extracted. Within this window an AR(2) model with an intercept is fitted by ordinary least squares:

$${y}_{t}={\beta }_{0}(t)+{\beta }_{1}(t)* {y}_{t-1}+{\beta }_{2}(t)* {y}_{t-2}+{\epsilon }_{t}$$

(25)

where y_t is the PM_2.5 value at position t within the estimation window, and β₀(t), β₁(t), β₂(t) are the hour-specific intercept and AR coefficients. The use of a sliding local window rather than a single global fit is what makes the model “time-varying”: the coefficients are re-estimated for each hour of the day. The forecast is then produced by recursive one-step-ahead prediction. Starting from the previous 2 values (last 2 h of the training data), each subsequent hour uses the AR coefficients estimated for that target hour and feeds its prediction back as a lag for the next step:

$${\widehat{c}}_{(t)}={\beta }_{0}(t)+{\beta }_{1}(t)* \widehat{c}(t-1)+{\beta }_{2}(t)* \widehat{c}(t-2)$$

(26)

Compared with persistence, TVAR exploits multi-day history and adapts its coefficients to the most recent regime, making it more responsive to trends and level shifts. However, like persistence, it operates independently at each node and does not incorporate spatial dependencies. It differs from the Fourier approach in that it models temporal structure through autoregressive lags.

Fourier-Based Forecasting uses a frequency-aware approach wherein a Fourier harmonic regression is applied. The imputed training data is regressed onto an intercept, a linear trend term, and K = 4 pairs of sine and cosine harmonics with periods of 24, 12, 8, and 4 h. The four harmonic pairs represent the fundamental diurnal cycle (24 h) together with its first three sub-harmonics (12, 8, and 4 h). The forecast is then obtained by evaluating the fitted equation at the 24 time indices immediately following the training series. The Fourier regression produces a non-trivial forecast as long as the training profile exhibits any hourly variation, because the harmonics are inherently oscillatory. By retaining only four harmonic pairs, the method acts as a low-pass filter, smoothing out high-frequency noise while preserving the dominant diurnal shape.

All three benchmarking methods treat each monitoring node independently. Thus, the forecast at one node is not influenced by observations or predictions at any other node. The ABM, by contrast, jointly models all 22 nodes through directed-edge flow parameters weighted by inter-node distances, enabling it to propagate spatial information across the monitoring network at every time step. The forecasting performance of the three time-series methods is compared to the spatio-temporal agent-based framework(this work) in the “Diurnal forecasting” subsection.

Identifying travel exposure at each agent

Each agent represents a finite space of pollutant concentration that changes with time. One of the possible utilities of this property is in determining ambient exposure during travel. We propose a method in this work that simplifies travel exposure estimation as a simple aggregate of pollution concentration across agents. If we consider a route along which an individual travels, the personal exposure along the route is given by equations (27–30):

$${E}_{i}={\kappa }_{E}* {P}_{i}* {T}_{i}$$

(27)

$${T}_{i}=\frac{{d}_{i}}{{s}_{i}}$$

(28)

$${E}_{T}=\frac{{\sum }_{i=1}^{24}{E}_{i}}{{\sum }_{i=1}^{24}{T}_{i}}$$

(29)

$${\kappa }_{E}=f(\,{\rm{Inhalation\; dose,\; mode\; of\; transport}})$$

(30)

where, E_i is the exposure to an individual at agent i, P_i is the pollution concentration at agent i, d_i is the distance covered within agent i, s_i is the speed with which the individual travels within agent i, E_T is the total exposure along the route and κ_E is the dimensionless exposure coefficient which we define as a function of intake efficiency rate of the individual and their mode of transport. Intake efficiency is the fraction of the ambient particulate matter concentration that is inhaled by the individual. These factors can be incorporated based on empirical values. For the sake of simplicity, κ_E in this simulated case is assumed to be unity.

Travel exposure has applications in determining the least-exposure route as described in the “Identifying the least-exposure route” section.

Data availability

PM_2.5 data used in this study is available through the URL https://doi.org/10.5281/zenodo.18988833. Wind data is available from the ECMWF website, and land-use data for Chennai is available through the CMDA website⁴¹.

Code availability

The basic ABM model is available through the URL https://doi.org/10.5281/zenodo.18988833. The code is written in MATLAB R2019a.

References

World Health Organization. WHO Global Air Quality Guidelines: Particulate Matter (PM_2.5 and PM₁₀), Ozone, Nitrogen Dioxide, Sulfur Dioxide and Carbon Monoxide. Report (WHO, 2021).
Manisalidis, I., Stavropoulou, E., Stavropoulos, A. & Bezirtzoglou, E. Environmental and health impacts of air pollution: a review. Front. Public Health 8, 14 (2020).
Article Google Scholar
Balakrishnan, K., Dey, S., Gupta, T. et al. The impact of air pollution on deaths, disease burden, and life expectancy across the states of India: the Global Burden of Disease Study 2017. Lancet Planet. Health 3, e26–e39 (2019).
Article Google Scholar
Apte, J. S., Messier, K. P., Gani, S. et al. High-resolution air pollution mapping with Google Street View cars: exploiting big data. Environ. Sci. Technol. 51, 6999–7008 (2017).
Article CAS Google Scholar
Messier, K. P., Chambliss, S. E., Gani, S. et al. Mapping air pollution with Google Street View cars: efficient approaches with mobile monitoring and land use regression. Environ. Sci. Technol. 52, 12563–12572 (2018).
Article CAS Google Scholar
Padró-Martínez, L. T., Patton, A. P., Trull, J. B. et al. Socioeconomic disparities in exposure to air pollution: a study of children in Madrid. Environ. Res. 187, 109711 (2020).
Google Scholar
Schneider, P., Castell, N., Vogt, M. et al. Mapping urban air quality in near real-time using observations from low-cost sensors and model information. Environ. Int. 106, 234–247 (2017).
Article CAS Google Scholar
Lelieveld, J., Evans, J. S., Fnais, M. et al. The contribution of outdoor air pollution sources to premature mortality on a global scale. Nature 525, 367–371 (2015).
Article CAS Google Scholar
Apte, J. S., Marshall, J. D., Cohen, A. J. & Brauer, M. Addressing global mortality from ambient PM_2.5. Environ. Sci. Technol. 49, 8057–8066 (2015).
Article CAS Google Scholar
Zhong, J., Cai, X. & Bloss, W. J. Computational fluid dynamics simulation of ultrafine particle dispersion in an urban street canyon with different wind directions. Atmos. Chem. Phys. 18, 13253–13270 (2018).
Google Scholar
Tominaga, Y. & Stathopoulos, T. CFD modeling of pollution dispersion in a street canyon: Comparison between LES and RANS. J. Wind Eng. Ind. Aerodyn. 99, 340–348 (2015).
Article Google Scholar
Pennington, E. A., Wang, Y., Schulze, B. C. et al. An updated modeling framework to simulate Los Angeles air quality–Part 1: model development, evaluation, and source apportionment. Atmos. Chem. Phys. 24, 2345–2363 (2024).
Article CAS Google Scholar
Wolf, T., Pettersson, L. H. & Esau, I. A very high-resolution assessment and modelling of urban air quality. Atmos. Chem. Phys. 20, 625–647 (2020).
Article CAS Google Scholar
Hoek, G., Beelen, R., De Hoogh, K. et al. A review of land-use regression models to assess spatial variation of outdoor air pollution. Atmos. Environ. 42, 7561–7578 (2008).
Article CAS Google Scholar
Chen, S., Jiang, H., Feng, J. et al. Machine learning reveals impacts of smoking on gene profiles of different cell types in lung. Nat. Commun. 9, 5081 (2018).
Google Scholar
Hsu, C.-Y., Wu, C.-D., Zeng, Y.-T. et al. Using a land use regression model with machine learning to estimate ground level PM_2.5. Environ. Pollut. 277, 116846 (2021).
Article Google Scholar
Ma, X., Zou, B., Deng, J. et al. A comprehensive review of the development of land use regression approaches for modeling spatiotemporal variations of ambient air pollution: a perspective from 2011 to 2023. Environ. Int. 183, 108430 (2024).
Article Google Scholar
Jiang, F., He, J. & Tian, T. Hybrid deep learning for air quality prediction with partial missing data. IEEE Access 9, 43530–43543 (2021).
Google Scholar
Liu, H. & Chen, C. A hybrid deep learning approach for spatial PM_2.5 prediction. Atmos. Environ. 273, 118971 (2022).
Google Scholar
Hettige, K. H. et al. AirPhyNet: Harnessing physics-guided855 neural networks for air quality prediction. In The Twelfth International Conference on Learning Representations, https://openreview.net/forum?id=JW3jTjaaAB (2024).
Zhang, J., Liu, H., Cheng, Q. et al. Improving air quality assessment using physics-inspired deep graph learning. npj Clim. Atmos. Sci. 6, 150 (2023).
Google Scholar
Jerrett, M., Arain, A., Kanaroglou, P. et al. A review and evaluation of intraurban air pollution exposure models. J. Expo. Sci. Environ. Epidemiol. 15, 185–204 (2005).
Article CAS Google Scholar
Xie, X., Semanjski, I., Gautama, S. et al. A review of urban air pollution monitoring and exposure assessment methods. ISPRS Int. J. Geo-Inf. 6, 389 (2017).
Article Google Scholar
Bibbero, R. J. Systems approach toward nationwide air-pollution control I. The problem, the system, the objective. IEEE Spectr. 8, 20–31 (1971).
Article Google Scholar
Bonabeau, E. Agent-based modeling: methods and techniques for simulating human systems. Proc. Natl. Acad. Sci. USA 99, 7280–7287 (2002).
Article CAS Google Scholar
Chen, Z., Guo, Y. & Stuart, A. L. Agent-based modeling in urban and architectural research: a brief literature review. Front. Archit. Res. 5, 166–177 (2016).
Google Scholar
Hunter, R. F., Cleland, C., Cleary, A. et al. An agent-based modeling framework for simulating human exposure to environmental stresses in urban areas. Int. J. Environ. Res. Public Health 15, 247 (2018).
Google Scholar
Grimm, V. & Railsback, S. F. Individual-based models in ecology after four decades. F1000Prime Rep. 5, 9–15 (2017).
D.A.V. Girls Senior Secondary School. School Timing. https://girlsmgp.davchennai.org/school-timing/ (2021).
Jafari, A. J., Charkhloo, E. & Pasalari, H. Urban air pollution control policies and strategies: a systematic review. J. Environ. Health Sci. Eng. 19, 1911–1940 (2021).
Article Google Scholar
Li, S. et al. Improving air quality through urban form optimization: a review study. Build. Environ. 243, 110685 (2023).
Article Google Scholar
Swaminathan, S., Sankar Guntuku, A. V., S, S., Gupta, A. & Rengaswamy, R. Data science and IoT based mobile monitoring framework for hyper-local PM_2.5 assessment in urban setting. Build. Environ. 225, 109597 (2022).
Article Google Scholar
Dijkstra, E. W. A note on two problems in connexion with graphs. Numer. Math. 1, 269–271 (1959).
Article Google Scholar
Holmes, N. S. & Morawska, L. A review of dispersion modelling and its application to the dispersion of particles: an overview of different dispersion models available. Atmos. Environ. 40, 5902–5928 (2006).
Article CAS Google Scholar
Srivastava, A. & Rao, B. P. Urban air pollution modeling. In Air Quality-Models and Applications 364 (InTech, 2011).
Saebø, A. et al. Plant species differences in particulate matter accumulation on leaf surfaces. Sci. Total Environ. 427-428, 347 - 354 (2012).
Article Google Scholar
Deligiorgi, D. & Philippopoulos, K. Spatial interpolation methodologies in urban air pollution modeling: application for the greater area of Metropolitan Athens, Greece. In Advanced Air Pollution (ed. Nejadkoorki, F.) Ch. 19 (IntechOpen, London, 2011).
Zhang, B., Zhou, F. & Song, G. Regional delimitation of PM_2.5 pollution using spatial cluster for monitoring sites: a case study of Xianyang, China. Atmosphere https://www.mdpi.com/2073-4433/11/9/972 (2020).
West Bengal Pollution Control Board. WBPCB: A Quinquennial Report, April 1998-March 2003 (West Bengal Pollution Control Board, 2003).
TomTom. TomTom traffic stats. https://www.tomtom.com/products/traffic-stats/ (2021).
CMDA. CMDA land use maps. http://www.cmdachennai.gov.in/LUMaps/Index. (2006).
Ma, X. et al. A comprehensive review of the development of land use regression approaches for modeling spatiotemporal variations of ambient air pollution: a perspective from 2011 to 2023. Environ. Int. 183, 108430 (2024).
Article Google Scholar
Alfeld, P. A trivariate Clough—Tocher scheme for tetrahedral data. Comput. Aided. Geom. Des. 1, 169–181 (1984).
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Article CAS Google Scholar
Ljung, L. System Identification: Theory for the User 2nd edn (Prentice-Hall PTR, Upper Saddle River, NJ, USA, 1999).
Jones, E., Oliphant, T., Peterson, P. et al. SciPy: Open source scientific tools for Python http://www.scipy.org (2001).
Jack. Easykriging. https://www.mathworks.com/matlabcentral/fileexchange/66136-easykriging (2021). Accessed 17 Feb 17 2022.
Murphy, B. S. et al. Pykrige: Kriging toolkit for Python. https://github.com/geostat-framework/pykrige (2023).
Lütkepohl, H. New Introduction to Multiple Time Series Analysis (Springer-Verlag, Berlin, 2005).

Download references

Acknowledgements

S.S. and V.F.M. acknowledge the Columbia University Climate School and Open Philanthropy for financial support. The authors acknowledge the Chennai Metropolitan Development Authority for providing land-use data and the European Centre for Medium-Range Weather Forecasts for meteorological data access. S.S., R.R. and P.A. acknowledge the Ministry of Education, India, and the Department of Chemical Engineering for financial support in the form of fellowships. S.S. would like to acknowledge Summer Subramanian from Kaatru IIT Madras and Centre for Urbanization, Buildings and Environment (CUBE, IIT Madras) for their assistance in the data collection endeavor.

Author information

Authors and Affiliations

Department of Chemical Engineering, Indian Institute of Technology Madras, Chennai, India
Sathish Swaminathan, Pranav Agrawal & Raghunathan Rengaswamy
Department of Chemical Engineering, Columbia University, New York, NY, USA
Sathish Swaminathan & V. Faye McNeill
Climate School, Columbia University, New York, NY, USA
Sathish Swaminathan & V. Faye McNeill
Department of Earth and Environmental Sciences, Columbia University, New York, NY, USA
V. Faye McNeill

Authors

Sathish Swaminathan
View author publications
Search author on:PubMed Google Scholar
Pranav Agrawal
View author publications
Search author on:PubMed Google Scholar
V. Faye McNeill
View author publications
Search author on:PubMed Google Scholar
Raghunathan Rengaswamy
View author publications
Search author on:PubMed Google Scholar

Contributions

S.S., R.R. and V.F.M. designed the research. S.S. and P.A. conducted the research. S.S., P.A. and V.F.M. wrote and edited the manuscript.

Corresponding author

Correspondence to V. Faye McNeill.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Swaminathan, S., Agrawal, P., McNeill, V.F. et al. Agent-based framework for modeling hyperlocal urban air quality. npj Clean Air 2, 32 (2026). https://doi.org/10.1038/s44407-026-00073-6

Download citation

Received: 15 December 2025
Accepted: 31 March 2026
Published: 12 May 2026
Version of record: 12 May 2026
DOI: https://doi.org/10.1038/s44407-026-00073-6

Subjects

Abstract

Similar content being viewed by others

A hybrid bio-inspired model for predicting urban air pollution using deep learning

Hyperlocal environmental data with a mobile platform in urban environments

Integrated assessment of environmental infrastructural and social risks for urban public safety

Introduction

Results

Descriptive capabilities: understanding pollution dynamics

Interplay between BP and K

Dominance of sourcing vs transport effect at each agent

Predictive capabilities: Imputation and Forecasting

Data imputation

Diurnal forecasting

Prescriptive capabilities: Applications in policy and intervention design

Identifying the least-exposure route

Source attribution and intervention targeting

Discussion

Methods

Agent-based modeling framework

Data collection

Data preprocessing and agent configuration

Defining agent adjacency

Parameterization strategy for BP

Parameterization strategy for K

Parameter estimation

Validation of model parameters

Identifying the relative dominance of BP and K on the agent

Benchmarking spatio-temporal imputation capability

Benchmarking diurnal forecasting capability

Identifying travel exposure at each agent

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary Information (download PDF )

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links