A review of geospatial exposure models and approaches for health data integration

Clark, Lara P.; Zilber, Daniel; Schmitt, Charles; Fargo, David C.; Reif, David M.; Motsinger-Reif, Alison A.; Messier, Kyle P.

doi:10.1038/s41370-024-00712-8

Download PDF

Review Article
Open access
Published: 06 September 2024

A review of geospatial exposure models and approaches for health data integration

Lara P. Clark¹,
Daniel Zilber²,
Charles Schmitt¹,
David C. Fargo³,
David M. Reif²,
Alison A. Motsinger-Reif⁴ &
…
Kyle P. Messier^2,4

Journal of Exposure Science & Environmental Epidemiology volume 35, pages 131–148 (2025)Cite this article

11k Accesses
14 Citations
2 Altmetric
Metrics details

Abstract

Background

Geospatial methods are common in environmental exposure assessments and increasingly integrated with health data to generate comprehensive models of environmental impacts on public health.

Objective

Our objective is to review geospatial exposure models and approaches for health data integration in environmental health applications.

Methods

We conduct a literature review and synthesis.

Results

First, we discuss key concepts and terminology for geospatial exposure data and models. Second, we provide an overview of workflows in geospatial exposure model development and health data integration. Third, we review modeling approaches, including proximity-based, statistical, and mechanistic approaches, across diverse exposure types, such as air quality, water quality, climate, and socioeconomic factors. For each model type, we provide descriptions, general equations, and example applications for environmental exposure assessment. Fourth, we discuss the approaches used to integrate geospatial exposure data and health data, such as methods to link data sources with disparate spatial and temporal scales. Fifth, we describe the landscape of open-source tools supporting these workflows.

A small area model to assess temporal trends and sub-national disparities in healthcare quality

Article Open access 28 July 2023

Geographic pair matching in large-scale cluster randomized trials

Article Open access 05 February 2024

Differences between gridded population data impact measures of geographic access to healthcare in sub-Saharan Africa

Article Open access 16 September 2022

Introduction

Geospatial models are a class of statistical and deterministic methods that account for spatial relationships and/or spatiotemporal correlation in the fixed effects (e.g., covariates, mechanisms) and/or random effects (e.g., error terms). These models have been extensively used in many scientific and engineering disciplines such as forestry [1], mining [2, 3], geology [4], and soil science [5, 6]. More recently, geospatial models have been used in environmental and public health for air and water quality [7,8,9,10] exposure assessments.

The geospatial models used in exposure assessment vary significantly in terms of physical properties, statistical methods, implementation requirements, and overall accuracy and precision of predictions. However, the vast literature makes comparisons of these exposure methods and health applications difficult. For example, specific technical assessments have been conducted including Hoek et al. [11], which reviewed land-use regression for geospatial exposure assessment applications. Further, VoPham et al. [12] discussed advanced machine learning (ML) and artificial intelligence (AI) approaches for geospatial exposure assessment. Nieuwenhuijsen [13] provided a book of exposure assessment methods in environmental health including chapters on measurement and modeling via geographic information systems and atmospheric dispersion modeling.

These challenges are becoming increasingly important as comprehensive and complex exposure metrics are needed to understand health outcomes. Whereas traditional environmental health studies exposures and health outcomes one chemical at a time, modern environmental health includes the exposome, which attempts to understand the totality of exposures across environmental, social, lifestyle, and ecological factors on human health [14]. The primary approach for quantifying the exposome is through advanced analytical chemistry techniques like non-targeted mass spectrometry, which can potentially elucidate novel and unknown chemicals in individuals [15]; however, it still currently suffers from low throughput, high cost, difficult reproducibility, and potential low precision in quantification. Geospatial approaches offer a tractable and complementary approach to quantify the exposome—particularly the external components such as the social and physical-chemical exposome. Therefore, there is a need for a review and vetted resources on approaches for geospatial exposure modeling and linkages to health data.

Applying geospatial models to help quantify the complete exposome involves substantial data engineering challenges. These include integrating large and diverse data streams (e.g., from sensors, models, surveys, and/or electronic health records), with disparate spatial and temporal coverage and scale, while protecting privacy in personal health information. Additionally, such geospatial exposure methods are rapidly evolving, due to advancements in data sources (such as satellite-based sensors), modeling methods (such as machine learning models), and tools (such as software for handling large geospatial datasets). These data challenges and methods span multiple disciplines, from geographic information science to bioinformatics.

Our objective is to provide a technical introduction to and a compendium of geospatial exposure modeling methods through reviewing information that is otherwise scattered across a vast literature. Specifically, we review: (1) key concepts and terminology for geospatial exposure data and modeling, (2) workflows in geospatial exposure model development and health data integration, (3) types of geospatial exposure models used to assess the complete external exposome, including environmental, climate, and social determinants of health, (4) methods to integrate geospatial exposure data with health data, and (5) open-source tools supporting these workflows.

Background

This section discusses key terminology and concepts for geospatial exposure data and modeling.

Terminology

Here, we outline the terminology and notation used throughout the manuscript. Most approaches are statistical or stochastic by definition, so we assume corresponding notation. We briefly mention mechanistic model notation, which is discussed in section “Mechanistic or chemical transport”. A single random variable is denoted by a bold lowercase letter, y. A collection of random variables across a spatial and/or temporal domain is denoted as a bold capital letter, Y. In the geospatial context, Y provides a full, probabilistic characterization of an exposure across space and time and is typically called a space/time random field (S/TRF). An S/TRF is referenced to a real-valued domain with spatial and temporal indices, $\{{{\bf{Y}}}(s,t);s\in {{\mathbb{R}}}^{2},t\in {{\mathbb{R}}}^{1}\}$, where the spatial dimension could be 1 or more than 2. For brevity in equations, let a single spatiotemporal index, p = (s, t), be equal to the combined spatial and temporal index, $\{{{\bf{Y}}}(p);p\in {{\mathbb{R}}}^{3}\}$.

When we describe the relationship between two spatiotemporal observations, parameter estimates, or model predictions, we can utilize generic subscripts, _ij. As is standard in matrix notation, let the first subscript represent row indices and the second represent column indices (and so on for higher dimensions), thus A_ij is the i-th row and j-th column of A.

Full quantification of an S/TRF is the goal of a geospatial exposure assessment; however, the realization is typically noisy. We denote a deterministic value or realization of a random variable of interest with regular, non-bold font, $Y={({y}_{1},...,{y}_{n})}^{T}$. Thus a given realization, y_i, can be thought of as a single random draw from an underlying random distribution, y_i. Moreover, all statistical models represent a latent, smooth estimation of the exposure of interest:

$${z}_{i}={y}_{i}-{\varepsilon }_{i}$$

(2.1.1)

where z_i is the latent estimate, y_i is the observed, noisy data, and ε_i is the error. Mechanistic models, or models based on explicit physics and chemistry, do not typically include an error term and thus directly estimate y_i.

Distance metrics

Distance, both spatial and temporal, is the foundation of geospatial exposure metrics, thus it is imperative to define the types of distances. The most common distance is the Euclidean distance, or the “way the crow flies”, a unique value that represents the shortest distance between 2 or more points. Let A be a matrix of size [n x 2] such as outcome locations or grid locations where an entire exposure field is calculated locations (i.e., x coordinate, y coordinate) and B be a [m x 2] matrix such as pollution sources. The euclidean distance between i-th (i = 1, . . . , n) object in A and the j-th (j = 1, . . . , m) object in B is:

$${d}_{ij}=\sqrt{{({A}_{i1}-{B}_{j1})}^{2}+{({A}_{i2}-{B}_{j2})}^{2}}$$

(2.2.1)

where d_ij, a [n x m] matrix, is the euclidean distance between points all the points in A and B.

Geodesic distance is used to calculate distances on the Earth’s surface that account for the Earth’s three-dimensional shape, which is typically modeled as an oblate spheroid. When the coordinates are expressed in longitude (x-coordinate) and latitude (y-coordinate), a geographic coordinate system, then a distance calculation must use the geodesic version otherwise the distance calculations can be inaccurate. Euclidean distances are valid in the two-dimensional projections of geographic coordinate systems called projected coordinate systems.

In hydrological applications, a non-Euclidean river-distance is often used that is constrained to the path of a given river system. The river distance incorporates the geometry of the river network and potentially flow direction (See Fig. 1), resulting in better accuracy and precision for exposures predominately controlled by river hydrology [16,17,18]. For the remainder of this review, unless otherwise specified, we refer to the euclidean distance.

**Fig. 1: Comparison of distance metrics.**

Time may be viewed as an additional spatial dimension and modeled on a continuous domain. In this case, the distances may be treated equally (i.e., a distance of 1 unit is the same in space or time) or they may be weighted or scaled based on expert knowledge or model parameters. For example, a combined spatiotemporal distance may be calculated as follows:

$${d}_{\gamma }\left(({{{\bf{s}}}}_{i},{t}_{i}),({{{\bf{s}}}}_{j},{t}_{j})\right)=\sqrt{\frac{| | {{{\bf{s}}}}_{i}-{{{\bf{s}}}}_{j}| {| }_{2}^{2}}{{\gamma }_{s}^{2}}+\frac{{({t}_{i}-{t}_{j})}^{2}}{{\gamma }_{t}^{2}}},$$

(2.2.2)

where the spatial distance is scaled by a spatial range parameter, γ_s, and the temporal distance is scaled by the temporal range parameter, γ_t. Range parameters provide an estimate or interpretation to the length or distance at which the process varies.

In dynamic spatial models, time is explicitly considered as a separate and discrete distance. For instance, a joint spatiotemporal process is viewed as a discrete time series of spatial process. For a brief or extensive review of dynamic spatiotemporal distances and models see Wikle [19] and Cressie and Wikle [20], respectively.

Workflows

This section provides an overview of workflows for environmental health research incorporating geospatial exposure models and a discussion of key steps in geospatial exposure model development.

Overview

Figure 2 presents an overview of key steps in environmental health research workflows incorporating geospatial model development for exposure assessment. The first step is formulation of a research question on the association between exposures and health outcomes for specific places (i.e., study area) and times (i.e., study period).

The research question shapes the workflow in several important ways. The study area and period determine the spatial and temporal range (or coverage) of input data needed. The selected exposure and health outcome determine the spatial and temporal scale (or resolution) of the input data needed: finer scale data is needed for exposures and health outcomes with higher spatial and/or temporal variability. The selected health outcome influences the scale of data integration for analysis: input data are typically spatially and temporally integrated to match the health outcome scale (e.g., to home addresses or ZIP codes during specific days or years).

The second step is preparation of the input data with the spatiotemporal range and scale needed to investigate the research question. Preparation of geospatial exposure data through geospatial model development involves the following steps:

Selection of the exposure metric. This metric quantifies an aspect of the external exposome. Example metrics are average annual air-pollutant concentration and daily maximum heat index.
Selection of geospatial modeling strategy for the selected exposure metric. This includes selecting the type of model (section “Types of models”), sources of input data, and methods for building and evaluating the model. The spatiotemporal range and scale of analysis are important considerations in selecting the strategy.
Collection and integration of geospatial data needed as inputs for the selected model. This may involve collecting new measurements (e.g., sampling local air quality) and/or gathering existing measurements from disparate data sources (e.g., accessing global satellite imagery). The input data can include observations of the exposure metric and/or various geographic covariates (section “Geographic covariate development”) used to predict the exposure metric at unobserved locations or times. Such geographic covariates are selected based on domain expertise to have an interpretable influence on the exposure metric; they typically represent sources of exposure as well as change and transport processes impacting exposure.
Building and evaluation of the geospatial model using the selected approaches with the integrated input data. This can involve various model selection and validation methods (section “Spatiotemporal model assessment and selection”).
Prediction of the exposure metric. This involves applying the geospatial model with input geospatial data to predict the exposure metric at unmonitored space/time locations such as residential addresses or areal units during specific days or years.

Preparation of health outcome data involves the following steps:

Selection of the health outcome metric. This metric describes a health outcome at the individual level (e.g., for a specific patient or research participant) or at the population level (e.g., among all people living in a specific county). Example metrics are asthma emergency department visits and blood pressure.
Collection and integration of the health outcome data. This may involve collecting new measurements (e.g., through surveys administered to health cohorts) or gathering information from existing health data sources such as electronic health records, insurance claims, public health surveillance programs, and existing health cohorts. Data catalogs (e.g., the Climate and Health Outcomes Research Data Systems (CHORDS) catalog [21] and others reviewed in section “Open-source tools”) provide examples of health data sources with descriptions of the available data types (e.g., conditions, prescriptions), population characteristics, and privacy-related restrictions on access and use (section “Special health data linkage considerations”).
Geocoding the health data. This step connects an individual or population to the spatial information needed for linkage to the geospatial exposure data (section “Special health data linkage considerations”). For individual-level data, this may involve geocoding, which is the process of translating addresses (i.e., street address listings) to coordinates (i.e., latitude and longitude). For population-level data, this may involve coding with standard geographic units (e.g., spatial boundaries for counties or postal codes).

The third step is the integration and analysis of the geospatial exposure data and health outcome data. This step involves calculating exposure metrics, using the geospatial exposure model to predict exposures at the specific spatial locations and times linked with each individual or population (section “Geospatial data integration”). Potential confounding factors such as age, social determinants of health, and other types of environmental exposures are then commonly linked with each individual or population. These steps produce an integrated exposure and health dataset. This integrated dataset can then be analyzed using various epidemiological methods to quantify associations between exposures and health while accounting for potential confounding factors.

Model development

This section discusses key steps in model development workflows.

Geographic covariate development

The fundamental drivers of most exposure assessment models are geographic covariates. A design matrix, X(p), consists of a combination of spatial and/or spatiotemporally referenced geographic covariates. Here, we explain the properties of and broadly define types of covariates that are subsets of the full design matrix. There are two domains of classification to consider in the development of covariates: the mechanism that the covariate represents and the geometry of the spatial representation.

Geographic covariates represent mechanistic approximations of the outcome of interest and are intended to be interpretable. The mechanisms can be classified into three types of variables, which helps ensure all relevant variables are considered. Source variables, X^s(p), are considered direct or indirect sources of the outcome of interest. For instance, in air pollution studies of NO₂, internal-combustion vehicles are a direct source since they directly emit NO₂. Change variables, X^c(p), are processes that may attenuate or concentrate the dependent variable through physical and chemical changes or transformations. For example, ground or surface water nitrate models include soil variables that have properties favorable to denitrification (i.e., decrease) of ${{{\rm{NO}}}}_{3}^{-}$ [22]. In air pollution modeling, solar radiation and urban morphology can inform on transformations such as secondary organics formation, aerosol nucleation, adhesion, and physical constraints such as urban structures. Transport variables, X^t(p), are processes that affect the movement of dependent variables (e.g., advection, diffusion) such as wind or water flow. The distinction of X^c(p) and X^t(p) is somewhat arbitrary at this stage; however, it is important for a priori understanding of physical and chemical processes for interpretation and validation. Moreover, many algorithms impose constraints on covariate groups, which enforces physical relevance and can aid in estimation by reducing the overall parameter search space. The geographic covariate design matrix is equal to the set containing the source, change, and transport variables:

$$X(p)=\{{X}^{s}(p),{X}^{c}(p),{X}^{t}(p)\}$$

(3.2.1)

Source covariates can be categorized as point or non-point sources. Point sources originate at an exact geographic location and are easily represented as a point. These can be calculated with many of the proximity variables from section “Proximity”, such as a nearest distance or the sum of exponentially decaying contribution of sources. Conversely, non-point sources are either unknown in exact origin or occur over a given line or area. In water quality exposure assessment, the impact of agriculture is often described as non-point since large agricultural areas may be a source of pollution, but the source cannot be pinned to a precise geographical location. A common metric is the percentage of a given land cover type such as developed or agricultural land, represented by areal raster or polygon data, within a buffer around the outcome locations. Change and transport variables are also typically derived from areal raster or polygon data. These variables are also best represented with proximity metrics such as percentage of an attenuating land cover type within a buffer (section “Proximity”).

Geographic covariate development is a key step in the exposure assessment model development (Fig. 2). Using a combination of subject expertise and literature reviews, exposure modelers decide which geographic covariates to include in the model. Recommended inclusion characteristics are [23]: (1) Data quality; Data accuracy and precision, strengths, and limitations should be documented and understood. (2) Spatial and temporal scales; Spatial and temporal resolution should be considered within the context of the outcome and the spatial and temporal domains. (3) Geographic and temporal coverage; The domain of the data should include the domain of interest, including prediction locations. (4) Scientific relevance; Covariates should reasonably represent a source, change, or transport process.

The model predictions described in Section “Types of models” are also examples of geographic covariates. For example, proximity models (section “Proximity”) and chemical-transport models (section “Mechanistic or chemical transport”) are common geographic covariates in land-use regression (section “Land-Use Regression”), kriging/Gaussian Process (section “Geostatistical Models: Gaussian Processes, Kriging, and BME”), and machine learning models (section “Machine learning”). Model predictions can serve as covariates or be combined with a post-hoc learning method to produce a hybrid model (section “Hybrid”). Creativity is the only limit to developing helpful and meaningful geographic covariates.

Spatiotemporal model estimation

Parameter estimation is an important part of statistical model development regardless of the scientific discipline. The benchmark goal is typically the unbiased estimation of parameters as close to the unobserved “truth” as possible. In exposure modeling, parameter estimation is often important for identifying and quantifying the contribution from sources or reduction from attenuation factors. There are three main approaches for estimating or fitting statistical models: maximum likelihood estimation (MLE), Bayesian inference, and least-squares fitting of empirical estimates. In section “Types of models”, we assume that models are estimated with MLE unless otherwise specified. We recommend Gelfand et al. [24] for in-depth discussions on estimation methods for geospatial models.

Spatiotemporal model assessment and selection

In many geospatial exposure applications, prediction at unobserved locations is the primary objective, so the model estimation and selection strategy must reflect this objective. Recently, many authors [25,26,27,28] have proposed spatiotemporal-specific cross-validation strategies that account for the spatiotemporal correlation in models and more accurately reflect their out-of-sample or extrapolation prediction capabilities. Figure 3 is a schematic of cross-validation schemes, including versions that produce fairer estimates of prediction errors for spatiotemporal data. Purely random folds (Fig. 3A) and leave-one-out (LOO) (Fig. 3B) cross-validation can result in overly optimistic estimation of the generalization error for spatiotemporal models. Watson et al. [28] recommended a spatial, temporal, or spatiotemporal version based on the space-time sampling strategy and goals of the prediction model. Options include leave-time-out (LTO) (Fig. 3C) or leave-location-out (LLO) (Fig. 3D), which are appropriate for models with sparse temporality or spatial clustering, respectively. Generalizing LLO to random folds (Fig. 3D) or spatially structured blocks (Fig. 3E) results in k-fold and blocked LLO cross-validation, respectively. Roberts et al. [25] and Valavi et al. [27] proposed strategies for developing spatial and temporal blocks based on regular grids, spatial clusters such as k-means, and structures such as watersheds, ecoregions, or political units. Moreover, they proposed the use of spatial buffers, an additional non-active set, between-training and held-out sets. Following Roberts et al. [25] and Valavi et al. [27], we recommend the usage of spatial or temporal block sets where the structure and size are based on correlation in the data and the overall prediction goals. If far-distance extrapolation is not needed, then LLO or LTO is sufficient whereas block LLO may be overly pessimistic [28].

**Fig. 3: Example cross-validation schemes for spatiotemporal data.**

Useful validation statistics for model cross-validation are R² and the mean-squared error (MSE). R² is bounded between 0 and 1, with larger values indicating a higher proportion of the overall variance described the model. MSE is bounded from 0 to infinity (lower is better) and includes bias, variance, and irreducible error. Gneiting and Katzfuss [29] and many others prior argued for consideration of the joint distribution of predictions and observations. More simply, model assessment and selection methods should consider both prediction point statistics (e.g., mean, median) and the uncertainty quantification (e.g., variance). A proper scoring rule such as the continuous ranked probability score (CRPS) evaluates and penalizes not only the central tendency prediction (e.g., mean) but also under and over confident predictions of uncertainty. For example, an exposure assessment with large variance predictions can claim the predictions never fall outside a given range. CRPS penalizes poor variance estimates, which encourages exposure models with better usage in downstream risk assessment. Moreover, CRPS reduces to the MSE for point predictions. The usage of proper scoring rules in geospatial exposure science is uncommon, but they have been utilized effectively in recent years as probabilistic models are becoming more common [30,31,32,33,34]. For predictive models that produce uncertainty quantification, we highly recommend the usage of proper scoring rules (see (author?) 29) for model assessment and validation.

Typically, a large number of potential covariates are calculated, with the final model determined through the selection of dimension reduction approaches. This avoids using a smaller, but easier to test, set of covariates that miss important covariates and lead to less accurate predictions. The goal is to find, X^r ⊆ X, where X^r is a subset of the true or best covariates in the large design matrix, X. Alternatively, dimension reduction may be used to estimate a new design matrix, X^d, whose rank (i.e., independent columns) is much lower than the full design matrix:

$$rank({X}^{d})\ll rank(X)$$

(3.2.2)

Here, we describe common model selection and dimension reduction techniques developed in the statistical methodology literature that have been successfully applied in geospatial exposure science.

Stepwise and Ad-Hoc stepwise

Forward, backward, and stepwise regression are a family of model selection algorithms with a long history of use in many statistical modeling applications [35]. In stepwise regression, variables are added or subtracted one at a time to the model. At each step, the variable with highest correlation or lowest coefficient p-value that is significant (e.g., F-test) is added. Variables previously in the model are tested for significance and may be dropped.

In geographic exposure assessments, a modified stepwise procedure is often used. In the European Study of Cohorts for Air Pollution Effects (ESCAPE) study [36], which has subsequently been adopted by the vast majority of land use regression (LUR) exposure science applications [37,38,39], LUR models were developed using the following modified procedure: (1) Every covariate is regressed against the outcome in univariable models. (2) The variable that results in the largest increase in R² that is also greater than 0.01, is significant at p ≤ 0.05, abides by directional constraints determined a priori, does not change the direction of previous variables, and does not increase the p-value of previous covariates to greater than 0.05, is added to the model. The directional constraints are determined based on the expected mechanistic interpretation of the covariate. Source variables are expected to increase pollution levels only and are thus constrained to be positive. Change and transport variables can increase or decrease pollution levels, so they are typically not constrained. (3) With a given set of covariates, the process continues, with one variable added at a time until the increase in R² is not met. An additional constraint for dealing with distance hyperparameters is often included. If a variable such as forest landcover percentage within a buffer is added, subsequent forest landcover buffer variables are either excluded or forced to have a much different distance hyperparameter (i.e., short- and long-range mechanisms).

Modified stepwise procedures for geospatial exposure assessments are the most frequently used approach for model selection and fitting. Other algorithms include a distance decay regression selection strategy [40] and constrained forward non-linear regression with hyperparameter optimization [41]. While these algorithms have been used for developing exposure assessments, the stepwise family of model selection strategies has many well-known limitations. Stepwise algorithms are known as “greedy” algorithms because they update a model one variable at a time, placing high importance on the local choice of one variable over another [42]. Additionally, the algorithm uses hypothesis tests such as t- or f − tests that were designed for a small number of model tests and thus are not optimal for multiple test comparisons [43]. Lastly, the algorithm does not scale well for a large number of covariates unless additional steps are taken to reduce the candidate set of variables, which can inflate out-of-sample prediction accuracy. For these reasons, researchers have utilized model selection and reduction methods such as penalization and dimension reduction that have better statistical properties.

Penalized regression

Penalization (also known as regularization or shrinkage) is a model-fitting and selection technique that utilizes a constraint (i.e., the penalty) on the model coefficients to shrink coefficients towards zero. The epistemological principle of penalization is that the model introduces bias (i.e., deviates from the minimum variance least-squares estimate) into the model coefficients to reduce the variance in each coefficient estimate. Moreover, the algorithms estimate penalized models along “regularization paths,” which reduces covariate coefficients towards zero in a continuous manner rather than all at once [44, 45]. This democratic version of model selection reduces the impact of a single covariate on the model selection process. Lastly, the penalization approaches effectively perform model selection and coefficient estimate simultaneously, avoiding the multiple hypothesis test violations of stepwise approaches.

A penalization can be written as a constrained optimization problem:

$${\min }_{\beta \in {{\mathbb{R}}}^{p}}\quad \{{(Y-f(X;\beta ))}^{2}\}\quad s.t.\quad P(\beta )\le t$$

(3.2.3)

in which the sum of squared loss on a generic model with input X and parameters β, f(X; β), is constrained subject to (s. t) a function, P(⋅), being less than or equal to some value, t, and X is unit-normal standardized so that the solution does not depend on the scale of the covariates. Penalization approaches are widely used in LUR (section “Land-use regression”), Gaussian process (Section “Geostatistical Models: Gaussian Processes, Kriging, and BME”), and machine learning (Section “Machine Learning”) models. For simplicity, we assume that f(X; β) is a simple linear model. The L₂-norm results in ridge regression, can handle high multicollinearity, and provides stable solutions:

$${\min }_{\beta \in {{\mathbb{R}}}^{p}}\quad \{{(Y-X\beta )}^{2}\}\quad s.t.\quad | | \beta | {| }_{2}^{2}\le t$$

(3.2.4)

However, ridge regression cannot reduce coefficients to zero due to the geometry of the L₂-norm. Least absolute shrinkage and selection operator (lasso) [46] is a popular penalty that performs simultaneous model selection and fitting which uses L₁-norm as the penalty function:

$${\min }_{\beta \in {{\mathbb{R}}}^{p}}\quad \{{(Y-X\beta )}^{2}\}\quad s.t.\quad | | \beta | {| }_{1}\le t$$

(3.2.5)

There are many more penalties that can be used to perform model fitting and selection, including the elastic-net [47], a combination of lasso and ridge penalties, and non-concave penalties such as the smoothly clipped absolute deviation [45]. Penalization is often written in the so-called Lagrangian form, which leads to a classical linear regression with the Lagrangian multiplier or penalty, λ:

$${\min }_{\beta \in {{\mathbb{R}}}^{p}}\quad \{{(Y-X\beta )}^{2}\}+\lambda P(\beta )$$

(3.2.6)

Equation (3.2.6) allows for faster and tailored unconstrained optimization algorithms such as Newton–Raphson, Nelder–Mead, or coordinate descent. Penalization approaches have been implemented in geospatial environmental exposure assessment applications [48,49,50]; however, they remain less common than stepwise and ad-hoc approaches. The authors recommend adoption and use of penalization over stepwise and ad-hoc selection approaches.

Dimension reduction

Dimension reduction (DR) is another popular approach for reducing the complexity of the model covariate space. Briefly, DR is a general framework for reducing a highly dimensional covariate space to a representative, low-dimensional subspace. The approaches can be described as either linear or non-linear, where the former are usually more interpretable and the latter can capture non-linearities in complex, high-dimensional data. DR is a large area of statistical and applied research, so we highlight the most common methods used in environmental exposure assessment.

Principal components analysis (PCA) is the most basic and well-known dimension reduction technique [51]. Principal components are linear combinations of a normalized, high-dimensional covariate matrix such that each component maximizes the sample variance. The first principal component has the largest sample variance among all normalized linear combinations. Subsequent principal components maximize the variance and are orthogonal to the previous principal components. PCA has been used in geospatial exposure assessments and is particularly useful in Bayesian model fitting techniques where model selection may be difficult or infeasible. Recently, the partial least squares (PLS) linear DR technique was used for big data land-use regression models of fine particulate matter [52] and nitrogen-dioxide [53]. PLS is similar to PCA; however, it includes a dependent variable, Y, and maximizes the covariance between Y and a high-dimensional space covariate set [52]. It is preferred over PCA since it accounts for relationships across dependent and independent variables [52].

Non-linear DR employs various techniques such as kernels, manifolds, auto-encoders [54], and other projections to transform and cluster high-dimensional data into low-dimensional subspaces. The methods are diverse and complex, and their descriptions is beyond the scope of this work. While non-linear DR methods have been employed in related fields such as pattern recognition and genomics, their implementation is sparse in geospatial exposure science. The authors are aware of only a few exposure assessment studies that have implemented the non-linear DR methods of uniform manifold approximation and projection [55] as part of a larger machine learning approach for the prediction of PM_2.5 and PM₁₀ [56, 57]. In adjacent scientific fields, auto-encoders are being used for dimension reduction of complex spatiotemporal processes in climate and weather modeling [58, 59].

In practice, DR is used as a pre-processing step to reduce a high-dimensional geographic covariate set into a tractable, low-dimensional covariate set that can then be used in subsequent model estimation and prediction (see Equation (3.2.2)). In linear DR, care and domain knowledge are critical for developing and interpreting reduced dimension covariates, and interpretations are typically not as straightforward as the raw covariates. Non-linear DR is advisable when prediction accuracy is the primary goal and clear interpretations are not necessary, as they typically capture more complexity in a smaller subspace that is not possible with linear methods.

Types of models

This section describes the diverse landscape of models used in geospatial exposure science. Table 1 provides an overview of the models and their basic functional forms, such as Y = f(x) + ε.

Table 1 Summary of the model names, general formulation, section number, and equation number.

Full size table

Proximity

Proximity exposure metrics are the most basic form of an exposure assessment because they rely only on the distance between a pollution source and the observed outcome location. Proximity exposure metrics have been used to elucidate the impacts of environmental exposures on human health, including asthma [60], cardiovascular disease [61], and reproductive fertility [62]. From a linear perspective, a proximity model is simply a deterministic covariate:

$$Y(p)=X(p)$$

(4.1.1)

where X is the deterministic quantity calculated at location p (note the lack of an error term). Here, we describe the most common and easily calculated proximity metrics. Given a distance matrix, d_ij, the minimum distance is:

$${X}_{i}^{min}=\min ({d}_{i,\cdot })$$

(4.1.2)

where d_i,⋅ is the i-th row indicating the distance between outcome i and every pollution source. The average distance is:

$$\overline{{X}_{i}}=\frac{1}{{n}_{j}}\mathop{\sum }_{j=1}^{{n}_{j}}{d}_{ij}$$

(4.1.3)

where n_j is the number of pollution sources. Buffer variables are a useful class of proximity metrics. Buffer variables can be calculated for areal, point, and line sources. Summary statistics such as the mean or fraction within a given area around the location of interest can be calculated. Figure 4 illustrates the most common buffer variables.

**Fig. 4: Illustration of common buffer variables applied to land cover classification.**

A proximity metric that incorporates the distance, density, and potential emissions is the sum of the exponentially decaying contribution of point sources, [37]:

$${X}_{i}={\sum }_{j=1}^{J}{C}_{0j}\exp \left(\!-\frac{{d}_{ij}}{r}\right)$$

(4.1.4)

where X_i is the quantity at location i, C_0j is an initial value such as concentration or emissions at source j, d_ij is the distance between site i and source j, r is the exponential decay range, and J is the number of sources.

Proximity metrics provide a simple exposure assessment model for point, line, and grid data. From an exposure model validation perspective, however, they are limited because there is no observed data to validate the model. For epidemiological studies, proximity metrics are tested directly against health outcome data with model selection or evaluation of multiple models [63]. If monitoring data are available for an exposure of interest, such as a chemical exposure, developing a model that directly predicts the chemical concentration is recommended. Proximity metrics are routinely developed and applied as geographic covariates in other types of exposure models, such as land use regression models.

Land-use regression

Briggs et al. [64] is largely credited with developing land-use regression (LUR) as a method for estimating air pollution exposure. Coincidentally, a similar method for estimating nutrient loads in river reaches was also introduced in the same year [65]. Land-use regression is simply a linear or nonlinear regression model with spatially-referenced, geographic covariates. The most common linear LUR is:

$${{\bf{Y}}}(p)=X(p)\beta +\varepsilon$$

(4.2.1)

where Y(p) are the n × 1 observations for the variable of interest (e.g., PM${}_{2.5},N{O}_{3}^{-}$) with space-time locations s and t, X(p) is an n × k design matrix of k spatial and/or spatiotemporal geographic covariates, β is a k × 1 vector of linear regression coefficients, and ε is the n × 1 vector of independent and identically distributed (i.i.d.) errors typical of a classical linear regression. Surface water [65] and similar groundwater models [22, 41] use nonlinear regression with linear source terms multiplied by exponential attenuation and transport terms.

A strength of LUR models is their flexibility to include other models and data sources as interpretable geographic covariates. The exposure models discussed in subsequent sections can be included as covariates in LUR models. Another strength of LUR geographic covariates is their flexibility with respect to distance parameters (i.e., distance hyperparameter). The distance hyperparameter is typically unknown, and thus calculating many of the same variables with varying distance hyperparameters is recommended [37, 40]. Model selection or dimension reduction is used to determine the best covariates and corresponding distance hyperparameters and provide insight into the spatial and temporal scales of the process of interest.

LUR models can be used to make exposure predictions at any location where geographic covariates exist. LUR model prediction mean and variance follow the same formulation as standard linear regression models. The prediction mean, $\hat{Y}$, at new location, p_*, is:

$$\hat{Y}({p}_{* })=X({p}_{* })\hat{\beta }$$

(4.2.2)

where $\hat{\beta }$ is the estimated coefficient vector in equation (4.2.1). The LUR prediction variance at a new location, p_*, depends on the variation in the residuals and the variation of estimating the true mean with the predictions [66]. It follows that the prediction variance is:

$$Var(\hat{Y}({p}_{* }))=Var(X({p}_{* })\hat{\beta })+Var(\varepsilon )$$

(4.2.3)

The variance of the residuals is the sample variance, ${S}_{Y}^{2}$. Expanding each component of equation (4.2.3), the general formulation can be written as [66]:

$$Var(\hat{Y}({p}_{* }))={S}_{Y}^{2}[1+X({p}_{* }){[X{(p)}^{T}X(p)]}^{-1}X{({p}_{* })}^{T}]$$

(4.2.4)

This general formulation for prediction variance is also seen in other regression-based models (Sections “Geographically Weighted Regression” and “Geostatistical Models: Gaussian Processes, Kriging, and BME”).

Geographically weighted regression

Geographically weighted regression (GWR) is an extension of linear regression and LUR models that allows for spatially and/or spatiotemporally varying coefficients [67]. The main principal behind GWR is that the coefficients in spatial models are non-stationary; that is, the properties or values of model parameters vary depending on local conditions. The GWR extension of linear regression can be written as [68]:

$${{\bf{Y}}}(p)=X(p)\beta (p)+\varepsilon$$

(4.3.1)

where Y(p), X(p), and ε are the spatiotemporally varying outcome, spatiotemporal covariate, and i.i.d. error and are the same as in equation (4.2.1). β(p) are now spatially and temporally referenced. Mathematically, the coefficient estimates can be considered a version of generalized least squares [68] or a random effect model [69]. For the former, the coefficients are:

$$\beta (p)={\left(X{(p)}^{T}W{(p)}^{-1}X(p)\right)}^{-1}X{(p)}^{T}W{(p)}^{-1}Y(p)$$

(4.3.2)

where W(p) is a spatiotemporal weight matrix. Gelfand et al. [69] provide the general specification of the random effects approach for spatially varying coefficients as:

$$\tilde{{\beta }_{k}}(p)={\beta }_{k}+{\beta }_{k}(p)$$

(4.3.3)

which can interpreted as a spatially-varying random adjustment, β_k(p), at locations p to the overall slope β_k.

While GWR is not as popular as LUR in environmental exposure assessment, it is gaining favor and has been successfully implemented in a few cases. Hu et al. [70] and Van Donkelaar et al. [71] both implemented a GWR for a PM_2.5 model that integrated a variety of geospatial covariates. van Donkelaar et al. [72] utilized GWR to estimate PM_2.5 chemical composition, such as nitrates, sulfate, and organic matter. Kloog et al. [73] and Kloog et al. [74] developed random-effects models for prediction of PM_2.5.

Brunsdon et al. [67] and Fotheringham et al. [68] provided frameworks for estimating the GLS style GWR models, including a spatiotemporally varying weight matrix, W(p). Briefly, choices on bandwidth or the distance at which coefficients are smoothed, must be made. This includes choices on the bandwidth distance and the smoothing function (e.g., inverse-distance, Gaussian kernel). Algorithms are available to estimate these parameters systematically through a cross-validation procedure; however, more flexibility in estimation is directly related to increased computational burden. To simplify choices and computation time, Van Donkelaar et al. [71] weight coefficient estimates according to an inverse-distance from observations. Gelfand et al. [69] provide the details such as the likelihood derivations and Bayesian estimation approaches for the spatially-varying random effects approach including the posterior predictives estimates.

GLS-style GWR models also have a straightforward approach for making exposure predictions at any location where the geographic covariates exist. GWR model prediction mean and variance are similar to LUR, but with modifications for the spatially varying coefficients and weights matrix. The prediction mean, $\hat{Y}$, at new location, p_*, is:

$$\hat{Y}({p}_{* })=X({p}_{* })\hat{\beta }({p}_{* })$$

(4.3.4)

where $\hat{\beta }({p}_{* })$ is the estimated coefficient vector in equation (4.3.1) at location p_* computed with (4.3.3). The GWR prediction variance also follows the general formulation of equation (4.2.4). The prediction variance for GWR, which accounts for both estimation of the mean uncertainty and point prediction uncertainty is [75]:

$$Var(\hat{{Y}_{\!\!* }})={S}_{Y}^{2}\left[1+{X}_{\!* }{[{X}^{T}{W}_{\!* }X]}^{-1}[{X}^{T}{W}_{\!* }^{2}X]{[{X}^{T}{W}_{\!* }X]}^{-1}{X}_{\!* }^{T}\right]$$

(4.3.5)

where the spatiotemporal index, (${{p}_{\!* }}$), is implied for brevity (i.e., $\hat{Y}({p}_{* })$ in equation (4.3.4) is equivalent to $\hat{{Y}_{\!* }}$ in equation (4.3.5)).

Geostatistical models: Gaussian Processes, Kriging, and BME

Geostatistical models are models that contain explicit error terms to model spatial, temporal, or spatiotemporal auto-correlation in the data. In other words, they are a function to interpolate, extrapolate, or smooth the dependent variable. They have a rich history across many scientific and computational fields including forestry [1], geology [3], engineering [76], statistics [4, 24], machine learning [77] and most recently in environmental health [78]. For this reason, there is often confusion in terminology as nominally equivalent methods were developed in parallel among siloed disciplines. For example, the term “Kriging” is most popular in the engineering and public health literature whereas “Gaussian Process” is more often used in the spatial statistics and machine learning literature.

By definition, a Gaussian process (GP) is a collection of random variables, a finite number of which have a joint Gaussian distribution [77]. A GP is defined by a mean, μ(p), and covariance between locations, Σ(p, p_*)

$${{\bf{Y}}}(p)=GP(\mu (p),{\Sigma }_{\theta }(p,{p}_{* }))$$

(4.4.1)

Each location is defined by the marginal Gaussian distribution, and thus the number of parameters in the model increases along with an increase in sample size. Hence, GP theoretically has an infinite parameter space and is considered non-parametric. Σ_θ is a covariance matrix that is modeled with kernel functions with parameters θ.

Geostatistical models can also be written as a mixed-effect model where the covariance between points is contained in the random effects term:

$${{\bf{Y}}}(p)=\mu (p)+\eta (p).$$

(4.4.2)

μ(p) can take many forms such as linear, nonlinear, or even ML models such as random forest [79]. Here, μ(p) is the form of a simple linear model, Xβ, and η(p) is an error term, which can decomposed into independent and identically distributed error and spatiotemporally correlated error represented as a GP, η ~ GP(0, Σ_θ + τ²I). Σ_θ is a covariance matrix with parameters, θ, that accounts for correlation between spatial and temporal locations. Given that y_i has a Gaussian distribution, the vector of space-time observations, Y, has a multivariate Gaussian distribution. Thus, we can utilize the probability distribution function of a multivariate Gaussian density to define the likelihood of equation (4.4.2) as [24]:

$$L(\beta ,\theta ;{{\bf{Y}}})={(2\pi )}^{-n/2}| {\Sigma }_{\theta }{| }^{-1/2}\exp \{-{({{\bf{Y}}}-X\beta )}^{T}{\Sigma }_{\theta }^{-1}({{\bf{Y}}}-X\beta )/2\}$$

(4.4.3)

where ∣Σ_θ∣ is the determinant of the covariance matrix, a positive-definite matrix parameterized by the covariance or kernel function. The choice of covariance or kernel functions is an active area of research, but, in exposure science, stationary, symmetric kernel functions such as exponential, Gaussian (squared-exponential), and Matérn are the most common and are recommended. The squared exponential or Gaussian covariance with variance σ², length scale (i.e., decay range), parameter r, and distance between locations, d, is one of the simplest and most common choices:

$$K(d| {\sigma }^{2},r)={\sigma }^{2}exp(-{d}^{2}/r)$$

(4.4.4)

Letting σ = 1, note that as d moves toward 0, the correlation moves toward 1: this implies that the covariance and correlation between locations increases as points are closer together, with the rate and overall distance determined by the estimated covariance parameters. The squared exponential also has a special property that ensures functions will be very smooth, or infinitely differentiable.

The Matérn kernel is a generalization of the Gaussian function that introduces a smoothness parameter to control how many times the sample paths can be differentiated. As the smoothness parameter approaches infinity, the squared exponential covariance is recovered. Natural phenomenon tend to have finite differentiability as opposed to infinite differentiability, and the theoretical properties are good [80], so the Matérn covariance is considered an appropriate choice for exposure and health applications.

Bayesian maximum entropy (BME) is a popular geostatistical approach that can be considered an extension of classical geostatistics methods. Like classical geostatistical methods, covariance parameters are estimated based on an empirical estimation of covariance and the approach does not utilize distributional assumptions. The key aspect differentiating BME from classical geostatistics methods is that predictions can include non-Gaussian uncertainty in predictions at new locations. He and Kolovos [81] extensively reviewed BME, including its successful applications in geospatial exposure modeling.

MLE and Bayesian estimation can simultaneously estimate the mean and variance components of equation (4.4.1), which is statistically more optimal than the empirical approach but introduces computational challenges. With a large number of locations, the likelihood, as in equation (4.4.3) becomes computationally difficult to evaluate because the inverse of the covariance matrix Σ is dense. Nearest neighbor approximations as predictive processes [82] and general Vecchia approximations [83] are among the simplest and most effective techniques: correlations between points that are far away from each other are essentially ignored. Nearest neighbor approximations are suitable for MLE or Bayesian inference. A popular approach for efficient Bayesian inference is the integrated nested Laplace approximation (INLA), which uses a stochastic partial differential approximation of a multivariate Gaussian random field and the Laplace approximation for posterior distributions [84]. Moran and Wheeler [85] developed a Gibbs sampling algorithm for rapid Bayesian inference utilizing hierarchical matrix approximations. Combined mean and GP estimation is further discussed in section “Hybrid” since the distinction between some GP and hybrid methods is blurry.

Kriging is often referred to as the explicit step using geostatistical models for prediction at new locations. For the covariance matrix Σ, we notate the dimension representing new predictions locations with subscript * and observations otherwise. The Kriging prediction mean, $\hat{Y}$, assuming a linear mean, is:

$$\hat{Y}({p}_{* })=X({p}_{* })\hat{\beta }+\left[K({p}_{* },p){[K(p,p)+{\tau }^{2}I]}^{-1}[Y(p)-X(p)\hat{\beta }]\right]$$

(4.4.5)

where $X({p}_{* })\hat{\beta }$ is the linear mean at prediction locations, K(p_*, p) (e.g., Equation (4.4.4)) is the estimated covariance matrix between prediction locations and observations, K(p, p) is the estimated covariance matrix between observations, τ²I is the independent error added to the K(p, p) diagonal, and [$Y(p)-X(p)\hat{\beta }$] is the residual at the observations. The Kriging prediction mean can be described as the linear regression model mean at prediction locations plus the interpolated residuals of the exposure data observations. The strength of the residual interpolation is based on the covariance parameters. The Kriging/GP prediction variance is the following:

$$Var(\hat{Y}({p}_{* }))=K({p}_{* },{p}_{* })-K({p}_{* },p){[K({p}_{,}p)+{\tau }^{2}]}^{-1}K(p,{p}_{* })$$

(4.4.6)

where K(p_*, p_*) is the covariance matrix between prediction locations, and $K(p,{p}_{* })=K{({p}_{* },p)}^{T}$ is the transpose of the covariance between prediction locations and observations. The Kriging/GP variance can be described as the total estimated variance at the prediction locations minus the variance from the additional information contributed by the observation residual interpolations.

Machine learning

Machine learning (ML) describes predictive modeling focused on a learning algorithm and out-of-sample prediction generalization [86]. ML methods have fewer assumptions and are highly parameterized and thus more flexible for capturing complex non-linearity. ML methods for geospatial exposure assessment utilize the same geographic covariates as predictor variables such as LUR (section “Land-use regression”), GWR (section “Geographically weighted regression”), and geostatistical models (section “Geostatistical Models: Gaussian Processes, Kriging, and BME”) and can capture non-linear relationships within and across covariates. Nonetheless, great care is needed to estimate a ML model properly as they are often susceptible to over-fitting. Their success in a wide variety of computational applications has led to their adoption in geospatial exposure modeling. A general ML equation is [87]:

$${{\bf{Y}}}(p)=f(X(p))$$

(4.5.1)

where f( ⋅ ) is a difficult-to-express function and encompasses a wide variety of forms in ML. Prediction for ML models varies, but, in general, ML prediction utilizes parameters from the estimation process and geographic covariates at prediction locations:

$$\hat{Y}({p}_{* })=f(X({p}_{* }))$$

(4.5.2)

ML models typically estimate a central tendency and do not have an explicit prediction variance equation; however, bootstrapping techniques provide a straightforward way to determine approximate prediction variance. For details on the methodologies of ML models and algorithms including generalized additive models, tree methods, boosted and additive trees, support vector machines, and neural networks, we refer readers to Hastie et al. [88]. Additionally, Yan [87] summarized and provided examples of ML for chemical safety applications, including geospatial exposure assessments. Here, we discuss the basic equations and properties of two classes of ML methods that have been successfully applied to geospatial exposure modeling: neural networks and ensemble models.

Neural networks

Neural networks, or artificial neural networks (ANN), are inspired by the structure and function of the human brain. Neural networks consist of layers of interconnected nodes, known as artificial neurons, that process and transmit information through the network. The connections between neurons are weighted, and the weights are updated during the training process to improve the accuracy of the network’s predictions. At their simplest, ANN are essentially repeated logistic regression models. However, ANN represent a modern frontier in statistics and ML where improvements and new models abound. More complicated ANN, referred to as deep learning, allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of non-linearity and abstraction. For comprehensive explanations of neural networks and deep learning, we refer the readers to Bishop [89], LeCun et al. [90], and Goodfellow et al. [91]. Additionally, Yan [87] provided an overview of ANN in exposure science applications.

ANN have been successfully applied in geospatial exposure modeling. Di et al. [92] and Di et al. [93] developed highly accurate, annual average PM_2.5 and ozone predictions, respectively. They noted that convolutional layers have the attractive property of accounting for spatial autocorrelation and spatial scale hierarchies (i.e., long range vs. short range). Pyo et al. [94] used ANN to predict cyanobacteria algae blooms in surface water, which are important for human and ecological health applications. Müller et al. [95] and Azimi et al. [96] used ANN to predict groundwater quantity and quality, respectively. ANN have improved predictions of social determinants and their associations with health outcomes [97]. Lastly, Weichenthal et al. [98] discussed future research directions for ANN in exposure science applications.

Ensemble methods

Individual geospatial exposure models, no matter how sophisticated, have strengths and weaknesses compared to alternative model choices. For example, one model may capture low concentrations better than high concentrations while another may capture regional variability better than fine, local-scale variability. Ensemble models are a class of ML algorithms based on the simple concept that a large committee of models is better than an individual model. Here, we describe ML methods based on an ensemble of base or weak models. This differs from ensemble models that are combinations of multiple other full models, often referred to as meta-learners, super-learners, or hybrid models. These are discussed in section “Hybrid”.

In an ensemble model, the final prediction is a weighted average of multiple models:

$$Y(p)={\sum }_{m=1}^{M}{w}_{m}{f}_{m}(X(p)),{\sum}^{M}{w}_{m}=1$$

(4.5.3)

where f_m is an individual model, m = 1, …, M, and the weights, w_m, sum to 1 and are typically estimated through an optimization procedure. If the weights are equivalent, w_m = 1/M, then a simple average of base models can be used. If a weighted average is desired, a simple linear model or GAM can serve as the meta-learner.

Hastie et al. [88] extensively explained tree-based ensembles for regression and classification. Briefly, a tree-based model splits data into hierarchical subsets based on certain features and at each split, applies a decision rule to partition the data and fit a simple model such as a constant. The prediction for a new sample is determined by traversing the tree, applying the decision rules at each node until a terminal node is reached, and using the average target value associated with that terminal node as the prediction. Breiman [99] introduced bootstrap aggregating and random forests, where a given data sample is bootstrapped M times (i.e., randomly sampled with replacement). A tree-based model is then fit on a given bootstrap sample, and the final prediction the average of all bootstrap model predictions. Random forest and bootstrapped aggregated models can be efficient as the individual models are easily parallelizable.

An alternative approach to developing ensemble models is to build the final model sequentially with base or weak-learner models, where each model attempts to improve slightly over the previous aggregated models. Informally, in gradient boosting [100], simple models are added in a stage-wise manner optimized via gradient descent on the current model’s residuals:

$${Y}_{m}={Y}_{m-1}+\nu {y}_{m}$$

(4.5.4)

where Y_m is the gradient-boosted model at iteration m, Y_m−1 is the previous iteration’s full model, y_m is the current base or weak-learner model, and ν is a penalization parameter between 0 and 1 that prevents the algorithm from proceeding too quickly and thus reducing effectiveness.

Ensemble models have been used with great success in geospatial exposure modeling. Random forest has been used in multiple studies to predict groundwater nitrate vulnerability [10, 101,102,103]. Ransom et al. [104] utilized boosted regression trees to predict groundwater nitrate concentrations in the Central Valley aquifer in California, USA. Gradient and extreme gradient boosting have also been used extensively to model spatiotemporal concentration of air pollutants such as PM_2.5 [105,106,107]. Zhan et al. [108] added a spatial weighting to gradient boosting and reported better results than without geographic weighting for the spatiotemporal prediction of PM_2.5. Lastly, new approaches have added Gaussian processes to random forest [109] and gradient boosting [109] to improve ensemble models with spatial data.

Mechanistic or chemical transport

The exposure models discussed in previous sections are considered statistical models. Conversely, mechanistic or chemical-transport models (CTM) represent a class of models derived from basic principles of physics and chemistry, such as conservation of energy, resulting in a system of partial differential equations [110]. Many CTM used in exposure science are derived from a general advection-dispersion model based on the principle of conservation of mass [111, 112]:

$$Y(p):= \frac{\partial {C}_{i}}{\partial t}=-{{\boldsymbol{\nabla }}}\cdot \left({{\bf{v}}}{C}_{i}\right)+{{\boldsymbol{\nabla }}}\cdot \left(D{{\boldsymbol{\nabla }}}{C}_{i}\right)+{\sum }_{j=1}^{n}{r}_{i,j}+{G}_{i}$$

(4.6.1)

where Y(p), the exposure measure of interest at locations p = (s, t), is defined as ( ≔ ) C_i, the concentration of species $i,{{\boldsymbol{\nabla }}}\cdot \left({{\bf{v}}}{C}_{i}\right)$ is the advective transport out of a defined domain due to a velocity such as wind or water flow, ${{\boldsymbol{\nabla }}}\cdot \left(D{{\boldsymbol{\nabla }}}{C}_{i}\right)$ is the diffusive transport with diffusion coefficient D, and ${\sum }_{j = 1}^{n}{r}_{i,j}$ is the net formation of species i from all of species j, and G_i is the net internal generation of pollutant species i such as emissions and deposition losses. ∇⋅ and ∇ are the spatial three-dimensional divergence and gradient operators, respectively. The general form of equation (4.6.1) provides a starting point for most CTM derivations. The exact derivation depends on factors such as the extent of the property, number of phases (e.g., gas, particle), the conservation law (e.g., mass, momentum), closure relations, and numerical approximations. For example, for air quality, the entire diffusion term is often ignored since it is minimal compared to advective transport. However, in groundwater transport, diffusive transport is typically an important term.

A large community of researchers are focused on elucidating the physical and chemical transport outside the human health context using CTM, but, their usage as a geospatial exposure assessment tool is also a common objective. In the air quality and health field, Community Multiscale Air Quality (CMAQ) model [113], Comprehensive Air Quality Model with Extensions (CAMx) [114], Weather Research and Forecasting model coupled with Chemistry (WRF-Chem) [115], and the Modern-Era Retrospective analysis for Research and Applications Version 2 (MERRA-2) [116] are examples of actively developed CTM. Since the CTM are large systems of partial differential equations, they are computational demanding and often require specialized training. To circumvent some of the computational issues and help geospatial exposure modeling scenarios, reduced complexity CTM have been developed. Tessum et al. [112] developed the Intervention Model for Air Pollution (InMAP), which utilizes variable grid sizes and simplified physics and chemistry to provide computationally scalability. Reduced complexity models excel in testing “what-if?” with exposure assessments by simulating with different parameterizations such as zeroing out emissions by location or industry. Tessum et al. [117] utilized InMAP to evaluate multiple exposure scenarios of air pollution emissions, industries, and the exposed racial and ethnic populations.

Dispersion models are a class of mechanistic models commonly used for air quality exposure assessments that focus on transport and neglect or simplify chemistry. For example, R-LINE is a steady-state Gaussian plume model designed to simulate line-type source emissions (e.g., mobile sources along roadways) by numerically integrating point source emissions [118]. AERMOD is a USEPA regulatory industrial and point source dispersion model. Dispersion models are excellent tools for geospatial exposure assessment for small-scale (i.e., city-block) applications when in-situ monitoring data is not available or when the focus is on a small set of point or line sources.

For groundwater assessment, the USGS developed and maintains MODFLOW, a three-dimensional mechanistic groundwater flow and transport model [119] comparable to CMAQ and MERRA-2 for air quality. Gallagher et al. [120] utilized MODFLOW to estimate the historical impact of effluent on drinking water wells for calculating exposures to examine the association between wastewater effluent in drinking water and breast cancer. Reduced-complexity mechanistically-based hydrological models are available that simplify transport or chemistry components of the full conservation equation. Beven and Kirkby [121] developed a topography-based hydrological model, a conceptual tool based on the topographic wetness index that allows the simulation of hydrological process, particularly the dynamics of surface or subsurface contributing areas, in a simplified mechanistic approach.

Since CTM have a high expertise and computational burden, many governmental agencies and consortia make model output across varying spatial and temporal domains available on-line. CMAQ-based output of air toxics are available for the conterminous US for 2014 through the National Air Toxics Assessment. The NASA Global Modeling and Assimilation Office provides regular MERRA-2 output, including daily and sub-daily simulations of many air quality parameters including ozone, aerosols, and gas-phase pollutants.

Hybrid

The final class of geospatial exposure models brings together many aspects of the previous models into a hybrid framework. The fundamental principle of hybrid models is similar to the motivation of ensemble models in machine learning: a consensus of multiple models offers advantages in terms of robustness and ability to handle complex data. In geospatial exposure applications, hybrid development is difficult and time consuming since it requires fitting multiple types of models, but hybrid models consistently outperform single model methods in terms of prediction accuracy. The types of hybrid models used in exposure modeling include (1) model output as a direct input to another model, (2) sequential application of models on model residuals, and (3) ensemble models.

The first type of hybrid model involves using the output of one geospatial model for another geospatial model. A common implementation is the integration of proximity models (section “Proximity”), CTM (section “Mechanistic or chemical transport”), or satellite-derived data as covariates in models such as a GP [122] or ANN [92]. This is a straightforward approach to integrating models with varying spatial or temporal scales since there is no strict requirement that covariates scales match that of the outcome.

The second type of hybrid model is the sequential application of models on the previous model’s residuals. The most common example is a LUR (i.e., linear regression) followed by a geostatistical model such as GP/kriging [123, 124] or BME [125]. This two-stage approach uses geostatistical modeling of the residual correlation of the LUR model, which results in improved overall prediction accuracy. However, the two-stage approach is non-optimal since the LUR model assumes independent errors, effectively admitting that the LUR model assumption was violated. Consequences include inflated LUR coefficient variances, reduced model selection sensitivity and specificity, and overall reduction in prediction accuracy due to mis-specified models in the LUR stage. Messier and Katzfuss [33] developed an LUR-kriging approach that simultaneously selects and estimates LUR coefficients and GP covariance parameters with a scalable penalized likelihood approach. This approach to hybrid sequential model fitting is more optimal as evidenced by improved prediction and model selection.

The third type of hybrid model is the ensemble model approach, also known as meta-learners or super-learners [126]. In this approach, multiple geospatial models are fit, and the final prediction is then derived from those models, using, for example, the weighted average of all the models, or using another model with those models as inputs. Requia et al. [9], Danesh Yazdi et al. [127], and Yu et al. [128] fit multiple ML models followed by cross-validated meta-model weighting final ensemble predictions of air pollution. Murray et al. [129] utilized a Bayesian model averaging method that provides full uncertainty quantification in base and ensemble model predictions.

Hybrid models also serve multiple purposes when combining mechanistic models (e.g., CTM) and observation-based statistical models. First, for CTM, calibration to observations reduces known biases due to errors in emissions inventory, chemical mechanisms, and large spatial resolution. Second, hybrid models downscale, or increase spatial resolution of, CTM and satellite imagery while taking advantage of benefits such as detailed emissions, meteorology, and large spatial and temporal domains.

Geospatial data integration

Geospatial data linkages are required for calculating geographic covariates and connecting exposure data and models to health data. To facilitate discussion of data linkages, some pre-requisite definitions are required.

Geometry refers to the spatial representation of objects including the shape, size, and relative position. There are three types of spatial geometry: point, line, and area (e.g., polygons, grids).

The scale of geospatial data refers to the extent at which a process varies. In other words, the minimal distinguishable size, length, or extent of spatial variability. Scale is both a spatial and a temporal property.

The support refers to the geometry, volume, shape, and orientation for geospatial data [130]. It is rare that multivariable exposure or health datasets are exactly aligned in their support, thus assumptions and methods are needed to ensure clarity and validity in linking geospatial data. Linking datasets may include transforming data to another support (e.g., point to area, area A to area B), which is known as the change-of-support problem [130, 131].

For non-uniform data, spatial geometries of point, line, and polygons are referred to as vector data. Vector data are defined by spatially referenced points (e.g., latitude/longitude pair) or sequences of spatially referenced points for lines and polygons. For uniform geometry data, the data may be represented as a grid where each grid cell represent an area value and is referred to as raster data. Raster data can be simply defined by a geographic bounding-box (i.e., 4 corners), coordinate reference system, and a grid cell size. Geospatial data can be converted from raster-to-vector or from vector-to-raster to facilitate linkages (section “Spatial linkage considerations”); however, these conversions can also introduce error to exposure assessment [8].

Spatial linkage considerations

Here, we describe spatial linkages based on the source and receptor concept. These linkages apply for exposure to health data connections, exposure to exposure transformations, and exposure metric calculations. Available linkage methods depend on the combination of spatial geometry data types, as illustrated in Fig. 5. Table 2 provides examples of source and receptor data for each spatial geometry type represented in Fig. 5.

**Fig. 5: Illustration of exposure and health data by spatial geometry relationships.**

Table 2 Example source and receptor data by spatial geometry type.

Full size table

Figure 5a shows point to point connections. These connections are common for calculating exposure metrics at pollutant monitoring locations. Geographic covariates, such as the sum of exponentially decaying contribution, are calculated for the exposure metric of interest (e.g., fine particulate matter (PM_2.5) air pollution concentration) at the receptor location with weighted contributions from the point source data. Point-to-point linkages are also possible for connecting point source exposure data to health receptor locations. For instance, the receptor data may be a patient residential location and the point source data may be exposure measurements at nearby monitors. An average of the point source data within a given circular buffer is a reasonable exposure metric. However, we recommend developing exposure models based on the point source data as opposed to observation statistics which may be biased due to differences in observation (point source in this case) density for each receptor.

Figure 5b shows the line source data contributing to a point receptor. This is most common in exposure geographic covariate calculations such as estimating the contributions of road emissions to a monitoring location [118]. The line is typically discretized into a sequence of points followed by point-to-point calculations. Similar to point-to-point connections, a line source exposure can be linked to point receptor health data; however, it is likely a crude proxy such as a proximity model (section “Proximity”). We recommend calculating a covariate that accounts for the line source data, then connecting that with the point receptor health data.

Figure 5c shows the areal (raster) source data contributing to a point receptor. This linkage is common for both exposure geographic covariate calculations and exposure to health receptor connections. Raster data sources are linked to point measurement data by either (1) extracting the exact grid cell value of the raster that intersects the point or (2) calculating a weighted average of nearby raster grid cells, which smooths spatial variability in the raster source data.

Figure 5d–f represents line geometry receptor data, which is shown for completeness; however, it is rare that exposure or health data are represented with this geometry.

Figure 5g, point source data to areal (polygon) receptor data, is a change-of-support problem that is commonly used in exposure data/models to health data connections and calculations of exposure geographic covariates. The point-level support is up-scaled to areal support by calculating the average of point-source data within the polygon receptor data. Equation (5.1.1) shows the calculation from point support, s, to areal support with area ∣B∣:

$$X(B)=\frac{1}{| B| }{\int}_{\!\!\!\!B}X(s)ds$$

(5.1.1)

Fig. 5h shows the connection between line source data and polygon areal receptor data. This linkage is common for linking exposure metrics directly to areal-level health data such as a disease rates within census boundaries. Areal statistics such as total line length and line density can be calculated for each receptor polygon. If a pollutant source is associated with the line source, then it is more appropriate to calculate an exposure covariate to a gridded receptor, then use areal source to areal receptor linkages.

Figure 5i shows the areal source to areal receptor linkage. This is a change-of-support problem that involves transforming the scale of the source data. Equation (5.1.2) shows this linkage is a weighted average of the source data, where the weights are the proportion/fraction, p_i, of grid cell i in the receptor boundary B:

$$X(B)=\mathop{\sum }_{i=1}^{k(B)}{p}_{i}{X}_{i}$$

(5.1.2)

where there are k(B) total overlapping units in B.

Temporal linkage considerations

Linking geospatial data can involve accounting for differences in temporal support. Temporal scale (i.e., frequency) can vary from seconds to decades among geospatial data sources. For example, meteorological data from satellites is often available at the hourly scale, whereas social data from surveys is often available at the annual or decadal scale. Similarly, the temporal range (i.e., time period covered) can vary from shorter-range– such as study-specific data collection campaigns (e.g., covering a single season)– to longer-range– such as routine data collection from government agencies (e.g., covering multiple decades).

Several approaches are available to link data with disparate temporal support. Finer-scale temporal data can be linked to coarser-scale temporal data using aggregate summary metrics (e.g., by calculating annual mean of hourly measurements). Likewise, coarser-scale temporal data can be linked to finer-scale temporal data using various temporal downscaling methods (e.g., statistical downscaling methods applied to meteorological data [132]). Data with disparate temporal coverage can be linked by nearest available time point or using temporal interpolation methods [133]. For example, in the limit, purely spatial covariates (e.g., built environment, social determinants of health) can be used with temporally varying covariates in models such that the spatial covariate value is repeated across time.

Additionally, there is growing interest in longer-range geospatial exposure models that can help characterize the exposome throughout the life course (e.g., covering multiple decades in the past as well as into the future). Considerations for linking data for such longer-range models include changes in data collection methods over time (e.g., in sensor technology or survey design), sparse availability of historic environmental data, and uncertain sustainability (i.e., future availability) of data sources.

Special health data linkage considerations

Here, we review special considerations for linkages involving health data for individuals. Sources of health data for individuals include clinical data (e.g., patient electronic health records (EHR)), research data (e.g., health cohort participant data), and health insurance data (e.g., payer claims data). Geospatial exposure data– such as air quality, noise, and greenness data– is typically linked to health data for individuals by calculating exposure metrics that account for the specific spatial locations and time periods of exposure for each individual. Important considerations in calculating and analyzing these exposure metrics include preparing geospatial information for individuals, accounting for time-activity patterns in exposure metrics, protecting privacy of individuals, and interpreting uncertainty in exposure metrics.

Geospatial information for individuals is needed for linking health and geospatial exposure data. Home addresses– which are routinely collected by health data providers for administrative purposes at specific time-points– are a common source of geospatial information used for linkages. Geocoding– the process of translating addresses from text (e.g., street address format) to coordinates (i.e., latitude and longitude)– can be technically challenging. Available geocoding methods have varying spatial accuracy, match rates, automation, and privacy protection strategies [134, 135]. Addresses collected for administrative purposes may be missing information needed for geocoding (e.g., incomplete street number). Population-level geographic units, such as ZIP codes or counties, are also commonly used as geocodes for individuals; however, these represent coarser-scale spatial information. These are used when address information is not available, or when studying exposure-health relationships at coarser spatial scales, such as neighborhood-level social determinants of health for individuals.

Accounting for time-activity patterns, which describe how individuals move through time and space (e.g., from home to work and other locations), is an important consideration for calculating geospatial exposure metrics– particularly for exposures with higher spatial and/or temporal variability (e.g., traffic-related air pollution exposure) [136, 137]. Time-activity data can range in detail from current home address, to home and work address histories, to personal geolocation data (e.g., from wearable GPS). Several approaches are available to account for time-activity patterns in the exposure metrics used for linkages [138]. For example, a time-activity weighted average exposure metric can account for exposures at different point locations (e.g., home, school, and work addresses) by weighting exposures estimated at each location in proportion to the amount of time an individual spends at each location [139]. An activity space-based exposure metric can also account for exposures during travel between the different point locations by reflecting time-integrated average conditions within a spatial area enclosing different point locations and/or travel paths between locations [140]. Where fine-scale time-activity data as well as fine-scale geospatial exposure data are available, various time- and space- integrated average exposure metrics can account for exposures at each specific spatial location at each specific time-step (e.g., at 1-min intervals) [141].

Importantly, geospatial information for individuals (e.g., addresses, geocodes, time-activity patterns) is sensitive, potentially identifying information. Laws and institutional policies (e.g., Health Insurance Portability and Accountability Act (HIPAA) in the US [142]) require protecting individuals’ geospatial information. Thus, privacy protecting strategies are needed throughout the data processing pipelines used in geospatial exposure assessment. For example, many available geocoding tools require sharing addresses with geocoding companies over the internet– which risks exposing those addresses. Geocoding strategies for protecting privacy include using offline geocoding tools [143] and developing privacy-aware APIs for accessing online geocoding tools [144]. There are also privacy concerns associated with developing integrated datasets of geospatial exposures for individuals, due to re-identification risks. Approaches for reducing this risk include geographic masking approaches, such as applying spatial blurring (or, introducing noise) to geocoded addresses before linking individual exposure estimates [145].

Interpreting potential sources of error in exposure metrics is an important consideration in analysis of exposure and health outcomes. There are several potential sources of error in exposure metrics estimated from geospatial models compared to, for example, direct personal exposure measurements. These sources include geocoding errors (i.e., uncertainty in spatial location of address), exposure model errors (i.e., differences between model predictions and measurements), and other exposure misclassification errors (e.g., errors owing to limited time-activity information).

Open-source tools

Linking geospatial exposure data with health data involves a range of geospatial data engineering challenges [143, 146,147,148]. Various open-source tools have been developed to support addressing these challenges. Supplementary Table A lists examples of available open-source tools, such as code, software, and web applications, and categorizes them by the specific steps they address in Fig. 2. Data sources and tools are rapidly evolving; Supplementary Table A represents a snapshot of the current landscape. Catalogs and repositories (such as the CHORDS catalog [21], NASA EarthData [149], CAFE repository [150], and others in Supplementary Table A) aim to provide continuously updated information about available data and tools.

Available tools include open-source geographic information system (GIS) software (e.g., QGIS [151]), R packages (e.g., sf [152]), and Python libraries (e.g., GeoPandas [153]) that broadly support geospatial data engineering and analysis. Other specialized tools help find open geospatial data (e.g., National Environmental Public Health Tracking Network data catalog [154]), access subsets of large geospatial datasets (e.g., OPeNDAP software [155]), integrate disparate data (e.g., GriddingMachine software [156]), develop geospatial exposure models (e.g., terra R package [157]), share data (e.g., NetCDF [158]), link addresses with geospatial exposure estimates (e.g., DeGAUSS software [159]), and calculate geospatial exposure metrics (e.g., hurricaneexposure R package [160]). Continued development of open-source tools (such as those in Supplementary Table A) can further reduce technical barriers to geospatial exposure assessment.

Conclusion

Looking forward, technological advancements in ML algorithms and remote sensing data could dramatically improve the accuracy and spatial resolution of exposure models. For example, approaches have explored the integration of image recognition algorithms to calculate high-resolution geographic covariates [161]. The dizzying pace of advancements in large-language and computer vision models [162] offers exciting opportunities for transformative improvements in geospatial exposure assessment. An example near-term advancement is AI-assisted code development that reduces the modeling expertise burden. Nonetheless, these advancements must be made with domain knowledge in mind and with careful ethical considerations for their connections with health data. Moreover, efforts should be made to increase data sharing and language harmonization among fields that touch the geospatial realm, such as environmental science, epidemiology, toxicology, and public health. By addressing these challenges and leveraging the potential of geospatial exposure modeling, we can advance our understanding of the environmental determinants of health and promote evidence-based interventions to improve public health.

References

Matérn B. Spatial variation. Vol. 36. Springer Science & Business Media; 2013.
Journel AG, Huijbregts CJ. Mining geostatistics. Vol. 600. Academic Press London; 1978.
Krige D. A study of gold and uranium distribution patterns in the Klerksdorp gold field. Geoexploration. 1966;4:43–53.
Article Google Scholar
Cressie N. Statistics for spatial data. John Wiley & Sons; 1993.
Goovaerts P, Journel A. Integrating soil map information in modelling the spatial variation of continuous soil properties. Eur J Soil Sci. 1995;46:397–414.
Article Google Scholar
Bogaert P, D’Or D. Estimating soil properties from thematic soil maps: the Bayesian maximum entropy approach. Soil Sci Soc Am J. 2002;66:1492–1500.
Article CAS Google Scholar
Cressie N, Majure JJ. Spatio-temporal statistical modeling of livestock waste in streams. J Agric Biol Environ Stat. 1997;2:24–47.
Article Google Scholar
Nuckols JR, Ward MH, Jarup L. Using geographic information systems for exposure assessment in environmental epidemiology studies. Environ Health Perspect. 2004;112:1007–15.
Article PubMed PubMed Central Google Scholar
Requia WJ, Di Q, Silvern R, Kelly JT, Koutrakis P, Mickley LJ, et al. An ensemble learning approach for estimating high spatiotemporal resolution of ground-level ozone in the contiguous United States. Environ Sci Technol. 2020;54:11037–47.
Article CAS PubMed PubMed Central Google Scholar
Rodriguez-Galiano V, Mendes MP, Garcia-Soldado MJ, Chica-Olmo M, Ribeiro L. Predictive modeling of groundwater nitrate pollution using random forest and multisource variables related to intrinsic and specific vulnerability: a case study in an agricultural setting (southern Spain). Sci Total Environ. 2014;476:189–206.
Article PubMed Google Scholar
Hoek G, Beelen R, de Hoogh K, Vienneau D, Gulliver J, Fischer P, et al. A review of land-use regression models to assess spatial variation of outdoor air pollution. Atmos Environ. 2008;42:7561–78.
Article CAS Google Scholar
VoPham T, Hart JE, Laden F, Chiang Y-Y. Emerging trends in geospatial artificial intelligence (geoai): potential applications for environmental epidemiology. Environ Health. 2018;17:1–6.
Article Google Scholar
Nieuwenhuijsen MJ. Exposure assessment in environmental epidemiology. OUP Us; 2015.
Vermeulen R, Schymanski EL, Barabási A-L, Miller GW. The exposome and health: Where chemistry meets biology. Science. 2020;367:392–96.
Article CAS PubMed PubMed Central Google Scholar
Wild CP. The exposome: from concept to utility. Int J Epidemiol. 2012;41:24–32.
Article PubMed Google Scholar
Hoef JMV, Peterson E, Theobald D. Spatial statistical models that use flow and stream distance. Environ Ecol Stat. 2006;13:449–64.
Article Google Scholar
Money ES, Carter GP, Serre ML. Modern space/time geostatistics using river distances: data integration of turbidity and e. coli measurements to assess fecal contamination along the Raritan River in New Jersey. Environ Sci Technol. 2009;43:3736–42.
Article CAS PubMed PubMed Central Google Scholar
Jat P, Serre ML. Bayesian maximum entropy space/time estimation of surface water chloride in Maryland using river distances. Environ Pollut. 2016;219:1148–55.
Article CAS PubMed PubMed Central Google Scholar
Wikle CK. Modern perspectives on statistics for spatio-temporal data. Wiley Interdiscip Rev Comput Stat. 2015;7:86–98.
Article Google Scholar
Cressie N, Wikle CK. Statistics for spatio-temporal data. John Wiley & Sons; 2015.
National Institute of Envionmental Health Sciences (NIEHS). Climate and Health Outcomes Research Data Systems (CHORDS) (2024). https://www.niehs.nih.gov/research/programs/chords. Website.
Nolan BT, Hitt KJ. Vulnerability of shallow groundwater and drinking-water wells to nitrate in the United States. Environ Sci Technol. 2006;40:7834–40.
Article CAS PubMed Google Scholar
Owusu C, Flanagan B, Lavery AM, Mertzlufft CE, McKenzie BA, Kolling J, et al. Developing a granular scale environmental burden index (ebi) for diverse land cover types across the contiguous United States. Sci Total Environ. 2022;838:155908.
Article CAS PubMed Google Scholar
Gelfand AE, Diggle P, Guttorp P, Fuentes M. Handbook of spatial statistics. CRC Press; 2010.
Roberts DR, Bahn V, Ciuti S, Boyce MS, Elith J, Guillera-Arroita G, et al. Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography. 2017;40:913–29.
Article Google Scholar
Meyer H, Reudenbach C, Hengl T, Katurji M, Nauss T. Improving performance of spatio-temporal machine learning models using forward feature selection and target-oriented validation. Environ Model Softw. 2018;101:1–9.
Article Google Scholar
Valavi R, Elith J, Lahoz-Monfort JJ, Guillera-Arroita G. blockcv: An r package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models. Biorxiv. 2018:357798
Watson GL, Reid CE, Jerrett M, Telesca D. Prediction and model evaluation for space-time data. J Appl Stat. 2023;51:2007–24.
Gneiting T, Katzfuss M. Probabilistic Forecasting. Annu Rev Stat Appl. 2014;1:125–51.
Kleiber W, Raftery AE, Baars J, Gneiting T, Mass CF, Grimit E, et al. Locally calibrated probabilistic temperature forecasting using geostatistical model averaging and local Bayesian model averaging. Monthly Weather Rev. 2011;139:2630–49.
Article Google Scholar
Forlani C, Bhatt S, Cameletti M, Krainski E, Blangiardo M. A joint Bayesian space–time model to integrate spatially misaligned air pollution data in r-inla. Environmetrics. 2020;31:e2644.
Article Google Scholar
Bonas M, Castruccio S. Calibration of SpatioTemporal forecasts from citizen science urban air pollution data with sparse recurrent neural networks. Ann Appl Stat. 2023;17:1820–40.
Messier KP, Katzfuss M. Scalable penalized spatiotemporal land-use regression for ground-level nitrogen dioxide. Ann Appl Stat. 2021;15:688–710.
Article PubMed PubMed Central Google Scholar
Patton A, Datta A, Zamora ML, Buehler C, Xiong F, Gentner DR, et al. Non-linear probabilistic calibration of low-cost environmental air pollution sensor networks for neighborhood level spatiotemporal exposure assessment. J Expo Sci Environ Epidemiol. 2022;32:908–16.
Article PubMed PubMed Central Google Scholar
Derksen S, Keselman HJ. Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables. Br J Math Stat Psychol. 1992;45:265–82.
Article Google Scholar
Vienneau D, De Hoogh K, Beelen R, Fischer P, Hoek G, Briggs D, et al. Comparison of land-use regression models between Great Britain and the Netherlands. Atmos Environ. 2010;44:688–96.
Article CAS Google Scholar
Messier KP, Akita Y, Serre ML. Integrating address geocoding, land use regression, and spatiotemporal geostatistical estimation for groundwater tetrachloroethylene. Environ Sci Technol. 2012;46:2772–80.
Article CAS PubMed PubMed Central Google Scholar
Kerckhoffs J, Hoek G, Vlaanderen J, van Nunen E, Messier K, Brunekreef B, et al. Robustness of intra urban land-use regression models for ultrafine particles and black carbon based on mobile monitoring. Environ Res. 2017;159:500–8.
Article CAS PubMed Google Scholar
Jones RR, Hoek G, Fisher JA, Hasheminassab S, Wang D, Ward MH, et al. Land use regression models for ultrafine particles, fine particles, and black carbon in southern California. Sci Total Environ. 2020;699:134234.
Article CAS PubMed Google Scholar
Su J, Jerrett M, Beckerman B. A distance-decay variable selection strategy for land use regression modeling of ambient air pollution exposures. Sci Total Environ. 2009;407:3890–8.
Article CAS PubMed Google Scholar
Messier K, Kane E, Bolich R, Serre M. Nitrate variability in groundwater of North Carolina using monitoring and private well data models. Environ Sci Technol. 2014;48.
Hastie T, Tibshirani R, Tibshirani RJ. Extended comparisons of best subset selection, forward stepwise selection, and the lasso. arXiv preprint arXiv:1707.08692. 2017.
Smith G. Step away from stepwise. J Big Data. 2018;5:1–12.
Article Google Scholar
Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33:1.
Article PubMed PubMed Central Google Scholar
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc. 2001;96:1348–60.
Article Google Scholar
Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc B (Methodol). 1996;58:267–88.
Article Google Scholar
Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc B (Methodol). 2005;67:301–20.
Article Google Scholar
Larkin A, Geddes JA, Martin RV, Xiao Q, Liu Y, Marshall JD, et al. Global land use regression model for nitrogen dioxide air pollution. Environ Sci Technol. 2017;51:6957–64.
Article CAS PubMed PubMed Central Google Scholar
Son Y, Osornio-Vargas ÁR, O’Neill MS, Hystad P, Texcalac-Sangrador JL, Ohman-Strickland P, et al. Land use regression models to assess air pollution exposure in Mexico city using finer spatial and temporal input parameters. Sci Total Environ. 2018;639:40–8.
Article CAS PubMed PubMed Central Google Scholar
Ren X, Mi Z, Georgopoulos PG. Comparison of machine learning and land use regression for fine scale spatiotemporal estimation of ambient air pollution: Modeling ozone concentrations across the contiguous United States. Environ Int. 2020;142:105827.
Article CAS PubMed Google Scholar
Pearson K. On lines and planes of closest fit to systems of points in space. Lond Edinb Dublin Philos Mag J Sci. 1901;2:559–72.
Article Google Scholar
Sampson PD, Richards M, Szpiro AA, Bergen S, Sheppard L, Larson TV, et al. A regionalized national universal kriging model using partial least squares regression for estimating annual PM2.5 concentrations in epidemiology. Atmos Environ. 2013;75:383–92.
Article CAS Google Scholar
Young MT, Bechle MJ, Sampson PD, Szpiro AA, Marshall JD, Sheppard L, et al. Satellite-based NO2 and model validation in a national prediction model based on universal kriging and land-use regression. Environ Sci Technol. 2016;50:3686–94.
Article CAS PubMed PubMed Central Google Scholar
Wang Y, Yao H, Zhao S. Auto-encoder based dimensionality reduction. Neurocomputing. 2016;184:232–42.
Article Google Scholar
Mcinnes L, Healy J, Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction arXiv: 1802. 03426v2 [stat. ML] 6 Dec 2018 (2018). _eprint: arXiv:1802.03426v2.
Yan X, Zang Z, Luo N, Jiang Y, Li Z. New interpretable deep learning model to monitor real-time PM2.5 concentrations from satellite data. Environ Int. 2020;144:106060.
Article CAS PubMed Google Scholar
Yan X, Zang Z, Jiang Y, Shi W, Guo Y, Li D, et al. A spatial-temporal interpretable deep learning model for improving interpretability and predictive accuracy of satellite-based PM2.5. Environ Pollut. 2021;273:116459.
Article CAS PubMed Google Scholar
Tibau X-A, Reimers C, Requena-Mesa C, Runge J. Spatio-temporal autoencoders in weather and climate research. Deep Learning for the Earth Sciences: A Comprehensive Approach to Remote Sensing, Climate Science, and Geosciences. Wiley Online Library; 2021:186–203.
Behrens G, Beucler T, Gentine P, Iglesias-Suarez F, Pritchard M, Eyring V, et al. Non-linear dimensionality reduction with a variational encoder decoder to understand convective processes in climate models. J Adv Modeling Earth Syst. 2022;14:e2022MS003130.
Article Google Scholar
Venn A, Lewis S, Cooper M, Hubbard R, Hill I, Boddy R, et al. Local road traffic activity and the prevalence, severity, and persistence of wheeze in school children: combined cross sectional and longitudinal study. Occup Environ Med. 2000;57:152–58.
Article CAS PubMed PubMed Central Google Scholar
Hoek G, Brunekreef B, Goldbohm S, Fischer P, van den Brandt PA. Association between mortality and indicators of traffic-related air pollution in the Netherlands: a cohort study. lancet. 2002;360:1203–09.
Article PubMed Google Scholar
Jahnke JR, Messier KP, Lowe M, Jukic AM. Ambient air pollution exposure assessments in fertility studies: a systematic review and guide for reproductive epidemiologists. Curr Epidemiol Rep. 2022;9:87–107.
Kim JJ, Huen K, Adams S, Smorodinsky S, Hoats A, Malig B, et al. Residential traffic and children’s respiratory health. Environ Health Perspect. 2008;116:1274–79.
Article CAS PubMed PubMed Central Google Scholar
Briggs DJ, Collins S, Elliott P, Fischer P, Kingham S, Lebret E, et al. Mapping urban air pollution using GIS: a regression-based approach. Int J Geogr Inf Sci. 1997;11:699–718.
Article Google Scholar
Smith RA, Schwarz GE, Alexander RB. Regional interpretation of water-quality monitoring data. Water Resour Res. 1997;33:2781–98.
Article CAS Google Scholar
Kleinbaum DG, Kupper LL, Nizam A, Rosenberg ES. Applied regression analysis and other multivariable methods. Cengage Learning; 2013.
Brunsdon C, Fotheringham AS, Charlton ME. Geographically weighted regression: a method for exploring spatial nonstationarity. Geogr Anal. 1996;28:281–98.
Article Google Scholar
Fotheringham AS, Crespo R, Yao J. Geographical and temporal weighted regression (gtwr). Geogr Anal. 2015;47:431–52.
Article Google Scholar
Gelfand AE, Kim H-J, Sirmans C, Banerjee S. Spatial modeling with spatially varying coefficient processes. J Am Stat Assoc. 2003;98:387–96.
Article PubMed PubMed Central Google Scholar
Hu X, Waller LA, Al-Hamdan MZ, Crosson WL, Estes Jr MG, Estes SM, et al. Estimating ground-level PM2.5 concentrations in the southeastern us using geographically weighted regression. Environ Res. 2013;121:1–10.
Article CAS PubMed Google Scholar
Van Donkelaar A, Martin RV, Spurr RJ, Burnett RT. High-resolution satellite-derived PM2.5 from optimal estimation and geographically weighted regression over North America. Environ Sci Technol. 2015;49:10482–491.
Article PubMed Google Scholar
van Donkelaar A, Martin RV, Li C, Burnett RT. Regional estimates of chemical composition of fine particulate matter using a combined geoscience-statistical method with information from satellites, models, and monitors. Environ Sci Technol. 2019;53:2595–611.
Article PubMed Google Scholar
Kloog I, Nordio F, Coull BA, Schwartz J. Incorporating local land use regression and satellite aerosol optical depth in a hybrid model of spatiotemporal PM2.5 exposures in the mid-Atlantic states. Environ Sci Technol. 2012;46:11913–921.
Article CAS PubMed PubMed Central Google Scholar
Kloog I, Chudnovsky AA, Just AC, Nordio F, Koutrakis P, Coull BA, et al. A new hybrid spatio-temporal model for estimating daily multi-year pm2. 5 concentrations across northeastern USA using high resolution aerosol optical depth data. Atmos Environ. 2014;95:581–90.
Article CAS Google Scholar
Leung Y, Mei C-L, Zhang W-X. Statistical tests for spatial nonstationarity based on the geographically weighted regression model. Environ Plan A. 2000;32:9–32.
Article Google Scholar
Olea RA. Geostatistics for engineers and earth scientists. Springer Science & Business Media; 2012.
Williams CK, Rasmussen CE. Gaussian processes for machine learning, Vol. 2. MA: MIT Press Cambridge; 2006.
Waller LA, Gotway CA. Applied spatial statistics for public health data. John Wiley & Sons; 2004.
Zhan Y, Luo Y, Deng X, Zhang K, Zhang M, Grieneisen ML, et al. Satellite-based estimates of daily NO2 exposure in China using hybrid random forest and spatiotemporal kriging model. Environ Sci Technol. 2018;52:4180–89.
Article CAS PubMed Google Scholar
Stein ML. Interpolation of spatial data: some theory for kriging. Springer Science & Business Media; 1999.
He J, Kolovos A. Bayesian maximum entropy approach and its applications: a review. Stoch Environ Res Risk Assess. 2018;32:859–77.
Article Google Scholar
Banerjee S, Gelfand AE, Finley AO, Sang, H. Gaussian predictive process models for large spatial data sets. J R Stat Soc Series B Stat Methodol. 2008;70:825–48.
Katzfuss M, Guinness J. A general framework for Vecchia approximations of Gaussian processes. Stat Sci. 2021;36:124–41.
Rue H, Martino S, Chopin N. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. J R Stat Soc B (Stat Methodol). 2009;71:319–92.
Article Google Scholar
Moran KR, Wheeler MW. Fast increased fidelity samplers for approximate Bayesian Gaussian process regression. J R Stat Soc B Stat Methodol. 2022;84:1198–1228.
Article Google Scholar
Breiman L. Statistical modeling: The two cultures (with comments and a rejoinder by the author). Stat Sci. 2001;16:199–231.
Article Google Scholar
Yan Y. Machine learning fundamentals. Machine Learning in Chemical Safety and Health: Fundamentals with Applications. Wiley Online Library; 2022:19–46.
Hastie T, Tibshirani R, Friedman JH. The elements of statistical learning: data mining, inference, and prediction, Vol. 2. Springer; 2009.
Bishop CM. Neural networks and their applications. Rev Sci Instrum. 1994;65:1803–32.
Article Google Scholar
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–44.
Article CAS PubMed Google Scholar
Goodfellow I, Bengio Y, Courville A. Deep learning. MIT Press; 2016.
Di Q, Kloog I, Koutrakis P, Lyapustin A, Wang Y, Schwartz J, et al. Assessing PM2.5 exposures with high spatiotemporal resolution across the continental United States. Environ Sci Technol. 2016;50:4712–21.
Article CAS PubMed PubMed Central Google Scholar
Di Q, Rowland S, Koutrakis P, Schwartz J. A hybrid model for spatially and temporally resolved ozone exposures in the continental United States. J Air Waste Manag Assoc. 2017;67:39–52.
Article CAS PubMed PubMed Central Google Scholar
Pyo J, Park LJ, Pachepsky Y, Baek SS, Kim K, Cho KH, et al. Using convolutional neural network for predicting cyanobacteria concentrations in river water. Water Res. 2020;186:116349.
Article CAS PubMed Google Scholar
Müller J, Park J, Sahu R, Varadharajan C, Arora B, Faybishenko B, et al. Surrogate optimization of deep neural networks for groundwater predictions. J Glob Optim. 2021;81:203–31.
Article Google Scholar
Azimi S, Moghaddam MA, Monfared SH. Prediction of annual drinking water quality reduction based on groundwater resource index using the artificial neural network and fuzzy clustering. J Contam Hydrol. 2019;220:6–17.
Article CAS PubMed Google Scholar
Seligman B, Tuljapurkar S, Rehkopf D. Machine learning approaches to the social determinants of health in the health and retirement study. SSM Popul health. 2018;4:95–9.
Article PubMed Google Scholar
Weichenthal S, Hatzopoulou M, Brauer M. A picture tells a thousand… exposures: opportunities and challenges of deep learning image analyses in exposure science and environmental epidemiology. Environ Int. 2019;122:3–10.
Article PubMed Google Scholar
Breiman L. Bagging predictors. Mach Learn. 1996;24:123–40.
Article Google Scholar
Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001;29:1189–232.
Wheeler DC, Nolan BT, Flory AR, DellaValle CT, Ward MH. Modeling groundwater nitrate concentrations in private wells in Iowa. Sci Total Environ. 2015;536:481–88.
Article CAS PubMed PubMed Central Google Scholar
Tesoriero AJ, Gronberg JA, Juckem PF, Miller MP, Austin BP. Predicting redox-sensitive contaminant concentrations in groundwater using random forest classification. Water Resour Res. 2017;53:7316–31.
Article CAS Google Scholar
Messier K, Wheeler D, Flory A, Jones R, Patel D, Nolan B, et al. Modeling groundwater nitrate exposure in private wells of North Carolina for the Agricultural Health Study. Sci Total Environ. 2019;655.
Ransom KM, Nolan BT, Traum JA, Faunt CC, Bell AM, Gronberg JAM, et al. A hybrid machine learning model to predict and visualize nitrate concentration throughout the central valley aquifer, California, USA. Sci Total Environ. 2017;601:1160–72.
Article PubMed Google Scholar
Chen Z-Y, Zhang TH, Zhang R, Zhu ZM, Yang J, Chen PY, et al. Extreme gradient boosting model to estimate PM2.5 concentrations with missing-filled satellite data in China. Atmos Environ. 2019;202:180–9.
Article CAS Google Scholar
Zhang T, He W, Zheng H, Cui Y, Song H, Fu S, et al. Satellite-based ground PM2.5 estimation using a gradient boosting decision tree. Chemosphere. 2021;268:128801.
Article CAS PubMed Google Scholar
He W, Meng H, Han J, Zhou G, Zheng H, Zhang S, et al. Spatiotemporal PM2.5 estimations in China from 2015 to 2020 using an improved gradient boosting decision tree. Chemosphere. 2022;296:134003.
Article CAS PubMed Google Scholar
Zhan Y, Luo Y, Deng X, Chen H, Grieneisen ML, Shen X, et al. Spatiotemporal prediction of continuous daily PM2.5 concentrations across China using a spatially explicit machine learning algorithm. Atmos Environ. 2017;155:129–39.
Article CAS Google Scholar
Sigrist F. Gaussian process boosting. J Mach Learn Res. 2022;23:1–46.
Google Scholar
Darcy H. Les fontaines publiques de la ville de Dijon: exposition et application des principes à suivre et des formules à employer dans les questions de distribution d’eau... un appendice relatif aux fournitures d’eau de plusieurs villes au filtrage des eaux. Vol. 1. Victor Dalmont, éditeur; 1856.
Gray WG, Miller CT. Introduction to the thermodynamically constrained averaging theory for porous medium systems. Vol. 696. Springer; 2014.
Tessum CW, Hill JD, Marshall JD. InMAP: a model for air pollution interventions. PLoS One. 2017;12:1–26.
Article Google Scholar
US EPA Office of Research and Development. CMAQ (2022). https://doi.org/10.5281/zenodo.7218076.
Ramboll Environment and Health. User’s guide to the comprehensive air quality model with extensions version 5.40. ENVIRON International Corporation, Novato, CA. Available at: www.camx.com. 2014.
Peckham SE, Grell GA, McKeen SA, Ahmadov R, Wong KY, Barth M, et al. WRF-Chem version 3.8.1 user’s guide. ENVIRON International Corporation, Novato, CA. Available at: www.camx.com (2017). https://doi.org/10.7289/V5/TM-OAR-GSD-48.
Global Modeling and Assimilation Office (GMAO). inst3_3d_asm_cp: Merra-2 3d iau state, meteorology instantaneous 3-hourly (p-coord, 0.625x0.5l42), version 5.12.4. Greenbelt, MD, USA: Goddard Space Flight Center Distributed Active Archive Center (GSFC DAAC) (2015). March 1, 2023 at https://doi.org/10.5067/VJAFPLI1CSIV.
Tessum CW, Apte JS, Goodkind AL, Muller NZ, Mullins KA, Paolella DA, et al. Inequity in consumption of goods and services adds to racial-ethnic disparities in air pollution exposure. Proc Natl Acad Sci USA. 2019;116:6001 LP–6006.
Article Google Scholar
Snyder MG, Venkatram A, Heist DK, Perry SG, Petersen WB, Isakov V, et al. Rline: a line source dispersion model for near-surface releases. Atmos Environ. 2013;77:748–56.
Article CAS Google Scholar
Langevin CD, Hughes JD, Banta ER, Niswonger RG, Panday S, Provost AM, et al. Documentation for the modflow 6 groundwater flow model. Tech. Rep., US Geological Survey. 2017.
Gallagher LG, Webster TF, Aschengrau A, Vieira VM. Using residential history and groundwater modeling to examine drinking water exposure and breast cancer. Environ Health Perspect. 2010;118:749–55.
Article PubMed PubMed Central Google Scholar
Beven K, Kirkby M. A physically based, variable contributing area model of basin hydrology. Hydrol Sci. 1979;24:43–69.
Article Google Scholar
Novotny EV, Bechle MJ, Millet DB, Marshall JD. National satellite-based land-use regression: NO2 in the United States. Environ Sci Technol. 2011;45:4407–14.
Article CAS PubMed Google Scholar
Messier K, Chambliss S, Gani S, Alvarez R, Brauer M, Choi J, et al. Mapping air pollution with Google Street View cars: efficient approaches with mobile monitoring and land use regression. Environ Sci Technol. 2018;52:12563–72.
de Hoogh K, Chen J, Gulliver J, Hoffmann B, Hertel O, Ketzel M, et al. Spatial PM2.5, NO2, O3 and BC models for Western Europe – Evaluation of spatiotemporal stability. Environ Int. 2018;120:81–92.
Article PubMed Google Scholar
Reyes JM, Serre ML. An LUR/BME framework to estimate PM2.5 explained by on road mobile and stationary sources. Environ Sci Technol. 2014;48:1736–44.
Article CAS PubMed PubMed Central Google Scholar
Van der Laan MJ, Polley EC, Hubbard AE. Super learner. Stat Appl Genet Mol Biol. 2007;6:25.
Danesh Yazdi M, Kuang Z, Dimakopoulou K, Barratt B, Suel E, Amini H, et al. Predicting fine particulate matter PM2.5 in the greater London area: an ensemble approach using machine learning methods. Remote Sensing. 2020;12. https://www.mdpi.com/2072-4292/12/6/914.
Yu W, Li S, Ye T, Xu R, Song J, Guo Y, et al. Deep ensemble machine learning framework for the estimation of pm 2.5 concentrations. Environ Health Perspect. 2022;130:037004.
Article PubMed PubMed Central Google Scholar
Murray NL, Holmes HA, Liu Y, Chang HH. A Bayesian ensemble approach to combine PM2.5 estimates from statistical models using satellite imagery and numerical model simulation. Environ Res. 2019;178:108601.
Article CAS PubMed PubMed Central Google Scholar
Gotway CA, Young LJ. Combining incompatible spatial data. J Am Stat Assoc. 2002;97:632–48.
Article Google Scholar
Young LJ, Gotway CA. Linking spatial data from different sources: the effects of change of support. Stoch Environ Res Risk Assess. 2007;21:589–600.
Article Google Scholar
Abatzoglou JT, Brown TJ. A comparison of statistical downscaling methods suited for wildfire applications. Int J Climatol. 2012;32:772–80.
Article Google Scholar
Ford TW, Quiring SM. Comparison and application of multiple methods for temporal interpolation of daily soil moisture. Int J Climatol. 2014;34:2604–21.
Article Google Scholar
Schinasi LH, Auchincloss AH, Forrest CB, Roux AVD. Using electronic health record data for environmental and place based population health research: a systematic review. Ann Epidemiol. 2018;28:493–502.
Article PubMed Google Scholar
Kinnee EJ, Tripathy S, Schinasi L, Shmool JL, Sheffield PE, Holguin F, et al. Geocoding error, spatial uncertainty, and implications for exposure assessment and environmental epidemiology. Int J Environ Res public health. 2020;17:5845.
Article PubMed PubMed Central Google Scholar
Yi L, Xu Y, Eckel SP, O’Connor S, Cabison J, Rosales M, et al. Time-activity and daily mobility patterns during pregnancy and early postpartum–evidence from the madres cohort. Spat Spatio Temporal Epidemiol. 2022;41:100502.
Article Google Scholar
Nethery E, Leckie SE, Teschke K, Brauer M. From measures to models: an evaluation of air pollution exposure assessment for epidemiological studies of pregnant women. Occup Environ Med. 2008;65:579–86.
Article CAS PubMed Google Scholar
Yi L, Wilson JP, Mason TB, Habre R, Wang S, Dunton GF, et al. Methodologies for assessing contextual exposure to the built environment in physical activity studies: a systematic review. Health Place. 2019;60:102226.
Article PubMed PubMed Central Google Scholar
Ntarladima A-M, Karssenberg D, Vaartjes I, Grobbee DE, Schmitz O, Lu M, et al. A comparison of associations with childhood lung function between air pollution exposure assessment methods with and without accounting for time-activity patterns. Environ Res. 2021;202:111710.
Article CAS PubMed Google Scholar
Laatikainen TE, Hasanzadeh K, Kyttä M. Capturing exposure in environmental health research: challenges and opportunities of different activity space models. Int J Health Geogr. 2018;17:1–14.
Article Google Scholar
Jankowska MM, Yang J-A, Luo N, Spoon C, Benmarhnia T. Accounting for space, time, and behavior using gps derived dynamic measures of environmental exposure. Health Place. 2021:102706.
Act A. Health insurance portability and accountability act of 1996. Public Law. 1996;104:191.
Google Scholar
Brokamp C, Wolfe C, Lingren T, Harley J, Ryan P. Decentralized and reproducible geocoding and characterization of community and environmental exposures for multisite studies. J Am Med Inform Assoc. 2018;25:309–14.
Article PubMed Google Scholar
Kane NJ, Wang X, Gerkovich MM, Breitkreutz M, Rivera B, Kunchithapatham H, et al. The envirome web service: Patient context at the point of care. J Biomed Inform. 2021;119:103817.
Article CAS PubMed Google Scholar
Buck C, Dreger S, Pigeot I. Anonymisation of address coordinates for microlevel analyses of the built environment: a simulation study. BMJ Open. 2015;5:e006481.
Article PubMed PubMed Central Google Scholar
Choirat C, Braun D, Kioumourtzoglou M-A. Data science in environmental health research. Curr Epidemiol Rep. 2019;6:291–99.
Article PubMed PubMed Central Google Scholar
Hu H, Liu X, Zheng Y, He X, Hart J, James P, et al. Methodological challenges in spatial and contextual exposome-health studies. Crit Rev Environ Sci Technol. 2023;53:827–46.
Article PubMed Google Scholar
Cui Y, Eccles KM, Kwok RK, Joubert BR, Messier KP, Balshaw DM, et al. Integrating multiscale geospatial environmental data into large population health studies: Challenges and opportunities. Toxics. 2022;10:403.
Article CAS PubMed PubMed Central Google Scholar
US National Aeronautics and Space Administration (NASA). EarthData. 2024. https://www.earthdata.nasa.gov. Website.
Harvard University & Boston University. Climate Change and Health Research Coordinating Center (CAFE) Collection (2024). https://dataverse.harvard.edu/dataverse/CAFE. Website.
QGIS Association. QGIS Geographic Information System. 2023. http://www.qgis.org.
Pebesma E. Simple features for R: standardized support for spatial vector data. R J. 2018;10:439–46.
Article Google Scholar
Jordahl K, den Bossche JV, Fleischmann M, Wasserman J, McBride J, Gerard J, et al. geopandas/geopandas: v0.8.1. 2020. https://doi.org/10.5281/zenodo.3946761.
United States Centers for Disease Control and Prevention (US CDC). National Environmental Public Health Tracking Network Data Explorer. 2023. https://ephtracking.cdc.gov/DataExplorer/.
OPeNDAP. OPeNDAP: Advanced Software for Remote Data Retrieval. 2023. https://www.opendap.org.
Wang Y, Köhler P, Braghiere RK, Longo M, Doughty R, Bloom AA, et al. Griddingmachine, a database and software for earth system modeling at global and regional scales. Sci Data. 2022;9:258.
Article PubMed PubMed Central Google Scholar
Hijmans R, Bivand R, Pebesma E, Sumner M. Terra: Spatial Data Analysis. 2023. https://CRAN.R-project.org/package=terra. R Package, version 1.7-18.
Rew R, Davis G. Netcdf: an interface for scientific data access. IEEE Comp Graph Appl. 1990;10:76–82.
Article Google Scholar
Brokamp C. Degauss: decentralized geomarker assessment for multi-site studies. J Open Source Softw. 2018;3:812.
Article Google Scholar
Anderson B, Yan M, Ferreri J, Crosson W, Al-Hamdan M, Schumacher A, et al. hurricaneexposure: Explore and Map County-Level Hurricane Exposure in the United States. 2020. https://cran.r-project.org/package=hurricaneexposure. R package version 0.1.1.
Qi M, Hankey S. Using street view imagery to predict street-level particulate air pollution. Environ Sci Technol. 2021;55:2695–704.
Article CAS PubMed Google Scholar
Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, et al. Segment anything. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023;4015–26.

Download references

Funding

This work is supported by the National Institute of Environmental Health Sciences, Division of Translational Toxicology, Division of Intramural Research, and the Spatiotemporal Exposures and Toxicology group under project number ZIA ES103368-02. LPC was also supported by the NIH Office of Data Science Strategy (ODSS). Open access funding provided by the National Institutes of Health.

Author information

Authors and Affiliations

National Institute of Environmental Health Sciences, Office of the Scientific Director, Office of Data Science, Durham, NC, USA
Lara P. Clark & Charles Schmitt
National Institute of Environmental Health Sciences, Division of Translational Toxicology, Predictive Toxicology Branch, Durham, NC, USA
Daniel Zilber, David M. Reif & Kyle P. Messier
National Institute of Environmental Health Sciences, Office of the Director, Office of Environmental Science Cyberinfrastructure, Durham, NC, USA
David C. Fargo
National Institute of Environmental Health Sciences, Division of Intramural Research, Biostatistics and Computational Biology Branch, Durham, NC, USA
Alison A. Motsinger-Reif & Kyle P. Messier

Authors

Lara P. Clark
View author publications
Search author on:PubMed Google Scholar
Daniel Zilber
View author publications
Search author on:PubMed Google Scholar
Charles Schmitt
View author publications
Search author on:PubMed Google Scholar
David C. Fargo
View author publications
Search author on:PubMed Google Scholar
David M. Reif
View author publications
Search author on:PubMed Google Scholar
Alison A. Motsinger-Reif
View author publications
Search author on:PubMed Google Scholar
Kyle P. Messier
View author publications
Search author on:PubMed Google Scholar

Contributions

LPC, DCF, CS, AAMR, and KPM contributed to conception and design of the study. DZ contributed to writing, editing, and checking equation accuracy. DMR edited and wrote with an emphasis on the introduction, conclusion, and section-to-section flow. LPC and KPM were the primary writers and editors. All authors contributed to manuscript revision, read, and approved the submitted version.

Corresponding author

Correspondence to Kyle P. Messier.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Clark, L.P., Zilber, D., Schmitt, C. et al. A review of geospatial exposure models and approaches for health data integration. J Expo Sci Environ Epidemiol 35, 131–148 (2025). https://doi.org/10.1038/s41370-024-00712-8

Download citation

Received: 02 February 2024
Revised: 01 August 2024
Accepted: 05 August 2024
Published: 06 September 2024
Issue date: April 2025
DOI: https://doi.org/10.1038/s41370-024-00712-8

Keywords

This article is cited by

FHIR PIT: a geospatial and spatiotemporal data integration pipeline to support subject-level clinical research
- Karamarie Fecho
- Juan J. Garcia
- Ashok Krishnamurthy
BMC Medical Informatics and Decision Making (2025)
Umweltgerechtigkeit – Herausforderungen für räumliche Analysen und Monitoring
- Gesa Czwikla
- Klaus Telkmann
- Gabriele Bolte
Bundesgesundheitsblatt - Gesundheitsforschung - Gesundheitsschutz (2025)

Abstract

Background

Objective

Methods

Results

Similar content being viewed by others

A small area model to assess temporal trends and sub-national disparities in healthcare quality

Geographic pair matching in large-scale cluster randomized trials

Differences between gridded population data impact measures of geographic access to healthcare in sub-Saharan Africa

Introduction

Background

Terminology

Distance metrics

Workflows

Overview

Model development

Geographic covariate development

Spatiotemporal model estimation

Spatiotemporal model assessment and selection

Stepwise and Ad-Hoc stepwise

Penalized regression

Dimension reduction

Types of models

Proximity

Land-use regression

Geographically weighted regression

Geostatistical models: Gaussian Processes, Kriging, and BME

Machine learning

Neural networks

Ensemble methods

Mechanistic or chemical transport

Hybrid

Geospatial data integration

Spatial linkage considerations

Temporal linkage considerations

Special health data linkage considerations

Open-source tools

Conclusion

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

FHIR PIT: a geospatial and spatiotemporal data integration pipeline to support subject-level clinical research

Umweltgerechtigkeit – Herausforderungen für räumliche Analysen und Monitoring

Search

Quick links