Introduction

Environmental epidemiologic studies commonly estimate exposures by linking geospatial-based exposure information to participants’ residential locations. In the absence of detailed residential histories, which can be burdensome or costly to collect, researchers often rely on the address at study enrollment. However, reliance on exposure at enrollment or another single time point as a proxy for long-term exposure can result in misclassification due to residential mobility [1, 2]. When exposure misclassification is non-differential, effect estimates can be underestimated, making it difficult to identify associations. Further, exposure assessment at the time of study entry may not capture the most etiologically relevant period for diseases with a long latency and precludes the evaluation of delayed effects of exposures during potentially sensitive periods.

Commercial databases that compile public records data have proven to be a useful source of residential history information for epidemiologic studies [3,4,5]. The commercial database LexisNexis® Accurint® provides address information that covers multiple decades and has been shown to correspond well with available study addresses [6,7,8,9,10,11,12]. Here, we used LexisNexis to reconstruct residential histories for participants in the Sister Study, supplementing existing study address data with complete, longitudinal address information prior to enrollment. We described the accuracy of LexisNexis-derived residential histories among a subset of participants and evaluated whether the accuracy varied by sociodemographic characteristics, across geographic regions, and over time.

Methods

Study population

The Sister Study is an ongoing prospective cohort study designed to investigate environmental risk factors for breast cancer [13]. Between 2003 and 2009, a total of 50,884 women across the United States and Puerto Rico were enrolled into the study. All participants provided written informed consent prior to enrollment. The collection of data in the Sister Study and linkage with commercial data sources has been approved by the institutional review board of the National Institute of Environmental Health Sciences. We excluded participants who had withdrawn from the study (n = 9; Data Release 11.1).

Sources of address data

LexisNexis is a commercial vendor of data products that aggregates information from public records including credit reporting data, real estate and tax records, property deed transfers and mortgages, driver’s license records, court filings, and state death registries. We requested up to 20 most recent addresses for all Sister Study participants and provided LexisNexis with the following information: full names, gender, date of birth, enrollment address, phone number, and date of death (when applicable). We also utilized the United States Postal Service (USPS) Residential Delivery Indicator product (RDI; https://postalpro.usps.com/address-quality-solutions/residential-delivery-indicator-rdi) to identify business addresses. Self-reported residence history at baseline included the street address and dates of residence of participants’ primary residence at study enrollment and where they lived longest as an adult.

Cleaning and processing of address data

LexisNexis provided a set of addresses for each participant that included the street address, city, state, zip code, and start and stop dates “seen” for each address. To create continuous residential histories from the set of LexisNexis addresses, we adapted a published algorithm to clean the address data [6]. After excluding addresses with missing information, we used the USPS RDI identify and exclude business addresses. We also excluded addresses with timeframes not within or overlapping the period from 1980 through the study enrollment year and truncated address stop years at the enrollment year.

We performed several steps to reconcile incongruities in timeframes of the cleaned address data. First, to ensure the residential histories reflected time that can be linked to meaningful exposure durations relative to chronic disease outcomes, we excluded short duration addresses (≤31 days) [6]. Next, when self-reported baseline or longest adult address locations matched the LexisNexis records, we substituted the LexisNexis dates with self-reported dates of residence. Then, we sorted addresses by their start dates. When there were matching street addresses, we combined the time frames (which also reconciled duplicate addresses). When gaps or overlaps existed, we assigned the start date of the following address as the end date of the preceding address. We followed this procedure for resolving gap and overlaps in address histories given evidence that start dates in LexisNexis are more accurate than end dates [6].

Address validation sample

Of the participants with LexisNexis residential history data, we selected 1000 women to participate in an address validation study. To ensure the sample included participants across sociodemographic groups to facilitate comparisons, we drew a weighted random sample based on baseline age and self-reported race and ethnicity: 25% non-Hispanic White (NHW) and ≤55 years, 25% NHW and >55 years, 25% non-NHW and ≤55 years, and 25% non-NHW and >55 years.

Study participants selected into the validation sample were asked to complete an address validation questionnaire. The form was personalized for each participant and included a list of their LexisNexis addresses (street name and number, city, state, zip code, and corresponding years of residence) up to their year of study enrollment. For each address, participants had the option to select “No updates” if the address was correct, or “Yes, updates” if any of the provided address information was incorrect. If participants selected “Yes, updates,” they were instructed to provide the correct address information. Separately, participants were asked to provide address information for any residences at which they lived between 1980 and their enrollment year that were not included on the form.

Descriptive summaries and analysis

Among all participants with available residential history data, we summarized the distribution of the number of addresses per participant, duration of residence (years) each address, total years of address history, and age at the start of the earliest address overall and by sociodemographic characteristics.

To evaluate the accuracy of LexisNexis address locations among the validation sample, we calculated the proportion of address confirmed at the detailed street (street name and number), street name, zip code, city, and state level, as well as the proportion of LexisNexis addresses that were assigned to the same census tract as verified/corrected address. To evaluate the accuracy of timing, we calculated the proportion with confirmed dates of residence (start and/or stop year) and the percent of time (years) correctly covered by the LexisNexis addresses.

Address validation metrics were calculated overall and by self-reported sociodemographic characteristics collected at baseline: age (≤45, 46 to 56, 56 to 65, and >65 years), race and ethnicity [Hispanic, non-Hispanic Black, NHW, additional groups (including American Indian or Alaska Native, Asian, Native Hawaiian or other Pacific Islander, and unknown or not specified)], educational attainment (high school graduate or lower, some college, four-year degree or higher), and household income (<$50,000, $50,000–$99,999, and ≥$100,000). We also calculated proportions by address urbanicity [based on 2003 USDA Rural-Urban Continuum Codes [14]: urban/metro [1,2,3], urban/non-metro [4,5,6,7], or rural county [8, 9] and census region (Midwest, Northeast, South, West, Puerto Rico). To make the results representative of the overall population with address history data, we calculated a weighted mean proportion for the overall sample, age groups, and race and ethnicity groups to account for sampling proportions; for all other groups, we calculated a simple proportion. Weights were calculated as the ratio of each sampling group’s proportion among analyzed respondents to their proportion in the full cohort with address history data. To assess differences between subgroups, we obtained p-values from chi-square tests of independence (\(\alpha\) = 0.05; tests for age and race/ethnicity groups were performed on weighted counts).

For addresses with a correction to location or dates of residence, we summarized the distance (km) and difference in years, respectively, between the geocoded LexisNexis and updated addresses.

Results

Residential history data

LexisNexis identified 47,557 (93.5%) study participants in its database and returned a total of 497,323 addresses, ranging from 1 to 20 addresses per participant. The sociodemographic characteristics of women identified in the LexisNexis database were similar to the entire cohort (Table 1). After data cleaning and processing, we generated residential histories for all 47,557 participants, with a mean of 3.5 addresses per participant [standard deviation (SD): 2.3] and a mean duration of 7.7 years (SD: 8.9) at each study address (Table 2). There was a median of 25 years [interquartile range (IQR): 21.6–30.5] of residential history, with a median age of 28 years (IQR: 23.1–33.6) at the start year of the earliest address. The number of addresses, duration lived at each address, total years of residential history, and start age of earliest address are summarized by participants sociodemographic characteristics in Table 2.

Table 1 Baseline characteristics of participants in the Sister Study cohort and the address verification study.
Table 2 Summary of LexisNexis residential history data among Sister Study participants.

Address validation study

A total of 823 (82.3%) participants completed the address validation questionnaire. Table 3 shows the proportion of address locations verified at various spatial levels among the entire sample and by sociodemographic and geographic characteristics. The reconstructed residence histories accurately captured most address locations (94.6% at the detailed street-level). This proportion was higher among participants older than 65 years at baseline (97.6%) and lower among addresses with a start year before 1980 (89.1%). Otherwise, proportions were similar across participant sociodemographic characteristics and geographic characteristics of addresses. For all addresses with updates to location information (n = 272, 6.2%), the median distance between the LexisNexis and corrected addresses was 5.4 km (IQR: 0.1–41.7).

Table 3 Proportion (%) of addresses in LexisNexis-derived residence histories with verified locations and timeframes.

While 71.7% of addresses had both verified start and stop dates of residence, the residential histories accurately covered 81.5% of verified/corrected address timeframes. The precent of time covered varied by race and ethnicity, educational attainment, household income, and start year of address (P-value < 0.01, see Table 3). A greater proportion of start dates of residence (79.8%) were verified than end dates (75.4%). For addresses with updates to dates of residence (n = 546, 15%), the median difference in duration of residence between LexisNexis and corrected addresses was +/- 3 years (IQR: 1–5).

On the address validation questionnaire, participants separately reported a total of 338 additional addresses at which they lived during or overlapping the period between 1980 and study enrollment that were not included in the LexisNexis residential history. For over half (58.9%) of these addresses, participants indicated that the start year of residence was before 1985, a period for which LexisNexis data is less complete.

Discussion

Using address records from the commercial database LexisNexis, we reconstructed retrospective residential histories for 93.5% of participants in the Sister Study cohort. This effort added a median of 25 years of address history prior to study enrollment. For a subset of participants, 95% of address locations and 82% of the time spent at addresses were verified, demonstrating the accuracy of the reconstructed residential histories. The residential data produced in this analysis provides a valuable resource for future studies leveraging geospatial data to understand the health impacts of social and environmental exposures across the lifecourse.

Our results align with prior literature that finds LexisNexis records correspond well with study address locations [7,8,9,10,11,12]. Most of these studies evaluated agreement between study and LexisNexis addresses at the year of study baseline or completion of a follow-up questionnaire. For example, the overall proportion of study baseline addresses that matched LexisNexis address records was 86% in both the California Teachers Study [10] and the Los Angeles Ultrafines Study [7] and 92% in the nationwide REGARDS cohort [11]. These proportions are comparable to our findings where 95% of LexisNexis address locations were confirmed in the address validation sample. However, because of the limited extent of self-reported address data, these prior studies were unable to evaluate the ability of LexisNexis to capture all residential moves continuously across time.

In our study, we evaluated longitudinal address histories, allowing us to describe temporal mismatches across multiple addresses. We found residential histories accurately covered 82% of the time spent at of addresses, and this proportion improved after 1985 when LexisNexis data is more complete. Our results are comparable to two other studies that also evaluated address location and timing [8, 9]. Among participants in a Michigan case-control study, Jacquez et al. [9] compared recalled lifetime residential histories against LexisNexis and found that LexisNexis addresses covered 72% of the time spent at lifetime addresses. However, the authors obtained only the 3 most recent LexisNexis addresses for each individual, which limited the length of residential history available for comparison. In another analysis among 1000 participants in the NIH-AARP Diet and Health Study, Wheeler and Wang [8] found that 86% of follow-up study addresses matched LexisNexis records at the detailed street-level, and for those matched addresses, LexisNexis records covered 89% of the time spent at them. Despite differences between this analysis and our study in the age of study participants and benchmark for comparison (study addresses in the NIH-AARP cohort were identified using a combination of self-report and the USPS National Change of Address product, whereas we used participant recall to verify LexisNexis addresses), the overall findings are similar.

Prior studies have reported that for certain sociodemographic groups or geographic regions, the completeness and accuracy of LexisNexis data can vary [10,11,12]. We observed few significant differences in the accuracy of address locations but some differences in the accuracy of timeframes across socioeconomic groups. The mechanisms that may lead to biases in the accuracy of residence histories include historic segregation that influences contemporary housing insecurity among Black and other racially minoritized groups [15]. For groups with high socioeconomic position, it can be challenging to distinguish primary residences from simultaneously owned properties or businesses. Consistent with prior studies, Sister Study participants who were not identified in the LexisNexis database differed from the overall cohort were older, had less education and lower incomes. This likely reflects bias inherent to commercial address data whereby people with a lower socioeconomic position are less likely to engage in activities (e.g., purchasing property, voting, or registering vehicles) that generate residential information in administrative databases [15]. It is reassuring that the proportion of Sister Study participants identified in LexisNexis was high (93.5%) and consistent with other studies at the same time period [7, 11, 12], although careful attention should be paid to potential selection or information biases introduced by linking commercial address data to cohort studies, such as greater levels of exposure misclassification among groups with less accurate resident histories [16]. Additionally, our findings from the Sister Study—which has an overall higher socioeconomic position than the general US population—may not be generalizable to other cohorts with a different age distribution, sociodemographic makeup, or study period. Future studies exploring the use of LexisNexis or other sources of commercial address data should be mindful of the time period and population characteristics that may impact data completeness and accuracy.

This study contributes to the knowledge of best practices for cleaning and processing commercial address data for research use. Xu et al. [12] provided the first in-depth description of a “standard” procedure for generating residential histories using raw data from LexisNexis. We followed a similar procedure in cleaning LexisNexis address data with the additional step of integrating self-reported study addresses to help resolve timing incongruities and improve our confidence in the final residential histories. Thus, our study complements the blueprint described in Xu et al. [12] by demonstrating how additional address data sources (i.e., self-reported study addresses and RDI) can be used in generating LexisNexis residence histories.

In this study, we described the utility of LexisNexis for generating retrospective detailed residential histories in a nationwide cohort of women. Our findings confirmed that the address histories accurately represented a high proportion of locations and time frames. These results highlight the value of this address information for use in future epidemiologic studies.