Abstract
Background
Commercial address data can help reconstruct detailed residential histories, which are crucial for accurate assessment of geospatial-based environmental exposures in epidemiologic studies.
Objective
To reconstruct and assess the accuracy of pre-baseline residential histories for the Sister Study, an ongoing United States-wide prospective cohort.
Methods
We used LexisNexis® Accruint® to construct pre-baseline residential histories for 47,557 participants. A subset (N = 823) validated their LexisNexis-derived addresses via a supplemental questionnaire. We assessed the proportion of addresses with verified locations and timeframes by sociodemographic and geographic characteristics.
Results
Residential histories were reconstructed for 93.5% of participants, adding a median of 25 years of data. The histories accurately captured 95% of address locations and 82% of residence durations, with improved accuracy after 1990.
Impact
-
This study leverages LexisNexis to reconstruct detailed residential histories before cohort enrollment for nearly all Sister Study participants, creating a valuable resource for investigating the health effects of past environmental exposures. A subset of participants verified the locations and timeframes of a high proportion of addresses in the LexisNexis-derived histories, reinforcing confidence in their accuracy for the full cohort.
Introduction
Environmental epidemiologic studies commonly estimate exposures by linking geospatial-based exposure information to participants’ residential locations. In the absence of detailed residential histories, which can be burdensome or costly to collect, researchers often rely on the address at study enrollment. However, reliance on exposure at enrollment or another single time point as a proxy for long-term exposure can result in misclassification due to residential mobility [1, 2]. When exposure misclassification is non-differential, effect estimates can be underestimated, making it difficult to identify associations. Further, exposure assessment at the time of study entry may not capture the most etiologically relevant period for diseases with a long latency and precludes the evaluation of delayed effects of exposures during potentially sensitive periods.
Commercial databases that compile public records data have proven to be a useful source of residential history information for epidemiologic studies [3,4,5]. The commercial database LexisNexis® Accurint® provides address information that covers multiple decades and has been shown to correspond well with available study addresses [6,7,8,9,10,11,12]. Here, we used LexisNexis to reconstruct residential histories for participants in the Sister Study, supplementing existing study address data with complete, longitudinal address information prior to enrollment. We described the accuracy of LexisNexis-derived residential histories among a subset of participants and evaluated whether the accuracy varied by sociodemographic characteristics, across geographic regions, and over time.
Methods
Study population
The Sister Study is an ongoing prospective cohort study designed to investigate environmental risk factors for breast cancer [13]. Between 2003 and 2009, a total of 50,884 women across the United States and Puerto Rico were enrolled into the study. All participants provided written informed consent prior to enrollment. The collection of data in the Sister Study and linkage with commercial data sources has been approved by the institutional review board of the National Institute of Environmental Health Sciences. We excluded participants who had withdrawn from the study (n = 9; Data Release 11.1).
Sources of address data
LexisNexis is a commercial vendor of data products that aggregates information from public records including credit reporting data, real estate and tax records, property deed transfers and mortgages, driver’s license records, court filings, and state death registries. We requested up to 20 most recent addresses for all Sister Study participants and provided LexisNexis with the following information: full names, gender, date of birth, enrollment address, phone number, and date of death (when applicable). We also utilized the United States Postal Service (USPS) Residential Delivery Indicator product (RDI; https://postalpro.usps.com/address-quality-solutions/residential-delivery-indicator-rdi) to identify business addresses. Self-reported residence history at baseline included the street address and dates of residence of participants’ primary residence at study enrollment and where they lived longest as an adult.
Cleaning and processing of address data
LexisNexis provided a set of addresses for each participant that included the street address, city, state, zip code, and start and stop dates “seen” for each address. To create continuous residential histories from the set of LexisNexis addresses, we adapted a published algorithm to clean the address data [6]. After excluding addresses with missing information, we used the USPS RDI identify and exclude business addresses. We also excluded addresses with timeframes not within or overlapping the period from 1980 through the study enrollment year and truncated address stop years at the enrollment year.
We performed several steps to reconcile incongruities in timeframes of the cleaned address data. First, to ensure the residential histories reflected time that can be linked to meaningful exposure durations relative to chronic disease outcomes, we excluded short duration addresses (≤31 days) [6]. Next, when self-reported baseline or longest adult address locations matched the LexisNexis records, we substituted the LexisNexis dates with self-reported dates of residence. Then, we sorted addresses by their start dates. When there were matching street addresses, we combined the time frames (which also reconciled duplicate addresses). When gaps or overlaps existed, we assigned the start date of the following address as the end date of the preceding address. We followed this procedure for resolving gap and overlaps in address histories given evidence that start dates in LexisNexis are more accurate than end dates [6].
Address validation sample
Of the participants with LexisNexis residential history data, we selected 1000 women to participate in an address validation study. To ensure the sample included participants across sociodemographic groups to facilitate comparisons, we drew a weighted random sample based on baseline age and self-reported race and ethnicity: 25% non-Hispanic White (NHW) and ≤55 years, 25% NHW and >55 years, 25% non-NHW and ≤55 years, and 25% non-NHW and >55 years.
Study participants selected into the validation sample were asked to complete an address validation questionnaire. The form was personalized for each participant and included a list of their LexisNexis addresses (street name and number, city, state, zip code, and corresponding years of residence) up to their year of study enrollment. For each address, participants had the option to select “No updates” if the address was correct, or “Yes, updates” if any of the provided address information was incorrect. If participants selected “Yes, updates,” they were instructed to provide the correct address information. Separately, participants were asked to provide address information for any residences at which they lived between 1980 and their enrollment year that were not included on the form.
Descriptive summaries and analysis
Among all participants with available residential history data, we summarized the distribution of the number of addresses per participant, duration of residence (years) each address, total years of address history, and age at the start of the earliest address overall and by sociodemographic characteristics.
To evaluate the accuracy of LexisNexis address locations among the validation sample, we calculated the proportion of address confirmed at the detailed street (street name and number), street name, zip code, city, and state level, as well as the proportion of LexisNexis addresses that were assigned to the same census tract as verified/corrected address. To evaluate the accuracy of timing, we calculated the proportion with confirmed dates of residence (start and/or stop year) and the percent of time (years) correctly covered by the LexisNexis addresses.
Address validation metrics were calculated overall and by self-reported sociodemographic characteristics collected at baseline: age (≤45, 46 to 56, 56 to 65, and >65 years), race and ethnicity [Hispanic, non-Hispanic Black, NHW, additional groups (including American Indian or Alaska Native, Asian, Native Hawaiian or other Pacific Islander, and unknown or not specified)], educational attainment (high school graduate or lower, some college, four-year degree or higher), and household income (<$50,000, $50,000–$99,999, and ≥$100,000). We also calculated proportions by address urbanicity [based on 2003 USDA Rural-Urban Continuum Codes [14]: urban/metro [1,2,3], urban/non-metro [4,5,6,7], or rural county [8, 9] and census region (Midwest, Northeast, South, West, Puerto Rico). To make the results representative of the overall population with address history data, we calculated a weighted mean proportion for the overall sample, age groups, and race and ethnicity groups to account for sampling proportions; for all other groups, we calculated a simple proportion. Weights were calculated as the ratio of each sampling group’s proportion among analyzed respondents to their proportion in the full cohort with address history data. To assess differences between subgroups, we obtained p-values from chi-square tests of independence (\(\alpha\) = 0.05; tests for age and race/ethnicity groups were performed on weighted counts).
For addresses with a correction to location or dates of residence, we summarized the distance (km) and difference in years, respectively, between the geocoded LexisNexis and updated addresses.
Results
Residential history data
LexisNexis identified 47,557 (93.5%) study participants in its database and returned a total of 497,323 addresses, ranging from 1 to 20 addresses per participant. The sociodemographic characteristics of women identified in the LexisNexis database were similar to the entire cohort (Table 1). After data cleaning and processing, we generated residential histories for all 47,557 participants, with a mean of 3.5 addresses per participant [standard deviation (SD): 2.3] and a mean duration of 7.7 years (SD: 8.9) at each study address (Table 2). There was a median of 25 years [interquartile range (IQR): 21.6–30.5] of residential history, with a median age of 28 years (IQR: 23.1–33.6) at the start year of the earliest address. The number of addresses, duration lived at each address, total years of residential history, and start age of earliest address are summarized by participants sociodemographic characteristics in Table 2.
Address validation study
A total of 823 (82.3%) participants completed the address validation questionnaire. Table 3 shows the proportion of address locations verified at various spatial levels among the entire sample and by sociodemographic and geographic characteristics. The reconstructed residence histories accurately captured most address locations (94.6% at the detailed street-level). This proportion was higher among participants older than 65 years at baseline (97.6%) and lower among addresses with a start year before 1980 (89.1%). Otherwise, proportions were similar across participant sociodemographic characteristics and geographic characteristics of addresses. For all addresses with updates to location information (n = 272, 6.2%), the median distance between the LexisNexis and corrected addresses was 5.4 km (IQR: 0.1–41.7).
While 71.7% of addresses had both verified start and stop dates of residence, the residential histories accurately covered 81.5% of verified/corrected address timeframes. The precent of time covered varied by race and ethnicity, educational attainment, household income, and start year of address (P-value < 0.01, see Table 3). A greater proportion of start dates of residence (79.8%) were verified than end dates (75.4%). For addresses with updates to dates of residence (n = 546, 15%), the median difference in duration of residence between LexisNexis and corrected addresses was +/- 3 years (IQR: 1–5).
On the address validation questionnaire, participants separately reported a total of 338 additional addresses at which they lived during or overlapping the period between 1980 and study enrollment that were not included in the LexisNexis residential history. For over half (58.9%) of these addresses, participants indicated that the start year of residence was before 1985, a period for which LexisNexis data is less complete.
Discussion
Using address records from the commercial database LexisNexis, we reconstructed retrospective residential histories for 93.5% of participants in the Sister Study cohort. This effort added a median of 25 years of address history prior to study enrollment. For a subset of participants, 95% of address locations and 82% of the time spent at addresses were verified, demonstrating the accuracy of the reconstructed residential histories. The residential data produced in this analysis provides a valuable resource for future studies leveraging geospatial data to understand the health impacts of social and environmental exposures across the lifecourse.
Our results align with prior literature that finds LexisNexis records correspond well with study address locations [7,8,9,10,11,12]. Most of these studies evaluated agreement between study and LexisNexis addresses at the year of study baseline or completion of a follow-up questionnaire. For example, the overall proportion of study baseline addresses that matched LexisNexis address records was 86% in both the California Teachers Study [10] and the Los Angeles Ultrafines Study [7] and 92% in the nationwide REGARDS cohort [11]. These proportions are comparable to our findings where 95% of LexisNexis address locations were confirmed in the address validation sample. However, because of the limited extent of self-reported address data, these prior studies were unable to evaluate the ability of LexisNexis to capture all residential moves continuously across time.
In our study, we evaluated longitudinal address histories, allowing us to describe temporal mismatches across multiple addresses. We found residential histories accurately covered 82% of the time spent at of addresses, and this proportion improved after 1985 when LexisNexis data is more complete. Our results are comparable to two other studies that also evaluated address location and timing [8, 9]. Among participants in a Michigan case-control study, Jacquez et al. [9] compared recalled lifetime residential histories against LexisNexis and found that LexisNexis addresses covered 72% of the time spent at lifetime addresses. However, the authors obtained only the 3 most recent LexisNexis addresses for each individual, which limited the length of residential history available for comparison. In another analysis among 1000 participants in the NIH-AARP Diet and Health Study, Wheeler and Wang [8] found that 86% of follow-up study addresses matched LexisNexis records at the detailed street-level, and for those matched addresses, LexisNexis records covered 89% of the time spent at them. Despite differences between this analysis and our study in the age of study participants and benchmark for comparison (study addresses in the NIH-AARP cohort were identified using a combination of self-report and the USPS National Change of Address product, whereas we used participant recall to verify LexisNexis addresses), the overall findings are similar.
Prior studies have reported that for certain sociodemographic groups or geographic regions, the completeness and accuracy of LexisNexis data can vary [10,11,12]. We observed few significant differences in the accuracy of address locations but some differences in the accuracy of timeframes across socioeconomic groups. The mechanisms that may lead to biases in the accuracy of residence histories include historic segregation that influences contemporary housing insecurity among Black and other racially minoritized groups [15]. For groups with high socioeconomic position, it can be challenging to distinguish primary residences from simultaneously owned properties or businesses. Consistent with prior studies, Sister Study participants who were not identified in the LexisNexis database differed from the overall cohort were older, had less education and lower incomes. This likely reflects bias inherent to commercial address data whereby people with a lower socioeconomic position are less likely to engage in activities (e.g., purchasing property, voting, or registering vehicles) that generate residential information in administrative databases [15]. It is reassuring that the proportion of Sister Study participants identified in LexisNexis was high (93.5%) and consistent with other studies at the same time period [7, 11, 12], although careful attention should be paid to potential selection or information biases introduced by linking commercial address data to cohort studies, such as greater levels of exposure misclassification among groups with less accurate resident histories [16]. Additionally, our findings from the Sister Study—which has an overall higher socioeconomic position than the general US population—may not be generalizable to other cohorts with a different age distribution, sociodemographic makeup, or study period. Future studies exploring the use of LexisNexis or other sources of commercial address data should be mindful of the time period and population characteristics that may impact data completeness and accuracy.
This study contributes to the knowledge of best practices for cleaning and processing commercial address data for research use. Xu et al. [12] provided the first in-depth description of a “standard” procedure for generating residential histories using raw data from LexisNexis. We followed a similar procedure in cleaning LexisNexis address data with the additional step of integrating self-reported study addresses to help resolve timing incongruities and improve our confidence in the final residential histories. Thus, our study complements the blueprint described in Xu et al. [12] by demonstrating how additional address data sources (i.e., self-reported study addresses and RDI) can be used in generating LexisNexis residence histories.
In this study, we described the utility of LexisNexis for generating retrospective detailed residential histories in a nationwide cohort of women. Our findings confirmed that the address histories accurately represented a high proportion of locations and time frames. These results highlight the value of this address information for use in future epidemiologic studies.
Data availability
Requests for data, including the data and code used in this manuscript, are welcome. De-identified data are made available upon request as described on the public study website (https://sisterstudy.niehs.nih.gov/English/data-requests.htm). The data sharing policy was developed to protect the privacy of study participants and is consistent with study informed consent documents as approved by the NIEHS institutional review board.
References
Medgyesi DN, Fisher JA, Cervi MM, Weyer PJ, Patel DM, Sampson JN, et al. Impact of residential mobility on estimated environmental exposures in a prospective cohort of older women. Environ Epidemiol. 2020;4:e110.
Ling C, Heck JE, Cockburn M, Liew Z, Marcotte E, Ritz B. Residential mobility in early childhood and the impact on misclassification in pesticide exposures. Environ Res. 2019;173:212–20.
Christian WJ, Walker CJ, Huang B, Levy JE, Durbin E, Arnold S. Using residential histories in case-control analysis of lung cancer and mountaintop removal coal mining in Central Appalachia. Spatial Spatio-temporal Epidemiol. 2020;35:100364.
Liu B, Niu L, Lee FF. Utilizing residential histories to assess environmental exposure and socioeconomic status over the life course among mesothelioma patients. J Thorac Dis. 2023;15:6126–39.
Semmens EO, Leary CS, Fitzpatrick AL, Ilango SD, Park C, Adam CE, et al. Air pollution and dementia in older adults in the Ginkgo Evaluation of Memory Study. Alzheimer’s Dement. 2023;19:549–59.
Stinchcomb DG, Roeser A. NCI/SEER residential history project. Rockville, MD; Westat, Inc. 2016.
Medgyesi DN, Fisher JA, Flory AR, Hayes RB, Thurston GD, Liao LM, et al. Evaluation of a commercial database to estimate residence histories in the Los Angeles ultrafines study. Environ Res. 2021;197:110986.
Wheeler DC, Wang A. Assessment of residential history generation using a public-record database. Int J Environ Res Public Health. 2015;12:11670–82.
Jacquez GM, Slotnick MJ, Meliker JR, AvRuskin G, Copeland G, Nriagu J. Accuracy of commercially available residential histories for epidemiologic studies. Am J Epidemiol. 2011;173:236–43.
Hurley S, Hertz A, Nelson DO, Layefsky M, Von Behren J, Bernstein L, et al. Tracing a path to the past: exploring the use of commercial credit reporting data to construct residential histories for epidemiologic studies of environmental exposures. Am J Epidemiol. 2017;185:238–46.
Brooks MS, Bennett A, Lovasi GS, Hurvitz PM, Colabianchi N, Howard VJ, et al. Matching participant address with public records database in a US national longitudinal cohort study. SSM Popul Health. 2021;15:100887.
Xu W, Agnew M, Kamis C, Schultz A, Salas S, Malecki K, et al. Constructing residential histories in a general population-based representative sample. Am J Epidemiol. 2024;193:348–59.
Sandler DP, Hodgson ME, Deming-Halverson SL, Juras PS, D’Aloisio AA, Suarez LM, et al. The Sister Study cohort: baseline methods and participant characteristics. Environ Health Perspect. 2017;125:127003.
U.S. Department of Agriculture ERS. Rural-Urban Continuum Codes. https://www.ers.usda.gov/data-products/rural-urban-continuum-codes.
Sims KD, Glymour MM, Ncube CN, Willis MD. Invited commentary: Improving spatial exposure data for everyone-life-course social context and ascertaining residential history. Am J Epidemiol. 2025;194:573–7.
Freeman VL, Boylan EE, Tilahun NY, Basu S, Kwan M-P. Sources of selection and information biases when using commercial database–derived residential histories for cancer research. Ann Epidemiol. 2020;51:35–40.e1.
Acknowledgements
We sincerely thank all Sister Study participants, especially those who took part in the address validation sub-study, for their time and commitment. We also acknowledge the efforts of the study staff and coordinators who facilitated data collection and participant engagement.
Funding
This work was supported by the Intramural Research Program of the National Institutes of Health, National Institute of Environmental Health Sciences (Z01-ES103332, Z01-ES044005). The contributions of the NIH authors were made as part of their official duties as NIH federal employees, are in compliance with agency policy requirements, and are considered works of the United States Government. However, the findings and conclusions presented in this paper are those of the authors and do not necessarily reflect the views of the NIH or the U.S. Department of Health and Human Services. Open access funding provided by the National Institutes of Health.
Author information
Authors and Affiliations
Contributions
AJW and NMN conceptualized the study, and with JLI, designed the methodology. PR processed and curated the data for analysis. JLI and MD conducted the formal analysis. JLI wrote the original draft. AJW supervised the project. JLI, MD, PR, NMN, RJR, and AJW contributed to the interpretation of results, reviewed and edited the manuscript, and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ish, J.L., Daniel, M., Ringwald, P. et al. Accuracy of LexisNexis-derived retrospective address histories in the Sister Study cohort. J Expo Sci Environ Epidemiol 36, 244–250 (2026). https://doi.org/10.1038/s41370-025-00802-1
Received:
Revised:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1038/s41370-025-00802-1