Abstract
The Personalized Environment and Genes Study (PEGS) is a unique resource comprising genetic and environmental exposure data linked to geospatial data. The PEGS cohort contains 19,445 demographically diverse participants who provided phenotype and exposure data by completing three surveys. Whole-genome sequencing was performed for a subset of 4,737 participants to interrogate common and rare variants and structural variations, including high-resolution human leukocyte antigen (HLA) variants. Geographic coordinates were assigned to participant addresses, enabling the use of distance to contaminant sources and area-level air-pollutant concentrations as surrogates for exposure. Several available tools are available to explore these data and results of exposome-wide association studies (ExWAS) conducted in the data. The i2b2 Query and Analysis Tool enables approved users to build customizable queries for exploring basic statistics from de-identified and aggregated PEGS data. PEGS Explorer allows users to explore published ExWAS results and rigorously calculated exposure correlations. Globe visualizations in this tool reflect the complex mixtures involved in the exposome and allow users to visualize correlations between exposures and common, complex diseases.
Data availability
PEGS Data Freeze 3.1 is deposited in a controlled-access data repository hosted by the NIEHS. The repository supports secure, independent download and local analysis of approved data by authorized users.
The deposited datasets include de-identified survey data, derived exposure and geospatial metrics, whole-genome sequencing variant files, DNA methylation beta values, polygenic scores, and associated metadata and documentation. Direct identifiers, protected health information, and precise residential address data are not included in shared datasets.
Access to PEGS data is available to qualified academic, governmental, and commercial researchers through an application process. Applications are reviewed by the PEGS Executive Leadership Committee using predefined criteria, including scientific validity, consistency with participant consent and NIH IRB approvals, and the applicant’s ability to comply with data security requirements. Approval does not require collaboration with PEGS investigators, and approved users may analyze the data independently.
Data access is granted following execution of a Data Use Agreement that specifies permitted uses, data protection requirements, and reporting obligations. Information on the application process and a public copy of the PEGS Data Use Agreement are available at: https://www.niehs.nih.gov/research/atniehs/labs/crb/studies/pegs/collaboration/guidelines.
Code availability
Standard open-source packages were used to generate the PEGS data as described in the Methods section. The code pipelines for data generation are available at https://github.com/fsakhtari/PEGS_common/blob/master/pegs_common_utils.R.
References
Willett, W. C. Balancing Life-Style and Genomics Research for Disease Prevention. Science 296, 695–698, https://doi.org/10.1126/science.1071055 (2002).
Rappaport, S. M. & Smith, M. T. Epidemiology. Environment and disease risks. Science 330, 460–461, https://doi.org/10.1126/science.1192603 (2010).
Lichtenstein, P. et al. Environmental and Heritable Factors in the Causation of Cancer — Analyses of Cohorts of Twins from Sweden, Denmark, and Finland. New England Journal of Medicine 343, 78–85, https://doi.org/10.1056/nejm200007133430201 (2000).
Smith, K. R., Corvalán, C. F. & Kjellström, T. How much global ill health is attributable to environmental factors? Epidemiology 10, 573–584 (1999).
Remoundou, K. & Koundouri, P. Environmental effects on public health: an economic perspective. Int J Environ Res Public Health 6, 2160–2178, https://doi.org/10.3390/ijerph6082160 (2009).
McCallister, E. Guide to protecting the confidentiality of personally identifiable information. Vol. 800 (Diane Publishing, 2010).
Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. BioRxiv, 201178 (2017).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997 (2013).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079, https://doi.org/10.1093/bioinformatics/btp352 (2009).
Kosoy, R. et al. Ancestry informative marker sets for determining continental origin and admixture proportions in common populations in America. Hum. Mutat. 30, 69–78, https://doi.org/10.1002/humu.20822 (2009).
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575, https://doi.org/10.1086/519795 (2007).
Raj, A., Stephens, M. & Pritchard, J. K. fastSTRUCTURE: variational inference of population structure in large SNP data sets. Genetics 197, 573–589, https://doi.org/10.1534/genetics.114.164350 (2014).
Delaneau, O., Zagury, J. F., Robinson, M. R., Marchini, J. L. & Dermitzakis, E. T. Accurate, scalable and integrative haplotype estimation. Nat Commun 10, 5436, https://doi.org/10.1038/s41467-019-13225-y (2019).
Maples, B. K., Gravel, S., Kenny, E. E. & Bustamante, C. D. RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am J Hum Genet 93, 278–288, https://doi.org/10.1016/j.ajhg.2013.06.020 (2013).
Byrska-Bishop, M. et al. High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. Cell 185, 3426–3440.e3419, https://doi.org/10.1016/j.cell.2022.08.004 (2022).
Lambert, S. A. et al. Enhancing the Polygenic Score Catalog with tools for score calculation and ancestry normalization. Nature Genetics 56, 1989–1994, https://doi.org/10.1038/s41588-024-01937-x (2024).
Lambert, S. A. et al. The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation. Nature Genetics 53, 420–425, https://doi.org/10.1038/s41588-021-00783-5 (2021).
Das, S. et al. Next-generation genotype imputation service and methods. Nature Genetics 48, 1284–1287, https://doi.org/10.1038/ng.3656 (2016).
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4, https://doi.org/10.1186/s13742-015-0047-8 (2015).
Schaid, D. J. et al. Polygenic scores and social determinants of health: Their correlations and potential biases. Human Genetics and Genomics Advances 6, 100389, https://doi.org/10.1016/j.xhgg.2024.100389 (2025).
Akhtari, F. S. et al. Questionnaire-based polyexposure assessment outperforms polygenic scores for classification of type 2 diabetes in a multiancestry cohort. Diabetes Care 46, 929–937, https://doi.org/10.2337/dc22-0295 (2023).
Ayala-Ramirez, M. et al. Association of distance to swine concentrated animal feeding operations with immune-mediated diseases: An exploratory gene-environment study. Environ. Int. 171, 107687, https://doi.org/10.1016/j.envint.2022.107687 (2023).
Lee, E. Y. et al. Questionnaire-based exposome-wide association studies (ExWAS) reveal expected and novel risk factors associated with cardiovascular outcomes in the Personalized Environment and Genes Study. Environ. Res., 113463, https://doi.org/10.1016/j.envres.2022.113463 (2022).
Lee, E. Y. et al. Race/ethnicity-stratified fine-mapping of the MHC locus reveals genetic variants associated with late-onset asthma. Front Genet 14, 1173676, https://doi.org/10.3389/fgene.2023.1173676 (2023).
Lowe, M. E. et al. The skin is no barrier to mixtures: Air pollutant mixtures and reported psoriasis or eczema in the Personalized Environment and Genes Study (PEGS). J. Expo. Sci. Environ. Epidemiol., https://doi.org/10.1038/s41370-022-00502-0 (2022).
Saini, N. et al. UV-exposure, endogenous DNA damage, and DNA replication errors shape the spectra of genome changes in human skin. PLoS Genet. 17, e1009302, https://doi.org/10.1371/journal.pgen.1009302 (2021).
Hussain, S. et al. TLR5 participates in the TLR4 receptor complex and promotes MyD88-dependent signaling in environmental lung injury. Elife 9, https://doi.org/10.7554/eLife.50458 (2020).
Murphy, S. N. et al. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). Journal of the American Medical Informatics Association 17, 124–130 (2010).
Lloyd, D. et al. Questionnaire-based exposome-wide association studies for common diseases in the Personalized Environment and Genes Study. Exposome 4, https://doi.org/10.1093/exposome/osae002 (2024).
Lloyd, D. et al. Interactive data sharing for multiple questionnaire-based exposome-wide association studies and exposome correlations in the Personalized Environment and Genes Study. Exposome 4, https://doi.org/10.1093/exposome/osae003 (2024).
Acknowledgements
We would like to thank the PEGS participants for their contributions to this work. We would also like to thank the Office of Communications and Public Liaison at NIEHS for their support in creating and maintaining the PEGS website and Donna Jeanne Corcoran for crafting the exceptionally elegant graphics for the PEGS website and this manuscript. We would also like to express our sincere appreciation to Sharon Soucek in the Office of Technology Transfer at NIEHS for support and expertise regarding the data use agreements that enable collaborative research projects with PEGS. We would also like to thank Hannah Collins Cakar for assistance with manuscript preparation.
Author information
Authors and Affiliations
Contributions
A.A.M.-R., J.E.H., C.P.S., D.C.F., F.S.A. and S.H.S. conceived the study design. A.B., D.C.F., J.E.H. and F.S.A. developed the methodology used in the study. A.B., F.S.A., C.P.S. and J.E.H. were responsible for the software used. A.B., F.S.A., and J.M. coordinated validation of the study data. A.B., F.S.A., J.M. and J.S.H. conducted formal analysis. D.D., F.S.A., N.M., S.S., D.J.S., S.K.M., R.R., Z.X., J.M. and J.E.H. were responsible for data curation. F.S.A. was responsible for visualization. A.A.M.-R., J.E.H., C.P.S. and D.C.F. were responsible for project administration. A.A.M.-R. and J.E.H. coordinated funding acquisition. D.J.S. and S.K.M. conducted PRS analyses. F.S.A. and A.A.M.-R. wrote the initial draft of the manuscript. All authors read, edited, and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Akhtari, F.S., Madenspacher, J., D’Agostin, D. et al. Personalized Environment and Genes Study (PEGS) Dataset-a resource for genomic, exposomic, and geospatial data. Sci Data (2026). https://doi.org/10.1038/s41597-026-07011-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-026-07011-x