Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Data
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific data
  3. data descriptors
  4. article
Personalized Environment and Genes Study (PEGS) Dataset-a resource for genomic, exposomic, and geospatial data
Download PDF
Download PDF
  • Data Descriptor
  • Open access
  • Published: 02 April 2026

Personalized Environment and Genes Study (PEGS) Dataset-a resource for genomic, exposomic, and geospatial data

  • Farida S. Akhtari1,2,
  • Jennifer Madenspacher2,
  • Diane D’Agostin2,
  • Samantha Shuptrine3,
  • Nathaniel MacNell3,
  • Rebecca Ritter3,
  • Adam Burkholder4,
  • Zongli Xu1,
  • Jasmine A. Mack1,5,
  • John S. House  ORCID: orcid.org/0000-0002-8447-78711,
  • Daniel J. Schaid6,
  • Shannon K. McDonnell6,
  • Shepherd H. Schurman  ORCID: orcid.org/0000-0002-9133-79067,
  • David C. Fargo4,
  • Charles P. Schmitt  ORCID: orcid.org/0000-0002-3148-22638,
  • Janet E. Hall2 &
  • …
  • Alison A. Motsinger-Reif  ORCID: orcid.org/0000-0003-1346-24931 

Scientific Data , Article number:  (2026) Cite this article

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Data acquisition
  • Data integration
  • Genetic interaction

Abstract

The Personalized Environment and Genes Study (PEGS) is a unique resource comprising genetic and environmental exposure data linked to geospatial data. The PEGS cohort contains 19,445 demographically diverse participants who provided phenotype and exposure data by completing three surveys. Whole-genome sequencing was performed for a subset of 4,737 participants to interrogate common and rare variants and structural variations, including high-resolution human leukocyte antigen (HLA) variants. Geographic coordinates were assigned to participant addresses, enabling the use of distance to contaminant sources and area-level air-pollutant concentrations as surrogates for exposure. Several available tools are available to explore these data and results of exposome-wide association studies (ExWAS) conducted in the data. The i2b2 Query and Analysis Tool enables approved users to build customizable queries for exploring basic statistics from de-identified and aggregated PEGS data. PEGS Explorer allows users to explore published ExWAS results and rigorously calculated exposure correlations. Globe visualizations in this tool reflect the complex mixtures involved in the exposome and allow users to visualize correlations between exposures and common, complex diseases.

Data availability

PEGS Data Freeze 3.1 is deposited in a controlled-access data repository hosted by the NIEHS. The repository supports secure, independent download and local analysis of approved data by authorized users.

The deposited datasets include de-identified survey data, derived exposure and geospatial metrics, whole-genome sequencing variant files, DNA methylation beta values, polygenic scores, and associated metadata and documentation. Direct identifiers, protected health information, and precise residential address data are not included in shared datasets.

Access to PEGS data is available to qualified academic, governmental, and commercial researchers through an application process. Applications are reviewed by the PEGS Executive Leadership Committee using predefined criteria, including scientific validity, consistency with participant consent and NIH IRB approvals, and the applicant’s ability to comply with data security requirements. Approval does not require collaboration with PEGS investigators, and approved users may analyze the data independently.

Data access is granted following execution of a Data Use Agreement that specifies permitted uses, data protection requirements, and reporting obligations. Information on the application process and a public copy of the PEGS Data Use Agreement are available at: https://www.niehs.nih.gov/research/atniehs/labs/crb/studies/pegs/collaboration/guidelines.

Code availability

Standard open-source packages were used to generate the PEGS data as described in the Methods section. The code pipelines for data generation are available at https://github.com/fsakhtari/PEGS_common/blob/master/pegs_common_utils.R.

References

  1. Willett, W. C. Balancing Life-Style and Genomics Research for Disease Prevention. Science 296, 695–698, https://doi.org/10.1126/science.1071055 (2002).

    Google Scholar 

  2. Rappaport, S. M. & Smith, M. T. Epidemiology. Environment and disease risks. Science 330, 460–461, https://doi.org/10.1126/science.1192603 (2010).

    Google Scholar 

  3. Lichtenstein, P. et al. Environmental and Heritable Factors in the Causation of Cancer — Analyses of Cohorts of Twins from Sweden, Denmark, and Finland. New England Journal of Medicine 343, 78–85, https://doi.org/10.1056/nejm200007133430201 (2000).

    Google Scholar 

  4. Smith, K. R., Corvalán, C. F. & Kjellström, T. How much global ill health is attributable to environmental factors? Epidemiology 10, 573–584 (1999).

    Google Scholar 

  5. Remoundou, K. & Koundouri, P. Environmental effects on public health: an economic perspective. Int J Environ Res Public Health 6, 2160–2178, https://doi.org/10.3390/ijerph6082160 (2009).

    Google Scholar 

  6. McCallister, E. Guide to protecting the confidentiality of personally identifiable information. Vol. 800 (Diane Publishing, 2010).

  7. Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. BioRxiv, 201178 (2017).

  8. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997 (2013).

  9. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079, https://doi.org/10.1093/bioinformatics/btp352 (2009).

    Google Scholar 

  10. Kosoy, R. et al. Ancestry informative marker sets for determining continental origin and admixture proportions in common populations in America. Hum. Mutat. 30, 69–78, https://doi.org/10.1002/humu.20822 (2009).

    Google Scholar 

  11. Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575, https://doi.org/10.1086/519795 (2007).

    Google Scholar 

  12. Raj, A., Stephens, M. & Pritchard, J. K. fastSTRUCTURE: variational inference of population structure in large SNP data sets. Genetics 197, 573–589, https://doi.org/10.1534/genetics.114.164350 (2014).

    Google Scholar 

  13. Delaneau, O., Zagury, J. F., Robinson, M. R., Marchini, J. L. & Dermitzakis, E. T. Accurate, scalable and integrative haplotype estimation. Nat Commun 10, 5436, https://doi.org/10.1038/s41467-019-13225-y (2019).

    Google Scholar 

  14. Maples, B. K., Gravel, S., Kenny, E. E. & Bustamante, C. D. RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am J Hum Genet 93, 278–288, https://doi.org/10.1016/j.ajhg.2013.06.020 (2013).

    Google Scholar 

  15. Byrska-Bishop, M. et al. High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. Cell 185, 3426–3440.e3419, https://doi.org/10.1016/j.cell.2022.08.004 (2022).

    Google Scholar 

  16. Lambert, S. A. et al. Enhancing the Polygenic Score Catalog with tools for score calculation and ancestry normalization. Nature Genetics 56, 1989–1994, https://doi.org/10.1038/s41588-024-01937-x (2024).

    Google Scholar 

  17. Lambert, S. A. et al. The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation. Nature Genetics 53, 420–425, https://doi.org/10.1038/s41588-021-00783-5 (2021).

    Google Scholar 

  18. Das, S. et al. Next-generation genotype imputation service and methods. Nature Genetics 48, 1284–1287, https://doi.org/10.1038/ng.3656 (2016).

    Google Scholar 

  19. Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4, https://doi.org/10.1186/s13742-015-0047-8 (2015).

  20. Schaid, D. J. et al. Polygenic scores and social determinants of health: Their correlations and potential biases. Human Genetics and Genomics Advances 6, 100389, https://doi.org/10.1016/j.xhgg.2024.100389 (2025).

    Google Scholar 

  21. Akhtari, F. S. et al. Questionnaire-based polyexposure assessment outperforms polygenic scores for classification of type 2 diabetes in a multiancestry cohort. Diabetes Care 46, 929–937, https://doi.org/10.2337/dc22-0295 (2023).

    Google Scholar 

  22. Ayala-Ramirez, M. et al. Association of distance to swine concentrated animal feeding operations with immune-mediated diseases: An exploratory gene-environment study. Environ. Int. 171, 107687, https://doi.org/10.1016/j.envint.2022.107687 (2023).

    Google Scholar 

  23. Lee, E. Y. et al. Questionnaire-based exposome-wide association studies (ExWAS) reveal expected and novel risk factors associated with cardiovascular outcomes in the Personalized Environment and Genes Study. Environ. Res., 113463, https://doi.org/10.1016/j.envres.2022.113463 (2022).

  24. Lee, E. Y. et al. Race/ethnicity-stratified fine-mapping of the MHC locus reveals genetic variants associated with late-onset asthma. Front Genet 14, 1173676, https://doi.org/10.3389/fgene.2023.1173676 (2023).

    Google Scholar 

  25. Lowe, M. E. et al. The skin is no barrier to mixtures: Air pollutant mixtures and reported psoriasis or eczema in the Personalized Environment and Genes Study (PEGS). J. Expo. Sci. Environ. Epidemiol., https://doi.org/10.1038/s41370-022-00502-0 (2022).

  26. Saini, N. et al. UV-exposure, endogenous DNA damage, and DNA replication errors shape the spectra of genome changes in human skin. PLoS Genet. 17, e1009302, https://doi.org/10.1371/journal.pgen.1009302 (2021).

    Google Scholar 

  27. Hussain, S. et al. TLR5 participates in the TLR4 receptor complex and promotes MyD88-dependent signaling in environmental lung injury. Elife 9, https://doi.org/10.7554/eLife.50458 (2020).

  28. Murphy, S. N. et al. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). Journal of the American Medical Informatics Association 17, 124–130 (2010).

    Google Scholar 

  29. Lloyd, D. et al. Questionnaire-based exposome-wide association studies for common diseases in the Personalized Environment and Genes Study. Exposome 4, https://doi.org/10.1093/exposome/osae002 (2024).

  30. Lloyd, D. et al. Interactive data sharing for multiple questionnaire-based exposome-wide association studies and exposome correlations in the Personalized Environment and Genes Study. Exposome 4, https://doi.org/10.1093/exposome/osae003 (2024).

Download references

Acknowledgements

We would like to thank the PEGS participants for their contributions to this work. We would also like to thank the Office of Communications and Public Liaison at NIEHS for their support in creating and maintaining the PEGS website and Donna Jeanne Corcoran for crafting the exceptionally elegant graphics for the PEGS website and this manuscript. We would also like to express our sincere appreciation to Sharon Soucek in the Office of Technology Transfer at NIEHS for support and expertise regarding the data use agreements that enable collaborative research projects with PEGS. We would also like to thank Hannah Collins Cakar for assistance with manuscript preparation.

Author information

Authors and Affiliations

  1. Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Durham, NC, USA

    Farida S. Akhtari, Zongli Xu, Jasmine A. Mack, John S. House & Alison A. Motsinger-Reif

  2. Clinical Research Branch, National Institute of Environmental Health Sciences, Durham, NC, USA

    Farida S. Akhtari, Jennifer Madenspacher, Diane D’Agostin & Janet E. Hall

  3. DLH Corporation, Bethesda, MD, USA

    Samantha Shuptrine, Nathaniel MacNell & Rebecca Ritter

  4. Office of the Director, National Institute of Environmental Health Sciences, Durham, NC, USA

    Adam Burkholder & David C. Fargo

  5. Department of Obstetrics and Gynaecology, University of Cambridge, Cambridge, UK

    Jasmine A. Mack

  6. Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, USA

    Daniel J. Schaid & Shannon K. McDonnell

  7. Clinical Research Core, National Institute on Aging, Bethesda, MD, USA

    Shepherd H. Schurman

  8. Office of Data Science, National Institute of Environmental Health Science, Durham, NC, USA

    Charles P. Schmitt

Authors
  1. Farida S. Akhtari
    View author publications

    Search author on:PubMed Google Scholar

  2. Jennifer Madenspacher
    View author publications

    Search author on:PubMed Google Scholar

  3. Diane D’Agostin
    View author publications

    Search author on:PubMed Google Scholar

  4. Samantha Shuptrine
    View author publications

    Search author on:PubMed Google Scholar

  5. Nathaniel MacNell
    View author publications

    Search author on:PubMed Google Scholar

  6. Rebecca Ritter
    View author publications

    Search author on:PubMed Google Scholar

  7. Adam Burkholder
    View author publications

    Search author on:PubMed Google Scholar

  8. Zongli Xu
    View author publications

    Search author on:PubMed Google Scholar

  9. Jasmine A. Mack
    View author publications

    Search author on:PubMed Google Scholar

  10. John S. House
    View author publications

    Search author on:PubMed Google Scholar

  11. Daniel J. Schaid
    View author publications

    Search author on:PubMed Google Scholar

  12. Shannon K. McDonnell
    View author publications

    Search author on:PubMed Google Scholar

  13. Shepherd H. Schurman
    View author publications

    Search author on:PubMed Google Scholar

  14. David C. Fargo
    View author publications

    Search author on:PubMed Google Scholar

  15. Charles P. Schmitt
    View author publications

    Search author on:PubMed Google Scholar

  16. Janet E. Hall
    View author publications

    Search author on:PubMed Google Scholar

  17. Alison A. Motsinger-Reif
    View author publications

    Search author on:PubMed Google Scholar

Contributions

A.A.M.-R., J.E.H., C.P.S., D.C.F., F.S.A. and S.H.S. conceived the study design. A.B., D.C.F., J.E.H. and F.S.A. developed the methodology used in the study. A.B., F.S.A., C.P.S. and J.E.H. were responsible for the software used. A.B., F.S.A., and J.M. coordinated validation of the study data. A.B., F.S.A., J.M. and J.S.H. conducted formal analysis. D.D., F.S.A., N.M., S.S., D.J.S., S.K.M., R.R., Z.X., J.M. and J.E.H. were responsible for data curation. F.S.A. was responsible for visualization. A.A.M.-R., J.E.H., C.P.S. and D.C.F. were responsible for project administration. A.A.M.-R. and J.E.H. coordinated funding acquisition. D.J.S. and S.K.M. conducted PRS analyses. F.S.A. and A.A.M.-R. wrote the initial draft of the manuscript. All authors read, edited, and approved the final manuscript.

Corresponding author

Correspondence to Alison A. Motsinger-Reif.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Akhtari, F.S., Madenspacher, J., D’Agostin, D. et al. Personalized Environment and Genes Study (PEGS) Dataset-a resource for genomic, exposomic, and geospatial data. Sci Data (2026). https://doi.org/10.1038/s41597-026-07011-x

Download citation

  • Received: 23 October 2024

  • Accepted: 02 March 2026

  • Published: 02 April 2026

  • DOI: https://doi.org/10.1038/s41597-026-07011-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Download PDF

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • Aims and scope
  • Editors & Editorial Board
  • Journal Metrics
  • Policies
  • Open Access Fees and Funding
  • Calls for Papers
  • Contact

Publish with us

  • Submission Guidelines
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Data (Sci Data)

ISSN 2052-4463 (online)

nature.com footer links

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing