Insights from the Biorepository and Integrative Genomics pediatric resource

Buonaiuto, Silvia; Marsico, Franco; Mohammed, Akram; Chinthala, Lokesh K.; Amos-Abanyie, Ernestine K.; Prins, Pjotr; Mozhui, Khyobeni; Rooney, Robert J.; Williams, Robert W.; Davis, Robert L.; Finkel, Terri H.; Brown, Chester W.; Colonna, Vincenza

doi:10.1038/s41467-025-59375-0

Download PDF

Article
Open access
Published: 22 May 2025

Insights from the Biorepository and Integrative Genomics pediatric resource

Nature Communications volume 16, Article number: 4750 (2025) Cite this article

3239 Accesses
18 Altmetric
Metrics details

Subjects

Abstract

The Biorepository and Integrative Genomics (BIG) Initiative in Tennessee has developed a pioneering resource to address gaps in genomic research by linking genomic, phenotypic, and environmental data from a diverse Mid-South population, including underrepresented groups. We analyzed 13,152 exomes from BIG and found significant genetic diversity, with 50% of participants inferred to have non-European or several types of admixed ancestry. Ancestry within the BIG cohort is stratified, with distinct geographic and demographic patterns, as African ancestry is more common in urban areas, while European ancestry is more common in suburban regions. We observe ancestry-specific rates of novel genetic variants, which are enriched for functional or clinical relevance. Disease prevalence analysis linked ancestry and environmental factors, showing higher odds ratios for asthma and obesity in minority groups, particularly in the urban area. Finally, we observe discrepancies between self-reported race and genetic ancestry, with related individuals self-identifying in differing racial categories. These findings underscore the limitations of race as a biomedical variable. BIG has proven to be an effective model for community-centered precision medicine. We integrated genomics education, and fostered great trust among the contributing communities. Future goals include cohort expansion, and enhanced genomic analysis, to ensure equitable healthcare outcomes.

Diverse ancestral representation improves genetic intolerance metrics

Article Open access 18 March 2025

Genomic data in the All of Us Research Program

Article Open access 19 February 2024

Increasing access to individualized medicine: a matched-cohort study examining Latino participant experiences of genomic screening

Article 26 January 2021

Introduction

To date, most genetic data available for human research has predominantly originated from European populations, introducing a bias in medical research and healthcare that fails to accurately represent the genetic diversity of the global human population^1,2,3,4,5. Systemic inequity were aggravated by historical technological limitations such as early SNP arrays^6,7,8 were primarily designed based on data from European populations. Recent breakthroughs^{9,10,11,12,13,14}, culminating in the development of human pangenome assemblies^15,16, have finally begun dismantling these technological barriers that reinforced genetic research disparities across populations. Genetic risk assessments based on European ancestry cohorts yield less accurate outcomes for non-European populations, as seen with CYP2C19 gene variants, which affect drug metabolism and increase risks of misdiagnosis or delayed treatment^17,18,19. While the importance of including ethnically diverse populations in studies of quantitative trait evolution is well known²⁰, the underrepresentation of diverse populations in genetic research exacerbates health inequities and limits understanding of disease genetics across ancestries, further deepening existing treatment disparities. This underrepresentation underscores the urgent need for more inclusive and diverse genetic studies to improve global health outcomes, leading to a surge of initiatives aimed at addressing these disparities (e.g., Refs. ^14,21,22,23).

The Biorepository and Integrative Genomics (BIG) Initiative of Tennessee (US), is a multi-institute initiative that has developed a biorepository resource from a diverse Mid-South population in the US, including African Americans from Memphis - a population previously shown to have among the highest and diverse proportions of African ancestry in the United States, making it particularly valuable for studying African genetic diversity in admixed populations^24,25, and rural populations in Appalachia, which are disproportionately impacted by chronic diseases and the associated costs of healthcare^26,27. The BIG biospecimens and their genomic data are linked to de-identified electronic health records, with the purpose of creating a platform for genomics-based research that includes underrepresented populations and to support future personalized healthcare delivery platforms²⁸. The initial focus of BIG on building a large and diverse cohort for genetically informed treatment and prevention of pediatric conditions, has now been expanded to a state-wide program that enrolls participants of any age with the goal of building genome-phenome-environment data for 100,000 Tennesseans.

Here we report on the analysis of 13,152 genomes from the BIG collection. We demonstrate that the BIG is a genetically diverse and ethnically rich study population, representing a unique and valuable resource for inclusive genomics. Our findings highlight ancestry-specific diversity and genetic burden, underscoring the critical need of inclusive sets of data. Finally, we show that self-reported race does not accurately reflect genetic ancestry and should be cautiously applied as a covariate in genetic analyses.

Results

A robust foundation for inclusive genomics studies

To date, the BIG initiative has consented over 42,000 participants with electronic health records and collected more than 15,000 biosamples from five collection sites (Fig. 1a). The BIG cohort is predominantly pediatric, with 87% of participants under 18 years old. At the time of sample collection, participant ages ranged from infancy to 90 years, with an average age of 8.4 years and a median age of 6.2 years (Supplementary Fig. 1). BIG stands out as one of the largest cohorts focused on diverse ancestries, providing a substantial representation of different ethnic backgrounds^{29,30,31,32,33,34,35} compared to cohorts with predominantly one ancestry^36,37 (Supplementary Table 1). Notably, it is among the few cohorts specifically enriched for children with diseases, unlike most pediatric cohorts that typically recruit healthy mother-child pairs during pregnancy^{30,31,33,35,38,39,40}.

**Fig. 1: Geographic distribution and global ancestry deconvolution of individuals from the BIG initiative.**

Since 2017, the BIG initiative has developed the Memphis Genomics Educational Network (MEMGEN) to engage the Memphis Shelby County public school district community in genomics education. MEMGEN has reached students in seven public high schools (with plans to expand to 25), providing hands-on genomic experiences and ethical discussions that inspire STEM careers and academic growth in underserved communities. Community engagement is strengthened through advisory boards like the Le Bonheur Family Partners Council, supporting the BIG initiative since 2015, and the UTHSC Community Advisory Board, representing seventeen grassroots organizations. These boards ensure research and educational efforts align with community needs, fostering a community-centered approach to precision medicine and addressing health disparities.

Capturing broad diversity and several types of admixture

Within the BIG cohort, we identified and phased 6.8 million high-confidence variable sites, evenly distributed across the genome (Supplementary Fig. 2) through exome sequencing and genotype-by-sequencing data from 13,152 individuals. We used this genetic information to understand the ancestry composition of BIG by performing supervised ancestry deconvolution⁴¹, with 1000 Genomes and HGDP as reference populations^42,43. While we observe a clear, uninterrupted cline of ancestry, we subdivided the data set into seven ancestry groups to account for admixture and further characterize our cohort (Fig. 1b). In practice, individuals were classified as not-admixed if more than 85% of their global ancestry corresponded to a single group. The choice of an 85% threshold reflects the understanding that genetic ancestry exists on a continuum, therefore defining discrete categories implies setting thresholds and making arbitrary decisions (ref. ²² see Methods section). Furthermore, ancestral contributions over 10–15% are generally considered accurate and significant, while lower proportions are often linked to shorter ancestral segments and higher error rates⁴⁴.

According to this ancestry-based grouping, 50% of participants relate to individuals of non-European origin in the reference data sets. In particular, 20% of the BIG individuals are similar to Africans in the reference sets, and 30% present admixed origins, with two-way and multiple-admixture patterns (Fig. 1b). The group of individual presenting more than two ancestry component is heterogeneous (Supplementary Fig. 3), consistently with previous observations⁴⁵. These figures, projected on all consented individuals, indicate that over 20k consented samples are likely of non-European or admixed origin, placing BIG among the largest pediatric cohorts with many admixed children (Supplementary Table 1).

The distribution of inferred ancestry groups by zip code shows ancestry stratification, with prevalence of European ancestry in the suburbs and areas surrounding Memphis (Figs. 1c, 4). Stratification appears even more marked when visualized by single ancestry (Fig. 1d). A high dissimilarity index⁴⁶ between EUR and AFR (0.67) is observed, highlighting relevant geographic difference, while AFR and EUR-AFR (0.24) are the most evenly distributed pair, indicating much closer spatial overlap (Supplementary Fig. 4c). This evidence indicates that BIG individuals with similar ancestry often share a similar environment, implying that geography could act as a confounding factor if not accounted for in association analyses.

Integrating genetic, phenotypic, and environmental information

Electronic health records are an integral part of the BIG cohort, covering a range of Phecode categories⁴⁷, with gastrointestinal and respiratory medical conditions among the most represented (Supplementary Fig. 5). We examined the prevalence of obesity, hypertension, diabetes and asthma, four health conditions commonly associated with minority groups and local environmental influences⁴⁸. BIG children have a high incidence of diabetes and asthma (363 and 697 cases, respectively, Fig. 2a), while adults have a more balanced incidence across these same four diseases (Supplementary Fig. 6). Ancestry categories such as AFR and EUR-AFR, are major contributors across conditions, and we observed higher odds ratios for obesity and asthma in minority groups (all individuals self-identified as belonging to non-White racial groups) compared to 200 randomly selected conditions (Fig. 2b).

**Fig. 2: Prevalence of diseases common in health disparities populations.**

Analysis of disease prevalence by zip code suggests a notable environmental component for obesity and asthma. In particular, three suburban areas around Memphis exhibit above-average prevalence for both conditions, with asthma being 1.7 times more prevalent in these zones compared to the overall prevalence in BIG (≈20% versus 12.8% CI95 [12.51-13.19] Fig. 2c). While these analyses are only preliminary, the resulting observations underscore the value of the BIG dataset in linking genetic, phenotypic, and environmental information, enabling a multidimensional understanding of health disparities.

Ancestry-specific diversity and genetic burden

Our joint principal component analysis (PCA) of the BIG and 1000 Genomes datasets (Fig. 3a, Supplementary Fig. 7) reveals significant genetic diversity in the BIG dataset, with mixed ancestry groups contributing to the spread and overlap between clusters corresponding to African, American, East Asian, and European individuals in the 1000 Genomes. In contrast, the populations of the 1000 Genomes dataset that we used as reference for ancestry deconvolution, exhibits more distinct clustering with minimal overlap, reflecting more clearly defined ancestral groups. These results underscore the BIG dataset’s value in capturing admixture and genetic diversity not represented in the 1000 Genomes, highlighting the importance of including diverse and admixed populations in genetic studies to better capture the full spectrum of human variation.

**Fig. 3: Genetic variability and genetic burden in the BIG cohort.**

As expected, the average number of genetic differences from the reference human genome varies by ancestry⁴². Individuals with African or admixed African ancestry typically have, on average, ~85k more variable sites compared to other ancestry groups (Fig. 3b). When counting This observation underscores the risk of bias in using a single reference sequence and its associated genomic annotations. The genetic diversity represented within BIG would be more accurately modeled by a pangenomic approach¹⁵.

Our dataset includes 771,717 novel single nucleotide variants (11.2% of the total), which are absent from major databases such as gnomAD, 1000 Genomes Project, Human Genome Diversity Project, or dbSNP^42,43,49,50. Novel variants are mostly rare and private to ancestries, as expected (Supplementary Fig. 9). The rough number of novel variants per individual is higher within inferred admixed ancestries, Americans, and Asians (Fig. 3c). This is especially true for rare novel variants, suggesting that admixture may expose previously undetected rare variation (Fig. 3d, Supplementary Fig. 8). Some novel variants have important functional consequences on the gene product (Supplementary Fig. 9, VEP classification⁵¹: 2.8% high impact, including frameshift variants, stop/start gain/loss and splicing affecting variants; 19.7%: missense) and potential implications for disease association (11.0% predicted to be deleterious by SIFT⁵²; 7.9% considered probably or possibly damaging by PolyPhen⁵³). Notably, the rate of high impact annotation in novel variants is double compared to known variants (logistic regression coefficient β = 0.95, p-value < 0.001, Supplementary Table 3, Fig. 3e).

Genetic burden by ancestry was evaluated as the distribution of rare deleterious (alternate allele frequency <1% in the total BIG samples, predicted to have high impact or missense with SIFT<0.05 and Polyphen>0.85) versus rare synonymous genetic variants across different ancestral groups. Among non-admixed groups, African individuals display the lowest deleterious/synonymous ratio, whereas European individuals exhibit the highest (Fig. 3f). Admixed populations show broader distributions in deleterious/synonymous ratios, with the European-American group demonstrating the highest ratios. In EUR-AMR group, the average number of rare deleterious variants per Gb is significantly higher in the AMR tracts compared to EUR ones (Fig. 3g, Supplementary Fig. 10) as shown in other studies⁵⁴, likely due to demography and founder effect^55,56.

Overall, the remarkable breadth of genetic diversity observed underscores BIG’s value as a comprehensive resource for exploring genetic variation, enhancing disease association studies, and promoting equitable genomic research in underrepresented populations.

Discrepancies between self-reported race and inferred genetic ancestry

We compared counts of individuals in self-reported racial categories with those in inferred genetic ancestry categories, with some racial categories aggregated for simplicity (Supplementary Table 2). The number of self-reported White individuals aligns closely with those inferred as Europeans, while participants identifying as Black or African American appear distributed between two genetic ancestry categories: Africans and admixed African-Europeans. For other racial groups, the patterns are more diverse and complex (Fig. 4a).

**Fig. 4: Poor alignment between self-reported race and genetic ancestry.**

We eavluated the fraction of the genome shared identical by descent (IBD) among all possible pairs of individuals and compared with self-reported race. Predictably, IBD genome sharing was higher among individuals within the same self-reported race. However, we also detected IBD sharing compatible with 2nd and 3rd degree relationships (half-siblings and 1st cousin, respectively) between individuals of different self-reported races (Fig. 4b). This observation suggests that genetically related individuals may self-identify differently with respect to socially constructed categories like race.

The relationship between self-reported race and inferred ancestry was further examined among pairs of individuals who identified as belonging to the same race. In some instances, the self-reported race of a pair differed from that of other pairs within the same ancestry category (Fig. 4c). For example, one pair of first-degree relatives (sharing ~50% of their genome) who both self-reported as White were found to have differing inferred ancestries: one individual was classified as having African ancestry, while the other showed a mixture of African and European ancestries (represented by the orange triangle in the AFR; EUR-AFR category in Fig. 4c). Similarly, among three pairs of individuals self-reporting as Black or African American, one member of each pair was inferred to have European ancestry (represented by the purple triangle in the EUR; EUR-AFR category in Fig. 4c). These findings highlight the limitations of using self-reported race as a category for analyzing genetic variation.

Discussion

The BIG cohort is a genetically diverse and ethnically inclusive pediatric resource, addressing the historic underrepresentation of non-European populations in genomics research. With 87% of participants under 18 and 50% of non-European ancestry—including 20% closely aligning with African reference populations and 30% exhibiting complex admixture patterns—it offers broad genetic variability and significant potential to represent human genomic diversity. Previous comparative studies have shown that admixed African populations from Tennessee rank among those with the highest proportion of African ancestry in the United States²⁵. Notably, individuals from Memphis exhibit the greatest genetic diversity within their African ancestry component compared to thirteen other similar populations²⁴. Although our study is not explicitly comparative, these findings position the African and admixed African individuals in the BIG cohort as being among the most genetically diverse populations globally similarly to what observed in the highly diverse multi-ethnic biobank BioMe⁵⁷. The high genetic diversity observed in BIG may be associated with the demographic and genealogical history of the African component in Memphis, as evidenced by a recent bottleneck followed by strong population growth²⁴, a line of inquiry that can be further explored in future analyses.

This diversity facilitated the discovery of new genetic variants, many of which may have clinical relevance. We have indications of ancestry-specific burden in admixed individuals. While this is an intriguing observation, it certainly deserves further investigation before any definitive conclusions can be reached. We believe that several factors, including sample size, stratification effects, and demography, must be carefully considered to achieve a more solid conclusion. This again underscores the importance of ensuring that relevant populations are well represented, as failing to do so risks leading to erroneous conclusions.

The higher number of novel variants observed in admixed individuals also deserves attention. This pattern could reflect several phenomena: First, admixture can create novel combinations of variants that were previously private to distinct ancestral populations. Second, the genetic recombination that occurs during admixture might expose previously masked deleterious variants or create new functional combinations. Third, the current reference databases may underrepresent admixed populations, making variants common in these groups appear novel in our analysis. These findings underscore both the importance of studying admixed populations and the need for more diverse reference panels in genomic research.

As a model for studying health disparities, the BIG cohort reveals higher odds ratios for obesity and asthma among minority groups, driven by genetic and environmental factors, as reflected in zip-code-specific disease patterns. We show that the BIG cohort has the potential to integrate genomic data, electronic health records, and environmental information to thoroughly analyze these and other common diseases⁵⁸. With relevance to disease mapping, our study highlights how self-identified racial categories often fail to align with genetic ancestry, as seen in other studies⁵⁹. The value of using race in biomedical research has been a longstanding topic of debate^60,61. Race is predominantly a socio-cultural construct, reflecting identity and social experiences rather than genetic heritage⁶². Nevertheless, race can serve as a useful framework for describing health disparities in societies where racial categories are deeply embedded in social structures⁵⁹, and there have been increasing calls for greater inclusion of underrepresented individuals in genetic and biomedical research to help clarify the relationship between race and ancestry^63,64.

A peculiar feature of the BIG cohort is the inclusion of many admixed individuals, encompassing four distinct patterns of admixture. Admixed populations constitute a significant part of global genetic diversity and present unique statistical challenges in the analysis of genetic variation, leading to their frequent exclusion from genomics and medical research. Admixture can be used to map quantitative traits and to detect positive selection^65,66, requiring smaller sample sizes compared to other mapping techniques⁶⁷. Admixture mapping leverages local ancestry inference to associate traits with an unusually high proportion of ancestry from one of the parental populations around the disease-causing locus^68,69,70 and it has been successfully used—as an example—to map Alzheimer’s disease⁷¹.

All the findings from the BIG study hold significant implications primarily for the scientific community, however, and most importantly, BIG pioneers a model for inclusive genomic studies, emphasizing community engagement to align research efforts with the needs of the contributing communities (Supplementary Fig. 11) Clinically, the insights gained from BIG can inform precision medicine initiatives for historically underserved populations, particularly in regions of Tennessee, where African Americans and others face a disproportionate burden of chronic disease. Through MEMGEN local students and families engage with hands-on genomics education and ethical aspects of genetic research, which demystifies the science and inspires interest in STEM fields, promoting inclusivity by respecting cultural contexts and building trust.

A future key priority for the BIG initiative is to expand its participant base to include adults, allowing for a comprehensive study across all age groups and an even broader spectrum of genetic diversity. Continued community education is also a priority to sustain engagement and participation in the BIG initiative. Another important priority is to adopt a pangenomic approach in genetic data analysis to better represent the genetic diversity within the cohort. Moving toward an inclusive genome model that integrates multiple ancestries and population-specific variants will enhance the accuracy of variant identification and genetic association studies for individuals in the BIG cohort.

By embracing this pangenomic approach, the BIG initiative can establish a benchmark for inclusive genomics, ensuring that research benefits all participants by reflecting their unique genetic backgrounds.

In conclusion, the BIG initiative can continue to lead in inclusive genomics, creating a resource that supports equitable health outcomes and advances the field toward a truly representative model of precision medicine.

Methods

Ethics

This study adhered to the ethical principles outlined in the Declaration of Helsinki for medical research involving human subjects. This study was conducted in accordance with ethical standards and is approved by the Institutional Review Board (IRB) of UTHSC (IRB number: 23-09204-NHSR). Written informed consent was obtained from all participants; for pediatric subjects, consent was provided by their legal guardians or next of kin. To ensure confidentiality, all data were de-identified prior to analysis.

Sample collection sites

Le Bonheur Children’s Hospital (LBCH, Memphis, TN) - LBCH is the primary pediatric care center in Memphis, and serves a predominantly African American population in an area marked by significant health disparities. Recruitment at this site was launched in October 2015 and spans inpatient rooms, ICUs, outpatient clinics, and the emergency department. The geographical proveninence of enrolled individuals follow more o less a gradient that reflect distance from the hospital (Supplementary Fig. 4). Information from genomic DNA extracted from leftover blood collected during routine care is linked to de-identified electronic health record data. Leftover samples are not always available for collection, although they can be collected on a subsequent visit. This explains the discrepancy between the number of consented participants and collected biosamples.

Regional One Health (ROH, Memphis, TN)—ROH is a leading healthcare provider in Memphis, providing comprehensive care to underserved and vulnerable communities in the same geographical area of LBCH. In May 2022, the BIG Initiative extended its reach to ROH, focusing on adult genomic research. Participants are recruited across hospital settings, with DNA collected from leftover blood during standard care and linked to de-identified EHR data. This expansion complements BIG’s pediatric focus at LBCH by including a diverse adult population.

East Tennessee State University (ETSU, Johnson City, TN)—The BIG Initiative expanded to ETSU in May 2023 to include the Appalachian region, emphasizing adult participant recruitment. DNA samples are collected through dedicated blood draws and linked to de-identified EHR data. ETSU’s inclusion aligns with BIG’s commitment to engaging rural and underserved populations, complementing efforts at LBCH and ROH to create a robust, diverse genomic database for advancing precision medicine across the Mid-South and Appalachia.

Family Resilience Initiative (FRI, Memphis, TN)—Launched in January 2019, the Family Resilience Initiative (FRI) examines the impact of adverse childhood experiences (ACEs) and social determinants of health on long-term outcomes. The program enrolls mother-child dyads from the Memphis region, collecting sputum and/or blood samples at four visits spaced 6 months apart. Samples are processed through BIG’s operational pipeline for DNA isolation, cortisol measurements, and clinical assessments. By linking biological and environmental data, FRI aims to understand ACEs’ physiological and epigenetic effects, providing insights to guide tailored interventions and improve family health in vulnerable communities.

DNA sequencing

The 13,152 samples were processed with NEB/Kapa reagents, captured with the Twist Comprehensive Exome Capture design, enhanced by Regeneron-designed spikes targeting sequencing genotyping sites. Among the sequenced samples, 95.2% achieved an average sequencing depth of at least 20X, and 99.3% of the samples had >90% of their bases covered at 20X or greater, highlighting the overall quality of the data. The genotyping spike targets an additional ≈ 1.4 M variants in the human genome. Genotyping call rate (percentage of SNP / indels targeted genotyping at which a call can be made) is 99.0%. All samples were sequenced on an Illumina NovaSeq 6000 system on S4 flow cells sequencer using 2 × 75 paired-end sequencing.

Variant identification

Sequence reads were aligned by the Burrows-Wheeler Aligner (BWA) MEM⁷² to the GRCh38 assembly of the human reference genome in an alt-aware manner. Duplicates were marked using Picard, and mapped reads were sorted using sambamba⁷³. DeepVariant v0.10.0 with a custom exome model was used for variant calling⁷⁴, and the GLnexus v1.2.6 tool was used for joint variant calling⁷⁵. The variants were annotated using a Variant Effect Predictor (VEP 110)⁵¹. Phasing was performed using ShapeIT v5⁷⁶. Our dataset comprised 6,886,631 variable sites after quality control, combining both exome capture and targeted sequencing data. From these sites: 135,652 variants overlapping with reference populations were used for Principal Component Analysis; 2,482,155 variants meeting RFMix filter criteria were used for Global and Local ancestry inference.

Global and local ancestry inference

To characterize the genetic admixture within the BIG cohort, we performed a global and local ancestry inference (LAI) analysis using RFMix v.2.0; https://github.com/slowkoni/rfmix⁴¹. Reference samples included those of the 1000 Genomes Project and the Human Genome Diversity Project (HGDP), using the recently developed joint call⁷⁷. The merged genotyping dataset, which combined BIG participants with reference samples, consisted of autosomal variants. To select the reference samples, we followed a quality control previously used in other studies⁴⁵. To exclude reference samples with extensive admixture, we performed an unsupervised cluster analysis using ADMIXTURE⁷⁸. We selected 4 groups (k = 4), and reference samples with a major group proportion >0.99 were considered for the analysis. Four-way LAI was performed with the number of terminal nodes for the random forest classifier set to 5 (-n 5), the average number of generations since the expected addition set to 12 (-G 12), and ten rounds of the expectation maximization algorithm (EM) (-e 10). The motivation behind the selection of k = 4 was our aim to characterize continental level ancestry, with four major groups: African, American, European and Asian. This aligns with the expectation for larger cities in the Americas, with the adition of the Asian group⁴⁵. This addition was consider based on self-reported race and ethnicity categories. Reference superpopulations selected at the continental level were African (AFR), American (AMR), European (EUR), and Asian (ASN). For the ASN group, we introduced two reference populations: East Asian (EAS) and Central South Asian (CSA). CSA ancestry was negligible, with 99% of the BIG cohort showing values close to 0 and a few cases below 0.075. As low global ancestry proportions are associated with inaccurate estimates, we excluded CSA from further analysis. Instead, we retained EAS, which showed a significant signal in a small proportion of cases consistent with the low number of individuals self-reported as Asians. Specifically, AFR is represented by YRI (101), LWK (30), MSL (16), Mbuti (10), GWD (48), ESN (64), Bantu South Africa (3), Bantu Kenya (10) and Biaka (21) groups. EUR contains Tuscan (6), Sardinian (12), Orcadian (13), IBS (117), GBR (103), French (24), Bergamo Italian (9), Basque (17) and CEU (114). AMR by Surui (6), Pima (10), PEL (10), Maya (16), Karitiana (7), and CLM (7). Finally, EAS is represented by CHS (106) and CHB (39). Local ancestry inference with RFMix2 was used to classify rare alleles (AF < 0.01), both synonymous and deleterious, by ancestry. A custom script was developed to process phased VCFs with local ancestry calls, assigning each allele to an ancestral population and generating ancestry-specific haplotype counts. This approach enables the precise tracking of allelic ancestry in samples.

Discrete ancestry categories (AMR, AFR, EUR, EAS, EUR-AMR, EUR-AFR, and Multiway) were defined based on the following criteria: (i) individuals with >85% of a single ancestry were categorized into single-ancestry groups; (ii) individuals with at least 15% contribution from two ancestries, and a combined total of over 85%, were classified as two-way admixed; (iii) individuals with significant contributions (>15%) from three or more ancestries were classified as Multiway. The 85% threshold was chosen because genetic ancestry proportion is a continuous variable, requiring arbitrary decisions when defining discrete categories (See About inferred population labels sub-section), and ancestral contributions above 10–15% are generally considered accurate and significant, while lower proportions are often associated with shorter ancestral segments and higher error rates^22,44. The number of individuals per ancestry group by ZIP code (based on ZCTA5 Code Tabulation Areas from the 2020 U.S. Census) was used to map the proportion of each ancestry within each location. The dissimilarity index⁴⁶ was calculated for ancestry categories with populations exceeding 500 individuals. To ensure reliable calculations, ZIP codes with fewer than 100 total individuals were excluded from the analysis. All geographic visualizations presented in this work were created using R. Maps were produced with the leaflet package (v. 2.2.1) using GeoJSON data for state ZIP-code boundaries publicly available.

About inferred population labels

In this study, we use self-reported race and ethnicity, which are socially constructed and categorical, alongside genetic ancestry proxies derived from methods like RFMix⁴¹. Although race and ethnicity are discrete categories that reflect social and historical contexts, genetic ancestry arises from continuous biological processes that capture paths through the ancestral recombination graph⁷⁹. To facilitate our analysis, we categorize genetic ancestry into regional groupings such as AMR (ancestries from the Americas) or EUR (ancestries from Europe), but it is important to clarify that these labels are not fixed or essentialized categories⁸⁰. This grouping is useful only because it helps us explore the demographic and environmental histories that shape the variation of complex genetic traits. This discretization is merely one arbitrary scale, and in several analyses, we examine finer ancestral variation within these groupings using dimensionality reduction techniques (PCA), unsupervised clustering (ADMIXTURE) and relatedness (e.g., IBD segment analyses). We emphasize that such proxy cannot be equated with historical racial categories that have been used to justify inequality⁸¹. In fact, a part of the results section is focused on showing the discrepancies between both categories.

About self-reported race

Race is self-reported by enrolled patients at the time of admission to the hospital. The admission staff select the race code from a drop-down list of possible race categories according to HL7 standards for race and ethnicity https://hl7-definition.caristix.com/v2/HL7v2.5/Tables/0005. It is possible to select multiple race codes from the drop-down list in case people associate themselves with multiple races. Nevertheless, due to the lack of standardization in historical record collection, some of the self-reported race classifications were inaccurate or inappropriate⁸². We therefore refined the data to reflect a more reliable classification system. The criteria for refinement are detailed in Supplementary Table 2.

Clinical data

The clinical data associated with BIG participants are extracted from the EHR (Electronic Health Records) system in flat files and shared with UTHSC through a secure file transfer protocol. These data include demographics, visits, diagnoses, procedures, prescribed and administered medications, labs, and vital signs. These data elements are converted to a limited data set (LDS) and mapped to a common data model, the OMOP (Observational Medical Outcomes Partnership) CDM. To support the analysis, the ICD9/10 diagnosis codes are assigned to PheCodes. Disease phenotypes were defined using these PheCodes: asthma was identified using Phecode RE_475; obesity using PheCodes beginning with EM_236, which includes obesity, overweight and obesity, morbid obesity, and localized adiposity; type 1 diabetes using Phecode EM_202.1; and hypertension using Phecode CV_401.

Diversity and population structure analyses

Joint PCA, considering BIG and 1000GP cohorts, was performed in order to compare genetic diversity. We used the bigsnpr R package protocol for PCA analysis (https://privefl.github.io/bigsnpr)⁸³. Briefly, this involved using King software⁸⁴ to estimate kinship coefficients and remove first and second-degree relatives (cutoff < 0.0884). LD clumping (r < 0.2) and exclusion of long-range LD regions were based on Mahalanobis distances. Outliers were identified with K-nearest-neighbor. The first 20 PCs were computed using truncated SVD. After excluding outliers, we projected related individuals in the PC space. Variants with MAF < 0.01 were excluded. For ADMIXTURE analyses, we performed unsupervised clustering with k = 3, 4, 5, and 6. We applied standard quality control filters, including LD pruning and removal of variants with MAF < 0.01. Logistic regression was performed in R.

Relatedness and identical by descent analysis

We analyzed relatedness and infer family relationships using different approaches. To detect close relationships, we used firstly KING software to calculate kinship coefficients and determine the probability of sharing zero IBD (identity by descent)⁸⁴. We also performed kinship inference using REAP, in order to account potential biases due to admixture⁸⁵. Quality control for kinship inference included removing variants with high missingness, filtering by MAF > 0.01, and performing LD pruning.

To identify IBD segments, we used hap-ibd in the phased data set comprising 13,152 genomes, focusing on autosomal loci⁸⁶. Hap-ibd was executed with a minimum seed parameter of 2 cM to detect IBD segments of at least this length. The inferred IBD segments were post-processed using the protocol developed by Browning et al.⁸⁷, particularly the merge-ibd-segments tool, with default parameters. Gaps with at most one discordant homozygote and <0.6 cM were removed. Total IBD between pairs of individuals was computed as the sum of the segments.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The data supporting this study’s findings are sourced from the Biorepository and Integrative Genomics (BIG) and are not publicly available due to privacy and ethical restrictions. Access to the data is restricted to protect participant confidentiality and comply with institutional and regulatory requirements. Researchers may request access to the data after obtaining approval from the University of Tennessee Health Science Center (UTHSC) Institutional Review Board (IRB) and the BIG Research Oversight Committee. Requests should be submitted via the BIG portal at https://uthsc.edu/cbmi/big/For further assistance, please contact biglist@uthsc.edu. Data access is granted only for legitimate research purposes, and approved requestors must comply with data use agreements. Requests will typically be processed within 4 weeks of submission. Once access is granted, the data will remain available for the duration of the approved research project. We support responsible data sharing and encourage interested researchers to contact the authors or BIG for additional details.

Code availability

The scripts used for QC, PCA, local and global ancestry deconvolution, and IBD analysis are available on https://github.com/SilviaBuonaiuto/BIG⁸⁸.

References

Van Hout, C. V. et al. Whole exome sequencing and characterization of coding variation in 49,960 individuals in the uk biobank. Nat. Commun. 11, 1–11 (2020).
Google Scholar
Kyriazis, C. C. et al. Human genetic diversity and disease: from outside africa to within europe. Commun. Biol. 6, 353 (2023).
Google Scholar
Sabeti, P. C. & Reich, D. Genetic and archeological evidence for early human population structure. Cell 179, 1462–1474 (2019).
Google Scholar
Sirugo, G., Williams, S. M. & Tishkoff, S. A. The missing diversity in human genetic studies. Cell 177, 26–31 (2019).
Article CAS PubMed PubMed Central Google Scholar
Fatumo, S. et al. A roadmap to increase diversity in genomic studies. Nat. Med. 28, 243–250 (2022).
Article CAS PubMed PubMed Central Google Scholar
Risch, N. & Merikangas, K. The future of genetic studies of complex human diseases. Science 273, 1516–1517 (1996).
Article ADS CAS PubMed Google Scholar
Nicolae, D. L., Wen, X., Voight, B. F. & Cox, N. J. Coverage and characteristics of the affymetrix genechip human mapping 100k snp set. PLoS Genetics 2, e67 (2006).
Article PubMed PubMed Central Google Scholar
Cardon, L. R. & Abecasis, G. R. Using haplotype blocks to map human complex trait loci. Trend Genetics 19, 135–140 (2003).
Article CAS Google Scholar
Altshuler, D. et al. International hapmap 3 consortium: Integrating common and rare genetic variation in diverse human populations. Nature 467, 52 (2010).
Article ADS CAS PubMed Google Scholar
Chen, G. et al. Development of admixture mapping panels for african americans from commercial high-density snp arrays. BMC Genomics 11, 1–12 (2010).
Article CAS Google Scholar
Tandon, A., Patterson, N. & Reich, D. Ancestry informative marker panels for african americans based on subsets of commercially available snp arrays. Genetic Epidemiol. 35, 80–83 (2011).
Article Google Scholar
Consortium, H. et al. Enabling the genomic revolution in africa: H3africa is developing capacity for health-related genomics research in africa. Science (New York, NY) 344, 1346 (2014).
Article Google Scholar
Mallick, S. et al. The simons genome diversity project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
The All of Us Research Program Genomics Investigators. Genomic data in the all of us research program. Nature 627, 340–346 (2024).
Liao, W.-W. et al. A draft human pangenome reference. Nature 617, 312–324 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Garrison, E. et al. Building pangenome graphs. bioRxiv 2, 2023.04.05.535718 (2024).
Moreno-Grau, S. et al. Polygenic risk score portability for common diseases across genetically diverse populations. Hum. Genomics 18, 93 (2024).
Article PubMed PubMed Central Google Scholar
AlAzzeh, O. & M Roman, Y. The frequency of rs2231142 in abcg2 among native hawaiian and pacific islander subgroups: implications for personalized rosuvastatin dosing. Pharmacogenomics 24, 173–182 (2023).
Article CAS PubMed PubMed Central Google Scholar
Twesigomwe, D. et al. Characterization of cyp2d6 pharmacogenetic variation in sub-saharan african populations. Clin. Pharmacol. Therap. 113, 643–659 (2023).
Article CAS Google Scholar
McQuillan, M. A., Zhang, C., Tishkoff, S. A. & Platt, A. The importance of including ethnically diverse populations in studies of quantitative trait evolution. Curr. Opin. Genetic Dev. 62, 30–35 (2020).
Article CAS Google Scholar
Ha, E. K. et al. Native hawaiian and pacific islander populations in genomic research. NPJ Genomic Med. 9, 45 (2024).
Article Google Scholar
Sohail, M. et al. Mexican biobank advances population and medical genomics of diverse ancestries. Nature 622, 775–783 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Pereira, L., Mutesa, L., Tindana, P. & Ramsay, M. African genetic diversity and adaptation inform a precision medicine agenda. Nat. Rev. Genetics 22, 284–306 (2021).
Article CAS PubMed Google Scholar
Browning, S. R. et al. Ancestry-specific recent effective population size in the americas. PLoS Genetics 14, e1007385 (2018).
Article PubMed PubMed Central Google Scholar
Baharian, S. et al. The great migration and african-american genomic diversity. PLoS Genetics 12, e1006059 (2016).
Article PubMed PubMed Central Google Scholar
Crespo, R., Christiansen, M., Tieman, K. & Wittberg, R. An emerging model for community health worker–based chronic care management for patients with high health care costs in rural appalachia. Prevent. Chronic Dis. 17, E13 (2020).
Google Scholar
Beatty, K., Egen, O., Dreyzehner, J. & Wykoff, R. Poverty and health in tennessee. South Med. J. 113, 1–7 (2020).
Article PubMed Google Scholar
Jose, R., Rooney, R., Nagisetty, N., Davis, R. & Hains, D. Biorepository and integrative genomics initiative: designing and implementing a preliminary platform for predictive, preventive and personalized medicine at a pediatric hospital in a historically disadvantaged community in the usa. EPMA J. 9, 225–234 (2018).
Article PubMed PubMed Central Google Scholar
Robison, L. L. et al. The childhood cancer survivor study: a national cancer institute–supported resource for outcome and intervention research. J. Clin. Oncol. 27, 2308–2318 (2009).
Article PubMed PubMed Central Google Scholar
Pearson, C. et al. Boston birth cohort profile: rationale and study design. Precision Nutrition 1, e00011 (2022).
PubMed PubMed Central Google Scholar
Kooijman, M. N. et al. The generation r study: design and cohort update 2017. Eur.J. Epidemiol. 31, 1243–1264 (2016).
Article PubMed Google Scholar
Jernigan, T. L. et al. The pediatric imaging, neurocognition, and genetics (ping) data repository. Neuroimage 124, 1149–1154 (2016).
Article PubMed Google Scholar
Louis, G. M. B. et al. Racial/ethnic standards for fetal growth: the nichd fetal growth studies. Am. J. Obstetrics Gynecol. 213, 449–e1 (2015).
Google Scholar
Alexander, L. M. et al. An open resource for transdiagnostic research in pediatric mental health and learning disorders. Sci. Data 4, 1–26 (2017).
Article CAS Google Scholar
Park, C. H. et al. How the environmental influences on child health outcome (echo) cohort can spur discoveries in environmental epidemiology. Am. J. Epidemiol. 193, 1219–122 (2024).
Bisgaard, H. The copenhagen prospective study on asthma in childhood (copsac): design, rationale, and baseline data from a longitudinal birth cohort study. Ann. Allergy Asthma Immunol. 93, 381–389 (2004).
Gray, M. & Smart, D. Growing up in australia: the longitudinal study of australian children: a valuable new data source for economists. Aust. Econ. Rev. 42, 367–376 (2009).
Lawlor, D. A. et al. The second generation of the avon longitudinal study of parents and children (alspac-g2): a cohort profile. Wellcome Open Res. 4, 36 (2019).
Magnus, P. et al. Cohort profile: the norwegian mother and child cohort study (moba). Int. J. Epidemiol. 35, 1146–1150 (2006).
Article PubMed Google Scholar
Tough, S. C. et al. Cohort profile: the all our babies pregnancy cohort (aob). Int. J. Epidemiol. 46, 1389–1390k (2017).
Article PubMed Google Scholar
Maples, B. K., Gravel, S., Kenny, E. E. & Bustamante, C. D. Rfmix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am. J. Hum. Genetics 93, 278–288 (2013).
Article CAS Google Scholar
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Article ADS PubMed Google Scholar
Bergström, A. et al. Insights into human genetic variation and population history from 929 diverse genomes. Science 367, eaay5012 (2020).
Article PubMed PubMed Central Google Scholar
Gravel, S. Population genetics models of local ancestry. Genetics 191, 607–619 (2012).
Article PubMed PubMed Central Google Scholar
Mas-Sandoval, A., Mathieson, S. & Fumagalli, M. The genomic footprint of social stratification in admixing american populations. Elife 12, e84429 (2023).
Article CAS PubMed PubMed Central Google Scholar
Duncan, O. D. & Duncan, B. A methodological analysis of segregation indexes. Am. Sociol. Rev. 20, 210–217 (1955).
Article Google Scholar
Bastarache, L. Using phecodes for research with the electronic health record: from phewas to phers. Annu. Rev. Biomed. Data Sci. 4, 1–19 (2021).
Article PubMed PubMed Central Google Scholar
Johnson, C. C. et al. Us childhood asthma incidence rate patterns from the echo consortium to identify high-risk groups for primary prevention. JAMA Pediatrics 175, 919–927 (2021).
Article PubMed Google Scholar
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Sherry, S. T., Ward, M. & Sirotkin, K. dbsnp—database for single nucleotide polymorphisms and other classes of minor genetic variation. Genome Res. 9, 677–679 (1999).
Article CAS PubMed Google Scholar
McLaren, W. et al. The ensembl variant effect predictor. Genome Biol. 17, 1–14 (2016).
Article Google Scholar
Vaser, R., Adusumalli, S., Leng, S. N., Sikic, M. & Ng, P. C. Sift missense predictions for genomes. Nat. Protocols 11, 1–9 (2016).
Article CAS PubMed Google Scholar
Adzhubei, I., Jordan, D. M. & Sunyaev, S. R. Predicting functional effect of human missense mutations using polyphen-2. Curr. Protocols Hum. Gnetics 76, 7–20 (2013).
Google Scholar
Szpiech, Z. A. et al. Ancestry-dependent enrichment of deleterious homozygotes in runs of homozygosity. Am. J. Hum. Genetics 105, 747–762 (2019).
Article CAS Google Scholar
Castro e Silva, M. A. et al. Population histories and genomic diversity of south american natives. Mol. Biol. Evol. 39, msab339 (2022).
Article PubMed Google Scholar
Niedbalski, S. D. & Long, J. C. Novel alleles gained during the beringian isolation period. Sci. Rep. 12, 4289 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Belbin, G. M. et al. Toward a fine-scale population health monitoring system. Cell 184, 2068–2083 (2021).
Article CAS PubMed Google Scholar
Voorhies, K. et al. Gsdmb/ormdl3 rare/common variants are associated with inhaled corticosteroid response among children with asthma. Genes 15, 420 (2024).
Article CAS PubMed PubMed Central Google Scholar
Mersha, T. B. & Abebe, T. Self-reported race/ethnicity in the age of genomic research: its potential impact on understanding health disparities. Hum. Genomics 9, 1 (2015).
Article PubMed PubMed Central Google Scholar
Burchard, E. G. et al. The importance of race and ethnic background in biomedical research and clinical practice. N. Engl. J. Med. 348, 1170-5 (2003).
Cooper, R. S. Race and genomics. N. Engl. J. Med. 348, 1166 (2003).
Article PubMed Google Scholar
White, K., Lawrence, J. A., Tchangalova, N., Huang, S. J. & Cummings, J. L. Socially-assigned race and health: a scoping review with global implications for population health equity. Int. J. Equity Health 19, 1–14 (2020).
Article Google Scholar
Martschenko, D. O., Wand, H., Young, J. L. & Wojcik, G. L. Including multiracial individuals is crucial for race, ethnicity and ancestry frameworks in genetics and genomics. Nat. Genetics 55, 895–900 (2023).
Article CAS PubMed Google Scholar
Sirugo, G. et al. The quagmire of race, genetic ancestry, and health disparities. J. Clin. Invest. 131, e150255 (2021).
Hamid, I., Korunes, K. L., Beleza, S. & Goldberg, A. Rapid adaptation to malaria facilitated by admixture in the human population of cabo verde. Elife 10, e63177 (2021).
Article CAS PubMed PubMed Central Google Scholar
Hamid, I., Korunes, K. L., Schrider, D. R. & Goldberg, A. Localizing post-admixture adaptive variants with object detection on ancestry-painted chromosomes. Mol. Biol. Evol. 40, msad074 (2023).
Article CAS PubMed PubMed Central Google Scholar
Lin, M., Park, D. S., Zaitlen, N. A., Henn, B. M. & Gignoux, C. R. Admixed populations improve power for variant discovery and portability in genome-wide association studies. Front. Genetics 12, 673167 (2021).
Article Google Scholar
Patterson, N. et al. Methods for high-density admixture mapping of disease genes. Am. J. Hum. Genetics 74, 979–1000 (2004).
Article CAS Google Scholar
Suarez-Pajes, E., Díaz-de Usera, A., Marcelino-Rodríguez, I., Guillen-Guio, B. & Flores, C. Genetic ancestry inference and its application for the genetic mapping of human diseases. Int. J. Mol. Sci. 22, 6962 (2021).
Article CAS PubMed PubMed Central Google Scholar
Smith, M. W. et al. A high-density admixture map for disease gene discovery in african americans. Am. J. Hum. Genetics 74, 1001–1013 (2004).
Article CAS Google Scholar
Horimoto, A. R., Xue, D., Thornton, T. A. & Blue, E. E. Admixture mapping reveals the association between native american ancestry at 3q13. 11 and reduced risk of alzheimer’s disease in caribbean hispanics. Alzheimer’s Res. Ther. 13, 1–14 (2021).
Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with burrows–wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Article CAS PubMed PubMed Central Google Scholar
Tarasov, A., Vilella, A. J., Cuppen, E., Nijman, I. J. & Prins, P. Sambamba: fast processing of ngs alignment formats. Bioinformatics 31, 2032–2034 (2015).
Article CAS PubMed PubMed Central Google Scholar
Poplin, R. et al. A universal snp and small-indel variant caller using deep neural networks. Nature Biotechnology 36, 983–987 (2018).
Article CAS PubMed Google Scholar
Lin, M. F. et al. Glnexus: joint variant calling for large cohort sequencing. BioRxiv https://doi.org/10.1101/343970 (2018).
Delaneau, O., Howie, B., Cox, A. J., Zagury, J.-F. & Marchini, J. Haplotype estimation using sequencing reads. Am. J. Hum. Genetics 93, 687–696 (2013).
Article CAS Google Scholar
Koenig, Z. et al. A harmonized public resource of deeply sequenced diverse human genomes. bioRxiv 28, 2023.01.23.525248 (2024).
Alexander, D. H. & Lange, K. Enhancements to the admixture algorithm for individual ancestry estimation. BMC Bioinformatics 12, 1–6 (2011).
Article Google Scholar
Lewis, A. C. et al. Getting genetic ancestry right for science and society. Science 376, 250–252 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Committee on the Use of Race, E., National Academies of Sciences, E. & Medicine. Using Population Descriptors In Genetics And Genomics Research: A New Framework For An Evolving Field. (National Academics Press, 2023).
Machado, H. & Granja, R. Emerging DNA Technologies And Stigmatization. Forensic Genetics in the Governance of Crime 1st edn, Vol. 114 (Palgrave Pivot Singapore, 2020).
Popejoy, A. B. Too many scientists still say caucasian. Nature 596, 463–463 (2021).
Privé, F., Aschard, H., Ziyatdinov, A. & Blum, M. G. Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr. Bioinformatics 34, 2781–2787 (2018).
Article PubMed PubMed Central Google Scholar
Manichaikul, A. et al. Robust relationship inference in genome-wide association studies. Bioinformatics 26, 2867–2873 (2010).
Article CAS PubMed PubMed Central Google Scholar
Thornton, T. et al. Estimating kinship in admixed populations. Am. J. Hum. Genetics 91, 122–138 (2012).
Article CAS Google Scholar
Zhou, Y., Browning, S. R. & Browning, B. L. A fast and simple method for detecting identity-by-descent segments in large-scale data. Am. J. Hum. Genetics 106, 426–437 (2020).
Article CAS Google Scholar
Browning, B. L. & Browning, S. R. Improving the accuracy and efficiency of identity-by-descent detection in population data. Genetics 194, 459–471 (2013).
Article PubMed PubMed Central Google Scholar
Buonaiuto, S. et al. Insights from the biorepository and integrative1 genomics pediatric resource. https://github.com/SilviaBuonaiuto/BIG (2025).

Download references

Acknowledgements

We extend our gratitude to all the individuals and their families who generously contributed to the BIG initiative. We would to thank Carol Hendrix and the consent teams in Memphis and in Johnson City for oversight of recruitment and sample collection; Kito Lord, from ROH; James Adkins, and Jonathan Patrick Moorman from ETSU; Jason Yaun, Sandra Arnold from FRI; Marcella Vacca; Scott Strome; Jon McCullers; David Haines; Peter Buckley, G. Nicholas Verne, and Pamela Beckley from UTHSC; Trey Eubanks from Le Bonheur Children’s Hospital; the BIG Community Advisory Board. The authors gratefully acknowledge support from the Center for Integrative and Translational Genomics at UTHSC (SB, FM, RWW, PP, VC); NIH/NIGMS (R01GM123489 to PP); NSF (PPoSS Award 2118709 to PP); the NIH/NHLBI (RO1 HL170151 to THF); The Rady Children’s Institute for Genomic Medicine (THF); the Children’s Foundation of Memphis (THF); the Urban Child Institute; the Children’s Foundation Research Institute, Children’s Foundation of Memphis; the Assisi Foundation (CWB). The Children’s Research Foundation Institute, Le Bonheur Children’s Hospital.

Author information

These authors contributed equally: Silvia Buonaiuto, Franco Marsico.

Authors and Affiliations

Dept of Genetics, Genomics and Informatics, UTHSC, Memphis, TN, USA
Silvia Buonaiuto, Franco Marsico, Ernestine K. Amos-Abanyie, Pjotr Prins, Khyobeni Mozhui, Robert J. Rooney, Robert W. Williams, Chester W. Brown & Vincenza Colonna
Institute of Genetics and Biophysics, National Research Council, Naples, 80111, Italy
Franco Marsico & Vincenza Colonna
Center for Biomedical Informatics, UTHSC, Memphis, TN, USA
Akram Mohammed, Lokesh K. Chinthala & Robert L. Davis
Department of Preventive Medicine, Division of Preventive Medicine, UTHSC, Memphis, TN, USA
Khyobeni Mozhui
Dept of Pediatrics, UTHSC, Memphis, TN, USA
Robert J. Rooney & Vincenza Colonna
Center for Integrative and Translational Genomics, UTHSC, Memphis, TN, USA
Robert W. Williams
Dept of Pediatrics, Division of Rheumatology, UTHSC, Memphis, TN, USA
Terri H. Finkel
Dept of Pediatrics, Division of Genetics, UTHSC, Memphis, TN, USA
Chester W. Brown
Regeneron Genetics Center, Tarrytown, NY, USA
Aris Baras, Goncalo Abecasis, Adolfo Ferrando, Giovanni Coppola, Andrew Deubler, Aris Economides, Luca A. Lotta, John D. Overton, Jeffrey G. Reid, Alan Shuldiner, Katherine Siminovitch, Jason Portnoy, Marcus B. Jones, Lyndon Mitnaul, Alison Fenney, Jonathan Marchini, Manuel Allen Revez Ferreira, Maya Ghoussaini, Mona Nafde, William Salerno, Christina Beechert, Erin Fuller, Laura M. Cremona, Eugene Kalyuskin, Hang Du, Caitlin Forsythe, Zhenhua Gu, Kristy Guevara, Michael Lattari, Alexander Lopez, Kia Manoochehri, Prathyusha Challa, Manasi Pradhan, Raymond Reynoso, Ricardo Schiavo, Maria Sotiropoulos Padilla, Chenggu Wang, Sarah E. Wolf, Amelia Averitt, Nilanjana Banerjee, Dadong Li, Sameer Malhotra, Justin Mower, Mudasar Sarwar, Deepika Sharma, Sean Yu, Aaron Zhang, Muhammad Aqeel, Manan Goyal, George Mitra, Sanjay Sreeram, Rouel Lanche, Vrushali Mahajan, Sai Lakshmi Vasireddy, Gisu Eom, Krishna Pawan Punuru, Sujit Gokhale, Benjamin Sultan, Pooja Mule, Eliot Austin, Xiaodong Bai, Lance Zhang, Sean O’Keeffe, Razvan Panea, Evan Edelstein, Ayesha Rasool, Evan K. Maxwell, Boris Boutkov, Alexander Gorovits, Ju Guan, Lukas Habegger, Alicia Hawes, Olga Krasheninina, Samantha Zarate, Adam J. Mansfield, Joshua Backman, Kathy Burch, Adrian Campos, Liron Ganel, Sheila Gaynor, Benjamin Geraghty, Arkopravo Ghosh, Salvador Romero Martinez, Christopher Gillies, Lauren Gurski, Joseph Herman, Eric Jorgenson, Tyler Joseph, Michael Kessler, Jack Kosmicki, Adam Locke, Priyanka Nakka, Karl Landheer, Olivier Delaneau, Anthony Marcketta, Joelle Mbatchou, Arden Moscati, Aditeya Pandey, Anita Pandit, Jonathan Ross, Carlo Sidore, Eli Stahl, Timothy Thornton, Sailaja Vedantam, Rujin Wang, Kuan-Han Wu, Bin Ye, Blair Zhang, Andrey Ziyatdinov, Yuxin Zou, Jingning Zhang, Kyoko Watanabe, Mira Tang, Frank Wendt, Suganthi Balasubramanian, Suying Bao, Kathie Sun, Chuanyi Zhang, Brian Hobbs, Jon Silver, William Palmer, Rita Guerreiro, Amit Joshi, Antoine Baldassari, Cristen Willer, Sarah Graham, Ernst Mayerhofer, Erola Pairo Castineira, Mary Haas, Niek Verweij, George Hindy, Jonas Bovijn, Tanima De, Parsa Akbari, Luanluan Sun, Olukayode Sosina, Arthur Gilly, Peter Dornbos, Juan Rodriguez-Flores, Moeen Riaz, Manav Kapoor, Gannie Tzoneva, Momodou W. Jallow, Anna Alkelai, Ariane Ayer, Veera Rajagopal, Sahar Gelfman, Vijay Kumar, Jacqueline Otto, Neelroop Parikshak, Aysegul Guvenek, Jose Bras, Silvia Alvarez, Jessie Brown, Jing He, Hossein Khiabanian, Joana Revez, Kimberly Skead, Valentina Zavala, Jae Soon Sul, Lei Chen, Sam Choi, Amy Damask, Nan Lin, Charles Paulding, Esteban Chen, Michelle G. LeBlanc, Jason Mighty, Jennifer Rico-Varela, Nirupama Nishtala, Nadia Rana, Jaimee Hernandez, Randi Schwartz, Jody Hankins, Anna Han, Samuel Hart, Ann Perez-Beals, Gina Solari, Johannie Rivera-Picart, Michelle Pagan & Sunilbe Siceron

Authors

Silvia Buonaiuto
View author publications
Search author on:PubMed Google Scholar
Franco Marsico
View author publications
Search author on:PubMed Google Scholar
Akram Mohammed
View author publications
Search author on:PubMed Google Scholar
Lokesh K. Chinthala
View author publications
Search author on:PubMed Google Scholar
Ernestine K. Amos-Abanyie
View author publications
Search author on:PubMed Google Scholar
Pjotr Prins
View author publications
Search author on:PubMed Google Scholar
Khyobeni Mozhui
View author publications
Search author on:PubMed Google Scholar
Robert J. Rooney
View author publications
Search author on:PubMed Google Scholar
Robert W. Williams
View author publications
Search author on:PubMed Google Scholar
Robert L. Davis
View author publications
Search author on:PubMed Google Scholar
Terri H. Finkel
View author publications
Search author on:PubMed Google Scholar
Chester W. Brown
View author publications
Search author on:PubMed Google Scholar
Vincenza Colonna
View author publications
Search author on:PubMed Google Scholar

Consortia

Regeneron Genetics Center

Aris Baras
, Goncalo Abecasis
, Adolfo Ferrando
, Giovanni Coppola
, Andrew Deubler
, Aris Economides
, Luca A. Lotta
, John D. Overton
, Jeffrey G. Reid
, Alan Shuldiner
, Katherine Siminovitch
, Jason Portnoy
, Marcus B. Jones
, Lyndon Mitnaul
, Alison Fenney
, Jonathan Marchini
, Manuel Allen Revez Ferreira
, Maya Ghoussaini
, Mona Nafde
, William Salerno
, John D. Overton
, Christina Beechert
, Erin Fuller
, Laura M. Cremona
, Eugene Kalyuskin
, Hang Du
, Caitlin Forsythe
, Zhenhua Gu
, Kristy Guevara
, Michael Lattari
, Alexander Lopez
, Kia Manoochehri
, Prathyusha Challa
, Manasi Pradhan
, Raymond Reynoso
, Ricardo Schiavo
, Maria Sotiropoulos Padilla
, Chenggu Wang
, Sarah E. Wolf
, Hang Du
, Kristy Guevara
, Amelia Averitt
, Nilanjana Banerjee
, Dadong Li
, Sameer Malhotra
, Justin Mower
, Mudasar Sarwar
, Deepika Sharma
, Sean Yu
, Aaron Zhang
, Muhammad Aqeel
, Jeffrey G. Reid
, Mona Nafde
, Manan Goyal
, George Mitra
, Sanjay Sreeram
, Rouel Lanche
, Vrushali Mahajan
, Sai Lakshmi Vasireddy
, Gisu Eom
, Krishna Pawan Punuru
, Sujit Gokhale
, Benjamin Sultan
, Pooja Mule
, Eliot Austin
, Xiaodong Bai
, Lance Zhang
, Sean O’Keeffe
, Razvan Panea
, Evan Edelstein
, Ayesha Rasool
, William Salerno
, Evan K. Maxwell
, Boris Boutkov
, Alexander Gorovits
, Ju Guan
, Lukas Habegger
, Alicia Hawes
, Olga Krasheninina
, Samantha Zarate
, Adam J. Mansfield
, Lukas Habegger
, Goncalo Abecasis
, Manuel Allen Revez Ferreira
, Joshua Backman
, Kathy Burch
, Adrian Campos
, Liron Ganel
, Sheila Gaynor
, Benjamin Geraghty
, Arkopravo Ghosh
, Salvador Romero Martinez
, Christopher Gillies
, Lauren Gurski
, Joseph Herman
, Eric Jorgenson
, Tyler Joseph
, Michael Kessler
, Jack Kosmicki
, Adam Locke
, Priyanka Nakka
, Jonathan Marchini
, Karl Landheer
, Olivier Delaneau
, Maya Ghoussaini
, Anthony Marcketta
, Joelle Mbatchou
, Arden Moscati
, Aditeya Pandey
, Anita Pandit
, Jonathan Ross
, Carlo Sidore
, Eli Stahl
, Timothy Thornton
, Sailaja Vedantam
, Rujin Wang
, Kuan-Han Wu
, Bin Ye
, Blair Zhang
, Andrey Ziyatdinov
, Yuxin Zou
, Jingning Zhang
, Kyoko Watanabe
, Mira Tang
, Frank Wendt
, Suganthi Balasubramanian
, Suying Bao
, Kathie Sun
, Chuanyi Zhang
, Adolfo Ferrando
, Giovanni Coppola
, Luca A. Lotta
, Alan Shuldiner
, Katherine Siminovitch
, Brian Hobbs
, Jon Silver
, William Palmer
, Rita Guerreiro
, Amit Joshi
, Antoine Baldassari
, Cristen Willer
, Sarah Graham
, Ernst Mayerhofer
, Erola Pairo Castineira
, Mary Haas
, Niek Verweij
, George Hindy
, Jonas Bovijn
, Tanima De
, Parsa Akbari
, Luanluan Sun
, Olukayode Sosina
, Arthur Gilly
, Peter Dornbos
, Juan Rodriguez-Flores
, Moeen Riaz
, Manav Kapoor
, Gannie Tzoneva
, Momodou W. Jallow
, Anna Alkelai
, Ariane Ayer
, Veera Rajagopal
, Sahar Gelfman
, Vijay Kumar
, Jacqueline Otto
, Neelroop Parikshak
, Aysegul Guvenek
, Jose Bras
, Silvia Alvarez
, Jessie Brown
, Jing He
, Hossein Khiabanian
, Joana Revez
, Kimberly Skead
, Valentina Zavala
, Jae Soon Sul
, Lei Chen
, Sam Choi
, Amy Damask
, Nan Lin
, Charles Paulding
, Marcus B. Jones
, Esteban Chen
, Michelle G. LeBlanc
, Jason Mighty
, Jennifer Rico-Varela
, Nirupama Nishtala
, Nadia Rana
, Jaimee Hernandez
, Alison Fenney
, Randi Schwartz
, Jody Hankins
, Anna Han
, Samuel Hart
, Ann Perez-Beals
, Gina Solari
, Johannie Rivera-Picart
, Michelle Pagan
& Sunilbe Siceron

Contributions

SB and FM equally contributed to the study. Conceived the analyses: SB, FM, PP, AM, KM, RWW, RLD, CWB, VC; Sample sequencing: RGC; Data curation: SB, FM, AM, LKC; Formal Analysis: SB, FM, AM, EKA, VC; Funding acquisition: PP, RJR, RWW, RLD, THF, CWB, VC; Investigation: SB, FM, VC; Methodology: SB, FM, VC; Resources: PP, RWW, CWB; Software: PP; Supervision: PP, RJR, RWW, RLD, THF, CWB, VC; Visualization: SB, FM, VC; Writing - original draft: SB, FM, EKA, RWW, RLD, THF, CWB, VC; Writing - review & editing: SB, FM, AM, LKC, RGC, PP, KM, EKA, RJR, RWW, RLD, THF, CWB, VC.

Corresponding author

Correspondence to Vincenza Colonna.

Ethics declarations

Competing interests

The Regeneron Genetic Center is a subsidiary of Regeneron Pharmaceuticals, Inc. All the other authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Reporting Summary

Transparent Peer Review file

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Buonaiuto, S., Marsico, F., Mohammed, A. et al. Insights from the Biorepository and Integrative Genomics pediatric resource. Nat Commun 16, 4750 (2025). https://doi.org/10.1038/s41467-025-59375-0

Download citation

Received: 06 December 2024
Accepted: 22 April 2025
Published: 22 May 2025
DOI: https://doi.org/10.1038/s41467-025-59375-0