Introduction

The biomedical research community has become increasingly aware of the genomics research gap, whereby the vast majority of participants in genetics research cohorts are of European ancestry1,2,3. The Eurocentric bias in genomics research threatens to exacerbate health disparities, since discoveries made with European ancestry cohorts may not transfer to diverse ancestry groups4. The NIH All of Us Research Program (All of Us) is a large cohort study of people who live in the US that combines participant genomic, phenotypic, and environmental data, with health-related outcome data gleaned from surveys and electronic health records5,6. All of Us has emphasized the recruitment of participants from population groups that are underrepresented in biomedical research in an effort to close the genomics research gap and to ensure that the benefits of precision medicine are shared equitably among all people7,8.

All of Us demonstration projects are being used to describe and validate the initial genomic data release and the cloud-based Researcher Workbench, where registered users can access and analyze participant data9. The aim of this demonstration project was to characterize the patterns of population structure and genetic ancestry among All of Us participants. Population structure refers to differences in the frequencies of genetic variants (alleles) among different groups or populations within a species, and population structure can be revealed by the presence of clusters of genetically similar individuals10. Genetic ancestry is closely related to the concept of population structure, and it can be defined mechanistically and operationally11,12,13,14. Mechanistically, genetic ancestry has been defined as the subset of genealogical paths through which an individual’s DNA has been inherited from their ancestors15. For any individual, only a subset of their genealogical ancestors contributes DNA to their genome. Operationally, genetic ancestry is typically characterized via genetic similarity between query individuals (e.g., All of Us participants) and individuals from global reference populations, which are taken as surrogates for ancestral populations16,17,18,19.

For this demonstration study of the All of Us cohort, we analyzed participant genomic variant data to (1) assess the extent of population structure in the cohort, (2) characterize the patterns of participant genetic ancestry at continental and subcontinental levels, and (3) explore how participants’ genetic ancestry changes over space and time in the US. Our results reveal substantial population structure and heterogeneous patterns of genetic ancestry among All of Us participants, consistent with the consortium’s efforts to recruit a diverse participant cohort.

Results

Unsupervised: population structure

A cohort of 297,549 All of Us participants, for whom genomic data are available, was created using the All of Us Researcher Workbench (Supplementary Fig. 1). All of Us participant genetic diversity was analyzed using PCA of genomic variant data followed by unsupervised clustering to assess the extent of population structure in the cohort. The clustering tendency of participant genomic PCA data was evaluated using the Hopkins statistic, nearest neighbors, and kernel density estimation. The PCA data yield a Hopkins statistic value of ~1, indicating highly clustered, non-uniformly, and non-randomly distributed genomic PCA data. The number of close neighbors per participant is highly variable across PC space, and kernel density estimation shows a multimodal distribution with distinct peaks separated in PC space (Fig. 1a, b). All three of these metrics reveal highly clustered participant genomic data, with dense groups of genetically similar individuals interspersed among less dense regions, indicative of substantial population structure in the All of Us cohort.

Fig. 1: Population structure.
Fig. 1: Population structure.
Full size image

Genomic PCA for All of Us participants. Left panels show PC1 versus PC2 comparisons, and right panels show PC1 versus PC3 comparisons, with the percent of variance explained by each PC shown. a Participants color-coded by the number of close neighbors as defined by Euclidean distance < 0.1 in PCs 1–5. b Kernel density estimation with peaks showing high-density clusters of participants in PC space. c High-density clusters of genetically similar participants are shown as groups 1–7.

Density-based clustering of the genomic PCA data yields an optimal number of K = 7 genetic diversity clusters (Fig. 1c). Similar clustering was performed using a Uniform Manifold Approximation and Projection (UMAP) analysis of the genomic PCA data (Supplementary Methods). Density-based clustering of UMAP data reveals almost twice as many clusters (K = 13) as seen for the PCA data, but there is broad concordance between the two methods with high percentages of participant overlap for each PCA cluster within one or two corresponding UMAP clusters (Supplementary Fig. 2). The number of All of Us genetic diversity clusters could change with future participant data releases.

Supervised: genetic ancestry

All of Us participants genetic ancestry was inferred using genomic PCA data analyzed with the Rye (Rapid Ancestry Estimation) program20. Participant PCA data were compared with PCA data from global reference populations, taken from the 1KGP and the HGDP, to infer individual ancestry proportions from seven continental-level ancestry groups: African, American, East Asian, South Asian, West Asian, European, and Oceanian (Supplementary Table 1 and Supplementary Fig. 3). All of Us participants are broadly distributed in PC space, whereas global reference samples from different ancestry groups are tightly clustered in PC space (Fig. 2a, b). Rye infers All of Us participant genetic ancestry proportions as linear combinations of reference population ancestries. Overall, the All of Us participant cohort shows 19.51% African, 6.33% American, 2.57% East Asian, 3.05% South Asian, 1.95% West Asian, 66.37% European, and 0.21% Oceanian ancestry. The All of Us participant genetic similarity groups inferred with density-based clustering show group-specific patterns of ancestry proportions, with a continuum of ancestry proportions within and between groups (Fig. 2c). Groups 1, 3, 4, and 7 show the most uniform patterns of ancestry within groups, whereas groups 2, 5, 6, and the remaining participants that did not fall into any density-based cluster show more diverse patterns of ancestry and admixture. All groups show evidence of admixture with multiple ancestry components present in different proportions.

Fig. 2: Continental genetic ancestry.
Fig. 2: Continental genetic ancestry.
Full size image

a Genomic PCA with All of Us participants shown in gray and global reference population samples color-coded as shown in the key. Left panels show PC1 versus PC2 comparisons, and right panels show PC1 versus PC3 comparisons, with the percent of variance explained by each PC shown. b Genetic ancestry proportions for All of Us participants stratified by the genetic similarity groups shown in Fig. 1c. Average ancestry proportions are shown above each group, and numbers of participants are shown below each group. The remaining participants are individuals who did not fall into a dense PCA cluster.

The All of Us Researcher Workbench predicts participant membership among six continental ancestry groups, using a PCA-based machine learning method that is distinct from the continuous ancestry inference approach used here21. We compared the participant continental ancestry percentages inferred here to the Researcher Workbench assigned categorical ancestry groups (Supplementary Fig. 4). Five of the six categorical ancestry groups correspond exactly with the reference population groups we use: African, East Asian, South Asian, Middle Eastern (West Asian here), and European. For these five groups, there is high correspondence between participants’ PCA-based machine learning predicted group membership and averages for the ancestry percentages that we inferred (83.02–97.71% matching ancestry). The Admixed American ancestry category from the Researcher Workbench includes modern, admixed reference samples from Latin America, whereas our American reference population group includes Indigenous American samples only (Supplementary Table 1). The Admixed American group shows 51.01% European ancestry and 35.84% American ancestry, consistent with what is expected for modern Latin American populations22,23.

We also used Rye to infer subcontinental ancestry for All of Us participants with high levels of African (n = 9291), East Asian (n = 2457), South Asian (n = 2484), and European ancestry (n = 24,730; Fig. 3 and Supplementary Table 3). The relationships among the reference populations used for subcontinental ancestry inference with Rye and All of Us participants are shown in Supplementary Figs. 57. African subcontinental ancestry is characterized by a predominant West Central African component, followed by West African and Bantu components. East Asian subcontinental ancestry is highly diverse, with predominant Han (Chinese), Japanese, and Southeast Asian components. South Asian subcontinental ancestry is mainly South Indian, followed by North Indian and a small Central Asian component. European subcontinental ancestry is made up primarily of British ancestry, followed by Italian and Iberian components.

Fig. 3: Subcontinental genetic ancestry.
Fig. 3: Subcontinental genetic ancestry.
Full size image

Subcontinental genetic ancestry proportions for All of Us participants from (a) African, (b) East Asian, (c) South Asian, and (d) European continental ancestry groups. Subcontinental groups (regions) for each continental ancestry group are color-coded as shown.

The continental and subcontinental ancestry estimates presented here are dependent on the reference samples used for the analysis, since Rye assigns ancestry percentages for All of Us participants based on relative genetic similarity to a set of reference populations. Accordingly, incomplete sampling of reference populations, coupled with spatial population structure as seen for the All of Us participants, could introduce biases to the ancestry estimates. We performed sensitivity analyzes to test for such biases by sequentially adding and removing reference populations and observing how continental or subcontinental ancestry estimates change.

For the continental ancestry sensitivity analysis, we focused on cluster 5, which shows combination of European, South Asian, and West Asian ancestry components that may not correspond to known historical events (Fig. 2). Adding a Central Asian reference population to the analysis does not noticeably change the ancestry estimates for this group, whereas removing either South or West Asian components does change the results appreciably (Supplementary Fig. 8). This could point to ancestral origins for these participants in the Arabian Peninsula, Iraq, or Iran, geographically in between the reference populations used here. Nevertheless, the challenge of incomplete reference populations only applies to a small percentage of All of Us participants (~3%), the majority of which show ancestral origins from continental regions that are well-covered by the reference populations used here.

For the subcontinental ancestry sensitivity analysis, we focused on the African subcontinental ancestry given the 7.7% average East Bantu ancestry component estimated for these participants (Fig. 3a and Supplementary Table 3). This ancestry component was not observed for African Americans in the US in a recent comprehensive analysis of genetic ancestry in the Americas24. Given that many of the participants selected for this analysis are admixed with European ancestry, the East Bantu component could be accounted for by missing European or related reference populations. However, adding European and North African reference populations does not change the results appreciably (Supplementary Fig. 9). The relatively small East Bantu component (7.7%) most likely corresponds to Bantu populations that are not well represented in the reference populations used here, rather than non-Bantu East African ancestry.

Genetic ancestry by geography and age

All of Us participant continental ancestry percentages were visualized across fifty states and Puerto Rico to evaluate the geographic distribution of ancestry across the US (Fig. 4). African ancestry is concentrated primarily in the southeast part of the country, whereas American ancestry is found primarily in the southwest and California. European ancestry is more uniformly distributed across the country, with the highest concentrations found in north, along the Canadian border. Relatively high levels of admixture are seen in the northeast, Florida, and Hawaii.

Fig. 4: Genetic ancestry by geography.
Fig. 4: Genetic ancestry by geography.
Full size image

Genetic ancestry proportions are shown for All of Us participants sampled from the fifty US states and Puerto Rico. a All participants and ancestry components. b Non-European genetic ancestry proportions for all individuals with <90% European ancestry. The results for states shaded in gray are suppressed owing to <20 participants with <90% European ancestry.

The relationship between All of Us participants’ age and genetic ancestry was assessed using genetic admixture entropy, where higher values indicate a more diverse combination of ancestry components within individual genomes and lower values indicate more homogenous ancestry (Fig. 5). Genetic admixture entropy is negatively correlated with participant age, indicating that younger participants have more diverse ancestry combinations than older participants.

Fig. 5: Genetic admixture by age.
Fig. 5: Genetic admixture by age.
Full size image

Genetic admixture entropy (y-axis) against participant age (x-axis). Ages shown in single year bins, where each bin had at least 1000 participants (24–89 years), with average and 95% CI values shown. Linear regression trend line (black) shown with 95% CI shaded (gray). The linear regression adjusted R2 and its P value are shown for n = 66 bins.

Discussion

Our analysis demonstrates the genomic and ancestral diversity of the All of Us cohort, consistent with the project’s goals to recruit participants from population groups that are underrepresented in biomedical research in support of health equity. Indeed, All of Us is one of the most diverse population biomedical datasets in the world, and this represents an important step towards making precision medicine more widely available and more applicable to diverse communities in the US7,8,25. The promise of population biomedical datasets like All of Us rests on the integration of genetic, social, environmental, and health outcome data for many thousands of diverse participants. Given that genetic ancestry is derived from the genome, it should be possible to use genetic ancestry inference, together with population biomedical datasets, to help elucidate genetic and socioenvironmental contributions to health outcomes and disparities.

One challenge is that current methods for genetic ancestry inference, while accurate, are slow and do not scale to biobank-sized datasets like All of Us. We developed the Rye algorithm as a fast and computationally efficient genetic ancestry inference method that can scale to biobank-sized genomic data sets20. Application of Rye to genome-wide genetic data for 297,549 All of Us participants underscores its utility for this purpose. Using Rye, we found the All of Us cohort to be ancestrally diverse with distinct patterns of genetic ancestry and admixture among genetic similarity groups and geographic regions (Figs. 24). The geographic patterns of genetic ancestry seen for the All of Us cohort are consistent with previous studies but could also reflect differences in participant recruitment across the country26,27,28.

Supervised genetic ancestry inference, using a program like Rye or comparable methods, relies on genetic similarity between query individuals (e.g., All of Us participants) and global reference population samples16,17. Accordingly, ancestry results are very much dependent on the choice of reference samples and may be biased by incomplete sampling of reference populations in the face of spatial genetic structure. If participants trace genetic ancestry to populations or geographic regions that are not well-represented by the reference populations, then they may appear to be admixed with ancestry components from nearby populations. We demonstrate this possibility using sensitivity analyzes for both continental and subcontinental ancestry inferences, with results suggesting that minor Asian and African ancestry components seen for All of Us participants may be mis-assigned owing to incomplete reference samples. Thus, the ancestry estimates reported here are best interpreted as the relative genetic similarity between All of Us participants and the reference populations used for the study, and as such, they are likely to change if different reference samples are used for the analysis.

The extent to which human genetic diversity is characterized by clusters of closely related individuals, i.e., population structure, versus clines of continuous genetic variation has long been a subject of interest29,30,31,32,33. The All of Us cohort allows for an assessment of the extent of population structure in the US, given the large size of the cohort, the extensive sampling of participants across the country, and the demographic diversity of the participants. The application of several different cluster analysis methods to participants’ genomic PCA data revealed evidence for substantial population structure in the cohort, with dense clusters of relatively closely related participants interspersed among less dense regions in PC space (Fig. 1). The population structure and genetic clusters that can be gleaned from clustering analysis of genomic PCA data are not readily apparent from visual inspection of these same data, owing to large size of the cohort and over-plotting of participants in dense regions of PC space (Fig. 2a).

Finally, we show that genetic diversity in the US is increasing over time. Younger All of Us participants are far more ancestrally diverse than older participants, and this trend is evident across the entire age range of the cohort. This finding suggests that genetic ancestry categories and group designations will become increasingly obsolete over time34.

Methods

All of Us participant cohort, consent, and IRB review

This study was performed as an All of Us genomic data demonstration project5. All of Us demonstration projects are intended to describe and validate data and analysis tools for the participant cohort. Details on the initial All of Us data release and Researcher Workbench used for this study were previously published6. The genomic data demonstration project and experimental protocols were approved by the All of Us Institutional Review Board (#2016–05-TN-Master), and informed consent was obtained from all participants. All of Us inclusion criteria include adults 18 and older, with the legal authority and decisional capacity to consent, and currently residing in the US or a territory of the US. All of Us exclusion criteria exclude minors under the age of 18 and vulnerable populations (prisoners and individuals without the capacity to give consent). Details on participant recruitment, informed consent, inclusion, and exclusion criteria are available online at https://allofus.nih.gov/sites/default/files/All of Us_operational_protocol_v1.7_mar_2018.pdf. Results reported here comply with the All of Us Data and Statistics Dissemination Policy, disallowing disclosure of group counts under 20.

The All of Us Researcher Workbench was used to build the participant cohort for this study (Supplementary Fig. 1). The cohort was built from the All of Us Controlled Tier dataset v7 (curated version C2022Q4R9), which includes participants enrolled from 2018 to 2022, with a data cutoff date of 7 January 2022. Participants who self-identified as American Indian or Alaska Native were not included in the analysis.

Unsupervised genetic clustering analysis

Participant genomic data were accessed from the Controlled Tier dataset. Genome-wide genotypes for All of Us participants were characterized using the Illumina Global Diversity Array with variants called for 1,824,517 genomic positions on the GRCh38/hg38 reference genome build. All of Us participant variants were merged and harmonized with whole genome sequence variant data from 3433 global reference samples characterized as part of the 1000 Genomes Project (1KGP; phase 3) and the Human Genome Diversity Project (HGDP; Supplementary Table 1)35,36. Variant merging, harmonization, and LD pruning were performed using PLINK version 1.937 and custom scripts38,39,40. Biallelic variants common to the All of Us and reference data sets were merged, with strand flips and variant identifier inconsistencies harmonized as needed. Variants with >1% missingness and <1% minor allele frequency were removed from the merged and harmonized dataset. Linkage disequilibrium (LD) pruning was done using PLINK with window size = 50, step size = 10, and pairwise threshold r2 < 0.1, yielding a final All of Us and global reference sample dataset of 187,795 variants. The final dataset of All of Us participant genomic variants was used for unsupervised clustering analysis. PCA was run on the variant dataset using the FastPCA program implemented in PLINK version 2.0. The clustering tendency of the resulting genomic PCA data was analyzed using the Hopkins statistic with the Hopkins R package41 and nearest neighbor search with the FNN R package version 1.1.442. Kernel density estimation was performed with the MASS R package using PCs 1-3, and contour lines were extracted from the estimated density distribution43. Density-based clustering was performed using the HDBSCAN algorithm44. HDBSCAN was run on first 5 PCs for the PCA data with parameters min_samples = 2000 and min_cluster_size = 2500. Cluster boundaries were visualized using the ggforce R package.

Supervised genetic ancestry inference

Genomic variants from All of Us participants and a set of global reference populations were merged and harmonized as described in the previous section to perform continental and subcontinental genetic ancestry inference. Kinship analysis was performed with the KING program to eliminate related (or duplicated) reference samples from the global reference populations45. Continental genetic ancestry inference was performed using a subset of 1572 global reference samples from the 1KGP and the HGDP, which were selected as non-admixed representatives of seven ancestry groups: African, American, East Asian, South Asian, West Asian, European, and Oceanian (Supplementary Table 1). K-nearest neighbor clustering of genomic PCA data was used to identify All of Us participants that cluster together with African, East Asian, South Asian, and European reference populations, and these participants were used for subcontinental ancestry inference46. West Asian and Oceanian reference populations were not used for this purpose owing to the relatively low number of participants that clustered with these groups. Asian and European reference populations for subcontinental ancestry inference were taken from the 1KGP and HGDP (Supplementary Table 2). 1KGP and HGDP reference populations were used together with additional reference populations to provide broader geographic coverage for African subcontinental ancestry inference (Supplementary Table 2). African reference samples were taken from a study of Bantu-speaking populations in Africa that included samples from 53 populations from east, central, south, and west Africa47. The merged and harmonized African subcontinental ancestry inference panel included 1659 reference samples and 228,033 variants.

Continental and subcontinental ancestry inference was performed via analysis of merged All of Us participant and global reference population genomic variant sets with the program Rye (Rapid Ancestry Estimation)20. Rye performs rapid and accurate genetic ancestry inference based on principal component analysis (PCA) of genomic variant data. PCA was run on the merged variant datasets using the FastPCA program implemented in PLINK version 2.0, and Rye was then run on the first 25 PCs, using the defined reference ancestry groups to assign ancestry group fractions to individual All of Us participant samples. The continuous ancestry fractions that we report here were calculated independently of the categorical ancestry predictions currently provided by the All of Us Researcher Workbench21.

All of Us participant continental ancestry fractions were visualized as admixture-style plots at the state (or territory) level using the geofacets R package48,49. Admixture entropy (\({AE}\)) was used to quantify the amount of genetic admixture for All of Us participants as previously described in refs. 40,50: \({{AE}}_{i}=-{\sum }_{j=1}^{7}{p}_{j}\log (\,{p}_{j})\), where \({p}_{j}\) is the fraction of ancestry group \(j\) for individual \(i\).

Note on genetic ancestry inference

As discussed in the introduction, genetic ancestry can be defined mechanistically and operationally. We use an operational definition of genetic ancestry for All of Us participants in this study, as measured by their levels of genetic similarity with global reference population samples16,17. Accordingly, the phrase “African ancestry” is used here as shorthand for similarity to African reference population samples, “European ancestry” is used for similarity to European reference population samples, and so on. “American ancestry” refers to genetic similarity in Indigenous American reference population samples. The relative levels of similarity to different reference population groups allow us to infer percent ancestry components for All of Us participants20. The genetic ancestry results reported here are contingent upon the choice of reference populations, how these reference populations are delineated, and the method used to infer genetic similarity between All of Us participants and the reference population samples. Although reference populations are taken as surrogates for ancestral populations, it should be stressed that human populations are an idealized concept, and discrete ancestral populations did not exist, just as modern populations are not discrete. Rather, population boundaries past and present are fuzzy, and genetic ancestry does not map neatly onto clusters defined by PCA or labeled reference populations. Finally, it should be noted that reference population labels themselves, such as Bantu or Han, can convey ethno-linguistic in addition to geographic information, underscoring the fact that reference populations are often culturally delineated.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.