Main

Ancestry (or genetic background) information is imperative for proper genetic association study design and for control of population stratification.1–3 If case and control samples are drawn from dissimilar ancestral populations, significant associations may actually represent underlying genetic differentiation among samples and not associations with the phenotype under study.4–7 Hundreds to thousands of markers across the genome can be used to estimate genetic ancestry and to adjust for population stratification in downstream tests of association. Common sources of these data in today's climate of high throughput, cost effective genotyping are data from genome-wide association studies (GWAS) or standard panels of ancestry informative markers (AIMs).8,9 Despite the wide availability of these data and genotyping assays, there remain many large genetic association studies that do not yet have either GWAS or AIMs data available for all or most DNA samples in the dataset.10–12 As such, these genetic association studies still rely on self-reported race/ethnicity as a proxy for genetic ancestry because most data support the assumption that self-reported race/ethnicity approximates genetic ancestry, particularly for samples with substantial genetic differentiation.13–15

Although self-reported race/ethnicity is common in genetic association studies, many clinic and hospital-based studies use observer or interviewer reported ancestry rather than self-report. The concordance between race/ethnicity recorded in the medical charts and clinical databases and self-reported race/ethnicity has been examined in recent years.16–18 In one small exploratory study, 22–33% of respondents with diverse racial/ethnic backgrounds viewed themselves differently than how they were categorized in a community health center database.17 In contrast, another report of Veterans Affairs health care users found that observer-reported race/ethnicity agreed with self-reported race/ethnicity for most users (95%).18 Thus, the concordance between observer-reported and self-described race/ethnicity varies, and this variation is most likely dependent on the methods of interview, site of study, and the racial/ethnic composition of the population under study.

The Vanderbilt DNA Databank (BioVU) is a biorepository of >80,000 DNA samples linked to electronic medical records (EMRs) in Nashville, TN. BioVU uses discarded blood samples collected during routine patient care.19 DNA is extracted and linked to de-identified data obtained and routinely updated from the EMRs and other administrative databases. This approach has the advantage of scale, enabling genotype-phenotype associations across a variety of clinical outcomes represented in the patient population.10,19 Although BioVU is a valuable source of DNA samples and phenotypes for genetic association studies, it is unclear whether race/ethnicity, which is administratively assigned in BioVU, can be used as a proxy for ancestry in future genetic association studies in the absence of high density genotype data across the genome (such as GWAS data) or incurring the cost of additional genotyping of AIMs.

BioVU does provide observer-reported race/ethnicity, but a report of self-identified race/ethnicity is not available for direct comparison; thus, an alternative approach was necessary to explore the possible differences between the two types of reporting methods. To assess the use of observer-reported race/ancestry as a proxy for ancestry in BioVU-based genetic association studies, we genotyped 360 markers on the Illumina DNA Test Panel, which includes AIMs in a subset of BioVU samples (observer-reported race/ethnicity data) and in a sample ascertained by the Multiple Sclerosis Genetics Group (MSGG) (self-reported race/ethnicity data) to infer genetic ancestry. The percent concordance of reported and inferred genetic ancestry was calculated in each group separately. We then tested for differences between the concordance of observer-reported race/ethnicity with inferred genetic ancestry and the concordance of self-reported ancestry with inferred genetic ancestry. Results of these comparisons demonstrate that observer-reported race/ethnicity in BioVU approximates inferred genetic ancestry as well as self-reported race/ethnicity, suggesting that observer-reported race/ethnicity recorded in BioVU is an acceptable proxy for genetic ancestry for most DNA samples studied.

MATERIALS AND METHODS

Study populations

A full description of the BioVU resource and its ethical protections has been described elsewhere.19 A subset of BioVU samples was used in these analyses (n = 1910). The MSGG was founded in 1989 to study the role heredity plays in multiple sclerosis; consent and ascertainment are detailed elsewhere.20 A random subset of unrelated controls from the MSGG was used for these analyses (n = 384).

Genotyping

Both BioVU and MSGG samples were genotyped using the Illumina DNA Test Panel, which contains 360 validated single nucleotide polymorphisms (SNPs) distributed across the genome (Supplemental Digital Content 1, Table, http://links.lww.com/GIM/A120). All genotyping was performed on the Illumina BeadXpress.21 For the majority of these SNPs, allele frequencies differ greatly between the major HapMap populations and thus can be used as AIMs.

Before analysis, SNPs were filtered to exclude those with low minor allele frequency <1%, deviations from Hardy-Weinberg expectations (P < 10−4), and low genotyping efficiency (<95%). A total of 294 and 341 SNPs in the samples with self-reported and observer-reported ancestry, respectively, were analyzed.

Statistical methods

For this analysis, we focused on two racial/ethnic groups, European Americans (EA) and African Americans (AA), because these two groups represent the majority of BioVU samples (78.7% and 10.5%, respectively). For each study population (BioVU and MSGG) and each racial/ethnic group (EA and AA), the proportion of samples whose genetic ancestry (inferred by Structure 2.27) matched their reported ancestry was calculated. A two-sample test of proportion was used to test for differences between the concordance of observer-reported race/ethnicity with inferred genetic ancestry and the concordance of self-reported race/ethnicity with inferred genetic ancestry. Statistical significance was defined as P < 0.05.

RESULTS

Observer-reported EA represent the majority of BioVU DNA samples in this study (78.7%). Of these 1503 samples, 1481 (98.5%) were inferred as having predominantly (>60%) European ancestry, including 1439 (95.7%) participants with >90% European ancestry (Table 1). Of the self-reported EA, all samples were inferred to have at least 75% European ancestry. This concordance rate was not significantly different than that calculated from observer-reported EA when lower ancestry proportion thresholds were used (P = 0.10, 50% threshold; P= 0.09, 60% threshold). When the threshold for classification was increased to >75%, the difference in concordance with inferred genetic ancestry between self- and observer-report became statistically significant (P= 0.04). At the strictest threshold of 90%, the percent concordance of observer-reported race was significantly higher than that of self-report (95.7 vs. 88.0%, respectively; P < 0.001).

Table 1 Comparison of the concordance rate of self-reported race/ethnicity and observer-reported race/ethnicity in EA and AA

The second most prominent ethnic/racial group in BioVU is AA (10.5%). Observer-reported race/ethnicity was able to distinguish samples with inferred African genetic ancestry from those of non-African genetic ancestry; however, the resulting inferred African ancestry samples were not as homogenous as the inferred European ancestry samples. That is, of the 201 observer-reported AA, 187 (93.0%) were of predominantly African genetic ancestry, but only 44 (21.9%) were inferred to have >90% African ancestry (Table 1). This distribution is not unexpected given the amount of admixture inherent in AA populations. Moreover, the concordance of observer-reported race/ethnicity with inferred genetic ancestry was not significantly different from the concordance of self-reported race/ethnicity in AA, regardless of the stringency of the threshold (P > 0.34 at all four thresholds).

DISCUSSION

Our data indicate that observer-reported race/ethnicity in BioVU can be used as a proxy for genetic ancestry. We found a high concordance between observer-reported race/ethnicity and genetic ancestry, especially in EA. Furthermore, we determined that observer-reported race/ethnicity has a similar percent concordance with genetically defined ancestry as that of the self-reported race/ethnicity, which is widely used and accepted as a proxy for ancestry in genetic epidemiology.

We acknowledge, however, that this proxy is imperfect and that this imperfection may, in part, reflect variability in observer reports, a variable we are not able to quantify easily in BioVU. Also, we must remain cautious in our interpretation when considering observer-reported AA, a historically admixed population. Although the majority of observer-reported EA fell into one genetic cluster, observer-reported AA had a broader distribution of percent genetically inferred ancestry. This distribution has been observed in many studies of admixture in AA.14,22–25 Therefore, although reported race/ethnicity is able to categorize individuals as having a majority of European or African genetic ancestry, it cannot estimate or account for the admixture inherent in populations of African descent in the United States.

Our study focused on EA and AA, because they are the predominant racial/ethnic groups in BioVU. However, it would be beneficial to expand our analysis to a larger representation of individuals with other observer-reported race/ethnicities (e.g., Hispanics, Asians, Native Americans, and other). To date, >1300 (∼2%) of BioVU samples fall into this category. However, use of HapMap samples as pseudoancestrals to infer genetic ancestry may prove problematic for these groups given that none of the major HapMap populations are perfect proxies for Native Americans, Hispanics, or Asians (not of eastern Asian descent). The recent expansion of populations available in HapMap may alleviate this problem, but further studies are needed to assess the utility of these additional HapMap populations compared with an outbred, diverse population that is characteristic of the United States. Also, this study did not address the small percentage of records for which observer race/ethnicity is absent. Our previous studies suggest that this set of missing race/ethnicity is small, varying between 3 and 9%.10,26 Use of automated methods to extract references to race/ethnicities from clinical notes may prove beneficial to fill these gaps in data.

Determining the feasibility of using observer-reported race/ethnicity as a proxy for genetic ancestry is crucial for all future genetic association studies using biorepositories linked to EMRs such as BioVU. In support of our conclusions, recent tests of association using observer-reported European-descent cases and controls in BioVU suggest that this proxy for genetic ancestry is sufficient to replicate well-known GWAS and candidate gene associations.10 Further studies, however, are needed to determine if the observations reported here are true for other biorepositories linked to EMRs given differences among administrative and demographic data collections in clinical settings across the United States. Nevertheless, we demonstrate that observer-reported race/ethnicity for EA and AA approximates genetic ancestry as well as self-reported race/ethnicity, suggesting biorepositories based on EMRs such as BioVU may be a viable source of DNA samples for future large-scale genetic association studies.