Abstract
Modern association studies often involve a large number of markers and hence may encounter the problem of testing multiple hypotheses. Traditional procedures are usually over-conservative and with low power to detect mild genetic effects. From the design perspective, we propose a two-stage selection procedure to address this concern. Our main principle is to reduce the total number of tests by removing clearly unassociated markers in the first-stage test. Next, conditional on the findings of the first stage, which uses a less stringent nominal level, a more conservative test is conducted in the second stage using the augmented data and the data from the first stage. Previous studies have suggested using independent samples to avoid inflated errors. However, we found that, after accounting for the dependence between these two samples, the true discovery rate increases substantially. In addition, the cost of genotyping can be greatly reduced via this approach. Results from a study of hypertriglyceridemia and simulations suggest the two-stage method has a higher overall true positive rate (TPR) with a controlled overall false positive rate (FPR) when compared with single-stage approaches. We also report the analytical form of its overall FPR, which may be useful in guiding study design to achieve a high TPR while retaining the desired FPR.
Similar content being viewed by others
Log in or create a free account to read this content
Gain free access to this article, as well as selected content from this journal and more on nature.com
or
References
Allison DB, Coffey CS (2002) Two-stage testing in microarray analysis: what is gained? J Gerontol A Biol Sci Med Sci 57:B189–B192
Becker T, Knapp M (2004) Maximum-likelihood estimation of haplotype frequencies in nuclear families. Genet Epidemiol 27:21–32
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B Methodol 57:289–300
Benjamini Y, Yekutieli D (2001) The control of the false discovery rate in multiple testing under dependency. Ann Stat 29:1165–1188
Böddeker IR, Ziegler A (2001) Sequential designs for genetic epidemiological linkage or association studies a review of the literature. Biomed J 43:501–525
Collins A, Lonjou C, Morton NE (1999) Genetic epidemiology of single-nucleotide polymorphisms. Proc Natl Acad Sci USA 96:15173–15177
Dale RN (2004) A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. Am J Hum Genet 74:765–769
Dudoit S, Shaffer JP, Boldrick JC (2003) Multiple hypothesis testing in microarray experiments. Stat Sci 18:71–103
Elston RC, Guo X, Williams LV (1996) Two-stage global search designs for linkage analysis using pairs of affected relatives. Genet Epidemiol 13:535–558
Ge Y, Dudoit D, Speed TP (2003) Resampling-based multiple testing for microarray data analysis. Test 12:1–77
Guo X, Elston RC (2000) Two-stage global search designs for linkage analysis. I. Use of the mean statistic for affected sib pairs. Genet Epidemiol 18:97–110
Haga H, Yamada R, Ohnishi Y, Nakamura Y, Tanaka T (2002) Gene-based SNP discovery as part of the Japanese millennium genome project: identification of 190,562 genetic variations in the human genome. J Hum Genet 47:605–610
Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, Meltzer P, Gusterson B, Esteller M, Kallioniemi OP, Benjamin W, Borg A, Trend J (2001) Gene-expression profiles in hereditary breast cancer. N Engl J Med 344:539–548
Hirschhorn JN, Daly MJ (2005) Genome-wide association studies for common diseases and complex traits. Nat Rev Genet 6:95–108
Hoh J, Wille A, Zee R, Cheng S, Reynolds R, Lindpaintner K, Ott J (2000) Selecting SNPs in two-stage analysis of disease association data: a model-free approach. Ann Hum Genet 64:413–417
Kao JT, Wen HC, Chien KL, Hsu HC, Lin SW (2003) A novel genetic variant in the apolipoprotein A5 gene is associated with hypertriglyceridemia. Hum Mol Genet 12:2533–2539
Miller RA, Galecki A, Shmookler-Reis RJ (2001) Interpretation, design, and analysis of gene array expression experiments. J Gerontol A Biol Sci Med Sci 56:B52–B57
Ohashi J, Tokunaga K (2001) The power of genome-wide association studies of complex disease genes: statistical limitations of indirect approaches using SNP markers. J Hum Genet 46:478–482
Ott J, Hoh J (2001) Statistical multilocus methods for disequilibrium analysis in complex traits. Hum Mutat 17:285–288
Ozaki K, Ohnishi Y, Iida A, Sekine A, Yamada R, Tsunoda T, Sato H, Sato H, Hori M, Nakamura Y, Tanaka T (2002) Functional SNPs in the lymphotoxin-α gene that are associated with susceptibility to myocardial infarction. Nat Genet 32:650–654
Reich DE, Gabriel SB, Altshuler D (2003) Quality and completeness of SNP databases. Nat Genet 33:457–458
Risch N, Merikangas K (1996) The future of genetic studies of complex human diseases. Science 273:1516–1517
Saito A, Kamatani N (2002) Strategies for genome-wide association studies: optimization of study designs by the stepwise focusing method. J Hum Genet 47:360–365
Satagopan JM, Elston RC (2003) Optimal two-stage genotyping in population-based association studies. Genet Epidemiol 25:149–157
Satagopan JM, Verbel DA, Venkatraman ES, Offit KE, Begg CB (2002) Two-stage designs for gene–disease association studies. Biometrics 58:163–170
Satagopan JM, Venkatraman ES, Begg CB (2004) Two-stage designs for gene–disease association studies with sample size constraints. Biometrics 60:589–597
Storey JD, Tibshirani R (2003) Statistical significance for genomewide studies. Proc Natl Acad Sci USA 100:9440–9445
Thomas D, Xie R, Gebregziabher M (2004) Two-stage sampling designs for gene association studies. Genet Epidemiol 27:401–414
Tsai CA, Hsueh H, Chen JJ (2003) Estimation of false discovery rates in multiple testing: application to gene microarray data. Biometrics 59:1071–1081
van den Oord EJCG, Sullivan PF (2003a) A framework for controlling false discovery rates and minimizing the amount of genotyping in the search for disease mutations. Hum Hered 56:188–199
van den Oord EJCG, Sullivan PF (2003b) False discoveries and models for gene discovery. Trends Genet 19:537–542
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1
TPR is defined as
Similarly, by the same argument for conditional probability, the overall FPR is
Define T(X) as the test statistic used in the first stage where X denotes the data from MN1 genotypings, and T(X*) for the second stage where X* contains the information based on the R(N1+N2) genotypings. The test statistic can be the difference in allele frequency between the case and control groups, or a chi-square statistic derived from contingency tables. These two statistics are correlated because partial data are used in both X and X*. Without loss of generality, we assume here that both statistics converge asymptotically to normality, providing N1 is large, and that the allele frequencies obtained from N1 and N1+N2 are roughly equal. Let ρ denote the correlation between T(X) and T(X*), the overall FPR at α1 and α2 is then:
where \(\sum = \left(\begin{array}{*{20}c} 1 & \rho\\ \rho & 1\\ \end{array}\right)\) and \(\sum\nolimits ^{{-1/2}} = \left(\begin{array}{*{20}c} s & t\\ t & s\\ \end{array}\right)\) whose elements depend on ρ. Applying their asymptotic normality with zero mean vector and the covariance matrix ∑, the above becomes
where \(a=z_{1-\alpha_{1}/2}\) and \(b=z_{1-\alpha_{2}/2}=z_{1-\alpha_{1}/2R}\) are the percentiles of standard normal, and Z1 and Z2 are independent standard normal distributions. This FPR will be bounded closely from above by \( \Pr (Z_{2}\ge t \cdot a + s \cdot b)\) if ρ is large. This boundary is determined by α1, α2, R, and ρ, and is smaller than 0.001 when R ≥10; while a quick guess for R would be M α1. If ρ=0, this boundary becomes Pr(Z2≥ b), which is simply α2.
To investigate the effect of sample size, note that T(X) is close to \(\sqrt{N_{1}/(N_{1}+N_{2})}\times T({\mathbf{X}}^{*})\) if the average allele frequencies in X and X* are similar, therefore,
where Φ is the cumulative density function of the standard normal. If the proportion N1/(N1+N2) is larger than \((z_{1-{\alpha_{1}}/2}/z_{1-{\alpha_{2}}/2})^{2}\), the overall FPR is approximately α2. Otherwise, it becomes \(2 \times \left[ 1-\Phi \left( \sqrt{N/N_{1}}\times z_{1-\alpha_{1}/2} \right) \right]\). Similarly, the TPR at fixed α1 and α2 is approximately (1−β2).
To derive the expected number of success and error counts, it suffices to calculate the path probability Q2 in the second stage. Conditional on α2=α1/R, its upper bound becomes α2/α1, and is approximately 1/[E(R0)+E(Ra)]. Similarly, the conditional probability U2, at fixed α1 and α2, is approximately (1−β2)/(1−β1).
Appendix 2
Let pc and pn represent the allele frequencies for the case and control groups, respectively, and δ their absolute difference. Let \(\bar{p}\) be their average, weighted by corresponding sample sizes Nc and Nn. Following the assumption that the allele frequency ranges from 10% to 90%, it can be derived that
Therefore, the relationship between \(\bar{p}\) and δ can be formulated. In the general form, when p0 stands for the detectable minimal allele frequency, the above equation becomes
Rights and permissions
About this article
Cite this article
Wen, SH., Tzeng, JY., Kao, JT. et al. A two-stage design for multiple testing in large-scale association studies. J Hum Genet 51, 523–532 (2006). https://doi.org/10.1007/s10038-006-0393-6
Received:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1007/s10038-006-0393-6
Keywords
This article is cited by
-
A simple Bayesian mixture model with a hybrid procedure for genome-wide association studies
European Journal of Human Genetics (2010)
-
A grid-search algorithm for optimal allocation of sample size in two-stage association studies
Journal of Human Genetics (2007)
-
Optimum two-stage designs in case–control association studies using false discovery rate
Journal of Human Genetics (2006)