Abstract
Multiple testing occurs commonly in genome-wide association studies with dense SNPs map. With numerous SNPs, not only the genotyping cost and time increase dramatically, many family wise error rate (FWER) controlling methods may fail for being too conservative and of less power when detecting SNPs associated with disease is of interest. Recently, several powerful two-stage strategies for multiple testing have received great attention. In this paper, we propose a grid-search algorithm for an optimal design of sample size allocation for these two-stage procedures. Two types of constraints are considered, one is the fixed overall cost and the other is the limited sample size. With the proposed optimal allocation of sample size, bearable false-positive results and larger power can be achieved to meet the limitations. The simulations indicate, as a general rule, allocating at least 80% of the total cost in stage one provides maximum power, as opposed to other methods. If per-genotyping cost in stage two differs from that in stage one, downward proportion of the total cost in earlier stage maintains good power. For limited total sample size, evaluating all the markers on 55% of the subjects in the first stage provides the maximum power while the cost reduction is approximately 43%.
Similar content being viewed by others
Log in or create a free account to read this content
Gain free access to this article, as well as selected content from this journal and more on nature.com
or
References
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol 57:289–300
Botstein D, Risch N (2003) Discovering genotypes underlying human phenotypes: past successes for Mendelian disease, future approaches for complex disease. Nat Genet 33(suppl):228–237
Cardon LR, Bell JI (2001) Association study designs for complex diseases. Nat Rev Genet 2:91–99
Hirschhorn JN, Daly MJ (2005) Genome-wide association studies for common diseases and complex traits. Nat Rev Genet 6:95–108
Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, Henning AK, SanGiovanni JP, Mane SM, Mayne ST, Bracken MB, Ferris FL, Ott J, Barnstable C, Hoh J (2005) Complement factor H polymorphism in age-related macular degeneration. Science 308:385–389
Kuchiba A, Tanaka NY, Ohashi Y (2006) Optimum two-stage design in case-control association studies using false discovery rate. J Hum Genet 51:1046–1054
Long AD, Langley CH (1999) The power of association studies to detect the contribution of candidate genetic loci to variation in complex traits. Genome Res 9:720–731
Miller RA, Galecki A, Shmookler-Reis RJ (2001) Interpretation, design, and analysis of gene array expression experiments. J Gerontol A Biol Sci 56A(2):B52–B57
Ohashi J, Clark AG (2005) Application of the stepwise focusing method to optimize the cost-effectiveness of genome-wide association studies with limited research budgets for genotyping and phenotyping. Ann Hum Genet 69:323–328
Risch N, Merikangas K (1996) The future of genetic studies of complex human diseases. Science 273:1516–1517
Saito A, Kamatani N (2002) Strategies for genome-wide association studies: optimization of study designs by the stepwise focusing method. J Hum Genet 47:360–365
Satagopan JM, Verbal DA, Venkatraman ES, Begg CB (2002) Two-stage designs for gene-disease association studies. Biometrics 58:163–170
Satagopan JM, Venkatraman ES, Begg CB (2004) Two-stage designs for gene-disease association studies with sample size constraints. Biometrics 60:589–597
Skol AD, Scott LJ, Abecasis GR, Boehnke M (2006) Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat Genet 38:209–213
Thomas DC (2006) Are we ready for the genome-wide association studies? Cancer Epidemiol Biomarkers Prev 15(4):595–598
Thomas DC, Haile RW, Duggan D (2005) Recent developments in genomewide association scans: a workshop summary and review. Am J Hum Genet 77:337–345
van den Oord EJ, Sullivan PF (2003) A framework for controlling false discovery rates and minimizing the amount of genotyping in the search for disease mutations. Hum Heredity 56:188–199
Wang H, Thomas DC, Pe’er I, Stram DO (2006) Optimal two-stage genotyping designs for genome-wide association scans. Genet Epidemiol 30:356–368
Wen SH, Tzeng JY, Kao JT, Hsiao CK (2006) A two-stage design for multiple testing in large-scale association studies. J Hum Genet 51:523–532
Zehetmayer S, Bauer P, Posch M (2005) Two-stage designs for experiments with a large number of hypotheses. Bioinformatics 21:3771–3777
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Following the work of Wen et al. (2006), the overall FPR, probability of claiming unassociated SNPs significant in both stages, is approximated by \(\hat{\alpha}_{2} (N_{1})\) below,
where the expected value of R, E(R), can be estimated by Mwα1 + M(1−w)(1−β1(N 1)), with \(1 - \beta_{1} (N_{1}) = \Phi\left(\frac{{{\sqrt {N_{1}}}\delta - \sigma_{0} z_{{\alpha_{1} /2}}}}{{\sigma_{1}}}\right)\) as the power in the first stage, Φ is the cumulative density function of standard normal distribution, and z α1/2 represents the 100(1−α1 /2)-th quartile of the standard normal distribution. Similarly, the other index \({\hbox{TPR}} \cong 1 - \beta_{2} = \Phi\left(\frac{{{\sqrt {N_{1} + N_{2}}}\delta - \sigma_{0} z_{{\hat{\alpha}_{2} (N_{1})/2}}}}{{\sigma_{1}}}\right),\) the probability of declaring truly associated SNPs as significant in both stages, is approximately (1−β2) with respect to \(\hat{\alpha}_{2} (N_{1})\) and N 1+N 2. Where σ 0 and σ1 denote the standard deviation of difference in mean allele frequencies under the null and alternative hypothesis, respectively.
Considering fixed total genotyping cost T=MN 1+RN 2 with the same per-genotyping cost in both stages and fixed M and w, the optimal design is to allocate N 1 and N 2 efficiently to achieve maximum TPR. For simplicity, let N 2 = kN 1 and R can be replaced with E(R). Therefore, T can be rewritten as T=MN 1+(Mwα1+M(1−w)(1−β1(N 1)))kN 1. In other words, k=(T−N 1M)/(N 1E(R)) is a function of N 1. The goal is to find the best value of (N 1, k) such that the two-stage method has maximum power. The TPR is defined as
where \(\hat{\alpha}_{2} (N_{1}) = \min(0.05/E(R),\,2 \times [1 - \Phi ({\sqrt {1 + k}} \times z_{{\alpha_{1} /2}})]),\) and Φ, z α_1/2, σ0 and σ1 are defined as previous.
It is not straightforward to derive the analytical form of k, but the maximum TPR can be searched through the range of N 1. The upper bound for N 1 is T/M, which implies all resources are allocated in the first stage. By setting the power in the first stage larger than 0.8, we can obtain a reasonable range for N 1 based on N 1(0) and T/M, where N 1(0) satisfies the equation \(1 - \beta_{1} (N_{{1(0)}}) = \Phi\left(\frac{{{\sqrt {N_{{1(0)}}}}\delta - \sigma_{0} z_{{\alpha_{1} /2}}}}{{\sigma_{1}}}\right) = 0.8.\) Given T, M, w, allele frequency \(\bar{p}\) and the effect size δ, we perform a grid search of (N 1, k) for the optimal design.
When the total number of participants (=N) is limited and when N 1=π N, the TPR is defined as \({\hbox{TPR}} = 1 - \beta_{2} (N_{1}) = \Phi\left(\frac{{{\sqrt N}\delta - \sigma_{0} z_{{\hat{\alpha}_{2} (N_{1})/2}}}}{{\sigma_{1}}}\right) \leq 1 - \beta_{1} (N_{1}),\) where \(\hat{\alpha}_{2} (N_{1}) = \min (0.05/E(R),\,2 \times [1 - \Phi (z_{{\alpha_{1} /2}} /{\sqrt \pi})]).\) Given N, M, and w, the TPR is only affected by \(\hat{\alpha}_{2} (N_{1})\) and is smaller or equal to 1−β1(N 1). If π is smaller, the TPR is bounded from above by 1−β1(N 1), which is small as well. On the other hand, if π is large, the TPR is likely to be large, but the cost increases dramatically. Hence, one needs to strike a balance between genotyping cost and the TPR. Denoting the cost as
where T(π) is also a function of π given N, M, and w. Similarly, a grid search of π can be set up to find the maximum TPR and affordable cost analytically.
Rights and permissions
About this article
Cite this article
Wen, S.H., Hsiao, C.K. A grid-search algorithm for optimal allocation of sample size in two-stage association studies. J Hum Genet 52, 650–658 (2007). https://doi.org/10.1007/s10038-007-0159-9
Received:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1007/s10038-007-0159-9
Keywords
This article is cited by
-
The feasibility of parameterizing four-state equilibria using relaxation dispersion measurements
Journal of Biomolecular NMR (2011)


