A grid-search algorithm for optimal allocation of sample size in two-stage association studies

Wen, S. H.; Hsiao, C. K.

doi:10.1007/s10038-007-0159-9

Original Article
Published: 30 June 2007

A grid-search algorithm for optimal allocation of sample size in two-stage association studies

S. H. Wen¹ &
C. K. Hsiao²

Journal of Human Genetics volume 52, pages 650–658 (2007)Cite this article

772 Accesses
4 Citations
Metrics details

Abstract

Multiple testing occurs commonly in genome-wide association studies with dense SNPs map. With numerous SNPs, not only the genotyping cost and time increase dramatically, many family wise error rate (FWER) controlling methods may fail for being too conservative and of less power when detecting SNPs associated with disease is of interest. Recently, several powerful two-stage strategies for multiple testing have received great attention. In this paper, we propose a grid-search algorithm for an optimal design of sample size allocation for these two-stage procedures. Two types of constraints are considered, one is the fixed overall cost and the other is the limited sample size. With the proposed optimal allocation of sample size, bearable false-positive results and larger power can be achieved to meet the limitations. The simulations indicate, as a general rule, allocating at least 80% of the total cost in stage one provides maximum power, as opposed to other methods. If per-genotyping cost in stage two differs from that in stage one, downward proportion of the total cost in earlier stage maintains good power. For limited total sample size, evaluating all the markers on 55% of the subjects in the first stage provides the maximum power while the cost reduction is approximately 43%.

Combining SNP-to-gene linking strategies to identify disease genes and assess disease omnigenicity

Article 06 June 2022

An adaptive and robust method for multi-trait analysis of genome-wide association studies using summary statistics

Article 26 May 2023

The distribution of common-variant effect sizes

Article 29 July 2021

References

Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol 57:289–300
Google Scholar
Botstein D, Risch N (2003) Discovering genotypes underlying human phenotypes: past successes for Mendelian disease, future approaches for complex disease. Nat Genet 33(suppl):228–237
Article CAS Google Scholar
Cardon LR, Bell JI (2001) Association study designs for complex diseases. Nat Rev Genet 2:91–99
Article CAS Google Scholar
Hirschhorn JN, Daly MJ (2005) Genome-wide association studies for common diseases and complex traits. Nat Rev Genet 6:95–108
Article CAS Google Scholar
Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, Henning AK, SanGiovanni JP, Mane SM, Mayne ST, Bracken MB, Ferris FL, Ott J, Barnstable C, Hoh J (2005) Complement factor H polymorphism in age-related macular degeneration. Science 308:385–389
Article CAS Google Scholar
Kuchiba A, Tanaka NY, Ohashi Y (2006) Optimum two-stage design in case-control association studies using false discovery rate. J Hum Genet 51:1046–1054
Article Google Scholar
Long AD, Langley CH (1999) The power of association studies to detect the contribution of candidate genetic loci to variation in complex traits. Genome Res 9:720–731
CAS PubMed PubMed Central Google Scholar
Miller RA, Galecki A, Shmookler-Reis RJ (2001) Interpretation, design, and analysis of gene array expression experiments. J Gerontol A Biol Sci 56A(2):B52–B57
Article CAS Google Scholar
Ohashi J, Clark AG (2005) Application of the stepwise focusing method to optimize the cost-effectiveness of genome-wide association studies with limited research budgets for genotyping and phenotyping. Ann Hum Genet 69:323–328
Article CAS Google Scholar
Risch N, Merikangas K (1996) The future of genetic studies of complex human diseases. Science 273:1516–1517
Article CAS Google Scholar
Saito A, Kamatani N (2002) Strategies for genome-wide association studies: optimization of study designs by the stepwise focusing method. J Hum Genet 47:360–365
Article CAS Google Scholar
Satagopan JM, Verbal DA, Venkatraman ES, Begg CB (2002) Two-stage designs for gene-disease association studies. Biometrics 58:163–170
Article Google Scholar
Satagopan JM, Venkatraman ES, Begg CB (2004) Two-stage designs for gene-disease association studies with sample size constraints. Biometrics 60:589–597
Article Google Scholar
Skol AD, Scott LJ, Abecasis GR, Boehnke M (2006) Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat Genet 38:209–213
Article CAS Google Scholar
Thomas DC (2006) Are we ready for the genome-wide association studies? Cancer Epidemiol Biomarkers Prev 15(4):595–598
Article Google Scholar
Thomas DC, Haile RW, Duggan D (2005) Recent developments in genomewide association scans: a workshop summary and review. Am J Hum Genet 77:337–345
Article CAS Google Scholar
van den Oord EJ, Sullivan PF (2003) A framework for controlling false discovery rates and minimizing the amount of genotyping in the search for disease mutations. Hum Heredity 56:188–199
Article Google Scholar
Wang H, Thomas DC, Pe’er I, Stram DO (2006) Optimal two-stage genotyping designs for genome-wide association scans. Genet Epidemiol 30:356–368
Article Google Scholar
Wen SH, Tzeng JY, Kao JT, Hsiao CK (2006) A two-stage design for multiple testing in large-scale association studies. J Hum Genet 51:523–532
Article Google Scholar
Zehetmayer S, Bauer P, Posch M (2005) Two-stage designs for experiments with a large number of hypotheses. Bioinformatics 21:3771–3777
Article CAS Google Scholar

Download references

Author information

Authors and Affiliations

Department of Public Health, College of Medicine, Tzu-Chi University, Hualien, 97004, Taiwan
S. H. Wen
Department of Public Health and Institute of Epidemiology, College of Public Health, National Taiwan University, Taipei, 100, Taiwan
C. K. Hsiao

Authors

S. H. Wen
View author publications
Search author on:PubMed Google Scholar
C. K. Hsiao
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to S. H. Wen.

Appendix

Following the work of Wen et al. (2006), the overall FPR, probability of claiming unassociated SNPs significant in both stages, is approximated by $\hat{\alpha}_{2} (N_{1})$ below,

$${\hat \alpha _2}({N_1}) = \left\{ {\begin{array}{*{20}{c}} {0.05/E(R),\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;}&{\text{if}\;{N_1}/({N_1} + {N_2}) \geqslant {{({z_{{\alpha _1}}}/{z_{{\alpha _2}}})}^2}} \\ {(2 \times [1 - \Phi \left( {\sqrt {\frac{{{N_1} + {N_2}}}{{{N_1}}}} \times {z_{{\alpha _1}/2}}} \right),}&{\text{if}\;{N_1}/({N_1} + {N_2}) < {{({z_{{\alpha _1}}}/{z_{{\alpha _2}}})}^2}} \end{array}} \right.$$

where the expected value of R, E(R), can be estimated by Mwα₁ + M(1−w)(1−β₁(N ₁)), with $1 - \beta_{1} (N_{1}) = \Phi\left(\frac{{{\sqrt {N_{1}}}\delta - \sigma_{0} z_{{\alpha_{1} /2}}}}{{\sigma_{1}}}\right)$ as the power in the first stage, Φ is the cumulative density function of standard normal distribution, and z _α1/2 represents the 100(1−α₁ /2)-th quartile of the standard normal distribution. Similarly, the other index ${\hbox{TPR}} \cong 1 - \beta_{2} = \Phi\left(\frac{{{\sqrt {N_{1} + N_{2}}}\delta - \sigma_{0} z_{{\hat{\alpha}_{2} (N_{1})/2}}}}{{\sigma_{1}}}\right),$ the probability of declaring truly associated SNPs as significant in both stages, is approximately (1−β₂) with respect to $\hat{\alpha}_{2} (N_{1})$ and N ₁+N ₂. Where σ ₀ and σ₁ denote the standard deviation of difference in mean allele frequencies under the null and alternative hypothesis, respectively.

Considering fixed total genotyping cost T=MN ₁+RN ₂ with the same per-genotyping cost in both stages and fixed M and w, the optimal design is to allocate N ₁ and N ₂ efficiently to achieve maximum TPR. For simplicity, let N ₂ = kN ₁ and R can be replaced with E(R). Therefore, T can be rewritten as T=MN ₁+(Mwα₁+M(1−w)(1−β₁(N ₁)))kN ₁. In other words, k=(T−N ₁M)/(N ₁E(R)) is a function of N ₁. The goal is to find the best value of (N ₁, k) such that the two-stage method has maximum power. The TPR is defined as

$$ TPR = 1 - \beta_{2} (N_{1}) = \Phi\left(\frac{{{\sqrt {(1 + (T - N_{1} M)/N_{1} E(R))N_{1}}}\delta - \sigma_{0} z_{{\hat{\alpha}_{2} (N_{1})/2}}}}{{\sigma_{1}}}\right) $$

where $\hat{\alpha}_{2} (N_{1}) = \min(0.05/E(R),\,2 \times [1 - \Phi ({\sqrt {1 + k}} \times z_{{\alpha_{1} /2}})]),$ and Φ, z _{α_1/2}, σ₀ and σ₁ are defined as previous.

It is not straightforward to derive the analytical form of k, but the maximum TPR can be searched through the range of N ₁. The upper bound for N ₁ is T/M, which implies all resources are allocated in the first stage. By setting the power in the first stage larger than 0.8, we can obtain a reasonable range for N ₁ based on N ₁₍₀₎ and T/M, where N ₁₍₀₎ satisfies the equation $1 - \beta_{1} (N_{{1(0)}}) = \Phi\left(\frac{{{\sqrt {N_{{1(0)}}}}\delta - \sigma_{0} z_{{\alpha_{1} /2}}}}{{\sigma_{1}}}\right) = 0.8.$ Given T, M, w, allele frequency $\bar{p}$ and the effect size δ, we perform a grid search of (N ₁, k) for the optimal design.

When the total number of participants (=N) is limited and when N ₁=π N, the TPR is defined as ${\hbox{TPR}} = 1 - \beta_{2} (N_{1}) = \Phi\left(\frac{{{\sqrt N}\delta - \sigma_{0} z_{{\hat{\alpha}_{2} (N_{1})/2}}}}{{\sigma_{1}}}\right) \leq 1 - \beta_{1} (N_{1}),$ where $\hat{\alpha}_{2} (N_{1}) = \min (0.05/E(R),\,2 \times [1 - \Phi (z_{{\alpha_{1} /2}} /{\sqrt \pi})]).$ Given N, M, and w, the TPR is only affected by $\hat{\alpha}_{2} (N_{1})$ and is smaller or equal to 1−β₁(N ₁). If π is smaller, the TPR is bounded from above by 1−β₁(N ₁), which is small as well. On the other hand, if π is large, the TPR is likely to be large, but the cost increases dramatically. Hence, one needs to strike a balance between genotyping cost and the TPR. Denoting the cost as

$$ \hbox{T}(\pi)=(M\pi +\hbox{E}(R)(1-\pi))N=(\pi + (w\alpha_{1} + (1 - w)(1 - \beta_{1} (N_{1})))(1 - \pi))MN, $$

where T(π) is also a function of π given N, M, and w. Similarly, a grid search of π can be set up to find the maximum TPR and affordable cost analytically.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wen, S.H., Hsiao, C.K. A grid-search algorithm for optimal allocation of sample size in two-stage association studies. J Hum Genet 52, 650–658 (2007). https://doi.org/10.1007/s10038-007-0159-9

Download citation

Received: 29 December 2006
Accepted: 10 May 2007
Published: 30 June 2007
Issue date: August 2007
DOI: https://doi.org/10.1007/s10038-007-0159-9

Keywords

This article is cited by

The feasibility of parameterizing four-state equilibria using relaxation dispersion measurements
- Pilong Li
- Ilídio R. S. Martins
- Michael K. Rosen
Journal of Biomolecular NMR (2011)

A grid-search algorithm for optimal allocation of sample size in two-stage association studies

Abstract

Similar content being viewed by others

Combining SNP-to-gene linking strategies to identify disease genes and assess disease omnigenicity

An adaptive and robust method for multi-trait analysis of genome-wide association studies using summary statistics

The distribution of common-variant effect sizes

Log in or create a free account to read this content

References

Author information

Authors and Affiliations

Corresponding author

Appendix

Rights and permissions

About this article

Cite this article

Keywords

This article is cited by

The feasibility of parameterizing four-state equilibria using relaxation dispersion measurements

Search

Quick links

Abstract

Similar content being viewed by others

Combining SNP-to-gene linking strategies to identify disease genes and assess disease omnigenicity

An adaptive and robust method for multi-trait analysis of genome-wide association studies using summary statistics

The distribution of common-variant effect sizes

Log in or create a free account to read this content

References

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

The feasibility of parameterizing four-state equilibria using relaxation dispersion measurements

Search

Quick links