A two-stage design for multiple testing in large-scale association studies

Wen, Shu-Hui; Tzeng, Jung-Ying; Kao, Jau-Tsuen; Hsiao, Chuhsing Kate

doi:10.1007/s10038-006-0393-6

Original Article
Published: 01 June 2006

A two-stage design for multiple testing in large-scale association studies

Shu-Hui Wen¹,
Jung-Ying Tzeng²,
Jau-Tsuen Kao³ &
…
Chuhsing Kate Hsiao⁴

Journal of Human Genetics volume 51, pages 523–532 (2006)Cite this article

712 Accesses
9 Citations
Metrics details

Abstract

Modern association studies often involve a large number of markers and hence may encounter the problem of testing multiple hypotheses. Traditional procedures are usually over-conservative and with low power to detect mild genetic effects. From the design perspective, we propose a two-stage selection procedure to address this concern. Our main principle is to reduce the total number of tests by removing clearly unassociated markers in the first-stage test. Next, conditional on the findings of the first stage, which uses a less stringent nominal level, a more conservative test is conducted in the second stage using the augmented data and the data from the first stage. Previous studies have suggested using independent samples to avoid inflated errors. However, we found that, after accounting for the dependence between these two samples, the true discovery rate increases substantially. In addition, the cost of genotyping can be greatly reduced via this approach. Results from a study of hypertriglyceridemia and simulations suggest the two-stage method has a higher overall true positive rate (TPR) with a controlled overall false positive rate (FPR) when compared with single-stage approaches. We also report the analytical form of its overall FPR, which may be useful in guiding study design to achieve a high TPR while retaining the desired FPR.

Large-scale plasma proteomics comparisons through genetics and disease associations

Article Open access 04 October 2023

Association between the MTNR1B, HHEX, SLC30A8, and TCF7L2 single nucleotide polymorphisms and cardiometabolic risk profile in a mixed ancestry South African population

Article Open access 10 October 2023

Positive predictive value highlights four novel candidates for actionable genetic screening from analysis of 220,000 clinicogenomic records

Article Open access 13 August 2021

References

Allison DB, Coffey CS (2002) Two-stage testing in microarray analysis: what is gained? J Gerontol A Biol Sci Med Sci 57:B189–B192
Article Google Scholar
Becker T, Knapp M (2004) Maximum-likelihood estimation of haplotype frequencies in nuclear families. Genet Epidemiol 27:21–32
Article Google Scholar
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B Methodol 57:289–300
Google Scholar
Benjamini Y, Yekutieli D (2001) The control of the false discovery rate in multiple testing under dependency. Ann Stat 29:1165–1188
Article Google Scholar
Böddeker IR, Ziegler A (2001) Sequential designs for genetic epidemiological linkage or association studies a review of the literature. Biomed J 43:501–525
Google Scholar
Collins A, Lonjou C, Morton NE (1999) Genetic epidemiology of single-nucleotide polymorphisms. Proc Natl Acad Sci USA 96:15173–15177
Article CAS Google Scholar
Dale RN (2004) A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. Am J Hum Genet 74:765–769
Article Google Scholar
Dudoit S, Shaffer JP, Boldrick JC (2003) Multiple hypothesis testing in microarray experiments. Stat Sci 18:71–103
Article Google Scholar
Elston RC, Guo X, Williams LV (1996) Two-stage global search designs for linkage analysis using pairs of affected relatives. Genet Epidemiol 13:535–558
Article CAS Google Scholar
Ge Y, Dudoit D, Speed TP (2003) Resampling-based multiple testing for microarray data analysis. Test 12:1–77
Article Google Scholar
Guo X, Elston RC (2000) Two-stage global search designs for linkage analysis. I. Use of the mean statistic for affected sib pairs. Genet Epidemiol 18:97–110
Article CAS Google Scholar
Haga H, Yamada R, Ohnishi Y, Nakamura Y, Tanaka T (2002) Gene-based SNP discovery as part of the Japanese millennium genome project: identification of 190,562 genetic variations in the human genome. J Hum Genet 47:605–610
Article CAS Google Scholar
Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, Meltzer P, Gusterson B, Esteller M, Kallioniemi OP, Benjamin W, Borg A, Trend J (2001) Gene-expression profiles in hereditary breast cancer. N Engl J Med 344:539–548
Article CAS Google Scholar
Hirschhorn JN, Daly MJ (2005) Genome-wide association studies for common diseases and complex traits. Nat Rev Genet 6:95–108
Article CAS Google Scholar
Hoh J, Wille A, Zee R, Cheng S, Reynolds R, Lindpaintner K, Ott J (2000) Selecting SNPs in two-stage analysis of disease association data: a model-free approach. Ann Hum Genet 64:413–417
Article CAS Google Scholar
Kao JT, Wen HC, Chien KL, Hsu HC, Lin SW (2003) A novel genetic variant in the apolipoprotein A5 gene is associated with hypertriglyceridemia. Hum Mol Genet 12:2533–2539
Article CAS Google Scholar
Miller RA, Galecki A, Shmookler-Reis RJ (2001) Interpretation, design, and analysis of gene array expression experiments. J Gerontol A Biol Sci Med Sci 56:B52–B57
Article CAS Google Scholar
Ohashi J, Tokunaga K (2001) The power of genome-wide association studies of complex disease genes: statistical limitations of indirect approaches using SNP markers. J Hum Genet 46:478–482
Article CAS Google Scholar
Ott J, Hoh J (2001) Statistical multilocus methods for disequilibrium analysis in complex traits. Hum Mutat 17:285–288
Article CAS Google Scholar
Ozaki K, Ohnishi Y, Iida A, Sekine A, Yamada R, Tsunoda T, Sato H, Sato H, Hori M, Nakamura Y, Tanaka T (2002) Functional SNPs in the lymphotoxin-α gene that are associated with susceptibility to myocardial infarction. Nat Genet 32:650–654
Article CAS Google Scholar
Reich DE, Gabriel SB, Altshuler D (2003) Quality and completeness of SNP databases. Nat Genet 33:457–458
Article CAS Google Scholar
Risch N, Merikangas K (1996) The future of genetic studies of complex human diseases. Science 273:1516–1517
Article CAS Google Scholar
Saito A, Kamatani N (2002) Strategies for genome-wide association studies: optimization of study designs by the stepwise focusing method. J Hum Genet 47:360–365
Article CAS Google Scholar
Satagopan JM, Elston RC (2003) Optimal two-stage genotyping in population-based association studies. Genet Epidemiol 25:149–157
Article Google Scholar
Satagopan JM, Verbel DA, Venkatraman ES, Offit KE, Begg CB (2002) Two-stage designs for gene–disease association studies. Biometrics 58:163–170
Article Google Scholar
Satagopan JM, Venkatraman ES, Begg CB (2004) Two-stage designs for gene–disease association studies with sample size constraints. Biometrics 60:589–597
Article Google Scholar
Storey JD, Tibshirani R (2003) Statistical significance for genomewide studies. Proc Natl Acad Sci USA 100:9440–9445
Article CAS Google Scholar
Thomas D, Xie R, Gebregziabher M (2004) Two-stage sampling designs for gene association studies. Genet Epidemiol 27:401–414
Article Google Scholar
Tsai CA, Hsueh H, Chen JJ (2003) Estimation of false discovery rates in multiple testing: application to gene microarray data. Biometrics 59:1071–1081
Article Google Scholar
van den Oord EJCG, Sullivan PF (2003a) A framework for controlling false discovery rates and minimizing the amount of genotyping in the search for disease mutations. Hum Hered 56:188–199
Article CAS Google Scholar
van den Oord EJCG, Sullivan PF (2003b) False discoveries and models for gene discovery. Trends Genet 19:537–542
Article CAS Google Scholar

Download references

Author information

Authors and Affiliations

Department of Public Health, College of Medicine, Tzu-Chi University, Hua-Lien, 97004, Taiwan
Shu-Hui Wen
Department of Statistics and Bioinformatics Research Center, North Carolina State University, Raleigh, NC, 27606, USA
Jung-Ying Tzeng
Department of Clinical Laboratory Sciences and Medical Biotechnology, College of Medicine, National Taiwan University, Taipei, 100, Taiwan
Jau-Tsuen Kao
Division of Biostatistics, Institute of Epidemiology, National Taiwan University, Taipei, 100, Taiwan
Chuhsing Kate Hsiao

Authors

Shu-Hui Wen
View author publications
Search author on:PubMed Google Scholar
Jung-Ying Tzeng
View author publications
Search author on:PubMed Google Scholar
Jau-Tsuen Kao
View author publications
Search author on:PubMed Google Scholar
Chuhsing Kate Hsiao
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Chuhsing Kate Hsiao.

Appendices

Appendix 1

TPR is defined as

$$\begin{aligned} \hbox{TPR} & \equiv \hbox{Pr} (\hbox{reject null} | \hbox{truly associated}) \\ & = \hbox{Pr} (\hbox{significant in first and second stage} | \hbox{truly associated})\\ & = \hbox{Pr} (\hbox{significant in first} | \hbox{truly associated})\\ & \quad \times \hbox{Pr} (\hbox{signigicant in second} | \hbox{significant in first, truly associated})\\ & = U_{1} \times U_{2}.\\ \end{aligned}$$

Similarly, by the same argument for conditional probability, the overall FPR is

$$\begin{aligned} \hbox{FPR} & \equiv \hbox{Pr} (\hbox{significant} | \hbox{no association})\\ & = \hbox{Pr} (\hbox{significant in first} | \hbox{no association})\\ & \quad \times \hbox{Pr} (\hbox{significant in second} | \hbox{significant in first, no association}) \\ & = Q_{1} \times Q_{2}. \\ \end{aligned}$$

Define T(X) as the test statistic used in the first stage where X denotes the data from MN₁ genotypings, and T(X*) for the second stage where X* contains the information based on the R(N₁+N₂) genotypings. The test statistic can be the difference in allele frequency between the case and control groups, or a chi-square statistic derived from contingency tables. These two statistics are correlated because partial data are used in both X and X*. Without loss of generality, we assume here that both statistics converge asymptotically to normality, providing N₁ is large, and that the allele frequencies obtained from N₁ and N₁+N₂ are roughly equal. Let ρ denote the correlation between T(X) and T(X*), the overall FPR at α₁ and α₂ is then:

$$\begin{aligned} \Pr (T({\mathbf{X}}) \geq a, T({\mathbf{X}}^{*}) \geq b) & = \Pr {\left( {\Sigma ^{{ -1/2}} \cdot {\left( {\begin{array}{*{20}c} {{T({\mathbf{X}})}} \\ {{T({\mathbf{X}}^{*})}} \\ \end{array} } \right)} \geq \Sigma ^{{-1/2}} \cdot {\left( {\begin{array}{*{20}c} {a} \\ {b} \\ \end{array} } \right)}} \right)} \\ & = \Pr {\left( {\Sigma ^{{ -1/2}} \cdot {\left( {\begin{array}{*{20}c} {{T({\mathbf{X}})}} \\ {{T({\mathbf{X}}^{*})}} \\ \end{array} } \right)} \geq {\left( {\begin{array}{*{20}c} {{s \cdot a + t \cdot b}} \\ {{t \cdot a + s \cdot b}} \\ \end{array} } \right)}} \right)},\\ \end{aligned} $$

where $\sum = \left(\begin{array}{*{20}c} 1 & \rho\\ \rho & 1\\ \end{array}\right)$ and $\sum\nolimits ^{{-1/2}} = \left(\begin{array}{*{20}c} s & t\\ t & s\\ \end{array}\right)$ whose elements depend on ρ. Applying their asymptotic normality with zero mean vector and the covariance matrix ∑, the above becomes

$$\begin{aligned} & \Pr {\left( {\Sigma ^{{ - 1/2}} \cdot {\left( {\begin{array}{*{20}c} {{T({\mathbf{X}})}} \\ {{T({\mathbf{X}}^{*})}} \\ \end{array} } \right)} \geq {\left( {\begin{array}{*{20}c} {{s \cdot a + t \cdot b}} \\ {{t \cdot a + s \cdot b}} \\ \end{array} } \right)}} \right)} \approx \Pr {\left( {{\left( {\begin{array}{*{20}c} {{Z_{1} }} \\ {{Z_{2} }} \\ \end{array} } \right)} \geq {\left( {\begin{array}{*{20}c} {{s \cdot a + t \cdot b}} \\ {{t \cdot a + s \cdot b}} \\ \end{array} } \right)}} \right)} \\ & \quad = \Pr (Z_{1} \geq s \cdot a + t \cdot b) \cdot \Pr (Z_{2} \geq t \cdot a + s \cdot b),\\ \end{aligned} $$

where $a=z_{1-\alpha_{1}/2}$ and $b=z_{1-\alpha_{2}/2}=z_{1-\alpha_{1}/2R}$ are the percentiles of standard normal, and Z₁ and Z₂ are independent standard normal distributions. This FPR will be bounded closely from above by $ \Pr (Z_{2}\ge t \cdot a + s \cdot b)$ if ρ is large. This boundary is determined by α₁, α₂, R, and ρ, and is smaller than 0.001 when R ≥10; while a quick guess for R would be M α₁. If ρ=0, this boundary becomes Pr(Z₂≥ b), which is simply α₂.

To investigate the effect of sample size, note that T(X) is close to $\sqrt{N_{1}/(N_{1}+N_{2})}\times T({\mathbf{X}}^{*})$ if the average allele frequencies in X and X* are similar, therefore,

$$\begin{aligned} {\hbox{FPR}} & = \Pr \left( T({\mathbf{X}}^{*}) \geq z_{{1 - \alpha _{2} /2}} ,T({\mathbf{X}}) \geq z_{{1 - \alpha _{1} /2}} |{\hbox{no association}} \right) \\ & \cong \Pr {\left( {T({\mathbf{X}}^{*}) \geq \max {\left( {z_{{1 - \alpha_{2} /2}}, {\sqrt {\frac{{N_{1} + N_{2} }} {{N_{1} }}} } \times z_{{1 - \alpha_{1} /2}} } \right)}|{\hbox{no association}}} \right)} \\ & = \min {\left( {\alpha _{2}, 2 \times {\left[ {1 - \Phi {\left( {{\sqrt {\frac{{N }} {{N_{1}}}} } \times z_{{1 - \alpha_{1} /2}} } \right)}} \right]}} \right)},\\ \end{aligned} $$

where Φ is the cumulative density function of the standard normal. If the proportion N₁/(N₁+N₂) is larger than $(z_{1-{\alpha_{1}}/2}/z_{1-{\alpha_{2}}/2})^{2}$, the overall FPR is approximately α₂. Otherwise, it becomes $2 \times \left[ 1-\Phi \left( \sqrt{N/N_{1}}\times z_{1-\alpha_{1}/2} \right) \right]$. Similarly, the TPR at fixed α₁ and α₂ is approximately (1−β₂).

To derive the expected number of success and error counts, it suffices to calculate the path probability Q₂ in the second stage. Conditional on α₂=α₁/R, its upper bound becomes α₂/α₁, and is approximately 1/[E(R₀)+E(R_a)]. Similarly, the conditional probability U₂, at fixed α₁ and α₂, is approximately (1−β₂)/(1−β₁).

Appendix 2

Let p_c and p_n represent the allele frequencies for the case and control groups, respectively, and δ their absolute difference. Let $\bar{p}$ be their average, weighted by corresponding sample sizes N_c and N_n. Following the assumption that the allele frequency ranges from 10% to 90%, it can be derived that

$$\left\{\begin{array}{*{20}l} 90\% \ge \max (p_{{\rm c}}, p_{{\rm n}}) = \bar{p} + \delta /2 \ge 10\%, \\ 90\% \ge \min (p_{{\rm c}}, p_{{\rm n}}) = \bar{p} - \delta /2 \ge 10\%, \\ 0.1 \le \bar{p} \le 0.9, \\ - 0.8 \le p_{{\rm c}} - p_{{\rm n}} \le 0.8, \\ 0 \le \delta \le \min (0.8, 2 \bar{p} - 20\%).\\ \end{array} \right.$$

Therefore, the relationship between $\bar{p}$ and δ can be formulated. In the general form, when p₀ stands for the detectable minimal allele frequency, the above equation becomes

$$\left\{ {\begin{array}{*{20}l} {{\forall \bar{p} \in (p_{0},0.5),}} & {{\delta \in (0, 2 \bar{p} - 2 \times p_{0})}}, \\ {{\forall \bar{p} \in (0.5,1 - p_{0}),}} & {{\delta \in (0, 2 \times (1 - p_{0}) - 2 \bar{p})}}.\\ \end{array}} \right.$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wen, SH., Tzeng, JY., Kao, JT. et al. A two-stage design for multiple testing in large-scale association studies. J Hum Genet 51, 523–532 (2006). https://doi.org/10.1007/s10038-006-0393-6

Download citation

Received: 09 November 2005
Accepted: 14 February 2006
Published: 01 June 2006
Issue date: June 2006
DOI: https://doi.org/10.1007/s10038-006-0393-6

Keywords

This article is cited by

A simple Bayesian mixture model with a hybrid procedure for genome-wide association studies
- Yu-Chung Wei
- Shu-Hui Wen
- Chuhsing K Hsiao
European Journal of Human Genetics (2010)
A grid-search algorithm for optimal allocation of sample size in two-stage association studies
- S. H. Wen
- C. K. Hsiao
Journal of Human Genetics (2007)
Optimum two-stage designs in case–control association studies using false discovery rate
- Aya Kuchiba
- Noriko Y. Tanaka
- Yasuo Ohashi
Journal of Human Genetics (2006)

A two-stage design for multiple testing in large-scale association studies

Abstract

Similar content being viewed by others

Large-scale plasma proteomics comparisons through genetics and disease associations

Association between the MTNR1B, HHEX, SLC30A8, and TCF7L2 single nucleotide polymorphisms and cardiometabolic risk profile in a mixed ancestry South African population

Positive predictive value highlights four novel candidates for actionable genetic screening from analysis of 220,000 clinicogenomic records

Log in or create a free account to read this content

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1

Appendix 2

Rights and permissions

About this article

Cite this article

Keywords

This article is cited by

A simple Bayesian mixture model with a hybrid procedure for genome-wide association studies

A grid-search algorithm for optimal allocation of sample size in two-stage association studies

Optimum two-stage designs in case–control association studies using false discovery rate

Search

Quick links

Abstract

Similar content being viewed by others

Large-scale plasma proteomics comparisons through genetics and disease associations

Association between the MTNR1B, HHEX, SLC30A8, and TCF7L2 single nucleotide polymorphisms and cardiometabolic risk profile in a mixed ancestry South African population

Positive predictive value highlights four novel candidates for actionable genetic screening from analysis of 220,000 clinicogenomic records

Log in or create a free account to read this content

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1

Appendix 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

A simple Bayesian mixture model with a hybrid procedure for genome-wide association studies

A grid-search algorithm for optimal allocation of sample size in two-stage association studies

Optimum two-stage designs in case–control association studies using false discovery rate

Search

Quick links