Abstract
Underestimation of the sample size needed to detect genetic association may occur as a result of deviations from the ‘fundamental theorem of the HapMap’. A biologically plausible mechanism that might cause this deviation is ‘cryptic’ tagging of multiple susceptibility loci by the same neutral marker. For complex disorders, the existence of multiple susceptibility loci on the same chromosome is probably the rule rather than the exception. Our results show that conditional on the known haplotype structure of the genome the probability that a tagging SNP that is in linkage disequilibrium (LD) with a susceptibility gene is also in LD with another susceptibility gene is not negligible. Consequently, we were able to estimate the extent and the prevalence of the bias in the necessary sample size to find association induced by ‘cryptic’ tagging. In general, the underestimation of the necessary sample size is modest: 5% of all association studies will underestimate the sample size by 5–30%. On the basis of our results, a safe bet is to use a sample that is 10% larger than otherwise deemed necessary.
Similar content being viewed by others
Log in or create a free account to read this content
Gain free access to this article, as well as selected content from this journal and more on nature.com
or
References
Gabriel SB, Schaffner SF, Nguyen H et al: The structure of haplotype blocks in the human genome. Science 2002; 296: 2225–2229.
Terwilliger JD, Hiekkalinna T : An utter refutation of the ‘fundamental theorem of the HapMap’. Eur J Hum Genet 2006; 14: 426–437.
Thomas DC, Stram DO : An utter refutation of the ‘fundamental theorem of the HapMap’ by Terwilliger and Hiekkalinna. Eur J Hum Genet 2006; 14: 1238–1239.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Derivation of formula (4)
For two events A and B, let pA, pB and pAB be the probabilities of A, B and A∩B, and let rAB be the correlation between the indicators 1A and 1B (defined as 1 or 0 whether the event occurs or not), that is,

We assume that the allele at the tagging locus T of a randomly selected individual is conditionally independent of the case–control status Ca given the haplotypes (or genotypes) at the disease loci. Let D denote all possible haplotypic alleles at the joint disease loci, and let D∈D denote the event that a random individual possesses allele D (possibly a multi-locus haplotype). Then, it follows from the general result, listed as Lemma 1 at the end of this section, that

This exhibits the root-noncentrality parameter as a linear combination of the root-noncentrality parameters rD,Ca of the tests of the 2 × 2 tables that would score case–control status versus causal haplotype, for each haplotypic allele D∈D in turn. In its generality, formula (A.1) is only mildly interesting. However, under special assumptions, it turns into easily interpretable formulas.
As a first application, if there are only two possible alleles at the disease loci, say D and d, then sum (A.1) has two terms, and the products of correlations in the two terms are equal (rT,DrD,Ca=rT,drd,Ca), because both correlations change by a minus sign upon replacing D by d. Then, the formula reduces to the multiplicity (1).
Secondly, we derive (4) from (A.1) under assumptions (2) and (3). It follows from the latter pair of assumptions that

Substitution in formula (A.1) yields

Here E(#D) can be deleted, because and hence is uncorrelated with any variable, and
can be rewritten as
for Di the event that an individual has the disease allele at the disease locus i (with the other loci unspecified). Thus, we obtain

Next we eliminate β from this formula by expressing this in the correlations . We have

for E(#D∣Di) the expected total number of disease alleles in an arbitrary individual carrying the disease allele at locus i. Combining this with formula (3) for the prevalence, we see

Solving for β and substituting the solution in (A.2), we find that

The total number of disease alleles in a random individual can be written (The curious first equality is a consequence of our abuse of notation: as a random variable the total number of disease alleles #D in an arbitrary individual is denoted by #D if the event D occurs.) This gives


Thus,

We conclude the derivation of (4) by substituting this in (A.3).
Lemma 1. If events A and B are conditionally independent given a partition D of the outcome space, then

Proof Because A and B are conditionally independent, they are conditionally uncorrelated, that is cov(1A,1BD)=0 almost surely. Therefore, the usual conditioning rule for covariances gives

Here on the event D the variable E(1AD)−E1A is equal to

Substituting this and the corresponding formula for P(BD)−P(B) in the preceding display gives

This can be rearranged to give the assertion.
Rights and permissions
About this article
Cite this article
Bochdanovits, Z., Heutink, P. & van der Vaart, A. Empirical assessment of the validity of the ‘fundamental theorem of the HapMap’ in the light of ‘cryptic’ tagging of multiple susceptibility loci. Eur J Hum Genet 16, 525–529 (2008). https://doi.org/10.1038/sj.ejhg.5201984
Received:
Revised:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1038/sj.ejhg.5201984