Extended Data Fig. 1: The relationship between BF, p, and effect sizes values. | Nature Neuroscience

Extended Data Fig. 1: The relationship between BF, p, and effect sizes values.

From: Using Bayes factor hypothesis testing in neuroscience to establish evidence of absence

Extended Data Fig. 1

a, This log-log plot shows the BF+0 values corresponding to familiar critical p values for a one-tailed one-sample t-test at different sample sizes (n). The curves show the BF+0 values obtained in a Bayesian t-test based on the critical t-value that provides P=0.05 (yellow), P=0.01 (green), P=0.005 (black) and P=0.001 (black). The yellow dashed horizontal line indicates the BF+0=3 bound for moderate evidence considered by Jeffreys9 to be similar to P=0.05, the green one the BF+0=10 for strong evidence considered similar to P=0.01. The two black dashed lines mark BF+0=1, i.e. the line of no evidence, and BF+0=1/3, the bound for moderate evidence of absence. The background gradient reminds the reader that the BF reference values of 3 and 10 should not be considered hard bounds. Instead the BF should be interpreted as a continuous value, with values diverging more from 1 supporting stronger conclusions. This panel makes two points. First, there is no simple equivalence between p and BF that holds over all sample sizes. This is because in a frequentist t-test, the observed effect size (d) sufficient to generate a specific p value decreases with \(\sqrt {\mathrm{n}}\) more rapidly than for the BF. As a result, at large n, very small effect sizes generate ‘significant’ t-test: at n=1000, the critical t-value for a one-tailed P=0.05 is 1.65, corresponding to d=1.65 /\(\sqrt {\mathrm{n}}\) =0.05. For the BF, such a minuscule effect is 4 times more likely under H0 than H+ (BF+0=0.26). Hence, for small sample sizes p and BF support similar conclusions (e.g., P=0.05 at n=4 corresponds to BF+0>3, supporting the same conclusion of evidence for an effect), but for large sample sizes the frequentist and Bayesian conclusions can diverge in the presence of very small effect sizes (e.g., P=0.05 at n=1000 corresponds to BF+0<1/3, see Jeffreys, H. Some Tests of Significance, Treated by the Theory of Probability. Proc. Cambridge Philos. Soc. 31, 203–222 (1935)). Considering confidence or credible intervals of the effect size in addition to p or BF values helps interpret such cases. Second, the fact that the dashed lines are above the curve of the same color for all n>4 shows that BF+0=3 and BF+0=10 indeed protect against Type I errors in a frequentist sense at least at P=0.05 or P=0.01, respectively. In other words, if BF10>3, p<0.05, and if BF10>10, p<0.01, but how much lower than 0.05 or 0.01 the exact P value is, depends on n. b, BF+0 (left) and p (right) values as a function of measured effect- and sample-sizes. These panels illustrate the measured effect sizes necessary to provide evidence for an effect at different sample sizes in a one-sample one-tailed t-test using the BF vs. traditional p values. Each curve connects the results at different sample sizes for the specified value of d. The logarithmic BF and p scales are aligned so as to place BF=3 next to P=0.05, and BF=10 next to P=0.01.

Back to article page