Many researchers naively follow the concept that a p-value below a certain threshold (traditionally .05)  is to be considered significant, and can therefore be considered as evidence that H1 is true, while a p-value above that level is considered not significant and therefore evidence that H0 is true. This Neyman-Pearson approach to statistical testing is based on properties of the whole procedure, and not of its results in an individual experiment [1], i.e. it results in minimum error of you would repeat the experiment over and over again. The evidential interpretation of a significance test therefore depends on the statistical power of the study design (see page 56f [1]). Put shortly: If your prior probability that your hypotheses was true is 0, it will be zero, regardless of the p-value you got. This  has given rise to a lot of debate about replicability and underpowered studies (e.g. [2]).

For novel research (and aren’t we all doing that) we face a complicated, albeit central issue: We usually don’t know the prior probability that our study will show a true effect. We might estimate the distribution of effects sizes for other studies in our field [3], or consider where the effect we are researching lies in a range of other hypotheses [4] or just follow Jeffreys suggestion and the principle of sufficent reason: every outcome is equally likely [5]. More often than not, researchers take the bold option: They believe (or publish as if they believe) that the effect they found is true [6], regardless of its prior or posterior probability to be true. For many critics, this is linked to null-hypothesis significance testing (NHST) and post-hoc maximum-likelihood hypothesis construction (MLHC). Indeed, many papers follow implicitly such a Fisherian approach: A lower p-value is supposed to imply higher evidence, and the effect is gonna be the maximum likelihood estimate based on the measured data. This appears to relieve one from the burden of specifying power and prior and all this quagmire involved in it. But it opens up the question, whether such an approach is actually a good way to deal with the actual evidence.

Prima facie, such evidential interpretation of the p-value follows the law of improbability [1], and there are some issues with that. First, p-values are based on the cumulative distribution function, i.e. we take into account the probability to get this and/or any more extreme results. But we did not actually measure these more extreme results, therefore our conclusion is not based only on the measured data. Also, just because a specific result is improbable under H0, it does not imply that this is evidence against H0. Throw a fair coin just often enough for 5 times in a row, and you will get an improbable (i.e. significant) result with certainty. The argument is therefore that the weight of evidence can only be expressed in a useful manner by contrasting two alternative hypotheses, and asking which one is more likely to produce the result. This is what gave rise to the likelihood ratio as a measure of statistical evidence [1,9]. Although the pure weight of evidence (i.e. P(X|H1) / P(X|H0) ) does not need speculation about the probability for the two hypotheses, some might consider it a Bayesian concept. But we just too often do not know the prior distribution, or just as bad, we might argue over it.  Yet, due to the described reasons, if you do NHST and post-hoc MLHC, the p-value is not a good measure of evidence.

Even if the p-value is not a good measure of evidence, we might ask: Is it at least a kind of okay measure of evidence? To find out (at least for the t-test), i followed this rationale: Whenever we  perform a t-test, we will receive a t-value, regardless of whether H0 or H1 is true. So what is interesting is not the distribution of t-values (this depends on whether my result is true or not), but on their relation to each other. First, let’s assume that H1 is true, i.e. the t-value comes from a noncentral t-distribution: what is the probability that i would have received exactly that t-value? Second, let’s assume that H0 is true, i.e. the t-value comes from a central t-distribution: what is the probability that i would have received exactly such a t-value? Additionally, what is the p-value of such a t-value? Then let’s calculate the likelihood ratio and the p-value over a range of t-values for different sample sizes and compare.

 

First of all, the p-value seems to scale almost linearly with the likelihood ratio. Therefore, a p-value can be used as a proxy for the weigth of evidence. Yet, the slope between p-value and LR depends on sample size. It is rather low for low sample sizes, and becomes high with higher sample sizes (see upper left figure, with log10(LR) on the x-axis, log10(p) on the y-axis, shown for sample sizes of 2 to 100 in different colors). This means that you can not compare the p-value from one study to another, as identical p-values do not map the same weight of evidence. Additionally, consider that t-values scale with the effect size based on the sample size: Cohens’s d = t/sqrt(N). The evidence you get for the same effect depends therefore on the sample size  (see figure 2,exemplified for three sample sizes).  Because of this you will also get almost the same p-value for almost all effect sizes when the sample size is low (see figure 3, exemplified for three sample sizes). Studies with a low sample size therefore have a very poor resolution over different effect sizes – compared to studies with a higher sample size.

Put shortly:  The exact p-value is a kind of okay measure of the statistical evidence in a simple statistical t-test, but you have to take into account the sample size. On another note, low sample size means poor resolution of the evidence.

References

1. Royall, R. M. (1997). Statistical evidence: a likelihood paradigm (1st ed). London ; New York: Chapman & Hall.
2. http://www.nicebread.de/whats-the-probability-that-a-significant-p-value-indicates-a-true-effect/and a possible crisis regarding their replication
3. Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365–376. http://doi.org/10.1038/nrn3475
4. Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2013). Life after P-Hacking. SSRN Electronic Journal. http://doi.org/10.2139/ssrn.2205186
5. Jeffreys, H. (1931). Scientific inference. Cambridge: Cambridge University Press.
6. http://chronicle.com/blogs/percolator/daniel-kahneman-sees-train-wreck-looming-for-social-psychology/31338
7. http://www.theatlantic.com/science/archive/2015/11/gambling-on-the-reliability-on-science-literally/414834/
8. Clark, L., Lawrence, A. J., Astley-Jones, F., & Gray, N. (2009). Gambling near-misses enhance motivation to gamble and recruit win-related brain circuitry. Neuron, 61(3), 481–490. http://doi.org/10.1016/j.neuron.2008.12.031
9. Good, I. J. (1985). Weight of Evidence: A Brief Survey. Bayesian Statistics, 2, 249–270.
10. Royall, R. (2000). On the Probability of Observing Misleading Statistical Evidence. Journal of the American Statistical Association, 95(451), 760. http://doi.org/10.2307/2669456
Robert Bauer

Written by Robert Bauer

Agricolab | Descendant of Latin ‚agricola‘, farmer; Lab (colloquial) A laboratory

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.