For a number of years ago, John Ionnidis published a paper claiming to prove that most research papers are in fact wrong. That is, the findings of many/most research papers can actually not be reproduced by other, independent teams.

According to Ionnidis, one reason for this poor state of affairs in the world of science and research is the traditional reliance on p-values.

Basically, a p-value is an indication on how likely – or not – it is for your results to have been caused by randomness.

For a while ago, I wrote about traditional (“frequentist”) hypothesis testing, and although I might not explicitly mention p-values in that article, they are in fact there, hidden in the confidence intervals and alpha values.

So, let’s take another look at the same example from the previous post:

Briefly, the scenario for the previous post was that we have a test, and we would like to determine whether our test subjects are registered as “positive” by the test, by showing a test value not equal to the expected value for the parameter, obtained from the control group. So, we pull a number of samples, 20 in our case, and record the mean and standard deviation of the sample, and plug those numbers into some statistical formulas, look up a few numbers in a Z-score table, and voilà, we can state with impressive looking certainty that “our test shows with 96% conficence that this test subject is “positive”, the probability for these results to have been caused purely by randomness is less than four in 100″.

Sounds really good and impressive, right…? We can illustrate this type of finding with a graph:

Here we have pulled 20 samples from both the control group (green) and the test group(red), 100000 times, and plotted the frequency of the mean values of those 20 samples. As can be seen, the two sets of samples are quite well separated, with little overlap – the Alpha (Type I error) is 4%, the Beta (Type II error) is 4%, and Power is 96%. So, by looking at repeted sample means, we are quite likely to draw the conclusion that there is a measurable difference between the two groups of test subjects.

However, if we should look at each individual sample of 20, the situations becomes different. The graph below presents the cumulative frequency over the 100000 draws, for each individual sample value, for both groups:

Now, when looking at all the samples, the picture is far from clear – sure, we can see that the test group is more towards the higher parameter values, but the overlap is quite large, meaning that each individual sample we draw from the populations, has a non-negligable likelihood of being very different from what we get by relying upon making our calculations using means of many many sets of samples. That is, chances are that the unique sample you just draw actually is an outlier, not being anywhere close to the the nice, cleancut results the statistical algebra you performed above indicate.

We can combine these two plots into a single one to see the big picture:

Here we have the information from both above graphs combined to one: as can be seen, the histogram shows the sample means, obtained by repeated draws, while the plot lines show the cumulative individual samples. The difference in spread – variance or standard deviation – is huge.

So, as far as I can understand it, the traditional statistical approach for significance/hypothesis testing is based on a theory that really only deals with “the long run”, i.e. the notion is that “should you repeat the experiment 100 times, you can expect to get the results you calculated 96 times”. And btw, you’d better have a large enough sample size too.

Now, that’s not necessarily a bad thing – it’s probably as good as it can get in fact – after all, we are dealing with randomness and variability, and prediction is difficult, particularly about the future. But I believe that very many people, seeing a statement such as “with 95% confidence”, give far too much weight to that statement in their decision making. After all, chances are your result was produced by a freak random event, a test result that is far from the beloved mean of means.

*One way to think about it: say you have a 100 round barrel gun, with 4 live rounds, the rest of them blank. Would you aim at your head and pull the trigger, because the conficence interval is 96%….?*

Perhaps a more informative, and honest way to present statistical findings is to explicitly show the uncertainty in the results – the below graph, obtained by Bayesian Inference on the same problem, is a two dimensional “heatmap” , showing the uncertainty in both the expected value, and its standard deviation.

The heatmap clearly shows how much uncertainty there is in our results – the stronger the red color, the more likely it is that our true parameter value resides there. But as can be seen, the uncertainty in expected value from our 20 samples ranges from 38.0 to 38.7, and similarly the standard deviation ranges from 0.3 to 0.8.

To me, this presentation is more honest, since it actually shows that our test has a significant uncertainty, and that perhaps we should increase our sample size before we take any action on the results.