A further exploration into the world of hypothesis testing, following up from my previous post on the topic.

(For simplicity, I’ll use the same example as in the previous post, that is, we have a control population with mean of 38.0, and a test population with a mean of 38.4, both with a standard deviation of 0.5)

This first graph shows our two population distributions, the topmost plot in actual values, while the bottom plot shows the corresponding standardized values, i.e. all control group values have been standardized to standard normal distribution : N~(0,1). The test group values have also been standardized, with an offset of (mean(control) – mean(population)) / std(population).

It’s clear from the plots that the two populations differ, but there’s also considerable overlap between the two groups.

Now, assume we are to take a sample of 20 from one of these groups, but it is not known to us a priori which of the groups the sample comes from. That is, we want to use sampling and inferential statistics to determine which of the two groups our sample comes from. What we’d like to figure out, given our sample, is the probability for us to determine which of the two groups our sample comes from.

This is where the concept of statistical power comes in. Basically, statistical power presents the probability for a test procedure to detect an “effect” when there in fact is an effect to be detected. That is, in our case the “effect” is the difference in parameter value (the mean) between the two groups.

Thus, our objective is to decide, given our sample size of 20, how “confident” we can be that our test result correctly reflects the truth, that is, if our test shows that our sample comes from group A or group B, how much should we believe that finding ?

So, first we define our alpha level, which basically is the limit of “false positives” (Type I errors) we are prepared to accept. The alpha level gives us our Margin of Error, and thus our Acceptance and Reject Regions, that is, the range of values that we will use to determine whether the value obtained from our sample should make us decide that our sample comes from group A or group B.

In this example, we set alpha to 4%, which gives us a Confidence Level of 96%. That means that should we repeat this experiment over and over, in 96% of the trials the value obtained in the trial will reside within our acceptance region.

In the plots above, the one on the left with actual data values, the one on the right with standardized values, the vertical black bars illustrate the acceptance region, that is, if the result of our sample fall within these bars, our decision is that the sample was taken from the control population (group A). If the result falls to the right of the rightmost bar, our decision will be that the sample comes from group B, the test population.

The above graph illustrates the alpha we’ve chosen (4%) by the greenish shaded areas (this is from a two tailed test), that is, values falling within these areas will be seen as false positives, making us erroneously draw the conclusion that our sample does not come from the control group.

The yellow shaded area represents our beta, or false negatives (Type II errors), causing us to erroneously draw the conclusion that our sample does not come from the test population.

The alpha is defined by us when setting up the experiment, and the beta follows from our acceptance region limit, given by the cumulative probability for the value of our upper acceptance region limit.

Finally, power is defined as 1 – beta, and gives us the probability of our test detecting an existing effect, that is, difference in the parameter between the two groups.

The above graph illustrates the results from our example: the magenta curve shows the expected distribution of 20 samples drawn from the control population, and the orange curve shows the expected distributio of 20 samples drawn from the test population. The vertical black bars show the acceptance region, that is, if our sample falls within these values, our decision is that the sample comes from the control group.

As can be seen, there is a tiny probability – 6% in fact – for our sample, despite being drawn from the test group, to fall within the acceptance region – this is our beta, or Type II errors, or false negatives, that is, the values falling under the orange curve inside the acceptance region cause us to erroneously decide that the sample comes from the control population. This gives our test a power of 94%, indicating the probability of our test having detected an existing effect.

For details, below a Python/Numpy/Scipy implementation of the example.