## Hypothesis testing in Python – alpha,beta,Effect & Power

Suppose you’d like to conduct an experiment determining whether a new drug is effective to combat some disease. Let’s say that you have a control group and a test group of test subjects, where the test group receives your drug under test, while the control group receive a placebo.

In your experiment, you are going to measure a parameter, let’s call it ‘C2H5OH-saturation’, where the mean value within the population is 38, with a standard deviation of 0.5.
So you set up your null hypothesis & alternate hypothesis for your experient:

• H(0) := mean of C2H5OH-saturation in test group == 38 (that is, our drug has no detectable, effect on our patients)
• H(a) := mean of C2H5OH-saturation in test group != 38 (that is, our drug does have a detectable effect)

Then you collect data from your test group, with samplesize = 20, and find that the mean value is 38.4

The question now becomes: can you from the experiment draw any conclusions regarding whether your drug under development actually works…? That is, does the experiment give us reason to reject our null hypothesis – thus accepting the alternate hypothesis, meaning that the drug indeed works?  And if so, to what degree ?

This is where the concept of power comes in handy: basically, statistical power is probability that the test will find a statistically significant difference between the control group and the test group.

To do the power analysis, we start with selecting the alpha level, which is the probability of False Positives, i.e. test subjects where our test indicate a detectable effect, but where in fact there is none – the positive outcome is caused not by our drug, but by randomness. Typical alpha levels are 5% or 10 %, but for this experiment, we chose 4%. (In reality, we would start our experiment by first of all figuring out what samplesize we would need in order to get significant results, but in our case that’s already been set to 20).

Basically, what the alpha level does, is to set the bar for the magnitude of values we are willing to accept as supporting our null hypothesis – any values falling outside the alpha region(s) will be taken as reasons to reject our null hypothesis, despite the reality being that these values represent ‘outliers’, that is, rare but not impossible values in fact supporting our null hypothesis. The values falling outside of the alpha limit are our False Positives, also called Type I-errors.

The other type of error outcomes from our test we could run into is called beta. These are the False Negatives, i.e. cases where we erroneously fail to reject our null hypothesis, that is, our test indicate that the drug is not effective, while it in fact is. False Negatives are also called Type II-errors.

Perhaps an image can make this a bit clearer: The blue distribution represents our null hypothesis, that is, our drug has no effect, while the red distribution represents our alternate hypothesis, that is our drug does indeed have an effect. The green vertical bars represent our acceptance region for our null hypothesis, that is, any test values falling outside this region  – illustrated by the green shaded areas –  will be seen as false positives, or type I errors, causing us to falsely reject our null hypothesis.

The yellow shaded area is our beta, that is, our probability for False Negatives, or type II errors, causing us to erroneously accept our null hypothesis, despite the reality being that the alternate hypothesis is true, that is, the drug works.

In the image below, we can see how a simulation of the two different distributions, the one for the expected mean = 38, and the other with expected mean = 38.8 (both normalized to standard normal distribution) plays out: As can be seen, the two distributions are quite well separated, meaning that there is indeed a significant difference between the two groups, the control group vs the test group. There is some overlap in terms of alpha and beta, but both of those are fairly small, indicating that we have a significant difference between the two groups, which in turn is an indication that our drug does work. But the question is, can we somehow quantify our belief that the groups are different ?

That’s where the concept of power comes in: power is formally defined as 1 – beta,  that is, 1 – P(Type-II-errors), and gives us the probability that the test will find a statistically significant difference in our test subjects.

In this example, our beta turns out to be about 6%, which gives us a power of about 94%.

As always, with stochastic processes, there are few if any absolutes, so there are no iron law rules for what levels of alpha, beta and power constitute “the best” values – it depends. But the stats folks claim that a power of 80% and above constitutes strong evidence for statistical significance, i.e. that our results are valid.

That’s all for now. Those of you who want to look at the details can have a look at the Python code below, which runs this example and produces the graphs above.

(as usual, I haven’t figured out how to preserve code formatting in !#¤% WordPress – without paying for it – so the code might be a bit weird to read…)

```import numpy as np
from math import erf
import scipy.stats as sprand
import matplotlib.pyplot as plt

def normpdf(x, mu, sigma):
factor = np.array(x-mu).astype(float)
u = ((factor))/np.abs(sigma)
y = (1/(np.sqrt(2*np.pi)*np.abs(sigma)))*np.exp(-u*u/2.)
return y

def normcdf(x,mu,sigma):
a = np.array(x-mu).astype(float)
b = a/(sigma*2**0.5)
elist = []
for i in b:
elist.append((1. + erf(i))/2)
cdf = np.array(elist)
return cdf

mu = 38
std = .5
n = 20
alpha = 0.04
a = alpha
twoTail = True

if twoTail:
a = alpha / 2
sampleMean = 38.4
z = sprand.norm.ppf(1-a,0,1)

print ('H0: mu =',mu)
print ('H0: std =',std)
print ('Ha:mean != mu (two tailed test)')
print ('alpha =',alpha,'==> z =',z)
print ('test samplesize:',n)
print ('test sampleMean:',sampleMean)print ('Effect size:',(sampleMean-mu) / std)
se = std/np.sqrt(n) #also std of sample!!!
E = z * se
accept_low = mu - E
accept_high = mu + E
print ('population accept region',accept_low,accept_high)
accept_low = (accept_low-mu)/se
accept_high =(accept_high-mu)/se
print ('normalized accept region',accept_low,accept_high)
Z = (sampleMean - mu) / se
print ('normalized sample mean (Z)',Z)
beta = normcdf(accept_high,np.array([Z]),1)
print ('beta:',beta)
print ('Power:',1.-beta)
population = np.random.randn(10000)
sample = np.random.normal(Z,1,10000)
plt.title('Hypothesis testing, Alpha, Beta, Power')
plt.plot(np.sort(population),normpdf(np.sort(population),0,1),color='b',label='Std Normal Dist')
plt.axvline(accept_low,label='accept_region',c='g')
plt.axvline(accept_high,label='accept_region',c='g')
plt.plot(np.sort(sample),normpdf(np.sort(sample),Z,1),color='r',label='test outcome dist.')
plt.plot(np.sort(population),normcdf(np.sort(population),0,1),color='b',ls='dashed')
plt.plot(np.sort(sample),normcdf(np.sort(sample),Z,1),color='r',ls='dashed')
plt.axvline(0,color='b',lw=3)plt.axvline(Z,color='r',lw=3)
plt.legend(loc='upper left')
bins = 25plt.figure()
ax = plt.gca()
ax2 = plt.twinx()
hist = ax.hist(sample,color='r',bins=bins,rwidth=0.5,label='normalized sample dist.',alpha=0.8)
ax2.plot(np.sort(sample),normpdf(np.sort(sample),Z,1),color='orange',ls='dashed',lw=3)
ax.hist(population,color='b',bins=bins,rwidth=0.5,label='normalized pop. dist',alpha=0.7)
plt.title('normalized sample frequency')plt.axvline(Z,label='normalized sample mean: '+ str(Z),color='k',lw=3,ls='dashed')
ax2.plot(np.sort(population),normpdf(np.sort(population),0,1),color='c',ls='dashed',lw=3)plt.axvline(accept_low,label='accept_region',c='g')
plt.axvline(accept_high,label='accept_region',c='g')ax2.plot(np.sort(sample),normcdf(np.sort(sample),Z,1),label='cdf',color='orange',ls='dashed')
ax2.legend(loc='upper left',frameon=False)plt.show()``` 