Statistical power, alpha, beta & effect size

A further exploration into the world of hypothesis testing, following up from my previous post on the topic.

(For simplicity, I’ll use the same example as in the previous post, that is, we have a control population with mean of 38.0, and a test population with a mean of 38.4, both with a standard deviation of 0.5)


This first graph shows our two population distributions, the topmost plot in actual values, while the bottom plot shows the corresponding standardized values, i.e. all control group values have been standardized to standard normal distribution : N~(0,1).  The test group values have also been standardized, with an offset of (mean(control) – mean(population)) / std(population).

It’s clear from the plots that the two populations differ, but there’s  also considerable overlap between the two groups.

Now, assume we are to take a sample of 20 from one of these groups, but it is not known to us a priori which of the groups the sample comes from.  That is, we want to use sampling and inferential statistics to determine which of the two groups our sample comes from.  What we’d like to figure out, given our sample, is the probability for us to determine which of the two groups our sample comes from.

This is where the concept of statistical power comes in.  Basically, statistical power presents the probability for a test procedure to detect an “effect” when there in fact is an effect to be detected. That is, in our case the “effect” is the difference in parameter value (the mean) between the two groups.

Thus, our objective is to decide, given our sample size of 20, how “confident” we can be that our test result correctly reflects the truth, that is, if our test shows that our sample comes from group A or group B, how much should we believe that finding ?

So,  first we define our alpha level, which basically is the limit of “false positives” (Type I errors) we are prepared to accept.  The alpha level gives us our Margin of Error, and thus our Acceptance and Reject Regions, that is, the range of values that we will use to determine whether the value obtained from our sample should make us decide that our sample comes from group A or group B.

In this example, we set alpha to 4%, which gives us a Confidence Level of 96%.  That means that should we repeat this experiment over and over, in 96% of the trials the value obtained in the trial will reside within our acceptance region.

In the plots above, the one on the left with actual data values, the one on the right with standardized values, the vertical black bars illustrate the acceptance region, that is, if the result of our sample fall within these bars, our decision is that the sample was taken from the control population (group A). If the result falls to the right of the rightmost bar, our decision will be that the sample comes from group B, the test population.


The above graph illustrates the alpha we’ve chosen (4%) by the greenish shaded areas (this is from a two tailed test), that is, values falling within these areas will be seen as false positives, making us erroneously draw the conclusion that our sample does not come from the control group.

The yellow shaded area represents our beta, or false negatives (Type II errors), causing us to erroneously draw the conclusion that our sample does not come from the test population.

The alpha is defined by us when setting up the experiment, and the beta follows from our acceptance region limit, given by the cumulative probability for the value of our upper acceptance region limit.

Finally, power is defined as 1 – beta, and gives us the probability of our test detecting an existing effect, that is, difference in the parameter between the two groups.


The above graph illustrates the results from our example: the magenta curve shows the expected distribution of 20 samples drawn from the control population, and the orange curve shows the expected distributio of 20 samples drawn from the test population.  The vertical black bars show the acceptance region, that is, if our sample falls within these values, our decision is that the sample comes from the control group.

As can be seen, there is a tiny probability – 6% in fact – for our sample, despite being drawn from the test group, to fall within the acceptance region – this is our beta, or Type II errors, or false negatives, that is, the values falling under the orange curve inside the acceptance region cause us to erroneously decide that the sample comes from the control population.  This gives our test a power of 94%, indicating the probability of our test having detected an existing effect.

For details, below a Python/Numpy/Scipy implementation of the example.

Continue reading

Posted in Data Analytics, Data Driven Management, Math, Numpy, Probability, Python, Statistics | Tagged , , , , , , , , | Leave a comment

World Economic Forum Global Competitiveness rankings 2017-2018

Interesting statistics on many categories of data regarding the performance of various nations. Includes rankings and trends.

Posted in Business, Culture, Data Analytics, Finance, Organization, Politik | Tagged , | Leave a comment

How do casinos make money? Expected Value explained

Ever played roulette ? Or any other game, such as black jack, or any other of those on offer in casinos ?  If not, good for you, because from a mathematical point of view, all these games are set up for you to lose money.

Let’s consider the roulette situation where you bet 1 [enter your favorite currency here] on a unique number, that is, one of the 37 (on european roulette) possible choices of 0..36

What’s the probability for you to win ? Well, with 1 out of 37, its about 2.70% (1/37).  What’s the probability for you to lose ? It’s 36/37, or about 97%.

If you win, the casino pays you 35:1, that is, with your bet of 1, if you win, you will get 35.

The casino folks are not stupid, so the odds are carefully calculated to ensure that the casino makes money in the long run. Thus, even if you get lucky, and leave the casino with more money than you had on arrival, in the long run, you, as well as everybody else playing in the casino are going to lose money.  But how much ?

The expected value of the game of betting 1 unit on a single number can be calculated as follows:

EV = p_win(win-cost) – p_loss(cost)

= 0.0270*(36-1) – 0.97*-1

= -0.0270

So, the expected value tells you that for every USD (or whatever currency) you bet, in the long run you are going to lose  little less than 3 cents. Not such a big deal, is it….? Well, in the long run, these cents will make up a very nice profit for the casino (and less nice losses for the gamers):

Below the results of a simulation, where 10.000 players each play 10.000 games of roulette, betting on a single number. That is, 100.000.000 games in total:

after 10000 sessions and 10000 games in each session:
with win:36.0 P(win):0.02702703 cost:1.0
simulated EV mean: -0.0277822
simulated balance mean: -277.822


Looking at the top graph, which shows the outcomes of 10.000 plays for each of the 10.000 players, it looks like the outcomes are pretty even, that is, about as many players end their 10000 games on plus, i.e. winning, as players ending their game day losing. From that top graph, it’s difficult to see that the casino actually makes money, reason being that the “House Advantage”, that is, the expected value, is very small, just -2.7%.

But that small difference, that small house edge taken in the long run, gives the casino solid profits, as can be seen in the bottom graph, which shows the cumulative player losses, or looking from the perspective of the casino, the cumulative wins: in the 100.000.000 games of players betting 1 USD on a specific number to come up, the casino has made a profit of 2.7 million USD, which of course is what the expected value tells us, that is, the casino can expect to gain 2.7% on every bet.  In other words: if you enter the casino with 10.000 USD, and bet each of those dollars on any single number of roulette, on average you can expect to leave the casino with a debt of 270 dollar.

Continue reading

Posted in Math, Numpy, Probability, Python | Tagged , , , | Leave a comment

“Mild Randomness – obesity,Expected value & thin tail distributions

In a previous post, I linked to an article illustrating the fundamental difference between distributions with thin vs fat tails, that is, between “mild” and “wild” randomness.

Here, to illustrate mild randomness, let’s consider the following challenge:

You are given a mission, where you are to randomly select two male adults in your city, one after the other. If you are lucky and manage to select those two men so that the second weighs at least twice as much as the first, you will receive an award of 1.000.000 USD.

On the other hand, should you fail in your random pic of the two men, you will have to pay 100 USD as a penalty for failing the mission.

The question now is: should you take on this task? What are the odds of success vs failure? Take a moment trying to figure out a ball park probability for success….


In order to figure out the odds/probabilities for this game, I wrote a simulation, where I pick the two subjects from a normally distributed population with mean weight 80 kg and standard deviation 11 kg (numbers I found somewhere on the net).

Below a graph of the simulation run with 1.000.000 iterations:


The top graph shows the population’s weight distribution, while the bottom graph maps the pairs found that meet the success criteria, i.e. where the second person weighs at least twice as much as the first: in 1M trials (there were 592 matches).

That gives a probaility for success of 0.0006, or about 1 in 2000.

The bottom line is that for any phenomena governed by normal distribution, that is, “mild” randomness or “thin tails”, randomly finding a sample that is twice as twice the size of another random sample is very unlikely.  Reason being the symmetry of the normal distribution, which means that in order to succeed in this mission, you must first be lucky enough to have your first person come from the lower weight side, the left side, of the distribution, and then be lucky enough to pick your second person from the high weight right side of the distribution. This can be clearly seen in the second graph, where all the first persons come from the area below -2 standard deviations, while most of the second persons come from the area above the mean.

So, wrt my previous post regarding journalists who mistake “mild” risks such as those with bathtubs, traffic, lawnmovers etc etc with the “wild”, long tailed risks of terrorism: since your battub/lawnmover/traffic etc are not actively trying to kill you, those risks are a very different beast from the “wild”, fat tailed risks of terrorism, where the perps actively try to kill you, and where a doubling (or whatever factor) of the number of casulties from one act of terror to next is fully possible.

Posted in Math, Numpy, Probability, Python, Statistics | Tagged , , , | Leave a comment

Are lawnmovers greater risk than terrorists…?

Paper debunking media’s common misunderstanding of different risks. Systemic risk vs random risk – two extremely different beasts. Your bathtub, lawnmover or swimming pool are random risks with narrow tails, while terrorists constitute systemic risks with fat tails.


Posted in development, Probability, Statistics | Tagged , | Leave a comment

Hypothesis testing in Python – alpha,beta,Effect & Power

Suppose you’d like to conduct an experiment determining whether a new drug is effective to combat some disease. Let’s say that you have a control group and a test group of test subjects, where the test group receives your drug under test, while the control group receive a placebo.

In your experiment, you are going to measure a parameter, let’s call it ‘C2H5OH-saturation’, where the mean value within the population is 38, with a standard deviation of 0.5.
So you set up your null hypothesis & alternate hypothesis for your experient:

  • H(0) := mean of C2H5OH-saturation in test group == 38 (that is, our drug has no detectable, effect on our patients)
  • H(a) := mean of C2H5OH-saturation in test group != 38 (that is, our drug does have a detectable effect)

Then you collect data from your test group, with samplesize = 20, and find that the mean value is 38.4

The question now becomes: can you from the experiment draw any conclusions regarding whether your drug under development actually works…? That is, does the experiment give us reason to reject our null hypothesis – thus accepting the alternate hypothesis, meaning that the drug indeed works?  And if so, to what degree ?

This is where the concept of power comes in handy: basically, statistical power is probability that the test will find a statistically significant difference between the control group and the test group.

To do the power analysis, we start with selecting the alpha level, which is the probability of False Positives, i.e. test subjects where our test indicate a detectable effect, but where in fact there is none – the positive outcome is caused not by our drug, but by randomness. Typical alpha levels are 5% or 10 %, but for this experiment, we chose 4%. (In reality, we would start our experiment by first of all figuring out what samplesize we would need in order to get significant results, but in our case that’s already been set to 20).

Basically, what the alpha level does, is to set the bar for the magnitude of values we are willing to accept as supporting our null hypothesis – any values falling outside the alpha region(s) will be taken as reasons to reject our null hypothesis, despite the reality being that these values represent ‘outliers’, that is, rare but not impossible values in fact supporting our null hypothesis. The values falling outside of the alpha limit are our False Positives, also called Type I-errors.

The other type of error outcomes from our test we could run into is called beta. These are the False Negatives, i.e. cases where we erroneously fail to reject our null hypothesis, that is, our test indicate that the drug is not effective, while it in fact is. False Negatives are also called Type II-errors.

Perhaps an image can make this a bit clearer:


The blue distribution represents our null hypothesis, that is, our drug has no effect, while the red distribution represents our alternate hypothesis, that is our drug does indeed have an effect. The green vertical bars represent our acceptance region for our null hypothesis, that is, any test values falling outside this region  – illustrated by the green shaded areas –  will be seen as false positives, or type I errors, causing us to falsely reject our null hypothesis.

The yellow shaded area is our beta, that is, our probability for False Negatives, or type II errors, causing us to erroneously accept our null hypothesis, despite the reality being that the alternate hypothesis is true, that is, the drug works.

In the image below, we can see how a simulation of the two different distributions, the one for the expected mean = 38, and the other with expected mean = 38.8 (both normalized to standard normal distribution) plays out:


As can be seen, the two distributions are quite well separated, meaning that there is indeed a significant difference between the two groups, the control group vs the test group. There is some overlap in terms of alpha and beta, but both of those are fairly small, indicating that we have a significant difference between the two groups, which in turn is an indication that our drug does work. But the question is, can we somehow quantify our belief that the groups are different ?

That’s where the concept of power comes in: power is formally defined as 1 – beta,  that is, 1 – P(Type-II-errors), and gives us the probability that the test will find a statistically significant difference in our test subjects.

In this example, our beta turns out to be about 6%, which gives us a power of about 94%.

As always, with stochastic processes, there are few if any absolutes, so there are no iron law rules for what levels of alpha, beta and power constitute “the best” values – it depends. But the stats folks claim that a power of 80% and above constitutes strong evidence for statistical significance, i.e. that our results are valid.

That’s all for now. Those of you who want to look at the details can have a look at the Python code below, which runs this example and produces the graphs above.

(as usual, I haven’t figured out how to preserve code formatting in !#¤% WordPress – without paying for it – so the code might be a bit weird to read…)

Continue reading

Posted in Data Driven Management, Probability, Python, Simulation, Statistics | Tagged , , , , , , | Leave a comment

Trigger-Happy, Autonomous, and Disobedient: Nordbat 2 and Mission Command in Bosnia

Leadership, when it really matters…!

Posted in development, Leadership | Tagged | Leave a comment

Statistics – lies, damn lies,part II: “Cancer rates in your county are more than factor 18 above national average!”

My previous post, on how statistics can, if not carefully presented and analyzed, give misleading conclusions can be further illustrated by a variant of the problem presented in that previous post: in the previous simulation, the overall population of 1.000.000 was uniformly allocated to the grid of geographical areas of equal size. That means that each areas has more or less the same number of people. Now, that’s probably not a very likely scenario for most countries/states/regions/cities etc, so let’s change the simulation so that the population in the different areas differ. In my simulation, I now allocate people to to the areas with a Normal Distribution, i.e. the area in the “middle”, area 1000 of the 2000 areas, is the mean area, which will receive most inhabitants, and the areas towards the two tails will get increasingly less people.  The point of this exercise is to ensure that the population in our 2000 areas differ.

So, what do you think will happen to the maximum cancer rate found in our 2000 areas….?

The simulation reveals that instead of a factor 4 above national average, which we had in the previous simulation with all areas more or less with equal population density, we now get a maximum cancer rate of factor 18  above national average. Well that would make headlines in our newspapers, maybe even a Pullizer Price to the journo…! 🙂

Below some graphs to illustrate this:

The first graph,  with 4 subplots, has the same info as in the previous post. It’s clear from the first subplot that now, in this simulation, the population is not uniformly distributed over the 2000 areas – instead, the distribution is normal, with a mean at 1000.

Below the subplots are two bar charts illustrating the cancer/population ratios, the one on left shows the data for this second simulation, with population in the areas varying by normal distribution, while the one on the right shows the data for a uniform population distribution (pay attention! the y-scales are different!).

As can be seen by carefully examining these two bar charts, there is a huge difference in maximum cancer rates between these two simulations.

Posted in Data Analytics, Data Driven Management, Math, Numpy, Probability, Simulation, Statistics | Tagged , , , , , | Leave a comment

Lies,damn lies, statistics – community cancer rates

Statistics is an interesting topic – it can illuminate as well as it can confuse. Sometimes, statistics do mislead unintentionally, i.e. the presenter has done a (honest) mistake. Other times, less honest presenters can take advantage of the fact that most people are not very comfortable with numbers, math nor statistics, and twist statistics to reflect their point of view.

To illustrate this, let’s consider a (hypothetical) newspaper article, stating that some particular city, or county, or in general, any specific geographical area close to you, has a rate of cancer that is 4 times greater than national average.

The question is: is there reason for alarm, i.e. should you immediately start planning for moving out of the vicinity…? After all, four times the national cancer rate sounds like there’s something in your area that is causing all these “way-over-the-bar” cancer occurences… Could there be some area specific factors, environmental, social, genetic, economic etc, that causes the difference in cancer rates ? Or could the difference be due to pure random variability…?

In order to illustrate that despite the alarmist article, perhaps you despite it should calm down, not overly worry about what the probably statistically illiterate newspaper journalist has written, I wrote a simulation to examine whether pure random variability could explain such a large difference in cancer rates.

The scenario thus is: imagine a state/country/region etc divided up into a number of equal areas. Furthermore, imagine that this “grid” of areas is populated with citizens more or less uniformly, i.e the number of citizens in each area is roughly the same. Now, the question is, with a single, nationwide cancer rate (which in the simulation I’ve set to 338/100000, based on cancer statistics  in the developed world I found on the net), will randomness alone suffice to explain a difference of factor 4 in cancer rates between the national average and a specific area ?

Turns out the answer to this question is YES – that is, the alarming cancer rate could very well be caused by pure random variability: the graphs below illustrate a grid of 2000 areas, onto which a total population of 1.000.000 has been randomly assigned, by a uniform distribution, resulting in each area having roughly the same number, ~500, people.  This is illustrated by the first subplot.

The second subplot shows the resulting population frequency, nicely following the Normal Distribution, with a mean of 500, and a standard deviation of 22.

On top of the area grid, we allocate a number of cancer cases, the number of such cases being #population * #cancer_rate, which in the simulation I ran was 1.000.000 * 338/100.000, which gives 3380 cancer cases, to be randomly assigned to the people in the 2000 areas.

The third subplot shows the distribution of the cancer cases onto the 2000 areas. The mean number of cancer cases per area is 1.69, with a standard deviation of 1.27, as can be seen from subplot 3. But from the same subplot you can also see a number of spikes, with cancer occurences way higher than the mean, in fact, the maximum number of cancer in a specific area is 8, that is, a factor 4.7 above the national average.

Subplot 4 illustrates this more clearly: most areas have 1 case of cancer among it’s roughly 500 inhabitants, others have 0 cases, but on the far right of the graph there are a handful of areas with up to 8 cases.

Thus, we can conclude that before we decide to pack and leave the area with the seemingly exceptionally high occurence of  cancer, we need more information about the causes – after all, it might be the case that the high frequency is fully caused by pure random variation.


Continue reading

Posted in Big Data, Data Analytics, Data Driven Management, Math, Numpy, Probability, Python, Simulation, Statistics | Tagged , , , , , | Leave a comment

Winter solstice

Posted in development | Leave a comment