As per 20180224.
As per 20180224.
[Update 20180224: updated per capita, plus a graph showing “Total Points”, where Gold gives 3p, Silver 2p, Bronze 1p]
Just a quick add-on to my previous post on yet another way to present multidimensional data:
To recap, we have a “distribution of distributions”, where each distribution has two dimensions, mu and sigma.
In the previous post, I chose to present the data as a heatmap, as below:
But there are other ways such multidimensional data could be presented, below a 3D version:
Although this 3D presentation looks quite fancy, I’m not sure it’s easier to understand than the heatmap above. One way to think about the data is that the heatmap presents a top view, that is, we are looking at the data from above, while the 3D view lets us see the data from a dimensional perspective. To illustrate that, below are plots showing the “profiles” of the “mountain”, that is, mu and sigma separately. You can imagine that the mu-plot is the projection of the 3D-data when standing on its mu-side, and the sigma-plot is the projection when standing on the sigma side:
I wanted to experiment a bit with a for me previously unknown Python library, “Beautiful Soup”, which is an html and xml parser. So I used the Soup to grab data from a Wikipedia page, and massaged it a bit with Pandas:
The graph contains stats on about 40 different countries, in terms of homicides by firearm and gun ownership. In the two first subplots, USA clearly sticks out, both in terms of number of guns – according to the wiki page, there are about as many guns as people in the US…! – as well as in homicides by gun.
The third subplot shows homicides per gun, and now the rankings give some surprises, e.g. that Netherlands (NL) ranks very high, obviously reason being there being very few guns, but relatively speaking many homicides.
I found this very illuminating short tutorial video on Approximate Bayesian Computation, by Rasmus Bååth, on youtube, and since Rasmus example uses R as the implementation language, I decided to implement the example in Python.
The problem at hand is as follows: you’ve done laundry, and started to pull out socks from the laundry machine. After having pulled out 11 socks, you notice that none of them constitute a pair, all 11 pulled socks are singletons.
The question is: how many socks were there in the machine from start ?
Think about it for a while….
Well, one thing we perhaps can conclude is that there must have been at least 11 socks to start with, since we managed to pull out the first 11… But can we do any better…?
Turns out that with Bayesian Inference, we can. But in order to do so, we need to provide Bayes with some initial information about our sock management procedures, namely a couple of priors.
First, let’s tell the Bayesian machine that our typical laundry consists of somewhere between a couple and up to some 80 or so individual socks (your milage might vary!). We can model this e.g. as a normal disribution, centered at 40, with a standard deviation of 10.
Second, we need to tell the Bayesian machine how “orderly” we are, that is, what ratio of our socks subject to laundry will be true pairs, and what ratio will be odd socks. We can model this as a beta distribution, and since we are very orderly, we supply an alpha = 15, and a beta = 2, which will result in the beta distribution centering around 90% of our laundry socks belonging to pairs of socks.
Here’s graphical illustrations of our two priors:
So, given these two priors, we can run a Bayesian Inference simulation that shows us the distribution for the posterior likelihood of the number of socks in the machine, given that our data is that we have pulled out 11 singleton socks from the machine:
Turns out the most likely number of socks in the machine to produce 11 singletons, is about 44.
In order for you to verify this finding next time you’re doing laundry, the code for the simulation is supplied below.
A follow-up on my previous post on statistical significance and hypothesis-testing:
Let’s say we pull a number of samples, as in the previous post, from both a control group and a test group. Let’s also say that for the samples from the control group, the sample mean is 38, standard deviation is 0.5. For the test group samples, the corresponding parameters are 38.4 and 0.5.
Let’s then use Bayes to fit these two sets of data independently to their likelihood functions, and to produce two posteriori distributions that we then can sample from, so that we can hopefully see whether there is enough statistical significance here to base any decision on:
Below a heat map with sample size 20, for both groups of sampled data:
As can be seen, the overlap between the two sets of data is quite large, meaning that we should perhaps not use this small dataset as basis of a decision.
Let’s increase the sample_size to 200:
Now with 10 times more data, the two datasets are almost perfectly separated, with the control data centered around 38.0 and 0.5, and the test data around 38.4 and 0.5, as our original sampling data indicated.
Let’s increase the sample size to 2000:
This time, with 100 times the initial sample size, our posterior sampling is very well clustered around the respective mean and standard deviation.
For a number of years ago, John Ionnidis published a paper claiming to prove that most research papers are in fact wrong. That is, the findings of many/most research papers can actually not be reproduced by other, independent teams.
According to Ionnidis, one reason for this poor state of affairs in the world of science and research is the traditional reliance on p-values.
Basically, a p-value is an indication on how likely – or not – it is for your results to have been caused by randomness.
For a while ago, I wrote about traditional (“frequentist”) hypothesis testing, and although I might not explicitly mention p-values in that article, they are in fact there, hidden in the confidence intervals and alpha values.
So, let’s take another look at the same example from the previous post:
Briefly, the scenario for the previous post was that we have a test, and we would like to determine whether our test subjects are registered as “positive” by the test, by showing a test value not equal to the expected value for the parameter, obtained from the control group. So, we pull a number of samples, 20 in our case, and record the mean and standard deviation of the sample, and plug those numbers into some statistical formulas, look up a few numbers in a Z-score table, and voilà, we can state with impressive looking certainty that “our test shows with 96% conficence that this test subject is “positive”, the probability for these results to have been caused purely by randomness is less than four in 100″.
Sounds really good and impressive, right…? We can illustrate this type of finding with a graph:
Here we have pulled 20 samples from both the control group (green) and the test group(red), 100000 times, and plotted the frequency of the mean values of those 20 samples. As can be seen, the two sets of samples are quite well separated, with little overlap – the Alpha (Type I error) is 4%, the Beta (Type II error) is 4%, and Power is 96%. So, by looking at repeted sample means, we are quite likely to draw the conclusion that there is a measurable difference between the two groups of test subjects.
However, if we should look at each individual sample of 20, the situations becomes different. The graph below presents the cumulative frequency over the 100000 draws, for each individual sample value, for both groups:
Now, when looking at all the samples, the picture is far from clear – sure, we can see that the test group is more towards the higher parameter values, but the overlap is quite large, meaning that each individual sample we draw from the populations, has a non-negligable likelihood of being very different from what we get by relying upon making our calculations using means of many many sets of samples. That is, chances are that the unique sample you just draw actually is an outlier, not being anywhere close to the the nice, cleancut results the statistical algebra you performed above indicate.
We can combine these two plots into a single one to see the big picture:
Here we have the information from both above graphs combined to one: as can be seen, the histogram shows the sample means, obtained by repeated draws, while the plot lines show the cumulative individual samples. The difference in spread – variance or standard deviation – is huge.
So, as far as I can understand it, the traditional statistical approach for significance/hypothesis testing is based on a theory that really only deals with “the long run”, i.e. the notion is that “should you repeat the experiment 100 times, you can expect to get the results you calculated 96 times”. And btw, you’d better have a large enough sample size too.
Now, that’s not necessarily a bad thing – it’s probably as good as it can get in fact – after all, we are dealing with randomness and variability, and prediction is difficult, particularly about the future. But I believe that very many people, seeing a statement such as “with 95% confidence”, give far too much weight to that statement in their decision making. After all, chances are your result was produced by a freak random event, a test result that is far from the beloved mean of means.
One way to think about it: say you have a 100 round barrel gun, with 4 live rounds, the rest of them blank. Would you aim at your head and pull the trigger, because the conficence interval is 96%….?
Perhaps a more informative, and honest way to present statistical findings is to explicitly show the uncertainty in the results – the below graph, obtained by Bayesian Inference on the same problem, is a two dimensional “heatmap” , showing the uncertainty in both the expected value, and its standard deviation.
The heatmap clearly shows how much uncertainty there is in our results – the stronger the red color, the more likely it is that our true parameter value resides there. But as can be seen, the uncertainty in expected value from our 20 samples ranges from 38.0 to 38.7, and similarly the standard deviation ranges from 0.3 to 0.8.
To me, this presentation is more honest, since it actually shows that our test has a significant uncertainty, and that perhaps we should increase our sample size before we take any action on the results.
Har nu uppdaterat både datat och min Bayesian Inferencemodell, datat genom att lägga till siffrorna från samtliga övriga opinionsinstitut, och modellen körs numera i Stan, ett domänspecifikt språk för statistiska analyser.
Efter en körning av samtliga publicerade opinionsundersökningar sedan valet 2014, totalt 290 mätningar per parti, så är prognosen för valet klar:
Kristdemokraterna är det enda parti som med 95% konfidensintervall kommer att ramla ur Riksdagen, såväl MP som L ser i nuläget ut att klara sig kvar, med en röstandel på ca 5%.
När det gäller blocken så kommer Alliansen att få 35% (med KD utanför Riksdagen), och de RödGröna får 39%. Om KD trots allt skulle råka klara sig kvar i Riksdagen, vilket de mycket väl kan göra mha “Kamrat-4%”, så blir det dött lopp mellan blocken.
Men som sagt – opinionsdatat till dags dato talar klart och tydligt för att KD inte får över 4% av rösterna, och därmed får Alliansen det mkt svårt att vinna valet.
Sverigedemokraterna får runt 18-19%.
Såg just att i en Facebook-grupp finns en poll “Vilket parti skulle du rösta på i kommunalvalet idag?” Eftersom den politiska situationen i min hemkommun är “lätt” turbulent, så tänkte jag att det kunde vara kul att köra en Bayesian Inference-analys givet poll-resultatet.
Som prior sätter jag de olika partiernas resultat i Val2014, och jag låter priorn vara relativt “lös”. Dessutom så har sen val2014 Sverigedemokraterna tillkommit, och då dessa inte hade några (publicerade) resultat i kommunalvalet 2014, så har jag satt deras prior till 1% som skönsvärde.
Asume we are in the process of doing evidence based testing of two competing strategies, “A” and “B”, and we want to evaluate these competing strategies over several days, weeks, months or whatever timespan we have decided upon. That is, we want to do our analysis by gathering multiple data points from our test population, over a period of time, one data point per unit time. For instance, we might each day record how many calls or site visits we get, or we might be interested in opinion poll results, whatever. Then we also keep track of how many of the calls/visits/polls etc result in a successful business outcome, whatever that is, e.g. a closed sale, a successful conversion of web visit to a signup or a vote for a specific party etc.
Below a trivial example of such a record of data points for our two strategies:
data_A = [(100,55),(50,30),(75,25),(132,30),(40,10),(70,10),(130,30)]
data_B = [(100,25),(32,8),(87,12),(10,2),(57,20),(88,32),(140,88)]
The first number in each tuple is the number of calls/visits etc, the second number is the number of successful outcomes. So, looking at the first data point for strategy A, we can see that 55 of 100 “visits” resulted in a success outcome, so we have a rate of success of 55% for that day (or whatever unit of time we are working with).
Let’s say the 7 data points above, for our two strategies, represent test data gathered during a week.
Now, given that data: can we make an informed decision regarding which of our two strategies looks most promising ?
One thing we might try, is to compare the percentages of successes:
In the topmost graph, we have the daily success ratios for both strategies. Looking at the two solid red and green lines, it looks like strategy B is a clear winner, right…? Not necessarily.
Still looking at the top graph, the dashed lines represent the mean, or average, of the success rates for the entire week. Looking at those lines, we see that the red strategy, strategy A, is performing a bit better, overall.
The reason the two trendlines and the mean lines give us contradicting results is because strategy A started at a high success rate, but then fell, while strategy B started low but then caught up.
So, which strategy is better…? Frankly, personally I would not be able to tell from the data presented thus far. Since we are interested not primarily in the (temporary ?) trend, but the overall outcome of a full week of testing, we should perhaps lean more towards using the mean values, and in that case, strategy A looks like a winner, despite the negative trend.
Perhaps Bayesian inference can bring some clarity into the decision. The bottom graph shows the results after Bayesian updating of the same data. This time, strategy B is the winner. Why the difference between the two approaches…?
The difference is that in our first method, using the means of the two strategies, there is very little information pushed forward from one days measurements to the next days measurements, that is, the data from each individual day is pretty much standing isolated from what’s happened before, and is not impacted much by the previous measurements. On the other hand, using Bayesian updating, the information gained from each previous set of data, i.e each day of measurement, is automatically “merged” into subsequent data. In other words, using Bayes, we use the information gained in previous steps to “nudge” any new data in the direction indicated by the “old” data.
Another way to express this is that the process of averaging the daily data removes available information, while the Bayesian approach preserves previosly gained information, and takes this information into account when processing new information.
Below the resulting distributions obtained by Bayesian Inference:
Here we can see that strategy B is clearly more likely to perform better than A.