Bayesian Inference – the dangers of “small data” combined with a mis-informed Prior

Continuing on my previous example of trying to figure out the true proportion of blue marbles in a bag:

Previously, we used a non-informative, uniform prior for our inference. This time, let’s compare that non-informed prior with an informed prior – in this example, a Beta distribution based prior, albeit centered slighty wrong. That is, we will use a “mis-informed” prior to illustrate the effect it has on our inference.

The true proportion of blue vs white marbles is 25% (which of course our inference engine does not know).

Let’s start with a very limited amount of data, 4 data points, i.e. we have pulled 4 marbles from our bag:

garden_beta_4_data_points

Topmost, we have our old non-informed, uniform Prior. Below it, we have the (mis)-informed Prior, centered around a ratio of about 0.35.  At the bottom are the resulting posterior beliefs.

We can immediately see that while the non-informed, uniform prior performs quite well, peaking at about 0.20, the (mis-)informed prior causes the posterior belief to be forced much more towards higher values, that posterior peaking around 0.30. Thus, it’s the “gravitational force” of the mis-informed prior being centered at about 0.35 that causes the corresponding posterior to tend towards values too high, whereas the non-informed prior does not exercise any such “gravitational pull”.

But what happens if we increase the number of samples…?

garden_beta_64_data_points

Here, the number of samples is 64. Now we can see that the “erroneous pull” effect from our mis-informed prior deminishes with increasing sample size – the two posteriors are almost identical, the informed posterior is still a little bit pulled towards higher values, but neglibly so for any practical purposes.

Lesson’s learned:

  1. Avoid using MIS-informed priors – i.e. don’t let your “it should be so” prejudices ruin otherwise sound inferences. It’s better to use a non-informed (e.g. uniform) prior than jumping to conclusions and using a “sharper” prior that happens to be wrong.
  2. By adding data you can reduce the impact any prior has on your inference.
Posted in Bayes, Data Driven Management, Probability, PYMC, Statistics | Tagged , , , | Leave a comment

Why (sample) size matters – Bayesian Inference

[The following example is adapted from Richard McElreath’s excellent “Statistical Rethinking”]

Let’s say you have bought a bag of marbles. The marbles in the bag are either blue or white. But you don’t know the proportions, that is, the ratio of blue to white.

So you would like to get an estimate on that proportion.

Since there are very very many marbles in the bag, we don’t want to pull out each of them, we want to be efficient and  take a sample. the question then becomes – how large a sample is “good enough” for us being able to with some confidence make a statement about the proportion…?

So, let’s create a bag with marbles where we know the exact proportion – in this case 1 blue for every 3 white, i.e. 25% of the marbles are blue, 75% white, and see how many samples we need to pull before we can be “fairly certain” about the true proportion.

Let’s also (for simplicity) say that we pull out the samples in the order of [W,B,W,W] repetedly.

Let’s check it out by using Bayesian Inference, first with a Uniform Prior, then with an informed Prior:

Lets first pull out 4 marbles (after having shaken the bag for a while), and see what Bayes says using a uniform prior:

garden_beta_4_data_points

The top graph shows our uniform prior belief – i.e. we assign an equal probability to all possible proportions blue/white, and the bottom graph shows the posterior, updated belief after Bayes has seen our 4 data points, that is, the color of each of the 4 marbles pulled. We can see that already with 4 data points, Bayes is able to figure out that the most likely proportion is around 25%, but the uncertainty is quite large – the vertical red dashed lines show the 25th, 50th and 75th percentiles, meaning that there is 50% chance that the proportion is either below about 0.18, or above about .44.  Another way to state it is to say that there’s 50% probability that the true proportion lies between 0.18 and 0.44.

Let’s increase the sample size to 12 data points:

garden_beta_12_data_points.jpg

Already with 12 data points, the posterior belief has now got sharper, the area of uncertainty has decreased – now 50% of the probabilities lie within 0.2 – 0.35.

Let’s increase the number of samples a bit more, to 40 samples:garden_beta_40_data_points

Now, the uncertainty has been further reduced, the inference has become more and more certain that the true proportion is with 50% certainty to be found between 0.22-0.31.

Let’s increase the number of samples even more, to 400:

garden_beta_400_data_points

Now, with 400 samples, the inference is pretty much certain that the true proportion is between 0.24-0.26, the area of uncertainty has been greatly reduced.  All of this while starting with an “uninformed”, uniform prior.

So, the more samples, the more “precise” the inference,  and the less uncertainty remains.

Next time, let’s look at what happens when using an informed prior.

 

Posted in Bayes, Probability, PYMC, Python, Statistics | Tagged , , , , | Leave a comment

Global Warming – soon in a place near you

Just downloaded hourly temperature data from SMHI, the Swedish Meteorological & Hydrological Institute, taken on Svenska Högarna, one of the islands in the remote Stockholm Archipelago, for the years 1949 – 2018.

The plots below present max/min/mean daily temps for the summer months of June-August.

From the data, it’s clear that temperatures are raising – e.g. for the max temp the regression Beta is 0.06, meaning an increase of about 3C in 50 years.

sv_hogarna_sommar_tempsv_hogarna_regression

Posted in Big Data, Data Analytics, Statistics | Tagged , , , | Leave a comment

Using Bayesian Inference to predict and bet on Italian Serie A Fotball

As my old timer readers know, I’be been using Bayesian Inference to predict and bet on various sporting events, such as FIFA World Cup, and IIHF World Championships. With some success.

When the Italian premier division started for about a month ago, I wanted to see whether my prediction engine and betting strategies would also work for a series, not just for tournaments limited in time and space.

So, below some stats and prediction + betting outcome results over the first 4 rounds of the series:

First, some raw series statistics:

 

stats_by_ascending.jpg

Next, Betting Strategy success/Failure status:

betting_strategy

Finally – A Bayesian Regression plot on the predictive engine’s accuracy:

regression_dn_ranking_vs_points

For those of you who would like to learn more about this project – have a look at this Facebook Group where updates are posted after each round of games.

Posted in Bayes, Data Analytics, Gambling, Machine Learning, Numpy, Probability, PYMC, Python, Simulation, Statistics | Tagged , , , , , , , , , , , , , | Leave a comment

Bayesian Multi-predictor Regression – Valet2018

[Continuing my exploration of the Swedish election results, but I thought this might be of interest also for those of you not very interested in the Swedish elections, simply because the potential MatStat’s  insights – thus, the text is in English…]

The data presented here is the official preliminary election result data from Valmyndigheten, combined with data from SCB on general population characteristics, and contains the data of all the 290 Swedish municipalities (“kommun”)]

So, let’s start by plotting a couple of potentially interesting parameters from the results of the recent elections. In this post, to get us started , I’ll focus just on the results for a single political party, “Moderaterna”, but I have the results for all the other parties, and might publish them at some later point.

[Heads-up: you are about to see a couple of very busy graphs, but stay with me, because these busy graphs actually reveal quite a bit of interesting info…]

Plot over share of votes for M, from 3 different perspectives: median income, education level, ratio of foreign born inhabitants:

M_data_point_plot

So, what do we actually see here…? First, there are 3 different plots on the graph, each dot represents a specific municipality:

  1. Ratio of votes for M  (y-axis) plotted over ratio inhabitants with high level of education (x-axis) – red dots and regression lines.
  2. Ratio of votes for M plotted over median income – green dots and regression lines
  3. Ratio of votes for M plotted over ratio of foreign born inhabitants.

Secondly, the axis are not in absolute values, but scaled to centralize the values on both axis. That means that for either axis, the 0 point represents the mean (average) value for the parameter, thus any point sitting at (0,0) has a mean value in both dimensions presented.

If we first focus on the green dots, representing share of votes over median income within the municipalities, we can see that the dots are fairly well clustered around 0  in the x-dimension, revealing that there are not major differences in median income levels between different Swedish municipalities.  If you compare the clustering in x-dimension of the green vs the red (ratio of inhabitants with high education level) or the magenta (ratio of foreign born inhabitants) you see that the green dots are clustered about the range -0.25 to 0.30 on the x-scale, meaning that the income varies in the range of -25% to +30% from the average.

From the three corresponding regression lines, we can see that all three parameters have a positive slope, the first two significantly so, meaning that an increase in x-value should result in an increase in the y-value, that is: the higher income, the more votes for M; the higher ratio of folks with high education, the more votes for M. From the slopes we can suspect that the economic (income) factor is a key determinant for whether folks vote for M, or not. But…: perhaps some of these params are inter-related…?

Let’s run a multi-predictor regression to find out:

Multi-predictor regression:

val2018_multi_reg_M

Here, we are still dealing with the same party, the same data, but now we have combined the 3 parameters (education,income,foreign born) to a single, multipredictor regression, represented by the orange area, with the black dashed line representing the mean regression line. A couple of things to note: here, I’ve run a Bayesian regression, while the regression lines in the previous graph were non-Bayesian, just std. Linear Least Squares. Since Bayesian methods deal with probability distributions, while more traditional (“Frequentist”) methods deal with point estimates, we can here explicitly show the level of uncertainty of the data and analysis – the orange area is in fact a whole bunch of regression lines, clustered more or less on top of each other, thereby illustrating the area of uncertainty.  Furthermore, the “baby blue” area below is the 89th percentile CI (“Credible Interval”), further illustrating the level of uncertainty within the result.

What we see here is that the income parameter is in fact the dominant force of the regression, another way to state that is that of the three parameters measured, income is the most important one for determining whether people vote for M or not.

There’s a whole bunch of other, more technically oriented info in the graph, but let’s just stop here for now, and contemplate the major finding: economy is the prime factor determining whether to vote for M or not… 🙂

Posted in Bayes, Big Data, Data Analytics, Data Driven Management, Numpy, Politik, Probability, PYMC, Python, Research, Society, Statistics, Sverige | Tagged , , , , , , , , , , | 1 Comment

Val2018 – top50 & bottom50 kommuner per parti

“Jag vill ha mer val
Ge mig mer val
Jag vill ha mer val
Ge mig mer val
Tusen stjärnor som tindrar
Glitter så långt jag ser
Av valljus som glimmar
Vill jag ha mer..”

Var det inte så de sjöng, Adolphson & Falk…? 🙂

Top50:

val2018_riks_top_50_kommun_C

val2018_riks_top_50_kommun_FIval2018_riks_top_50_kommun_KDval2018_riks_top_50_kommun_Lval2018_riks_top_50_kommun_Mval2018_riks_top_50_kommun_MPval2018_riks_top_50_kommun_Sval2018_riks_top_50_kommun_SDval2018_riks_top_50_kommun_V

val2018_riks_bottom_50_kommun_C

Bottom-50:

val2018_riks_bottom_50_kommun_FIval2018_riks_bottom_50_kommun_KDval2018_riks_bottom_50_kommun_Lval2018_riks_bottom_50_kommun_Mval2018_riks_bottom_50_kommun_MPval2018_riks_bottom_50_kommun_Sval2018_riks_bottom_50_kommun_SDval2018_riks_bottom_50_kommun_V

Posted in Data Analytics, Politik, Statistics, Sverige | Tagged , , , | Leave a comment

Val2018 – samband mellan röstning och inkomst/utbildningsnivå

Det börjar bli många olika analyser av valresultatet på den här bloggen, så här kommer ytterligare en:

En regressionsanalys över valdatat (från Valmyndigheten) och befolkningsdatat (från SCB): Bayesian Linear Regression över sambanden Röstandelar per parti vs andelen högutbildade (minst 3 årig akademisk utbildning), samt Röstandelar per parti vs medianinkomsten.

I bägge graferna nedan är CI (“Credible Interval”) satt till 89%, och visas av det “babyblå” fältet.

I bägge fallen är det datat från samtliga sveriges kommuner.

Röstandelar vs. andelen med hög utbildning:

regression_utb_rost

Röstandelar vs medianinkomst:

regression_inkomst_roster

Posted in Bayes, Data Analytics, Politik, Probability, PYMC, Python, Society, Statistics, Sverige | Tagged , , , , , , , , | Leave a comment

Val2018 – top50 & bottom50 samtliga valdistrikt i Sverige för samtliga partier

Som komplement till de två tidigare inläggen [1,2] som redogjorde för valdistrikten i Stockholms kommun, så kommer här top50 & bottom50 för samtliga de 6004 valdistrikt som finns med i valnattens preliminära resultat.

Top-50:

val2018_riks_topp_50_Cval2018_riks_topp_50_FIval2018_riks_topp_50_KDval2018_riks_topp_50_Lval2018_riks_topp_50_Mval2018_riks_topp_50_MPval2018_riks_topp_50_Sval2018_riks_topp_50_SDval2018_riks_topp_50_V

Bottom-50:

val2018_riks_bottom_50_Cval2018_riks_bottom_50_FIval2018_riks_bottom_50_KDval2018_riks_bottom_50_Lval2018_riks_bottom_50_Mval2018_riks_bottom_50_MPval2018_riks_bottom_50_Sval2018_riks_bottom_50_SDval2018_riks_bottom_50_V

Posted in Data Analytics, Politik, Society, Statistics, Sverige | Tagged , , , , | Leave a comment

Val2018 – Partiernas sämsta valdistrikt inom Stockholms kommun

Förra inlägget visade partiernas top-50, här visas partiernas bottom-50.

val2018_bottom50_Cval2018_bottom50_FIval2018_bottom50_KDval2018_bottom50_Lval2018_bottom50_Mval2018_bottom50_MPval2018_bottom50_Sval2018_bottom50_SDval2018_bottom50_V

Posted in Data Analytics, Society, Statistics | Tagged , , , | Leave a comment

Val2018 – partiernas topp-50 valdistrikt i Stockholms kommun

[Och top-bottom-50 finns här]

val2018_top50_Sval2018_top50_Vval2018_top50_MPval2018_top50_Mval2018_top50_Lval2018_top50_KDval2018_top50_FIval2018_top50_Cval2018_top50_SD

Posted in development | 1 Comment