Using Bayesian Inference to predict and bet on Italian Serie A Fotball

As my old timer readers know, I’be been using Bayesian Inference to predict and bet on various sporting events, such as FIFA World Cup, and IIHF World Championships. With some success.

When the Italian premier division started for about a month ago, I wanted to see whether my prediction engine and betting strategies would also work for a series, not just for tournaments limited in time and space.

So, below some stats and prediction + betting outcome results over the first 4 rounds of the series:

First, some raw series statistics:

 

stats_by_ascending.jpg

Next, Betting Strategy success/Failure status:

betting_strategy

Finally – A Bayesian Regression plot on the predictive engine’s accuracy:

regression_dn_ranking_vs_points

For those of you who would like to learn more about this project – have a look at this Facebook Group where updates are posted after each round of games.

Posted in Bayes, Data Analytics, Gambling, Machine Learning, Numpy, Probability, PYMC, Python, Simulation, Statistics | Tagged , , , , , , , , , , , , , | Leave a comment

Bayesian Multi-predictor Regression – Valet2018

[Continuing my exploration of the Swedish election results, but I thought this might be of interest also for those of you not very interested in the Swedish elections, simply because the potential MatStat’s  insights – thus, the text is in English…]

The data presented here is the official preliminary election result data from Valmyndigheten, combined with data from SCB on general population characteristics, and contains the data of all the 290 Swedish municipalities (“kommun”)]

So, let’s start by plotting a couple of potentially interesting parameters from the results of the recent elections. In this post, to get us started , I’ll focus just on the results for a single political party, “Moderaterna”, but I have the results for all the other parties, and might publish them at some later point.

[Heads-up: you are about to see a couple of very busy graphs, but stay with me, because these busy graphs actually reveal quite a bit of interesting info…]

Plot over share of votes for M, from 3 different perspectives: median income, education level, ratio of foreign born inhabitants:

M_data_point_plot

So, what do we actually see here…? First, there are 3 different plots on the graph, each dot represents a specific municipality:

  1. Ratio of votes for M  (y-axis) plotted over ratio inhabitants with high level of education (x-axis) – red dots and regression lines.
  2. Ratio of votes for M plotted over median income – green dots and regression lines
  3. Ratio of votes for M plotted over ratio of foreign born inhabitants.

Secondly, the axis are not in absolute values, but scaled to centralize the values on both axis. That means that for either axis, the 0 point represents the mean (average) value for the parameter, thus any point sitting at (0,0) has a mean value in both dimensions presented.

If we first focus on the green dots, representing share of votes over median income within the municipalities, we can see that the dots are fairly well clustered around 0  in the x-dimension, revealing that there are not major differences in median income levels between different Swedish municipalities.  If you compare the clustering in x-dimension of the green vs the red (ratio of inhabitants with high education level) or the magenta (ratio of foreign born inhabitants) you see that the green dots are clustered about the range -0.25 to 0.30 on the x-scale, meaning that the income varies in the range of -25% to +30% from the average.

From the three corresponding regression lines, we can see that all three parameters have a positive slope, the first two significantly so, meaning that an increase in x-value should result in an increase in the y-value, that is: the higher income, the more votes for M; the higher ratio of folks with high education, the more votes for M. From the slopes we can suspect that the economic (income) factor is a key determinant for whether folks vote for M, or not. But…: perhaps some of these params are inter-related…?

Let’s run a multi-predictor regression to find out:

Multi-predictor regression:

val2018_multi_reg_M

Here, we are still dealing with the same party, the same data, but now we have combined the 3 parameters (education,income,foreign born) to a single, multipredictor regression, represented by the orange area, with the black dashed line representing the mean regression line. A couple of things to note: here, I’ve run a Bayesian regression, while the regression lines in the previous graph were non-Bayesian, just std. Linear Least Squares. Since Bayesian methods deal with probability distributions, while more traditional (“Frequentist”) methods deal with point estimates, we can here explicitly show the level of uncertainty of the data and analysis – the orange area is in fact a whole bunch of regression lines, clustered more or less on top of each other, thereby illustrating the area of uncertainty.  Furthermore, the “baby blue” area below is the 89th percentile CI (“Credible Interval”), further illustrating the level of uncertainty within the result.

What we see here is that the income parameter is in fact the dominant force of the regression, another way to state that is that of the three parameters measured, income is the most important one for determining whether people vote for M or not.

There’s a whole bunch of other, more technically oriented info in the graph, but let’s just stop here for now, and contemplate the major finding: economy is the prime factor determining whether to vote for M or not… 🙂

Posted in Bayes, Big Data, Data Analytics, Data Driven Management, Numpy, Politik, Probability, PYMC, Python, Research, Society, Statistics, Sverige | Tagged , , , , , , , , , , | 1 Comment

Val2018 – top50 & bottom50 kommuner per parti

“Jag vill ha mer val
Ge mig mer val
Jag vill ha mer val
Ge mig mer val
Tusen stjärnor som tindrar
Glitter så långt jag ser
Av valljus som glimmar
Vill jag ha mer..”

Var det inte så de sjöng, Adolphson & Falk…? 🙂

Top50:

val2018_riks_top_50_kommun_C

val2018_riks_top_50_kommun_FIval2018_riks_top_50_kommun_KDval2018_riks_top_50_kommun_Lval2018_riks_top_50_kommun_Mval2018_riks_top_50_kommun_MPval2018_riks_top_50_kommun_Sval2018_riks_top_50_kommun_SDval2018_riks_top_50_kommun_V

val2018_riks_bottom_50_kommun_C

Bottom-50:

val2018_riks_bottom_50_kommun_FIval2018_riks_bottom_50_kommun_KDval2018_riks_bottom_50_kommun_Lval2018_riks_bottom_50_kommun_Mval2018_riks_bottom_50_kommun_MPval2018_riks_bottom_50_kommun_Sval2018_riks_bottom_50_kommun_SDval2018_riks_bottom_50_kommun_V

Posted in Data Analytics, Politik, Statistics, Sverige | Tagged , , , | Leave a comment

Val2018 – samband mellan röstning och inkomst/utbildningsnivå

Det börjar bli många olika analyser av valresultatet på den här bloggen, så här kommer ytterligare en:

En regressionsanalys över valdatat (från Valmyndigheten) och befolkningsdatat (från SCB): Bayesian Linear Regression över sambanden Röstandelar per parti vs andelen högutbildade (minst 3 årig akademisk utbildning), samt Röstandelar per parti vs medianinkomsten.

I bägge graferna nedan är CI (“Credible Interval”) satt till 89%, och visas av det “babyblå” fältet.

I bägge fallen är det datat från samtliga sveriges kommuner.

Röstandelar vs. andelen med hög utbildning:

regression_utb_rost

Röstandelar vs medianinkomst:

regression_inkomst_roster

Posted in Bayes, Data Analytics, Politik, Probability, PYMC, Python, Society, Statistics, Sverige | Tagged , , , , , , , , | Leave a comment

Val2018 – top50 & bottom50 samtliga valdistrikt i Sverige för samtliga partier

Som komplement till de två tidigare inläggen [1,2] som redogjorde för valdistrikten i Stockholms kommun, så kommer här top50 & bottom50 för samtliga de 6004 valdistrikt som finns med i valnattens preliminära resultat.

Top-50:

val2018_riks_topp_50_Cval2018_riks_topp_50_FIval2018_riks_topp_50_KDval2018_riks_topp_50_Lval2018_riks_topp_50_Mval2018_riks_topp_50_MPval2018_riks_topp_50_Sval2018_riks_topp_50_SDval2018_riks_topp_50_V

Bottom-50:

val2018_riks_bottom_50_Cval2018_riks_bottom_50_FIval2018_riks_bottom_50_KDval2018_riks_bottom_50_Lval2018_riks_bottom_50_Mval2018_riks_bottom_50_MPval2018_riks_bottom_50_Sval2018_riks_bottom_50_SDval2018_riks_bottom_50_V

Posted in Data Analytics, Politik, Society, Statistics, Sverige | Tagged , , , , | Leave a comment

Val2018 – Partiernas sämsta valdistrikt inom Stockholms kommun

Förra inlägget visade partiernas top-50, här visas partiernas bottom-50.

val2018_bottom50_Cval2018_bottom50_FIval2018_bottom50_KDval2018_bottom50_Lval2018_bottom50_Mval2018_bottom50_MPval2018_bottom50_Sval2018_bottom50_SDval2018_bottom50_V

Posted in Data Analytics, Society, Statistics | Tagged , , , | Leave a comment

Val2018 – partiernas topp-50 valdistrikt i Stockholms kommun

[Och top-bottom-50 finns här]

val2018_top50_Sval2018_top50_Vval2018_top50_MPval2018_top50_Mval2018_top50_Lval2018_top50_KDval2018_top50_FIval2018_top50_Cval2018_top50_SD

Posted in development | 1 Comment

PYMC – Markov Chain Monte Carlo regression – canonical example

pymc_regression_canonical_data_genpymc_regression_canonical_reg_linespymc_regression_canonical_param_posteriors

Continue reading

Posted in Bayes, Data Analytics, Numpy, Probability, PYMC, Python, Statistics | Tagged , , , , , , , , | Leave a comment

Val2018 – röstningsmönster i Sveriges län och Stockholms kommuner

I tidigare inlägg har jag redogjort för hur min Bayesian Inference valprediktion lyckades (riktigt bra, tack för att du frågar, bättre än många proffs-tyckare, faktiskt!) 🙂

I detta inlägg presenteras några obearbetade “rådata” kring valutgången och populationen i dels samtliga svenska län, dels i kommunerna i Stockholms län.

Vidare presenteras en regressionsanalyser över partiernas valresultat releterat till andelen utrikes födda i respektive valområde, då huruvida detta har påverkat valresultatet har varit och är ett aktuellt debattämne.

Då jag anser att graferna nedan är självförklarande (åtm. för den primära målgruppen för denna blogg) så räcker det att jag påpekar en intressant skillnad i regressionsplottarna mellan å ena sidan Stockholms kommuner och andra sidan landets län… Den skillnaden bör vara av intresse för ev. statsvetare samt övriga valforskare…

val2018_rostandel_per_lan

val2018_andel_utfodda_lan

val2018_regression_lan

 

val2018_andel_sth

Val2018_utlandsfodda_sth_kommuner

 

val2018_lan_vinnare

val2018_regression_Sth_kommun

val2018_kommun_vinnare

[Addendum]

Efter att ha funderat ett tag på hur det kan komma sig att regressionslinjerna går i motsatt riktning för Stockholms kommun och aggregationen av länen, så tror jag mig nu ha svaret: helt enkelt att en eller ett par “outliers” på kommun-nivå har ingen större effekt när man aggregerar kommunerna till en större enhet. För att verifiera detta körde jag en simulering:

val2018_regession_sim

de röda, mindre prickarna representerar kommuner i ett län, de större oranga länen i landet. Den röd-oranga stora pricken är medelvärdet för kommunerna i de små prickarna, och därmed ett av värdena på länsnivå, dvs den “ingår” i mängden oranga prickar i den nedre delen av grafen, som en “outlier”.

Tittar man då på den oranga regressionslinjen, så ser man att outliern har en viss påverkan (linjen går över de flesta prickarna pga gravitationen från outliern), men inte tillräckligt för att ändra riktning på länens regression.

Posted in Bayes, Data Analytics, Numpy, Politik, Probability, PYMC, Python, Society, Statistics, Sverige | Tagged , , , , , | Leave a comment

Val2018 – partiernas röstandel per län

val2018_rostandel_per_lan

Image | Posted on by | Tagged , , , | Leave a comment