Bayesian updating with PYMC

I’ve been looking for neat ways to update a Bayesian Prior from a posterior sample for a while, and just the other day managed to find what I was looking for: a code example that shows how to make a posterior to a distribution, known to PYMC. The code for doing that, by jcrudy, can be found here. The key insight with that code is the “known to PYMC”-part: it’s no problem to make a posterior sample to a distribution, but in order to make it compatible with PYMC, it needs to be done as a PYMC object.  And that’s what jcrudy’s code does.

Anyway, below a small example, borrowed from Richard McElreath’s excellent book “Statistical Rethinking”.

Let’s pretend that we wanted to know the ratio of Water to Land of our earth, and that the way we will go about it is to toss an inflatable globe into air a number of times, and keep track of where we first touch it when we catch it, land or water.

Let’s say that after the first 9 tosses, our results look like below:

[1,0,1,0,1,0,1,1,1] where ‘1’ indicates water, and ‘0’ land.

The basic logic of our Bayesian Inference is to update the prior,  each time we get a new data point, i.e. either a ‘1’ or a ‘0’.

Let’s furthermore say that we have an initial idea of the correct ratio being somewhere around 50%, but we are far from certain, so we will set a fairly “loose” initial prior, a Beta-distribution centered on 50%, as below. This prior allows the calculated rate to be almost the entire range of 0% to 100%, with the highest probability around 50%.

initial_prior

Now, what we want to do is to update our prior belief – the one initially being centered around 50% – based on each subsequent data point.

Below the results of 18 of such updates (the above vector of 9 points is concatenated to provide these 18 data points):

updated_priors

This graph is read from left-to-right, and shows the updated prior, or current belief on the correct ration, after each subsequent data point has been processed.

From the topmost left image, we can see that when the first data point, which happens to be a ‘1’, has been processed, the Bayesian model has adjusted its belief on the correct ratio somewhat to the right, from 50% to 55%, and after having processed the ‘0’ in the second data point, reduced the current belief back to 50%. And so on so forth as the remaining data points get processed.

Finally, after having processed all 18 data points, updating the prior for each point, the Bayesian model settles for a ratio of 64% (mean value). But the real benefit of Bayesian analysis is that we actually get a probability distribution as a result, not just a point estimate. Thus, the Bayesian approach preserves the underlying uncertainty of the model in a very neat, visually obvious way.

If you look closely at the 18 different plots above, you will see that the distribution gets narrower as each data point is processed. This means the Bayesian model is gradually getting more and more certain about the sought after ratio, as it gets more data to base its estimate on.

Code below.

Continue reading

Advertisements
Posted in Bayes, Data Analytics, Math, Numpy, Probability, PYMC, Python, Statistics | Tagged , , , , , , , , | Leave a comment

Poor Man’s Betting Shop – Using Baysian Inference to setup your own Betting Shop

iihf-2018

Further exploration of Bayesian Inference, applied to the upcoming 2018 Ice Hockey World Championships. This time, I’m trying to understand how the professional betting shops set their odds, and how they make a profit. It took some ‘research’ into the gaming/betting vocabulary, for instance, the notion of “odds” is not unique, there are several types, including fractional e.g. 3:1, and decimal, e.g. 2.0.  In this article, I’m going to stick with the decimal form of odds, so below, whenever you see a number like 2.0 or 1.5, those numbers are decimal gambling odds, and the way to read them is that for every dollar you bet against those odds, you will get that number back, if you happen to win. So basically, your ‘return’, if you happen to win the bet, is your stake multiplied by the odds number.  To make it bleeding obvious, let’s take an example:

You decide to bet 1$ on an outcome with 1.0 in odds. If you win, you’ll get your money back, nothing more. If instead the odds were set to 2.0, you would get 2$ back for your 1$ stake, i.e. a nice winning of 1$ (or 100%).

So, with this now hopefully bleedingly clear and obvious, let’s see what my Bayesian inference engine – which at this point has been fed with nearly 1100 historical game results, as well as the current IIHF Rankings table – thinks about the upcoming World Championship group game outcomes:

Below are two tables, one for group A and one for group B. In each table, each game is listed, with two lines, the first line showing the ‘raw’ odds, based on the actual probabilities for the game, the second line including the ‘markup’ that any professional bookie must make to ensure a healthy business. So, to compare these odds with the one’s from any professional betting shop, you should use line #2 for each game.

I compared the outcomes from my odds compilator, i.e. the Bayesian Inference Engine I’ve written in Python and PYMC, with the odds compiled by a betting brooker on the net, and was pretty surprised to see that in many cases, the odds from my program and the ones on that site are not that far from each other… That’s surprising, because my program – at this point – only uses historical results and the actual rankings to determine the probabilities of the game outcomes, and I’m pretty sure that professional betting shops put much more data into their odds compilation engine, e.g. based on the current team rooster, form, home vs away game, venue, umpire etc etc.

The most obvious differences I’ve been able to notice are two: first, my program appears to place more emphasis on the historical results than the betting shops do. Secondly, for teams with no previous world cup participation, like Korea, my program seems to put less emphasis on the rankings table than the professional shops do.

So, anyways:below the results. Oh, btw, a few games have a right most column in square brackets. Those indicate the odds obtained from the betting brooker mentioned above. The reason there aren’t more of them, is that it appears that the betting folks don’t publish all their odds at once, but ‘piece-by-piece’.

BETTING ODDS PER GAME, GROUP B:
AUT-BLR  : HOME WIN: 545.45 DRAW: 130.43 AWAY WIN: 1.01
         : HOME    : 495.87 DRAW: 118.58 AWAY    : 0.92

AUT-CZE  : HOME WIN: 7500.00 DRAW: 2142.86 AWAY WIN: 1.00
         : HOME    : 6818.18 DRAW: 1948.05 AWAY    : 0.91

AUT-FRA  : HOME WIN: 5.02 DRAW: 5.95 AWAY WIN: 1.58
         : HOME    : 4.56 DRAW: 5.41 AWAY    : 1.44

AUT-RUS  : HOME WIN: 188.68 DRAW: 60.85 AWAY WIN: 1.02
         : HOME    : 171.53 DRAW: 55.32 AWAY    : 0.93

AUT-SUI  : HOME WIN: 9.25 DRAW: 10.31 AWAY WIN: 1.26
         : HOME    : 8.41 DRAW: 9.38 AWAY    : 1.14	[7.20 6.30 1.39]

AUT-SVK  : HOME WIN: 3000.00 DRAW: 389.61 AWAY WIN: 1.00
         : HOME    : 2727.27 DRAW: 354.19 AWAY    : 0.91

AUT-SWE  : HOME WIN: inf DRAW: 3750.00 AWAY WIN: 1.00
         : HOME    : inf DRAW: 3409.09 AWAY    : 0.91

BLR-CZE  : HOME WIN: 267.86 DRAW: 50.85 AWAY WIN: 1.02
         : HOME    : 243.51 DRAW: 46.22 AWAY    : 0.93

BLR-FRA  : HOME WIN: 1.80 DRAW: 5.90 AWAY WIN: 3.63
         : HOME    : 1.64 DRAW: 5.36 AWAY    : 3.30	[2.10 4.34 3.16]

BLR-RUS  : HOME WIN: 17.03 DRAW: 8.90 AWAY WIN: 1.21
         : HOME    : 15.48 DRAW: 8.09 AWAY    : 1.10

BLR-SUI  : HOME WIN: 9.57 DRAW: 5.16 AWAY WIN: 1.43
         : HOME    : 8.70 DRAW: 4.69 AWAY    : 1.30

BLR-SVK  : HOME WIN: 5.44 DRAW: 4.92 AWAY WIN: 1.63
         : HOME    : 4.95 DRAW: 4.48 AWAY    : 1.48

BLR-SWE  : HOME WIN: 24.15 DRAW: 12.18 AWAY WIN: 1.14
         : HOME    : 21.96 DRAW: 11.07 AWAY    : 1.04	[14.25,9.20,1.19]

CZE-FRA  : HOME WIN: 1.13 DRAW: 15.93 AWAY WIN: 19.96
         : HOME    : 1.02 DRAW: 14.48 AWAY    : 18.15

CZE-RUS  : HOME WIN: 3.54 DRAW: 3.46 AWAY WIN: 2.34
         : HOME    : 3.21 DRAW: 3.14 AWAY    : 2.12

CZE-SUI  : HOME WIN: 1.34 DRAW: 5.71 AWAY WIN: 12.85
         : HOME    : 1.22 DRAW: 5.19 AWAY    : 11.68

CZE-SVK  : HOME WIN: 1.27 DRAW: 6.89 AWAY WIN: 14.61
         : HOME    : 1.16 DRAW: 6.27 AWAY    : 13.28	[1.47 6.00 6.70]

CZE-SWE  : HOME WIN: 7.53 DRAW: 3.78 AWAY WIN: 1.66
         : HOME    : 6.85 DRAW: 3.44 AWAY    : 1.51

FRA-RUS  : HOME WIN: 63.56 DRAW: 28.01 AWAY WIN: 1.05
         : HOME    : 57.78 DRAW: 25.46 AWAY    : 0.96	[21.25 13.00 1.12]

FRA-SUI  : HOME WIN: 6.32 DRAW: 5.42 AWAY WIN: 1.52
         : HOME    : 5.75 DRAW: 4.93 AWAY    : 1.38

FRA-SVK  : HOME WIN: 7.36 DRAW: 7.70 AWAY WIN: 1.36
         : HOME    : 6.69 DRAW: 7.00 AWAY    : 1.24

FRA-SWE  : HOME WIN: 123.97 DRAW: 36.36 AWAY WIN: 1.04
         : HOME    : 112.70 DRAW: 33.06 AWAY    : 0.94

RUS-SUI  : HOME WIN: 1.07 DRAW: 18.82 AWAY WIN: 76.73
         : HOME    : 0.97 DRAW: 17.11 AWAY    : 69.75

RUS-SVK  : HOME WIN: 1.42 DRAW: 5.21 AWAY WIN: 9.55
         : HOME    : 1.29 DRAW: 4.74 AWAY    : 8.68

RUS-SWE  : HOME WIN: 1.59 DRAW: 3.89 AWAY WIN: 8.75
         : HOME    : 1.45 DRAW: 3.53 AWAY    : 7.95

SUI-SVK  : HOME WIN: 5.31 DRAW: 5.83 AWAY WIN: 1.56
         : HOME    : 4.83 DRAW: 5.30 AWAY    : 1.42

SUI-SWE  : HOME WIN: 37.59 DRAW: 9.00 AWAY WIN: 1.16
         : HOME    : 34.18 DRAW: 8.19 AWAY    : 1.05

SVK-SWE  : HOME WIN: 6.89 DRAW: 4.91 AWAY WIN: 1.54
         : HOME    : 6.26 DRAW: 4.46 AWAY    : 1.40

BETTING ODDS PER GAME, GROUP B:
CAN-DEN  : HOME WIN: 1.14 DRAW: 14.73 AWAY WIN: 18.69
         : HOME    : 1.03 DRAW: 13.39 AWAY    : 16.99

CAN-FIN  : HOME WIN: 1.97 DRAW: 3.50 AWAY WIN: 4.83
         : HOME    : 1.79 DRAW: 3.18 AWAY    : 4.40

CAN-GER  : HOME WIN: 1.03 DRAW: 40.65 AWAY WIN: 157.89
         : HOME    : 0.94 DRAW: 36.95 AWAY    : 143.54

CAN-KOR  : HOME WIN: 1.21 DRAW: 7.08 AWAY WIN: 28.25
         : HOME    : 1.10 DRAW: 6.43 AWAY    : 25.68

CAN-LAT  : HOME WIN: 1.00 DRAW: 750.00 AWAY WIN: 6000.00
         : HOME    : 0.91 DRAW: 681.82 AWAY    : 5454.55

CAN-NOR  : HOME WIN: 1.01 DRAW: 150.00 AWAY WIN: 681.82
         : HOME    : 0.92 DRAW: 136.36 AWAY    : 619.83

CAN-USA  : HOME WIN: 1.45 DRAW: 5.37 AWAY WIN: 8.08
         : HOME    : 1.32 DRAW: 4.88 AWAY    : 7.35	[1.55 5.50 5.80]

DEN-FIN  : HOME WIN: 184.05 DRAW: 32.64 AWAY WIN: 1.04
         : HOME    : 167.32 DRAW: 29.68 AWAY    : 0.94

DEN-GER  : HOME WIN: 2.98 DRAW: 4.63 AWAY WIN: 2.23
         : HOME    : 2.71 DRAW: 4.21 AWAY    : 2.03	[3.50 4.40 2.05]

DEN-KOR  : HOME WIN: 1.82 DRAW: 2.81 AWAY WIN: 10.41
         : HOME    : 1.66 DRAW: 2.56 AWAY    : 9.46

DEN-LAT  : HOME WIN: 3.05 DRAW: 4.83 AWAY WIN: 2.15
         : HOME    : 2.78 DRAW: 4.39 AWAY    : 1.95

DEN-NOR  : HOME WIN: 6.55 DRAW: 5.03 AWAY WIN: 1.54
         : HOME    : 5.95 DRAW: 4.57 AWAY    : 1.40

DEN-USA  : HOME WIN: 15.22 DRAW: 9.29 AWAY WIN: 1.21
         : HOME    : 13.84 DRAW: 8.45 AWAY    : 1.10	[9.00 7.70 1.35]

FIN-GER  : HOME WIN: 1.17 DRAW: 10.22 AWAY WIN: 20.98
         : HOME    : 1.06 DRAW: 9.29 AWAY    : 19.07

FIN-KOR  : HOME WIN: 1.28 DRAW: 5.87 AWAY WIN: 21.69
         : HOME    : 1.16 DRAW: 5.34 AWAY    : 19.72	[1.04 21.00 42.50]

FIN-LAT  : HOME WIN: 1.38 DRAW: 6.78 AWAY WIN: 7.82
         : HOME    : 1.25 DRAW: 6.16 AWAY    : 7.11

FIN-NOR  : HOME WIN: 1.08 DRAW: 18.58 AWAY WIN: 48.15
         : HOME    : 0.98 DRAW: 16.89 AWAY    : 43.78

FIN-USA  : HOME WIN: 3.28 DRAW: 2.75 AWAY WIN: 3.02
         : HOME    : 2.98 DRAW: 2.50 AWAY    : 2.75

GER-KOR  : HOME WIN: 1.34 DRAW: 4.94 AWAY WIN: 19.60
         : HOME    : 1.22 DRAW: 4.49 AWAY    : 17.81

GER-LAT  : HOME WIN: 1.52 DRAW: 5.32 AWAY WIN: 6.52
         : HOME    : 1.38 DRAW: 4.84 AWAY    : 5.93

GER-NOR  : HOME WIN: 6.90 DRAW: 8.60 AWAY WIN: 1.35
         : HOME    : 6.27 DRAW: 7.81 AWAY    : 1.23

GER-USA  : HOME WIN: 4.73 DRAW: 4.58 AWAY WIN: 1.75
         : HOME    : 4.30 DRAW: 4.17 AWAY    : 1.59

KOR-LAT  : HOME WIN: 12.10 DRAW: 3.15 AWAY WIN: 1.67
         : HOME    : 11.00 DRAW: 2.86 AWAY    : 1.52

KOR-NOR  : HOME WIN: 15.13 DRAW: 4.45 AWAY WIN: 1.41
         : HOME    : 13.75 DRAW: 4.04 AWAY    : 1.28

KOR-USA  : HOME WIN: 21.57 DRAW: 5.47 AWAY WIN: 1.30
         : HOME    : 19.61 DRAW: 4.97 AWAY    : 1.18

LAT-NOR  : HOME WIN: 2.10 DRAW: 4.50 AWAY WIN: 3.31
         : HOME    : 1.91 DRAW: 4.09 AWAY    : 3.01	[2.81 4.40 2.25]

LAT-USA  : HOME WIN: 15.29 DRAW: 5.96 AWAY WIN: 1.30
         : HOME    : 13.90 DRAW: 5.42 AWAY    : 1.19

NOR-USA  : HOME WIN: 30.58 DRAW: 17.50 AWAY WIN: 1.10
         : HOME    : 27.80 DRAW: 15.91 AWAY    : 1.00


Below the prediction graphs for the expected goal spread for the two groups:

hockey_prediction_BBhockey_prediction_AA

Posted in Bayes, Complex Systems, Data Analytics, development, Gambling, Math, Numpy, Probability, PYMC, Python, Statistics | Tagged , , , , , , , , , | Leave a comment

Bayesian Inference 2018 Ice Hockey World Cup outcomes

I’ve tuned my Bayesian model a bit. Previously, it used the cumulative sum of historical results point spread as its data input, now it uses each individual game spread. Perhaps an example can make this clearer:

Consider two teams, A and B, who have met 5 times in the past, with following results:

10-0,0-2,0-2,0-2,0-2

The above series gives a cumulative spread of +2, which normalized by number of games becomes 2/5 = 0.4 for team A, indicating that team A is more likely to win. However, that result is highly influence by the 10-0 result of the first game, while not putting enough weight on the fact that the four last games team A actually lost.

To cope with that, I changed the model, instead of considering the cumulative spread in its calculations, to consider each historical outcome individually, which in the above example would result in team B being regarded as the winner.

Below the new predicted outcomes and corresponding odds.

iihf-2018

 

odds-Aodds-B

hockey_prediction_Ahockey_prediction_B

Posted in Bayes, Data Analytics, Data Driven Management, Numpy, Probability, PYMC, Python, Simulation, Statistics | Tagged , , , , , , , , , , | Leave a comment

2018 Ice Hockey World Championships – ‘Raw’ Odds Qualifying Round

iihf-2018

[logo copyright IIHF]

Below the probability distributions from previous post  converted to odds, or more specifically, “Raw” odds, that is, odds based purely on the underlying posterior distributions, not taking into account other aspects, such as the betting distribution, or the risk management aspects of setting odds that I’m sure commercial betting shops employ to safeguard their business from going bankrupt.  Thus, some odds are very high, reflecting the fact that the inference model is very certain about which way the outcome will end up.

The way to read the tables below is to look in the home_win column for True/False. If that boolean value is True, then the corresponding odds are ‘home_odds’ : 1. If home_win is False, then the odds are read the other way around, i.e 1: ‘home_odds’.

One interesting observation in group B is Korea, having very narrow odds. The reason is that Korea has no previous game history at the world championship level, and therefore the model has only the ranking to its use in making the predictions and corresponding odds setting.

Group A Game Odds:

   home away  home_odds  home_win
0   AUT  BLR        112     False
1   AUT  CZE        229     False
2   AUT  FRA         29     False
3   AUT  RUS        809     False
4   AUT  SUI          1     False
5   AUT  SVK          1     False
6   AUT  SWE       3749     False
7   BLR  CZE        205     False
8   BLR  FRA          5      True
9   BLR  RUS        809     False
10  BLR  SUI         62     False
11  BLR  SVK         19     False
12  BLR  SWE         23     False
13  CZE  FRA        378      True
14  CZE  RUS          2     False
15  CZE  SUI          1     False
16  CZE  SVK          3      True
17  CZE  SWE          7     False
18  FRA  RUS         34     False
19  FRA  SUI          1     False
20  FRA  SVK         33     False
21  FRA  SWE        433     False
22  RUS  SUI        318      True
23  RUS  SVK       4284      True
24  RUS  SWE         49      True
25  SUI  SVK          2     False
26  SUI  SWE         87     False
27  SVK  SWE         51     False

Group B Game Odds:

   home away  home_odds  home_win
0   CAN  DEN       2999      True
1   CAN  FIN          3      True
2   CAN  GER       5999      True
3   CAN  KOR         10      True
4   CAN  LAT        554      True
5   CAN  NOR       2999      True
6   CAN  USA         15      True
7   DEN  FIN       9999     False
8   DEN  GER          3     False
9   DEN  KOR          2      True
10  DEN  LAT          1      True
11  DEN  NOR          5     False
12  DEN  USA        279     False
13  FIN  GER         98      True
14  FIN  KOR          7      True
15  FIN  LAT         15      True
16  FIN  NOR        216      True
17  FIN  USA          6     False
18  GER  KOR          5      True
19  GER  LAT        170      True
20  GER  NOR        293     False
21  GER  USA          3     False
22  KOR  LAT          2     False
23  KOR  NOR          4     False
24  KOR  USA          6     False
25  LAT  NOR         46     False
26  LAT  USA        205     False
27  NOR  USA         14     False

 

Posted in Bayes, Data Analytics, Data Driven Management, Math, Numpy, Probability, PYMC, Python, Statistics | Tagged , , , , , , , , , , | Leave a comment

2018 Ice Hockey World Championships – predicted outcomes

Here’s the first real prediction from my Bayesian Inference Engine, on results for the upcoming Ice Hockey World Championships, starting in about 3 weeks from now.

As usual, there are 16 teams, split into two groups:

Group A : [‘RUS’,’SWE’,’CZE’,’SUI’,’BLR’,’SVK’,’FRA’,’AUT’]
Group B: [‘CAN’,’FIN’,’USA’,’GER’,’NOR’,’LAT’,’DEN’,’KOR’]

By “real”, I mean that the model now incorporates the actual teams, and the actual games for each group, and the latest IIHF rankings, as well as real historical game data.

The historical game result data comprises about 400 previous world championship games, from the years 2012-2017.

Below the predicted game outcomes, for both groups:

Group A:

hockey_2018_prediction_A

Group B:

hockey_2018_prediction_B

Group A Game Probability Distributions:

SVK-SWESUI-SVKSUI-SWERUS-SVKRUS-SWERUS-SUIFRA-SVKFRA-SWEFRA-SUIFRA-RUSCZE-SVKCZE-SWECZE-SUICZE-RUSCZE-FRABLR-SVKBLR-SWEBLR-SUIBLR-RUSBLR-FRABLR-CZEAUT-SVKAUT-SWEAUT-SUIAUT-RUSAUT-FRAAUT-CZEAUT-BLR

 

Group B Game Probability Distributions:

CAN-DENCAN-FINCAN-GERCAN-KORCAN-LATCAN-NORCAN-USADEN-FINDEN-GERDEN-KORDEN-LATDEN-NORDEN-USAFIN-GERFIN-KORFIN-LATFIN-NORFIN-USAGER-KORGER-LATGER-NORGER-USAKOR-LATKOR-NORKOR-USALAT-NORLAT-USANOR-USA

Posted in AI, Bayes, Data Analytics, Gambling, Math, Numpy, Probability, PYMC, Python, Statistics | Tagged , , , , , , , , , | Leave a comment

Bayesian Inference – what good is the Prior, anyway…?

A brief example on the effect of Bayesian priors (I’m going to use my Ice Hockey Championship Prediction hack under development for this example):

Assume you would like to bet on the outcome of some particular game, for instance, Sweden – Finland (assuming these two teams are actually going to meet) in the upcoming championships.  How would you decide upon your bet…? Solely by gut feeling, or by browsing previous game results, or perhaps by having a look at the current IIHF Ranking table…?

Scenario I – without a prior

Let’s say you settle for the ranking table. Currently, Sweden is ranked #3 (after Canada and Russia), and Finland is ranked #4. So, on basis of this ranking, you should put your money on Sweden winning. Let’s see how that playes out:

The graph below displays the expected points spread, from the perspective of the ‘home’-team, i.e. the team on the left. In this graph, the model operates in non-Bayesian mode, that is, without any prior; the only thing impacting the expected result here is the difference in ranking between the two teams. If you look at row #7, you can see that the model predicts Finland to loose to Sweden by a narrow margin, which is fully in accordance with their rankings.

(Perhaps I should also mention how to read the graph…: the blue dot shows the mean of the posterior probability for winning/losing: if it is on the left of the dashed orange line, that means the ‘home’-team is predicted to lose. If it is on the right of the line, the ‘home’-team is predicted to win. The blue horisontal line is the 95% confidence interval. The red dot shows the historical point spread obtained from data, but in this first senario, we don’t have that data included in the model, since we are pretending to be non-Bayesian. Finally, the number between the square brackets indicates the number of times these two teams have met before, 0 for all games in this scenario.)

You should also observe that the confidence interval for each game is pretty much equal, which make sense, since the model currently does not take into account the magnitude of the ranking difference, it only looks for the ordinal, the ranking positions.

hockey_prediction_fake_D_res_no

Scenario II: Adding a prior

Let’s now add a Baysian prior, let’s use results from all previous games between the two teams, Sweden vs Finland, as the basis for our prior. Basically, what this means is that instead of letting our betting decision rely solely upon the ranking, we will now in addition to the ranking also peruse the historical game outcomes to influence how the Bayesian inference engine decides upon the prediction for the outcome of the game. Here, for simplicity, I’m using faked historical results data, where I’ve initially set all results to a point spread of zero, that is, regardless of the number of times the two teams have played each other, the cumulative point spread is zero:

hockey_prediction_fake_D_res_yes

The first thing you should notice now, with previous results added to the model, is that for those teams that have met each other several times, the confidence interval has now shrunk, it’s narrower. Again, this makes sense, since the model now has more data on which to make a prediction.  You should also notice that for teams that have met each other several times, the Expected Value, represented by the blue dot, has been moving around a bit, reflecting the addition of the historical results. For instance, looking again at Sweden – Finland, now with 6 historical games in the model, the new expected result is pretty much zero, i.e. a draw. So, despite Sweden’s higher ranking, the fact that the historical data for the point spread for these two teams is zero, the expected result has also been pulled towards zero, i.e. a draw.

Scenario III – predictions with non-zero historical spread

Let’s modify the historical results from zero cumulative spread to something different, say random values for the sake of an example. What you should first observe in the below graph, is that the red dot has for many teams moved away from zero, reflecting a non-zero historical point spread.

hockey_prediction_fake_D_no_draw

You should also see that for some games, those with a large number of previous games combined with a large +/- spread, the conficence interval has shrunk considerably, reflecting the model being more confident of its predictions. Looking again at Finland – Sweden, the predicted outcome now is almost 4 Standard Deviations on the plus side for Finland, which is caused by the historical results between the two teams giving Finland a large positive point spread. The conficence interval is very narrow, caused by both the number of historical games being large (10), as well as the point spread being large.

So, how would we peruse these posterior predictions to make a bet…? Well, according to this model – and remember that currently, the historical result data is faked – among the “safest” bets are:

  • Russia – Sweden
  • Finland – Sweden
  • Czech – Finland
  • Canada – Russia

On the other hand, the most uncertain bets (that will have the best odds at your favorite gambling house) include:

  • Canada – Czech
  • Canada – Hungary
  • Canada – Sweden
  • and several others

There is an easy way to immediately determine from the graph how ‘safe’ or ‘unsafe’ a bet is, but I thought I’ll leave it to the reader to figure out…. 🙂

Posted in Bayes, Data Analytics, Data Driven Management, Gambling, Math, Numpy, Probability, PYMC, Python, Simulation, Statistics | Tagged , , , , , , , , , , , , | Leave a comment

Ice Hockey World Championships 2018 – Baysian inference for predicting results, part II

Just a brief update on my Bayesian model for predicting the upcoming hockey championships. [For info on how to read the graphs below, have a look at part I].

I wrote some scripts to pull in more game results, so for time being the game data consists of about 200 games from the previous 3 world championships. I might add more data later.

The graphs below predict outcomes based on the game data above, and a number of Bayesian Priors, among them the current world ranking of the teams.

For the moment, I haven’t bothered looking at which teams will play this years world cup, that really doesn’t matter, as my model now has data for almost all of the teams that might end up playing the world cup. Eventually, I’ll restrict the model to only deal with the teams actually participating this year.

Below predictions for 171 different team meetings (that might or might not occur in this year’s world champ’s). I haven’t had time to examine the predictions closely yet, but I happened to notice one interesting thing: despite Russia being highter ranked than Finland, Finland actually is on plus,  at least in the games I got in my data. That’s SISU, folks! 🙂

hockey_prediction_1hockey_prediction_2hockey_prediction_3hockey_prediction_4hockey_prediction_5

Posted in Bayes, Data Analytics, Numpy, Probability, PYMC, Python, Simulation, Statistics | Tagged , , , , , , , , , | Leave a comment

Predicting the outcome of World Championships in Ice Hockey using Bayesian Inference

Just for fun, I thought I’d implement a Bayesian “statistical inference engine” for some sports tournament. For whatever reason, I came to choose ice hockey world championships, training my inference engine on data from the 2016 tournament, and testing its predictions against the results of the 2017 tournament.

A problem with this approach is that most of the time, the set of teams participating is not the same year over year, and since this is just a small experiment, I did not want to go thru year after year of match data for finding at least one game result for each combo of the teams that were playing the 2017 championships. Secondly, for the same reason, to avoid lots and lots of boring data entry, I settled with having just one game result per team pair in my training data, thus, from the 8 teams in Group A 2016 that I chose as my results data, where each team met every other team once, I got 28 results in total, which clearly is not much to make predictions upon. But, the purpose of this experiment was not to get a high accuracy in predictions, but to experiment a bit with a slightly more complex inference scenario.

Thus, with the small amount of data, and without having put much thought about my various priors, I wasn’t expecting much accuracy in the predictions.

A bit about the model (this time, I’m not going to publish the code, for reasons that will become clear in a minute… 😉

The data the model is fitting, consists of 28 game results from the 2016 world championships. From those matches, I collect the point spread for each game, not the game result.

Why  not the results? Because the point spread carries more information than the binary win/loose result. If I just know that team A beat team B, that’s one bit of information, while a result like 9-4, with a delta of 5, conveys much more information.

Furthermore, I use the International Ice Hockey Federation’s ranking of teams for 2016 as one of the priors.

Here’s the prediction result, coming from a few hundred lines Python and PYMC code:

hockey

On the vertical axis are the 28 games from 2016. The vertical red dashed line is “point zero”, i.e. if the blue dot is to the right of that line, the “home-team” – the team on the left, is predicted  to win, if the blue dot is on the left of the red line, the home team is predicted to loose. Each blue dot has a 95% confidence interval, illustrated by the blue lines.

The red dots represent the training data, that is, the game outcomes from the 2016 tournament. They are there just for easy control of the predicted, blue results, and would normally, in a “real” prediction scenario, not be there.

So, how did the model do….? In fact, much better than I expected, given the very limited amount of data, the minimal thought I’ve given to the priors, and last but not least: as anyone interested in betting knows, it is notoriously hard to predict sports events – even the world’s best team can occasionally loose against a team with a ranking around 50…

To check the prediction power of the model, I compared its predictions with the results from the 2017 world championships. As noted above, not all teams  from 2016 were present, nor did all those that were present 2016 play each other 2017, so out of my 28 training games from 2016 there were only 11 matching games in 2017.

Never the less, the model correctly predicted 7 of the 11 possible outcomes, that is, the model got it right in 64% of the cases. And as any gambler knows, that kind of “house advantage” is enourmous…!

So, the reason I’m not publishing any code this time is that I might consider putting some real thought into enhancing the model, e.g. to predict the upcoming 2018 hockey championships, or perhaps even more interesting: the 2018 FIFA world cup… 🙂

Posted in Bayes, Big Business, Big Data, Data Analytics, Gambling, Numpy, Probability, PYMC, Python, Simulation, Statistics | Tagged , , , , , , , , , , , | Leave a comment

Bayesian Linear Regression with PYMC, part II

In the previous post we looked at a simple linear regression between (simulated) human heights and weights.  In that example, the regression was truly ‘linear’, in that the predictor variable only occured in its first power, to provide a linear regression equation on the form mu = beta * x + alpha.

From a purely mathematical point of view though, there is nothing preventing us using a higher power for the predictor, in a polynomial form. The only thing that changes (except for some of the ‘fancy’ plotting code’) is the regression equation, and adding a new prior (beta2);

for instance for a quadratic regression (for some weird reason, it’s still called ‘linear’ regression though!) :

mu = beta2 * x*x   + beta1 * x  + alpha

in fact, we can add arbitrarily many terms with increasing powers, each with their own prior stochastic coefficient.

Below results from the first three powers, this time on real data from [Howell]:

power = 1:

regression_basic

power = 2: quad

power = 3:

poly

As can be seen, the more terms with higher powers we add, the better the fit. In fact, by adding terms we can create a perfect fit, where the regression line passes exactly thru each data point.

Now, the question becomes: which of these three models is “better”…? Perhaps non-intuively, from a statistical inference point of view, it’s rarely the one with the “best” fit… I might return to elaborate on that topic in a later post, meanwhile, I encourage the curious reader to think a bit about why this might be so….

Posted in Bayes, Data Analytics, Probability, PYMC, Simulation, Statistics | Tagged , , , , , , | Leave a comment

Bayesian Linear Regression with PYMC

Is there a relationship between human height and weight…? Probably. But what does that relationship look like ?

Those kinds of questions can be answered by Linear Regression. Below an example, using simulated data for weights,heights etc:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pymc as pm

def create_fake_data():
    ### create simulated data, 200 rows, with columns ->
    ### height,weight,male,age ### 
    entries = 200
    height_series = pd.Series(np.random.normal(170,10,entries))
    weight_series = pd.Series(np.random.normal(60,10,entries))
    male_series = pd.Series(np.random.uniform(0,2,entries)).astype(int)
    age_series = pd.Series(np.random.normal(40,10,entries)).astype(int)

    df = pd.DataFrame({'height':height_series,'weight':weight_series,
                       'male':male_series,'age':age_series})

    mean_height = np.mean(df['height'])
    std_height = np.std(df['height'])
    mean_weight = np.mean(df['weight'])
    std_weight = np.std(df['weight'])

    ### increase weight of tall persons, reduce of short ### 
    df.loc[df['height'] > mean_height * 1.01,'weight'] = df['weight'].apply(
        lambda x: float(np.random.normal(mean_weight + 30,10,1)))

    df.loc[df['height'] < mean_height * 0.995,'weight'] = df['weight'].apply(         lambda x: float(np.random.normal(mean_weight - 10,5,1)))     ### increase weight and height of men ###     df.loc[df['male'] == 1,'weight'] = df['weight'] * 1.1     df.loc[df['male'] == 1,'height'] = df['height'] * 1.1     ### ###      df = df[df['age'] >= 18] # unnecessary here,but left for selection syntax
    return df
    
df = create_fake_data()

x = df['weight']

### define priors ###
alpha = pm.Normal('alpha',mu=178,tau=1/(100*100))
beta = pm.Normal('beta',mu=0,tau=1/(10*10))
sigma = pm.Uniform('sigma',lower=0,upper=50)

### sample the priors for plotting ###
alpha_prior_samples = [alpha.random() for i in range(10000)]
beta_prior_samples = [beta.random() for i in range(10000)]
sigma_prior_samples = [sigma.random() for i in range(10000)]

### define regression ###
@pm.deterministic()
def mu(x=x,beta=beta,alpha=alpha):
    return x*beta+alpha

### define data generations scheme, connect to data ###
height = pm.Normal('height',
                   mu=mu,tau=1 / (sigma * sigma),
                   value=df['height'],observed=True)

### define PYMC model ###
model = pm.Model([alpha,beta,sigma,height,mu])
mcmc = pm.MCMC(model)

mcmc.sample(iter=100000,burn=25000,thin=2)

### get traces ###
alpha_samples = mcmc.trace('alpha')[:]
beta_samples = mcmc.trace('beta')[:]
sigma_samples = mcmc.trace('sigma')[:]
mu_samples = np.array(mcmc.trace('mu')[:,0])

result = pd.DataFrame({'alpha':alpha_samples,'beta':beta_samples,
                       'sigma':sigma_samples,'mu':mu_samples})

print (result.head())
print (result.describe())

print ('height mean:',np.mean(df['height']))
print ('height std:',np.std(df['height']))
print ('height women mean:',np.mean(df[df['male'] == 0]['height']))
print ('height men mean:',np.mean(df[df['male'] == 1]['height']))
       
print ('posterior mu mean:',np.mean(mu_samples))
print ('posterior sigma mean:',np.mean(sigma_samples))
print ('posterior alpha mean:',np.mean(alpha_samples))
print ('posterior beta mean:',np.mean(beta_samples))

nr_rows = 1000
rows = np.random.choice(result.index,replace=True,size=nr_rows)

posterior_samples = result.iloc[rows,:]

X = np.linspace(min(df['weight']),max(df['weight']),nr_rows)

### for each weight, generate random heights ->
### with mu and sigma from from posterior ###  
posterior_mus = np.array([np.random.normal(X[i] * posterior_samples[
    'beta'] + posterior_samples['alpha'],posterior_samples[
        'sigma']) for i in range(len(posterior_samples))])

### 89% confidence interval ### 
low,high = np.percentile(posterior_mus,[5.5,94.5],axis=1)

#### plotting ####
sub=10
lines = np.array([X[i] * result['beta'][::sub] + result[
        'alpha'][::sub] for i in range(len(X))])

plt.figure(figsize=(18,12))
plt.subplot(3,1,1)
plt.title(r'Prior $\alpha$')
plt.hist(alpha_prior_samples,bins=30)

plt.subplot(3,1,2)
plt.title(r'Prior $\beta$')
plt.hist(beta_prior_samples,bins=30)

plt.subplot(3,1,3)
plt.title(r'Prior $\sigma$')
plt.hist(sigma_prior_samples,bins=30)

plt.tight_layout()
plt.savefig('priors.jpg',format='jpg')

plt.figure(figsize=(18,12))
plt.subplot(4,1,1)
plt.title(r'Posterior $\alpha$')
plt.hist(alpha_samples,bins=30)

plt.subplot(4,1,2)
plt.title(r'Posterior $\beta$')
plt.hist(beta_samples,bins=30)

plt.subplot(4,1,3)
plt.title(r'Posterior $\sigma$')
plt.hist(sigma_samples,bins=30)

plt.subplot(4,1,4)
plt.title(r'Posterior $\mu$')
plt.hist(mu_samples,bins=30)
plt.tight_layout()
plt.savefig('posteriors.jpg',format='jpg')

plt.figure(figsize=(18,12))
plt.title(r'Bivariate Linear Regression $\alpha$:{:.2f} $\beta$:{:.2f}, CI=89%'.format(np.round(np.mean(alpha_samples),2),np.round(np.mean(beta_samples),2)))

plt.scatter(df['weight'],df['height'])
plt.plot(X,lines,alpha=0.005,color='r')
plt.plot(X,X*np.mean(beta_samples)+np.mean(alpha_samples),color='k',lw=3)
plt.fill_between(np.linspace(min(df['weight']),
                             max(df['weight']),high.size),
                 low,high,color='c',alpha=0.1)

plt.xlabel('Weight [kg]')
plt.ylabel('Height [cm]')
plt.savefig('regression.jpg',format='jpg')
plt.show()

Running that code gives the below graphs:

First the priors for Alpha, Beta & Sigma:

priors

Next, the Posteriors:

posteriors

And finally, the Regression itself: The blue dots are the actual (simulated) data points, the black line is the Maximum Aposteriori for mu, the red area around mu is the uncertainty for mu, and the baby blue area represents the 89% confidence interval for what the model believes are the heights of people for each weight.

regression

Posted in Bayes, Data Analytics, Numpy, Probability, PYMC, Python, Simulation, Statistics | Tagged , , , , , , , , , , | Leave a comment