Fooled by Averages and Ignorant of Uncertainty – Bayesian Inference to assistance

As a follow up to my previous posts [1,2,3] on the danger’s of relying upon averages, and Simpson’s paradox, which is a consequence of misused averaging, here’s yet another angle on the same topic.

Let’s first resume with the baseball batting average example from this post:

In the example, we had two players, A and B, with half-season  batting average stat’s as below:

joe_baseball

While the entries for 1st and second half seasons are correct, the row with season averages,calculated as a straight mean of the two values,  is not:

As my old friend Joe Marasco say’s (aptly called “TWR” – Three Word Rule) : NEVER AVERAGE AVERAGES!

The problem here with the season averages is that they are computed as the mean of the 1st and 2d half season batting averages, but the “evidence”, that is, the raw data on number of hits vs nr of bats per each season split are very different, as can be seen here:

joe_batting

So, a naive mean of the two half season averages is not what we want. NEVER AVERAGE AVERAGES!

To compute the true season mean we have to take into account the overall number of bats and hits, i.e. what we want is to compute the weighted average, which is easiest done by dividing the number of total hits, over both half seasons, by total bats for each player:

joe_baseball_true_average

But getting these point estimates, that is, the averages, leave a lot of the information out: surely there is a fair amount of uncertainty around these averages…?

One way to make that uncertainty clearly visible is to use Baysian Inference.

In more specific terms, we are not satisfied with those point estimates, instead we want the full probability distribution for the respective batting averages.

To do that, in the Bayesian framework, we start by declaring our prior belief, that is, what we believe before seeing the data. To keep this example simple, I’ll use an uninformed, that is, ‘flat’ prior. Code below:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import pymc as pm

sns.set()

df = pd.DataFrame({'A_bats' : [4,40],
                  'A_hits' : [1,15],
                  'B_bats' : [10,5],
                  'B_hits' : [3,2]})

df.index = ['1st_half','second_half']
df['A_batting_average'] = df['A_hits'] / df['A_bats']
df['B_batting_average'] = df['B_hits'] / df['B_bats']
df.loc['All',:] = df.loc[:,'A_bats' : 'B_hits'].sum()
df.loc['All','A_batting_average' : 'B_batting_average'] = df.loc[:,'A_batting_average' : 'B_batting_average'].mean()

true_averages = pd.DataFrame({'A' : [df.loc['All','A_hits'] / df.loc['All','A_bats']]})
true_averages['B'] = df.loc['All','B_hits'] / df.loc['All','B_bats']
true_averages.index = ['all_season_batting_average']
#weighted means
# wm = (nr_events_1 * value_event_1 + nr_events_2 * value_event_2) /( nr_events_1 + nr_events_2)

wm_A = (df.loc['1st_half','A_batting_average'] * df.loc['1st_half','A_bats'] + df.loc['second_half','A_bats'] * \
df.loc['second_half','A_batting_average']) / (df.loc['1st_half','A_bats'] + df.loc['second_half','A_bats'])

wm_B = (df.loc['1st_half','B_batting_average'] * df.loc['1st_half','B_bats'] + df.loc['second_half','B_bats'] * \
df.loc['second_half','B_batting_average']) / (df.loc['1st_half','B_bats'] + df.loc['second_half','B_bats'])

averages = df.loc[:'second_half','A_batting_average':]
averages.at['erroneous_mean','A_batting_average'] = averages['A_batting_average'].mean()
averages.at['erroneous_mean','B_batting_average'] = averages['B_batting_average'].mean()
averages.at['All','A_batting_average'] = wm_A
averages.at['All','B_batting_average'] = wm_B

# Baysian Inference for most likely batting averages

season = 'All'
predictor = '1st_half' #for informed prior

flat_prior = pm.Uniform('prior',lower=0,upper=1,size=2)

informed_prior = pm.Beta('prior',
alpha=[100 * averages.loc[predictor,'A_batting_average'],
100 * averages.loc[predictor,'B_batting_average']],
beta=[100-averages.loc[predictor,'A_batting_average']*100 ,
100 - averages.loc[predictor,'B_batting_average'] * 100],size=2)

prior = flat_prior

lkh_A = pm.Binomial('lkh_A',n=df.loc[season,'A_bats'],p=prior[0],observed=True,value=df.loc[season,'A_hits'])
lkh_B = pm.Binomial('lkh_B',n=df.loc[season,'B_bats'],p=prior[1],observed=True,value=df.loc[season,'B_hits'])

model = pm.Model([prior,lkh_A,lkh_B])

mcmc = pm.MCMC(model)

sample = mcmc.sample(50000,10000,2)

post_A = mcmc.trace('prior')[:,0]
post_B = mcmc.trace('prior')[:,1]

result = pd.DataFrame({'post_A' : post_A,
'post_B' : post_B})

Running this code results in the below posterior distribution:

joe-results

We can see that the mean value for posterior batting average for player A is 0.37, and for player B 0.35, that is, close to but not identical to the analytically obtained numbers of 0.364 and 0.333.  However, the whole point of doing Bayesian Inference is to incorporate the inherent uncertainty, and by looking just at the mean values does not buy us anything new.

Instead, let’s plot the distributions:

joe_baseball_distributions_All

Now we can see that there’s a lot more uncertainty about the performance of player B, the posterior batting average ranging from less than 0.100 all the way up to almost 0.800. The reason being that player B has only made 15 bats in the season, while player A has made 44. Because player B has so few hits, the Bayesian Inference Engine duly reports that “there’s a lot of uncertainty about these numbers for player B mate, be careful how you use them!”

Player A on the other hand, with 44 bats, has a much narrower posterior distribution, reflecting that the inference engine is much more certain about player A’s ‘true’ batting average.

Working with distributions also allows us to ask and answer various questions, such as ‘what’s the probability for player B having a better betting average than player A?’

We can obtain the answer very easily, by just subtracting the posterior for A from posterior for B, and obtain 0.44, that is, the probability of B having a better batting average than A is 44%, which, in case we’d want to make a bet, corresponds to odds of 2.27, that is, for each dollar, you’d get 2.27 back if you bet on player B, and it turns out he’s the better batter of the two.

Bottom line thus far: more data provides (typically) better – less uncertain – results.

As a second example, consider the typical scenario of product or service ratings, e.g. when you look at restaurants and see that one has gotten 4.75 stars out of 5 on average, the other has an average rating of “only’ 4.55.

Clearly restaurant A, with an average rating of 4.75 is better than restaurant B with an average rating of 4.55…?

Not necessarily. Again, as in all the examples in this thread on the dangers with averages, it depends – in order to find out, we need the raw data.

Looking at the raw data in this example, it turns out that restaurant A had only 4 reviews, while restaurant B had 90 reviews.

Which one appears more likely to provide good food and service now…?

Again, the easiest way to see where to put our money is to look at the posterior distributions:

product_rating_example

We can here see that for restaurant B, the posterior distribution for the ratings are very  compressed around 0.9, that is, an average of 4.5 stars, while the distribution for restaurant A is much flatter, indicating much more uncertainty about the “true” number of stars.  However, the Maximum Likelihood is clearly better for restaurant A, but you might also run into a surprise with food and service not to your liking, despite the impressive average rating. In this example, my bet would be on restaurant B, that appears to have a consistent high level of service, with few surprises.

 

About swdevperestroika

High tech industry veteran, avid hacker reluctantly transformed to mgmt consultant.
This entry was posted in Bayes, Data Analytics, Gambling, Pandas, Probability, PYMC, Python, Statistics and tagged , , , , , , . Bookmark the permalink.

3 Responses to Fooled by Averages and Ignorant of Uncertainty – Bayesian Inference to assistance

  1. Joe Marasco says:

    Distributions tell us the whole story because they are lossless. Think of it this way: bitmaps have all the information, jpegs have most of it. Compression algorithms gain space and time, but if you want all of the detail at any magnification, you have to go back to the source, the bitmap.

    Think of an average as a very lossy estimator. It is a measure of central tendency, but some information is lost in the process. Thinking that if you have several measures, like the mean, the median, the standard deviation, etc. you are OK is fallacious, especially when you have a skewed distribution and/or one that is asymmetrical with a long tail. Not all distributions are normal!

    Your restaurant ratings example illustrates this very nicely, but I disagree with the conclusion. Restaurant B, with the yellow bars distribution, would be my choice. The odds of getting a bad meal are clearly worse for Restaurant A, even if you often get a very good meal. It depends on which situation you are trying to avoid.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s