Deeper look into Simpson’s Paradox – be very careful with averages!

A while ago, I did a quick post on Simpson’s Paradox, in context of Corona and R0-values.

Here we are going to look a bit deeper into the paradox. Why ? Because it has some very important implications in any area where we use averages or means do describe somehing: basically, unless careful thinking is applied, using averages can often lead to a bullet in the foot…

The bottom line: before using averages in your arguments, make sure you really understand what those averages represent – very often, they do not represent what you think. 

Let’s start by illustrating the problem with averages/means with an example provided by my old friend Joe Marasco:

Imagine two baseball players, A & B.  During the first half of the season A has a batting average of 0.250, while B has 0.300. During the second half of season A has 0.375 and B has 0.400.

Which of these two players is the better batter ?

By looking at the half season averages, it seems clear that B, with 0.300 and 0.400 is clearly better than A with 0.250 and 0.375.  Let’s put the data into a table:

joe_baseball

So, season average – calculated this ‘naive’ way, by taking the average of the two half season scores – results in player B clearly appearing to be a better hitter.

But is that really a correct reflexion of the skills of A vs B….?

The simple answer is: “we don’t know”.

In order to know whether B indeed has a better performance, we need the raw numbers, i.e. the actual data behind the given average numbers.

Why…? Because in statistics, 1/2 is way different from 10/20, even though both ratios result in 50%.

Why…? because when the only value we get is the 50% (or 0,5) we have eliminated a lot of information from the original, raw data. In the case of 1/2 vs 10/20: imagine those were ratings (‘likes’) on some product or service, in the first case 1 out of 2 people liked the service, in the second 10 out of 20: which  one would you trust more…?  what about 1000/2000…?  All these ratios compress down to 0.5, but their semantics are very different.

Anyways, let’s get back to the baseball example and let’s construct one possible set of raw data that conforms to these averages:

joe_baseball_averages

So, with these raw data, we can better see what’s going on, which of the two players is the best hitter.   Looking at the data split into half seasons, as we started with, shows that player B indeed had a better betting average for both seasons.

But the devil is in the detail: look at the number of bats: firstly, player B has only 15 bats in total, 10 for the first half and 5 for the second half, while player A has 4 and 40 respectively.  So, the impressive 0.400 batting average of second half for player B was based on only 5 bats, and the less impressive 0.250 batting average for player A during the first half was based on only 4 bats.   So, these numbers are hardly very indicative of the player’s overall skill, because these results could very well be the result of luck, or unluck, i.e. randomness.

This is yet another example of the “Law of Small Numbers” that I wrote about in a previous post.

So, the correct way to assess the performace, the batting average,  of these two players is to look at the overall statistics, that is, for the full season:

player A has 44 bats, of which 16 were hits, giving a batting average of 0.364, while player B has 15 bats and 5 hits, for a batting average of 0.333.

joe_baseball_true_average

So, what happens here is that during the second half of the season, player A has not only a large number of bats, but those bats are more often successful bats, creating a total batting average that eventually overwhelms the very high rates of B, based on very few bats.

This result can also be obtained by taking the Weighted Means:

for A: (4 * 0.250 + 40 * 0.375) / ( 4 + 40)

for B: (10 * 0.300 + 5 * 0.400) / (10 + 5)

It’s perhaps easier to see  what happens if we put the sequences of bats on a graph:

joe_baseball

Here I’ve arranged the bats from both seasons in order, so player A has 44 bats in total, player B has 15 bats. Furthermore, I’ve arranged the series of bats so that each player starts each half-season by misses, and then ends the half-season with hits.

So the first peak of both plots is end of the first half season, and we can see that player A has an overall half season score of 0.250, and player B has 0.300.

Then the second half starts, and both players start off terribly with missing. Player B starts with 3 consecutive misses, followed by 2 hits, for a second-half average of 2/5, i.e. 0.400.

Player A starts really terribly, with 25 misses, before picking up speed and hitting the next 15, for a second half average of 0.375, which thanks to the relatively many successful bats is able to bring the season total over and above that of player B.

Another observation from the graph above is that when the number of data points is small, each individual data point can impact the average a lot, while when the number of data points is large, the overall impact of that point is small.

This is known as “The Law of Small Numbers” or “The Funnel”, and illustrated in the graph below:

Law_of_averages_of_small_samples

This graph is based on the example on e.g. epidemics or cancer rates in different geographical areas given in this post, where the idea was that we have a number of areas, such as cities/counties/regions/countries, and a survey on some parameter, such as rate of cancer was done (I’ve moved the average to 0.5 here for a more ‘beautiful’ funnel…).

And it turned out that some of the areas have much higher rates than the average, which is enough for media to put up headlines such as “Cancer Rates 7 times national average in area X!”

Well, turns out that often this type of outliers is due to differences in population size, i.e. in areas with small populations, even a few random cases can skew the average very badly.

So, what the Funnel above illustrates is that for areas with small populations, the parameter of interest, i.e. the ratio of infected/cancer etc can vary wildly, purely by randomness.

In the example above, I’ve simulated a number of areas, and randomly assigned a population to each area, and then with a probability of 0.5 hit that population with a virus or cancer. Thus, the mean should be 50% but for the areas with very small populations, the ratio ranges from 0 to 100%. Purely by randomness, while for the more populous areas the ratio converges to the true value of 50%.

 

 

 

About swdevperestroika

High tech industry veteran, avid hacker reluctantly transformed to mgmt consultant.
This entry was posted in Data Analytics, Statistics and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s