Simpson’s Paradox

As noted previously, e.g. in this post, care must be taken with statistics…

From Unherd.com:

Simpsons Paradox & R0

Thanks to Richard McElreath’s Baysian Inference class, I happened to have the data from the example on Simpson’s Paradox in the University Admissions example given in the article.

It’s a very illustrative example on what’s called “Confounding” or “Confounds” in statistics:  basically, a confound is some unobserved variable that impacts the variables in the model, and can often wreck havoc with causal inferences, e.g. by reversing them.

In the UCB example, the hypothesis was that since only 30% of women got admitted to PhD programmes, while the acceptance rate for men was over 40%, there must be gender discrimination in play, i.e. that the admission board discriminated women.

Only by identifying the confounders – in this case, the Departments of the University – did it become clear that there was no discrimination by the admission board – the reason women had a lower overall acceptance rate was because they on average applied to departments that accept very few students, while men on average applied to other departments, with much larger acceptance.

So, by “controlling for” the confounder, that is, the department, the hypothesis that the acceptance board discriminated women could be refuted.

Below a quick Python hack to illustrate this.

[before you run this code, make sure you don’t have anything named ‘share’ in your current directory!]

 

simpsons_paradox_gender_ratios

import os

os.system('rm -rf share')
os.system('git clone https://github.com/tolex3/share.git')

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

sns.set()


df = pd.read_csv('./share/UCBadmit.csv',sep=';')

pivot = pd.pivot_table(df,index='dept',
                       columns='applicant.gender',
                       values=['admit','applications'],
                       aggfunc=sum,margins=True)

pivot['applications_f'] = pivot[('applications','female')] / \
    pivot[('applications','All')]

pivot['applications_m'] = pivot[('applications','male')] / \
    pivot[('applications','All')]

pivot['admit_pct_f'] = pivot[('admit','female')] / \
    pivot[('applications','female')]

pivot['admit_pct_m'] = pivot[('admit','male')] / \
    pivot[('applications','male')]

pivot['admit_pct_tot'] = pivot[('admit','All')] / \
    pivot[('applications','All')]

pivot['female_less'] = pivot['admit_pct_f'] < pivot['admit_pct_m']

print (pivot.head(20))

pivot[['applications_f','applications_m','admit_pct_f','admit_pct_m']].plot.bar(
figsize=(18,12),color=['red','blue','pink','cyan'])


 

About swdevperestroika

High tech industry veteran, avid hacker reluctantly transformed to mgmt consultant.
This entry was posted in Epidemics, Statistics and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s