Python,Pandas, Statsmodels: Linear Regression – dealing with categorical data

Sometimes your explanatory variables are not numeric, but categorical. Typical example: “Gender” – which used to be binary, and since I’m an old man, I will continue regarding it as binary… 😉

To handle categorical data in Linear Regression, we must convert them to numerical values. Luckily, Pandas has a method doing just that for us effortlessly:

pd.get_dummies()

So, in our example dataframe below, we have two categorical data columns: ‘male’, which has already been coded to binary, and ‘color’, which has not:

 age height male weight color
0 31 171.518577 1 64.866907 green
1 47 177.820986 0 58.922829 green
2 53 169.038830 1 50.096169 green
3 21 176.534921 1 65.554427 green
4 41 138.613215 1 51.425359 green
5 40 166.366936 0 68.101234 green
6 28 179.422860 1 50.794746 green
7 36 162.395712 0 40.132260 green
8 44 159.035290 0 54.100919 green
9 45 166.739483 0 62.111734 green

In order to include the ‘color’ column in our regression, we must convert it to numerical values. Easiest is to let pd.get_dummies() do that:

 age height male weight color blue green red
0 31 188.670434 1 71.353597 red 0 0 1
1 47 177.820986 0 89.161939 green 0 1 0
2 53 185.942713 1 50.216582 red 0 0 1
3 21 194.188414 1 110.273308 red 0 0 1
4 41 152.474536 1 55.532181 blue 1 0 0
5 40 166.366936 0 54.056843 blue 1 0 0
6 28 197.365146 1 87.518089 red 0 0 1
7 36 162.395712 0 52.907817 blue 1 0 0
8 44 159.035290 0 56.937504 blue 1 0 0
9 45 166.739483 0 48.671899 blue 1 0 0

Let’s also add a column ‘female’ despite it being redundant, but to be able to illustrate multicolinearity later:

 height weight age male female color blue green red
0 188.670434 71.353597 31 1 0 red 0 0 1
1 177.820986 89.161939 47 0 1 green 0 1 0
2 185.942713 50.216582 53 1 0 red 0 0 1
3 194.188414 110.273308 21 1 0 red 0 0 1
4 152.474536 55.532181 41 1 0 blue 1 0 0
5 166.366936 54.056843 40 0 1 blue 1 0 0
6 197.365146 87.518089 28 1 0 red 0 0 1
7 162.395712 52.907817 36 0 1 blue 1 0 0
8 159.035290 56.937504 44 0 1 blue 1 0 0
9 166.739483 48.671899 45 0 1 blue 1 0 0

Running the full dataset (20000 entries) thru Statsmodels OLS results in:

                            OLS Regression Results
==============================================================================
Dep. Variable:                 height   R-squared:                       0.823
Model:                            OLS   Adj. R-squared:                  0.823
Method:                 Least Squares   F-statistic:                 2.297e+04
Date:                Sun, 01 Apr 2018   Prob (F-statistic):               0.00
Time:                        21:15:26   Log-Likelihood:                -62369.
No. Observations:               19713   AIC:                         1.247e+05
Df Residuals:                   19708   BIC:                         1.248e+05
Df Model:                           4
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        103.6469      0.149    696.190      0.000     103.355     103.939
weight         0.2629      0.002    121.022      0.000       0.259       0.267
male          56.4231      0.090    625.937      0.000      56.246      56.600
age           -0.0029      0.004     -0.683      0.495      -0.011       0.005
red           10.8265      0.119     90.601      0.000      10.592      11.061
female        47.2238      0.090    522.004      0.000      47.046      47.401
==============================================================================
Omnibus:                     1070.226   Durbin-Watson:                   2.005
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3061.090
Skew:                          -0.270   Prob(JB):                         0.00
Kurtosis:                       4.853   Cond. No.                     3.81e+16
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 9.72e-26. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

As can be seen, all predictors except age are statistically significant (significance level 0.05). But note also that at the bottom of the results summary is a warning [2] about colinearity: this is caused by the fact that the categories ‘male’ and ‘female’ correlate perfectly – one is the inverse of the other. By removing one of those categories, the warning disappears.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm

### create dummy data ###
entries = 20000
height_series = pd.Series(np.random.normal(170,10,entries))
weight_series = pd.Series(np.random.normal(60,10,entries))
male_series = pd.Series(np.random.uniform(0,2,entries)).astype(int)
age_series = pd.Series(np.random.normal(40,10,entries)).astype(int)

df = pd.DataFrame({'height':height_series,'weight':weight_series,
                   'male':male_series,'age':age_series})

df = df[df['age'] >= 18] # unnecessary here,but left for selection syntax

df['color'] = 'green' #default categorical value, 2 be modified below

print (df.head(10))
       
mean_height = np.mean(df['height'])
std_height = np.std(df['height'])
mean_weight = np.mean(df['weight'])
std_weight = np.std(df['weight'])
### ###

### increase weight of tall persons, reduce of short ### 
df.loc[df['height'] > mean_height * 1.01,'weight'] = df['weight'].apply(
    lambda x: float(np.random.normal(mean_weight + 30,10,1)))

df.loc[df['height'] < mean_height * 0.995,'weight'] = df['weight'].apply(     lambda x: float(np.random.normal(mean_weight - 10,5,1))) ### ### ### increase weight and height of men ### df.loc[df['male'] == 1,'weight'] = df['weight'] * 1.1 df.loc[df['male'] == 1,'height'] = df['height'] * 1.1 ### ###  ### assign categorical data values (based on height here) ### df.loc[df['height'] > mean_height * 1.07,'color'] = 'red'
df.loc[df['height'] < mean_height * 0.99,'color'] = 'blue'
### ###

### create numerical representation for categorical variable color ###
dummy = pd.get_dummies(df['color'])
df = pd.concat([df,dummy],axis=1)
### ###

print (df.head(10))

### create female column by toggling male column ### 
df['male'] = df['male'].astype('bool')
df['female'] = ~df['male']
df[['male','female']] = df[['male','female']].astype('int')
### ###

### reorder columns ###
df = df[['height','weight','age','male','female','color','blue','green','red']]

female_mean_height = np.mean(df['height'][df['female'] == 1])
male_mean_height = np.mean(df['height'][df['male'] == 1])

### setup Statsmodels Linear Regression ###
y = df['height']
X = df[['weight','male','age','red','female']]
X = sm.add_constant(X)

model = sm.OLS(y,X)
fitted = model.fit()
print (fitted.summary())
### ### 

cols = X.columns.tolist()

for col in cols:
    sm.graphics.plot_fit(fitted,col)

# there appears to be a bug in plot_fit - it doesnt plot the last regressor...
sm.graphics.plot_fit(fitted,cols[-1])

plt.figure()
plt.subplot(2,1,1)
plt.title ('Height men')
plt.hist(df['height'][df['male'] == 1])

plt.subplot(2,1,2)
plt.title('Height women')
plt.hist(df['height'][df['female'] == 1])
plt.tight_layout()
plt.show()

 

About swdevperestroika

High tech industry veteran, avid hacker reluctantly transformed to mgmt consultant.
This entry was posted in Data Analytics, Numpy, Probability, Python, Simulation, Statistics and tagged , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s