Python,Pandas, Statsmodels: Linear Regression – dealing with categorical data

Sometimes your explanatory variables are not numeric, but categorical. Typical example: “Gender” – which used to be binary, and since I’m an old man, I will continue regarding it as binary… 😉

To handle categorical data in Linear Regression, we must convert them to numerical values. Luckily, Pandas has a method doing just that for us effortlessly:

pd.get_dummies()

So, in our example dataframe below, we have two categorical data columns: ‘male’, which has already been coded to binary, and ‘color’, which has not:

``` age height male weight color
0 31 171.518577 1 64.866907 green
1 47 177.820986 0 58.922829 green
2 53 169.038830 1 50.096169 green
3 21 176.534921 1 65.554427 green
4 41 138.613215 1 51.425359 green
5 40 166.366936 0 68.101234 green
6 28 179.422860 1 50.794746 green
7 36 162.395712 0 40.132260 green
8 44 159.035290 0 54.100919 green
9 45 166.739483 0 62.111734 green```

In order to include the ‘color’ column in our regression, we must convert it to numerical values. Easiest is to let pd.get_dummies() do that:

``` age height male weight color blue green red
0 31 188.670434 1 71.353597 red 0 0 1
1 47 177.820986 0 89.161939 green 0 1 0
2 53 185.942713 1 50.216582 red 0 0 1
3 21 194.188414 1 110.273308 red 0 0 1
4 41 152.474536 1 55.532181 blue 1 0 0
5 40 166.366936 0 54.056843 blue 1 0 0
6 28 197.365146 1 87.518089 red 0 0 1
7 36 162.395712 0 52.907817 blue 1 0 0
8 44 159.035290 0 56.937504 blue 1 0 0
9 45 166.739483 0 48.671899 blue 1 0 0```

Let’s also add a column ‘female’ despite it being redundant, but to be able to illustrate multicolinearity later:

``` height weight age male female color blue green red
0 188.670434 71.353597 31 1 0 red 0 0 1
1 177.820986 89.161939 47 0 1 green 0 1 0
2 185.942713 50.216582 53 1 0 red 0 0 1
3 194.188414 110.273308 21 1 0 red 0 0 1
4 152.474536 55.532181 41 1 0 blue 1 0 0
5 166.366936 54.056843 40 0 1 blue 1 0 0
6 197.365146 87.518089 28 1 0 red 0 0 1
7 162.395712 52.907817 36 0 1 blue 1 0 0
8 159.035290 56.937504 44 0 1 blue 1 0 0
9 166.739483 48.671899 45 0 1 blue 1 0 0```

Running the full dataset (20000 entries) thru Statsmodels OLS results in:

```                            OLS Regression Results
==============================================================================
Dep. Variable:                 height   R-squared:                       0.823
Method:                 Least Squares   F-statistic:                 2.297e+04
Date:                Sun, 01 Apr 2018   Prob (F-statistic):               0.00
Time:                        21:15:26   Log-Likelihood:                -62369.
No. Observations:               19713   AIC:                         1.247e+05
Df Residuals:                   19708   BIC:                         1.248e+05
Df Model:                           4
Covariance Type:            nonrobust
==============================================================================
coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        103.6469      0.149    696.190      0.000     103.355     103.939
weight         0.2629      0.002    121.022      0.000       0.259       0.267
male          56.4231      0.090    625.937      0.000      56.246      56.600
age           -0.0029      0.004     -0.683      0.495      -0.011       0.005
red           10.8265      0.119     90.601      0.000      10.592      11.061
female        47.2238      0.090    522.004      0.000      47.046      47.401
==============================================================================
Omnibus:                     1070.226   Durbin-Watson:                   2.005
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3061.090
Skew:                          -0.270   Prob(JB):                         0.00
Kurtosis:                       4.853   Cond. No.                     3.81e+16
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 9.72e-26. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

```

As can be seen, all predictors except age are statistically significant (significance level 0.05). But note also that at the bottom of the results summary is a warning [2] about colinearity: this is caused by the fact that the categories ‘male’ and ‘female’ correlate perfectly – one is the inverse of the other. By removing one of those categories, the warning disappears.

```import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm

### create dummy data ###
entries = 20000
height_series = pd.Series(np.random.normal(170,10,entries))
weight_series = pd.Series(np.random.normal(60,10,entries))
male_series = pd.Series(np.random.uniform(0,2,entries)).astype(int)
age_series = pd.Series(np.random.normal(40,10,entries)).astype(int)

df = pd.DataFrame({'height':height_series,'weight':weight_series,
'male':male_series,'age':age_series})

df = df[df['age'] >= 18] # unnecessary here,but left for selection syntax

df['color'] = 'green' #default categorical value, 2 be modified below

mean_height = np.mean(df['height'])
std_height = np.std(df['height'])
mean_weight = np.mean(df['weight'])
std_weight = np.std(df['weight'])
### ###

### increase weight of tall persons, reduce of short ###
df.loc[df['height'] > mean_height * 1.01,'weight'] = df['weight'].apply(
lambda x: float(np.random.normal(mean_weight + 30,10,1)))

df.loc[df['height'] < mean_height * 0.995,'weight'] = df['weight'].apply(     lambda x: float(np.random.normal(mean_weight - 10,5,1))) ### ### ### increase weight and height of men ### df.loc[df['male'] == 1,'weight'] = df['weight'] * 1.1 df.loc[df['male'] == 1,'height'] = df['height'] * 1.1 ### ###  ### assign categorical data values (based on height here) ### df.loc[df['height'] > mean_height * 1.07,'color'] = 'red'
df.loc[df['height'] < mean_height * 0.99,'color'] = 'blue'
### ###

### create numerical representation for categorical variable color ###
dummy = pd.get_dummies(df['color'])
df = pd.concat([df,dummy],axis=1)
### ###

### create female column by toggling male column ###
df['male'] = df['male'].astype('bool')
df['female'] = ~df['male']
df[['male','female']] = df[['male','female']].astype('int')
### ###

### reorder columns ###
df = df[['height','weight','age','male','female','color','blue','green','red']]

female_mean_height = np.mean(df['height'][df['female'] == 1])
male_mean_height = np.mean(df['height'][df['male'] == 1])

### setup Statsmodels Linear Regression ###
y = df['height']
X = df[['weight','male','age','red','female']]

model = sm.OLS(y,X)
fitted = model.fit()
print (fitted.summary())
### ###

cols = X.columns.tolist()

for col in cols:
sm.graphics.plot_fit(fitted,col)

# there appears to be a bug in plot_fit - it doesnt plot the last regressor...
sm.graphics.plot_fit(fitted,cols[-1])

plt.figure()
plt.subplot(2,1,1)
plt.title ('Height men')
plt.hist(df['height'][df['male'] == 1])

plt.subplot(2,1,2)
plt.title('Height women')
plt.hist(df['height'][df['female'] == 1])
plt.tight_layout()
plt.show()

```