Sometimes your explanatory variables are not numeric, but categorical. Typical example: “Gender” – which used to be binary, and since I’m an old man, I will continue regarding it as binary… 😉

To handle categorical data in Linear Regression, we must convert them to numerical values. Luckily, Pandas has a method doing just that for us effortlessly:

pd.get_dummies()

So, in our example dataframe below, we have two categorical data columns: ‘male’, which has already been coded to binary, and ‘color’, which has not:

age height male weight color 0 31 171.518577 1 64.866907 green 1 47 177.820986 0 58.922829 green 2 53 169.038830 1 50.096169 green 3 21 176.534921 1 65.554427 green 4 41 138.613215 1 51.425359 green 5 40 166.366936 0 68.101234 green 6 28 179.422860 1 50.794746 green 7 36 162.395712 0 40.132260 green 8 44 159.035290 0 54.100919 green 9 45 166.739483 0 62.111734 green

In order to include the ‘color’ column in our regression, we must convert it to numerical values. Easiest is to let pd.get_dummies() do that:

age height male weight color blue green red 0 31 188.670434 1 71.353597 red 0 0 1 1 47 177.820986 0 89.161939 green 0 1 0 2 53 185.942713 1 50.216582 red 0 0 1 3 21 194.188414 1 110.273308 red 0 0 1 4 41 152.474536 1 55.532181 blue 1 0 0 5 40 166.366936 0 54.056843 blue 1 0 0 6 28 197.365146 1 87.518089 red 0 0 1 7 36 162.395712 0 52.907817 blue 1 0 0 8 44 159.035290 0 56.937504 blue 1 0 0 9 45 166.739483 0 48.671899 blue 1 0 0

Let’s also add a column ‘female’ despite it being redundant, but to be able to illustrate multicolinearity later:

height weight age male female color blue green red 0 188.670434 71.353597 31 1 0 red 0 0 1 1 177.820986 89.161939 47 0 1 green 0 1 0 2 185.942713 50.216582 53 1 0 red 0 0 1 3 194.188414 110.273308 21 1 0 red 0 0 1 4 152.474536 55.532181 41 1 0 blue 1 0 0 5 166.366936 54.056843 40 0 1 blue 1 0 0 6 197.365146 87.518089 28 1 0 red 0 0 1 7 162.395712 52.907817 36 0 1 blue 1 0 0 8 159.035290 56.937504 44 0 1 blue 1 0 0 9 166.739483 48.671899 45 0 1 blue 1 0 0

Running the full dataset (20000 entries) thru Statsmodels OLS results in:

OLS Regression Results ============================================================================== Dep. Variable: height R-squared: 0.823 Model: OLS Adj. R-squared: 0.823 Method: Least Squares F-statistic: 2.297e+04 Date: Sun, 01 Apr 2018 Prob (F-statistic): 0.00 Time: 21:15:26 Log-Likelihood: -62369. No. Observations: 19713 AIC: 1.247e+05 Df Residuals: 19708 BIC: 1.248e+05 Df Model: 4 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 103.6469 0.149 696.190 0.000 103.355 103.939 weight 0.2629 0.002 121.022 0.000 0.259 0.267 male 56.4231 0.090 625.937 0.000 56.246 56.600 age -0.0029 0.004 -0.683 0.495 -0.011 0.005 red 10.8265 0.119 90.601 0.000 10.592 11.061 female 47.2238 0.090 522.004 0.000 47.046 47.401 ============================================================================== Omnibus: 1070.226 Durbin-Watson: 2.005 Prob(Omnibus): 0.000 Jarque-Bera (JB): 3061.090 Skew: -0.270 Prob(JB): 0.00 Kurtosis: 4.853 Cond. No. 3.81e+16 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The smallest eigenvalue is 9.72e-26. This might indicate that there are strong multicollinearity problems or that the design matrix is singular.

As can be seen, all predictors except age are statistically significant (significance level 0.05). But note also that at the bottom of the results summary is a warning [2] about colinearity: this is caused by the fact that the categories ‘male’ and ‘female’ correlate perfectly – one is the inverse of the other. By removing one of those categories, the warning disappears.

import numpy as np import matplotlib.pyplot as plt import pandas as pd import statsmodels.api as sm ### create dummy data ### entries = 20000 height_series = pd.Series(np.random.normal(170,10,entries)) weight_series = pd.Series(np.random.normal(60,10,entries)) male_series = pd.Series(np.random.uniform(0,2,entries)).astype(int) age_series = pd.Series(np.random.normal(40,10,entries)).astype(int) df = pd.DataFrame({'height':height_series,'weight':weight_series, 'male':male_series,'age':age_series}) df = df[df['age'] >= 18] # unnecessary here,but left for selection syntax df['color'] = 'green' #default categorical value, 2 be modified below print (df.head(10)) mean_height = np.mean(df['height']) std_height = np.std(df['height']) mean_weight = np.mean(df['weight']) std_weight = np.std(df['weight']) ### ### ### increase weight of tall persons, reduce of short ### df.loc[df['height'] > mean_height * 1.01,'weight'] = df['weight'].apply( lambda x: float(np.random.normal(mean_weight + 30,10,1))) df.loc[df['height'] < mean_height * 0.995,'weight'] = df['weight'].apply( lambda x: float(np.random.normal(mean_weight - 10,5,1))) ### ### ### increase weight and height of men ### df.loc[df['male'] == 1,'weight'] = df['weight'] * 1.1 df.loc[df['male'] == 1,'height'] = df['height'] * 1.1 ### ### ### assign categorical data values (based on height here) ### df.loc[df['height'] > mean_height * 1.07,'color'] = 'red' df.loc[df['height'] < mean_height * 0.99,'color'] = 'blue' ### ### ### create numerical representation for categorical variable color ### dummy = pd.get_dummies(df['color']) df = pd.concat([df,dummy],axis=1) ### ### print (df.head(10)) ### create female column by toggling male column ### df['male'] = df['male'].astype('bool') df['female'] = ~df['male'] df[['male','female']] = df[['male','female']].astype('int') ### ### ### reorder columns ### df = df[['height','weight','age','male','female','color','blue','green','red']] female_mean_height = np.mean(df['height'][df['female'] == 1]) male_mean_height = np.mean(df['height'][df['male'] == 1]) ### setup Statsmodels Linear Regression ### y = df['height'] X = df[['weight','male','age','red','female']] X = sm.add_constant(X) model = sm.OLS(y,X) fitted = model.fit() print (fitted.summary()) ### ### cols = X.columns.tolist() for col in cols: sm.graphics.plot_fit(fitted,col) # there appears to be a bug in plot_fit - it doesnt plot the last regressor... sm.graphics.plot_fit(fitted,cols[-1]) plt.figure() plt.subplot(2,1,1) plt.title ('Height men') plt.hist(df['height'][df['male'] == 1]) plt.subplot(2,1,2) plt.title('Height women') plt.hist(df['height'][df['female'] == 1]) plt.tight_layout() plt.show()