Linear Regression – comparing Bayesian with Frequentist approach

[This post was inspired by Richard McElreath]

One of the nice characteristics of Bayesian methods is that you are always dealing with full distributions. On the other hand, in a Frequentist (tratitional) approach, you will mostly deal with point estimators, such as mean values.

In Bayesian world, since we are dealing with distributions asop to point estimators, we are able to preserve the underlying uncertainty in our models, which is a good thing, not least to avoid overfitting the model.

In the below example, I’ve done a Linear Regression on Nancy Howell’s data on the Dobe !Kung people,  using both Bayesian as well as Frequentist methods.

The data at hand deals with if and how human height relates to human weight.

First, lets run the Frequentist approach:

statsmodels_fit

So above we have the results from a frequentist approach, where we asked Statsmodels to correlate the weights with the heights. Thanks to Statsmodels, we not only get the nice, straight line in red, but also the margin-of-error bars, overlayed upon the actual data points.  However, the plot uses exclusively only the data at hand (352 data points).

Here’s the Bayesian version:

linreg_comparision

In the graph above, I’ve sampled from the resulting regression posterior, and plotted all the red lines on top of the data points. The black straight line is not part of Bayes, but comes from the above Frequentist analysis. As can be seen, in Bayes the uncertainty is directily visible thanks to the many red lines, while in the frequentist approach we must explicitly add the error bars to illustrate the uncertainty.

Furthermore, as we are now doing Bayes, we are sampling from a posterior distribution that can be made arbritarily large, in this case, the posterior contains more that 100.000 entries that have been fitted to the underlying real data.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pymc as pm
import statsmodels.api as sm

df = pd.read_csv('Stat_Rethink/Howell1.csv',header=0,sep=';')
df = df.loc[df['age'] >= 18]

std = 100
tau = 1. / (std * std)

alpha = pm.Normal('alpha',178,tau)
beta = pm.Normal('beta',0,1/(10*10))
sigma = pm.Uniform('sigma',lower=0,upper=50)

x = pm.Normal('x',value=df['weight'],observed=True)

@pm.deterministic()
def linear_regress(x=x,alpha=alpha,beta=beta):
    return x*alpha+beta

y = pm.Normal('y',mu=linear_regress,tau=1/(sigma*sigma),
              value=df['height'],observed=True)

model = pm.Model([alpha,beta,sigma,x,y])

mcmc = pm.MCMC(model)
mcmc.sample(iter=100000,burn=10000,thin=10)

alpha_samples = mcmc.trace('alpha')[:]
beta_samples = mcmc.trace('beta')[:]

nr_of_lines = 352

rows = np.random.choice(range(len(alpha_samples)),
                        replace=True,size=10000)
                        
alphas = alpha_samples[rows]
betas = beta_samples[rows]

X = np.linspace(30,70,nr_of_lines)

lines = np.array([X[i] * alphas + betas for i in range(len(X))])

print ('PYMC: slope mean:',np.mean(alphas))
print ('PYMC: intercept mean:',np.mean(betas))

##### statsmodels #####

sm_x = df['weight']
sm_x = sm.add_constant(sm_x)
sm_y = df['height']

sm_model = sm.OLS(sm_y,sm_x)
fitted = sm_model.fit()
print (fitted.summary())
sm_intercept = fitted.params[0]
sm_slope = fitted.params[1]

plt.figure(figsize=(18,12))
sm.graphics.plot_fit(fitted,1)
plt.savefig('statsmodels_fit.jpg', format='jpg')
           
plt.figure()
plt.subplot(2,1,1)
plt.title(r'$\alpha$-posterior (slope)')
plt.hist(alpha_samples,bins=50)

plt.subplot(2,1,2)
plt.title(r'$\beta$-posterior (intercept)')
plt.hist(beta_samples,bins=50)
plt.tight_layout()

plt.figure(figsize=(18,12))
plt.title('Linear Regression Posterior')
plt.plot(X,lines[:,0],color='r',alpha=1,label='Baysian Posterior Regression')
plt.plot(X,lines[:,1:],color='r',alpha=0.005)
plt.scatter(df['weight'],df['height'],label='Data to be fitted')

# print statsmodels based regression 
plt.plot(X,sm_slope * X + sm_intercept,color='k',
         lw=3,label='Frequentist Regression')

plt.ylabel('length [cm]')
plt.xlabel('weight [kg]')
plt.ylim(130,180)
plt.legend(loc='upper left')
plt.savefig('linreg_comparision.jpg',format='jpg')
plt.show()

About swdevperestroika

High tech industry veteran, avid hacker reluctantly transformed to mgmt consultant.
This entry was posted in Bayes, Big Data, Data Analytics, Data Driven Management, Numpy, Probability, PYMC, Simulation, Statistics and tagged , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s