## Linear Regression – comparing Bayesian with Frequentist approach

[This post was inspired by Richard McElreath]

One of the nice characteristics of Bayesian methods is that you are always dealing with full distributions. On the other hand, in a Frequentist (tratitional) approach, you will mostly deal with point estimators, such as mean values.

In Bayesian world, since we are dealing with distributions asop to point estimators, we are able to preserve the underlying uncertainty in our models, which is a good thing, not least to avoid overfitting the model.

In the below example, I’ve done a Linear Regression on Nancy Howell’s data on the Dobe !Kung people,  using both Bayesian as well as Frequentist methods.

The data at hand deals with if and how human height relates to human weight.

First, lets run the Frequentist approach: So above we have the results from a frequentist approach, where we asked Statsmodels to correlate the weights with the heights. Thanks to Statsmodels, we not only get the nice, straight line in red, but also the margin-of-error bars, overlayed upon the actual data points.  However, the plot uses exclusively only the data at hand (352 data points).

Here’s the Bayesian version: In the graph above, I’ve sampled from the resulting regression posterior, and plotted all the red lines on top of the data points. The black straight line is not part of Bayes, but comes from the above Frequentist analysis. As can be seen, in Bayes the uncertainty is directily visible thanks to the many red lines, while in the frequentist approach we must explicitly add the error bars to illustrate the uncertainty.

Furthermore, as we are now doing Bayes, we are sampling from a posterior distribution that can be made arbritarily large, in this case, the posterior contains more that 100.000 entries that have been fitted to the underlying real data.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pymc as pm
import statsmodels.api as sm

df = df.loc[df['age'] >= 18]

std = 100
tau = 1. / (std * std)

alpha = pm.Normal('alpha',178,tau)
beta = pm.Normal('beta',0,1/(10*10))
sigma = pm.Uniform('sigma',lower=0,upper=50)

x = pm.Normal('x',value=df['weight'],observed=True)

@pm.deterministic()
def linear_regress(x=x,alpha=alpha,beta=beta):
return x*alpha+beta

y = pm.Normal('y',mu=linear_regress,tau=1/(sigma*sigma),
value=df['height'],observed=True)

model = pm.Model([alpha,beta,sigma,x,y])

mcmc = pm.MCMC(model)
mcmc.sample(iter=100000,burn=10000,thin=10)

alpha_samples = mcmc.trace('alpha')[:]
beta_samples = mcmc.trace('beta')[:]

nr_of_lines = 352

rows = np.random.choice(range(len(alpha_samples)),
replace=True,size=10000)

alphas = alpha_samples[rows]
betas = beta_samples[rows]

X = np.linspace(30,70,nr_of_lines)

lines = np.array([X[i] * alphas + betas for i in range(len(X))])

print ('PYMC: slope mean:',np.mean(alphas))
print ('PYMC: intercept mean:',np.mean(betas))

##### statsmodels #####

sm_x = df['weight']
sm_y = df['height']

sm_model = sm.OLS(sm_y,sm_x)
fitted = sm_model.fit()
print (fitted.summary())
sm_intercept = fitted.params
sm_slope = fitted.params

plt.figure(figsize=(18,12))
sm.graphics.plot_fit(fitted,1)
plt.savefig('statsmodels_fit.jpg', format='jpg')

plt.figure()
plt.subplot(2,1,1)
plt.title(r'$\alpha$-posterior (slope)')
plt.hist(alpha_samples,bins=50)

plt.subplot(2,1,2)
plt.title(r'$\beta$-posterior (intercept)')
plt.hist(beta_samples,bins=50)
plt.tight_layout()

plt.figure(figsize=(18,12))
plt.title('Linear Regression Posterior')
plt.plot(X,lines[:,0],color='r',alpha=1,label='Baysian Posterior Regression')
plt.plot(X,lines[:,1:],color='r',alpha=0.005)
plt.scatter(df['weight'],df['height'],label='Data to be fitted')

# print statsmodels based regression
plt.plot(X,sm_slope * X + sm_intercept,color='k',
lw=3,label='Frequentist Regression')

plt.ylabel('length [cm]')
plt.xlabel('weight [kg]')
plt.ylim(130,180)
plt.legend(loc='upper left')
plt.savefig('linreg_comparision.jpg',format='jpg')
plt.show() 