[This post was inspired by Richard McElreath]

One of the nice characteristics of Bayesian methods is that you are always dealing with full distributions. On the other hand, in a Frequentist (tratitional) approach, you will mostly deal with point estimators, such as mean values.

In Bayesian world, since we are dealing with distributions asop to point estimators, we are able to preserve the underlying uncertainty in our models, which is a good thing, not least to avoid overfitting the model.

In the below example, I’ve done a Linear Regression on Nancy Howell’s data on the Dobe !Kung people, using both Bayesian as well as Frequentist methods.

The data at hand deals with if and how human height relates to human weight.

First, lets run the Frequentist approach:

So above we have the results from a frequentist approach, where we asked Statsmodels to correlate the weights with the heights. Thanks to Statsmodels, we not only get the nice, straight line in red, but also the margin-of-error bars, overlayed upon the actual data points. However, the plot uses exclusively only the data at hand (352 data points).

Here’s the Bayesian version:

In the graph above, I’ve sampled from the resulting regression posterior, and plotted all the red lines on top of the data points. The black straight line is not part of Bayes, but comes from the above Frequentist analysis. As can be seen, in Bayes the uncertainty is directily visible thanks to the many red lines, while in the frequentist approach we must explicitly add the error bars to illustrate the uncertainty.

Furthermore, as we are now doing Bayes, we are sampling from a posterior distribution that can be made arbritarily large, in this case, the posterior contains more that 100.000 entries that have been fitted to the underlying real data.

import numpy as np import matplotlib.pyplot as plt import pandas as pd import pymc as pm import statsmodels.api as sm df = pd.read_csv('Stat_Rethink/Howell1.csv',header=0,sep=';') df = df.loc[df['age'] >= 18] std = 100 tau = 1. / (std * std) alpha = pm.Normal('alpha',178,tau) beta = pm.Normal('beta',0,1/(10*10)) sigma = pm.Uniform('sigma',lower=0,upper=50) x = pm.Normal('x',value=df['weight'],observed=True) @pm.deterministic() def linear_regress(x=x,alpha=alpha,beta=beta): return x*alpha+beta y = pm.Normal('y',mu=linear_regress,tau=1/(sigma*sigma), value=df['height'],observed=True) model = pm.Model([alpha,beta,sigma,x,y]) mcmc = pm.MCMC(model) mcmc.sample(iter=100000,burn=10000,thin=10) alpha_samples = mcmc.trace('alpha')[:] beta_samples = mcmc.trace('beta')[:] nr_of_lines = 352 rows = np.random.choice(range(len(alpha_samples)), replace=True,size=10000) alphas = alpha_samples[rows] betas = beta_samples[rows] X = np.linspace(30,70,nr_of_lines) lines = np.array([X[i] * alphas + betas for i in range(len(X))]) print ('PYMC: slope mean:',np.mean(alphas)) print ('PYMC: intercept mean:',np.mean(betas)) ##### statsmodels ##### sm_x = df['weight'] sm_x = sm.add_constant(sm_x) sm_y = df['height'] sm_model = sm.OLS(sm_y,sm_x) fitted = sm_model.fit() print (fitted.summary()) sm_intercept = fitted.params[0] sm_slope = fitted.params[1] plt.figure(figsize=(18,12)) sm.graphics.plot_fit(fitted,1) plt.savefig('statsmodels_fit.jpg', format='jpg') plt.figure() plt.subplot(2,1,1) plt.title(r'$\alpha$-posterior (slope)') plt.hist(alpha_samples,bins=50) plt.subplot(2,1,2) plt.title(r'$\beta$-posterior (intercept)') plt.hist(beta_samples,bins=50) plt.tight_layout() plt.figure(figsize=(18,12)) plt.title('Linear Regression Posterior') plt.plot(X,lines[:,0],color='r',alpha=1,label='Baysian Posterior Regression') plt.plot(X,lines[:,1:],color='r',alpha=0.005) plt.scatter(df['weight'],df['height'],label='Data to be fitted') # print statsmodels based regression plt.plot(X,sm_slope * X + sm_intercept,color='k', lw=3,label='Frequentist Regression') plt.ylabel('length [cm]') plt.xlabel('weight [kg]') plt.ylim(130,180) plt.legend(loc='upper left') plt.savefig('linreg_comparision.jpg',format='jpg') plt.show()