This is part 2 of my exploration of PyMC, and I will continue using the example from part one, where I concluded that the data of my experiment, a dataset consisting of a list of daily visitor counts to one of my blogs, appers to exhibit ‘wild randomness’, and thus is not governed by the normal distribution. The distribution of normally distributed web visit data would look like this:
So, let’s look at a probability distribution that exhibits wild randomness, the Cauchy distribution. Before doing so, let’s first explain the characteristics of the Normal (or Gaussian) distribution, that was under examination in the previous post.
Assume a skilled archer, standing with his bow some distance from a marked target. Let this archer fire away a number of shots. Measure the error, i.e. the distance from the bullseye for each shot, and plot the frequencies for those errors. With enough shots fired, the frequency histogram will describe the well known ‘Bell curve’, i.e. the errors are under Normal distribution. Depending on the accuracy and precision of the archer, the bell curve can have different shapes, more spread in the errors if his precision is bad, and the offset from the bullseye depends on the accuracy, i.e. ability to hit as close as possible of the bullseye, as shown in the graph below:
Now, instead of the skilled archer, let’s imagine we have a less skilled archer, who furthermore has his eyes blindfolded, and has no idea where – in which direction – the target resides. We let this archer fire away many many shots, and measure the distances from the target, and plot their frequency as above. What would that distribution look like ?
It turns out that the distribution looks deceivingly similar to that of the Gaussian or Normal distribution:
Despite the simularity of the plots, there’s a world of difference in what these two distributions generate, and that difference is fundamentally due to the difference at the tail ends of the respective curves: if you look at the tail ends, you can see that for Normal distribution, the tails drop down towards 0, while for the Cauchy, the tail ends reside above 0. This is the “Long Tail” or “Fat Tail” characteristic of wild randomness.
The below graph shows sample values drawn from three different probability distributions, the Normal (yellow), the Cauchy(red), and the Poisson(blue) (which I will return to in a later post):
As you can see, both the Poisson and Normal distributions behave quite mildly, while the Cauchy distribution exhibits the characteristic outliers of wild randomness. The more samples drawn, the clearer you can see that: while the above Cauchy had only 100 datapoints, the below has 1000, and now the magnitude of the largest outliers forces the y-scale to change to cope with the very large numbers, leaving the non-outlier values as well as the Normal and Poisson-based samlples barely visible.
Data that behaves according the Cauchy distribution is indeed wild in character, and Benoit Mandebrot, whom I referred to in the previous post, has written an excellent book, The Misbehaviors of the Market, where he argues that a significant cause for all the market bubbles and busts are due to the fact that most financial analysis assumes a Normal distribution in their models, while the Cauchy distribution much better describes real world market behavior.
Returning to my own dataset of blog visits, the real, collected data looked like this:
Using PyMC to generate a random Cauchy time series, I got this graph:
There seems to be some similarity, both graphs displaying a handful of outlier data, and having the median and mode just below 20, but still, there seems to be some difference in that the real, observed data seem to exhibit more clustering, that is, around most of the outliers there are entries that ‘trend’ towards the outlier. That pattern is not present it the Cauchy distribution. Thus, my blog data does not seem to wild enough for being truly Cauchy.
I’ll have to look further…