## Bayesian analysis of web traffic reveals Black Swans

Continuing my exploration of PyMC, the Python library for Bayesian statistics – from the Introduction section in PyMC documentation:

“PyMC is a python module that implements Bayesian statistical models and fitting algorithms, including Markov chain Monte Carlo”

I gathered visitor statistics from one of my blogs, to use as experimental data. More specifically, I collected the daily visitor count since Jan 1 2012 until Dec 16, 2013. That turns out to be 716 days.

Below a timeline over the number of visits each day:

For the moment, don’t worry about the red horizontal line, I’ll explain it later.   The graph shows the number of unique visitors (in some images I’ve mistakenly labeled the graphs by “pagecount”), over the period of almost two years, 716 days.

Now, one interesting question is – at least if you happen to be a nerd : can we find some probability distribution that would fit this data, that is, is there a ‘pattern’ within the data, that would allow us, given that we can find the correct statistical distribution and the correct parameters, to predict the future visitor count for this website ?

Or is the visitor data totally random, with no pattern in it ? In that case, predition of future values will be very risky if not impossible (despite the fact that this is how many business decisions of today are made, looking for patterns in data that have none…)

Intutively, the presence of a handful of ‘spikes’, that is, the 5 or 6 days on which the visitor count vastly exceeded the mean would seem to indicate that if there indeed is a pattern behind the numbers, it will most likely not be a normal distribution.

So, let’s test that hypothesis. The mean of the above 716 datapoints happens to be 18.3, and the standard deviation is 13.0.  The maximum value is 136, the minimum 1.

Plugging the mean & stdev numbers into Python’s scientific, mathematical and plotting libraries, we can produce the graph below showing what a normal distribution with these parameters would look like:

The yellow vertical bar illustrates the mean, and each of the red vertical bars illustrate standard deviations from the mean.  As can be seen, the normal distribution is very unlikely to be the distribution governing our data: already at 60 visits per day, we are 3 stdev’s away from the mean, meaning that any daily visitor value greater than 60 would occur at 0.1% of time, meaning that in my dataset of some 700 numbers, it might occur once.

Examining the data, it turns out that I have 15 datapoints exceeding 60, of those 15,  7 exceed 4 stdev, 6 exceed 5 stdev, 2 exceed 6 stdev, and 1 exceeds 9 stdev.

To put these probabilities en perspective, it might be useful to know that, under a normal distribution,  a 3 stdev event should occur about once in every 370 events, a 4 stdev event  should occur once every 16000 events, a 5 stdev event once in 1.7 million, a 6 stdev event about once in 500.000.000 days, which corresponds to once in 1.4 million years, which pretty much is the entire history of humankind.

As you can see, the probability of events beyond a couple of stdev’s diminishes very quickly, and an event exceeding 9 stdev’s, of which my data has one, is beyond any imagination.

Thus, we can safely conclude that my blog statistics are not governed by the normal distribution, there are way too many “Black Swan” type of events in the data, that is, my data seem to exhibit what my former colleague Benoit Mandelbrot  together with Nassim Nicholas Taleb referred to as “wild randomness“, while data fitting under the normal distribution exhibit “mild randomness”.

In subsequent posts, I’ll see if I can find some other probability distribution that could better fit my data. And for that, I’ll use Bayes, Probabilistic Programming  and PyMC for help.