## Bayesian A/B-testing, part II

Continuing my example by examining how the different assumptions – yes, in any model there are always assumptions, explicit or implicit – of the model impact the end result, that is, the prediction of the sought after signup-rate, a.k.a our posterior probability distribution.

We have several assumptions within our model: first of all, I’ve chosen the Beta-distribution as the prior distribution, and already that choice in itself is an assumption. For the generator function, I’ve chosen the binomial distribution, yet another assumption.  But for this problem, both of these assumptions make sense, so let’s stick with them.

Further assumptions arise from the parameters for the functions: for the Prior Beta, we need to specify the alpha and beta, which defines the “shape” of the distribution.  In our example, we decided that the prior should be centered around 20%, so let’s start by setting alpha to 2, and beta to 8, which will result in the beta distribution centering at 0.2.

Let’s also start with values for number of trials (n) and number of successes(k) as in the previous post, that is, the data reported by our preliminary market research found that for strategy A, 6 out of 16 respondents signed up, while for strategy B, 8 out of 25 respondents signed up.

Let’s first look at the prior distribution, what it looks like with alpha = 2, and beta = 8:

This prior, which represents our existing knowledge or belief on the likelihood for people signing up on our various offers, is centered around 0.2, that is, with a mean signup-rate of 20%, but as can be seen from the graph, the distribution is pretty wide, with most likely probabilities ranging from ~15% – ~30%.  That is, there’s a fair amount of uncertainty in this prior. That in itself is neither good or bad, however, if indeed there is much uncertainty in our prior, the we would want the prior distribution to reflect that fact, by being wide. On the other hand, if we have a prior we know for certain being very narrow, then the above prior is not reflecting that belief/knowledge.

But let’s stick with this fairly wide prior for now, and see what it generates:

Our first run of my inference engine was run with the prior as discussed above, and the data input to the binomial generator was (16,6) for strategy A, and (25,8) for strategy B. After chewing a very short time on these inputs, my Python/Pystan/Stan program came up with the above posterior belief on the signup-rate:

For strategy A, the mean probability for sign-ups was found to be about 30%, while for strategy B, the signup probability was about 28%. That’s a narrow difference of only 2%-points, and probably not differentiated enough for us to make a decision on, particularily considering that the amount of data is quite limited, 16 and 25 data points, respectively for each strategy.

So, after analyzing this result, we decide that we need more data. Thus, we conduct another market poll, and this time, a larger one, where for strategy A, we have 160 responses, with 60 of them signing up, and for strategy B, we get 250 responses, of them 80 signing up.  Let’s put these new numbers into the program, while keeping the existing prior:

Now, with 10 times as much data to fit our model to, the two alternatives start to move apart from each other: now Strategy A gets almost 37% signups, while strategy B gets about 32%, a difference of about 5 percentage points.

Does this poll size give us enough confidence to make a decision going for strategy A ? Perhaps. First of all, it depends on whether a 5% difference makes significant difference to your business – that’s up to you to decide. Secondly, how can we be sure that we now have enough data…? Well, one way to find out an answer for the second question is to run the program again, with even larger n and s, and see what happens. So let’s multipy the n and s of trials and successes once again by factor 10 and see what happens:

Now, for strategy A, we have 1600 trials, out of which 600 sign up, and for B, we have 2500 trials, with 800 signups.  As can be seen from the graph, by increasing our data by factor 10, we get almost the same result as before. Thus, perhaps we are now justified to determine that the results from our previous poll, with 160 vs 250 trials give us good enough results to base our decision upon.

Of course, there’s always a risk that should we indeed decide to expand our preliminary market research by doing a factor 10 times larger poll yet, we might get different answers – perhaps our data collection has been biased, and in that case a larger data set could smooth out the outliers better than a smaller data set. On the other hand, time (and resources spent) are money, and if we are looking for the “perfect” answer, we will never cross the finishing line, after all, “Prediction is Difficult, Particularily about the Future”, so we need to make an as informed decision, with the data we can collect with reasonable time and effort. And with Bayesian methods, we get more than just the point estimates of traditional, frequentist methods, so having found that there’s not really much point of conducting yet another, larger and more expensive market research activity, at least I myself would be willing to commit to a decision based on the findings of this analysis.