Inspired by Rasmus Bååth’s lectures on Bayesian Inference, I’ve implemented a simple Python example demostrating how Bayesian Inference can be used for A/B-testing, that is, evidence based testing. This methodology, i.e. A/B-testing, is useful in most domains, e.g. development of farmaceuticals, marketing campaigns, medical treatments, Search & Rescue strategies, legal court hearings, sales strategies or any other domain where there is a need for evidence based predictions over which potential strategy (A or B) will produce better results.
The scene for this demo is as follows (for simplicity, I’m using one of the examples from Rasmus’ lectures as the basis for the scenario): Assume that you are responsible for taking a decision on which of two different sales strategies your company will take. You have conducted preliminary market research, taking (hopefully!) random samples from your potential customer base, using strategy A for half of the random samples, and strategy B for the rest of the samples. These preliminary results state that for strategy A, 6 out of 16 respondents signed up for your offer (whatever you are trying to sell), while for strategy B, 8 out of 25 respondents signed up (I’m deliberately keeping the number of respondents and successes very low for this simple example, for reasons that will become clear a bit further down).
So, how should we go about deciding which of these two strategies is the better, i.e. which of the strategies will provide us with most sign-up’s in the long run….? Or indeed, do these results even bear enough significance to base a decision upon them….?
Well, one way to use the data from the poll is simply to figure out the rate of success for each of the two strategies: strategy A had 6 out of 16 successes, which translates to about 38%, while strategy B had 8 successes out of 25, which is about 32%.
So, clearly, if we just look at these two numbers, strategy A looks better by 6 percentage points. But with the quite limited amount of data (16 and 25 data points), would you be willing to bet your shop on these simple point estimates…? I wouldn’t.
So instead of relying just upon simple data points, let’s use Bayesian inference to produce not only single data points, but a probability distribution:
This first graph below shows the probability distributions for our two competing strategies, A and B (the “ABC” in the title of the graph tells us that this graph is produced by Approximate Bayesian Computation, a conceptually great but computationally very heavy implementation of Bayesian Inference). From the graph we can see that indeed strategy A looks better, with a most likely range of about 25-45%, while strategy B has a range of 20-35%. So clearly, were we forced to chose between these two strategies, strategy A indeed looks better, and furthermore, now, after having done our Bayesian analysis (or more correctly: having let our computer do it) we can probably be a bit more confident about our decision, asop. to relying only upon the two point estimates above.
Now, the above analysis was done without having any previous opinion about the likelihood for people signing up on our sales campaigns, instead the above analysis was produced with something called “uninformative Prior”, that is, we didn’t have any prior data/statistics/opinion on how likely it is for people to sign up on our various marketing/sales campaigns.
So, let’s instead say that we in fact know (or believe we know 🙂 that previous similar campaigns tend to result in a 20% signup-rate. That is, in the past, with similar campaigns, 1 in 5 people addressed by our efforts have indeed signed up. How can we peruse this information in our analysis….?
In Bayesian Inference, we can instead of using an uninformed prior (as we did above) give our inference engine an informative prior, in our case here, the knowledge that in the past, 1 in 5 people have signed up.
Running the same model again, this time with an informative prior, the results look a bit different:
This time, with our knowledge about the success rate for previous campaigns, we can see that the two different strategies are no longer so very different in terms of their results: the two probability distributions pretty much overlap fully, and the mean for both is almost identical, about 22% rate of success in terms of signups. The other observation is that now, with the informative prior, i.e. our “apriori belief” based on the results of previous campaigns, the expected success rates for both strategies have been “pulled back”, from 32% and 38%, to 22%. Thus, by adding the informative prior to our model, our expectation has shrunk, quite a bit, and furthermore, we no longer see any significant difference between our two strategies.
We can also illustrate the – in this case, non-existing – difference between our two strategies graphically: As can be seen below, the distribution of difference is clearly centered around zero, indicating that our two strategies are equivalent in terms of success rate, thus neither of them is better than the other.
With this information, a reasonable decision would probably be to modify one or both of the strategies, hoping to find a significant difference, e.g. by setting a treshold, e.g 5%-points, that must be exceeded before deciding upon a specific strategy.
To summarize this part: by using Bayesian Inference, we can make informed decisions based not just on point estimates, but on probability distributions. Furthermore, most often we do indeed have some previous – prior – information that we would like to peruse in our analysis, and Bayesian methods allows to do so easily. Of course, it still takes careful analysis of the results, and good judgement to form a decision based on any stochastic process, but what I particularly like about Bayesian methods is that it clearly shows, by the resulting distributions, significantly more information than a single p-value or any other point estimate.
The “ABC” in the title of the graphs above indicate that the results produced thus far were obtained with a Python implementation of “Approximate Bayesian Computation”. ABC is easy to understand, and easy to implement – the number of lines of code to produce the above results is less than 100, and there is nothing very complicated in that code.
The problem with ABC is that for almost any real world problem, it will literally take days and weeks of computation time to get the results. This is because of the combinatorial explosion of possibilities that the program must explore when the numbers of trials and observations grow, the so called “Garden of bifurcating paths”, neatly demostrated by Richard McElreath in the video below:
So, for real world problems, the ABC will most likely take way too long to execute. However, as luck has it, smart people in many different domains have figured out really clever algoritms for performing the necessary computations in a fraction of time compared to ABC, these algorithms are based on Markov Chain Monte Carlo.
Not only does a MCMC-based implementation run many orders of magnitude faster than one based on ABC, but furthermore MCMC can easily handle numbers of trials/observations also also orders of magnitude larger than is the case for ABC.
The graph below is produced using Pystan, a Python interface to Stan, a domain specific language implementing MCMC:
With Pystan we get – in this case, with fairly small numbers for trials/observations, the same results as we got with ABC. However, if I should try to increase the numbers, my ABC implementation would choke, not finding any solutions, while Stan & Pystan would happily compute the distributions sought after.