Asume we are in the process of doing evidence based testing of two competing strategies, “A” and “B”, and we want to evaluate these competing strategies over several days, weeks, months or whatever timespan we have decided upon. That is, we want to do our analysis by gathering multiple data points from our test population, over a period of time, one data point per unit time. For instance, we might each day record how many calls or site visits we get, or we might be interested in opinion poll results, whatever. Then we also keep track of how many of the calls/visits/polls etc result in a successful business outcome, whatever that is, e.g. a closed sale, a successful conversion of web visit to a signup or a vote for a specific party etc.

Below a trivial example of such a record of data points for our two strategies:

data_A = [(100,55),(50,30),(75,25),(132,30),(40,10),(70,10),(130,30)]

data_B = [(100,25),(32,8),(87,12),(10,2),(57,20),(88,32),(140,88)]

The first number in each tuple is the number of calls/visits etc, the second number is the number of successful outcomes. So, looking at the first data point for strategy A, we can see that 55 of 100 “visits” resulted in a success outcome, so we have a rate of success of 55% for that day (or whatever unit of time we are working with).

Let’s say the 7 data points above, for our two strategies, represent test data gathered during a week.

Now, given that data: can we make an informed decision regarding which of our two strategies looks most promising ?

One thing we might try, is to compare the percentages of successes:

In the topmost graph, we have the daily success ratios for both strategies. Looking at the two solid red and green lines, it looks like strategy B is a clear winner, right…? Not necessarily.

Still looking at the top graph, the dashed lines represent the mean, or average, of the success rates for the entire week. Looking at those lines, we see that the red strategy, strategy A, is performing a bit better, overall.

The reason the two trendlines and the mean lines give us contradicting results is because strategy A started at a high success rate, but then fell, while strategy B started low but then caught up.

So, which strategy is better…? Frankly, personally I would not be able to tell from the data presented thus far. Since we are interested not primarily in the (temporary ?) trend, but the overall outcome of a full week of testing, we should perhaps lean more towards using the mean values, and in that case, strategy A looks like a winner, despite the negative trend.

Perhaps Bayesian inference can bring some clarity into the decision. The bottom graph shows the results after Bayesian updating of the same data. This time, strategy B is the winner. Why the difference between the two approaches…?

The difference is that in our first method, using the means of the two strategies, there is very little information pushed forward from one days measurements to the next days measurements, that is, the data from each individual day is pretty much standing isolated from what’s happened before, and is not impacted much by the previous measurements. On the other hand, using Bayesian updating, the information gained from each previous set of data, i.e each day of measurement, is automatically “merged” into subsequent data. In other words, using Bayes, we use the information gained in previous steps to “nudge” any new data in the direction indicated by the “old” data.

Another way to express this is that the process of averaging the daily data removes available information, while the Bayesian approach preserves previosly gained information, and takes this information into account when processing new information.

Below the resulting distributions obtained by Bayesian Inference:

Here we can see that strategy B is clearly more likely to perform better than A.