What is A/B Testing?
A/B testing is a popular method used by websites to drive up conversions, engagement or revenue. The idea is pretty simple. Show users two versions of a page on your website, one without the change and the other with the change that you want to test. Each variation is measured with a different set of users and the performance is measured. The winning variation is rolled-out to all traffic. Repeat, Roll out, Repeat and you have a winning strategy of optimizing conversions on your website.
Even though the idea behind A/B testing is simple and there are now a lot of online tools that would allow you to test and run experiments on the fly, it is important to understand the meaning and limitations of the results before you make any decisions. In this post, I have given an overview of the statistics behind A/B testing and what the associated jargon means.
While there are endless things you can test on your website, every A/B test should start out with a well-defined hypothesis. A simple example of a hypothesis is changing the color of a button to red because you believe that it provides more visibility and hence would prompt more users to click based on similar results you have seen on other pages of your websites. All A/B tests are validation or invalidation of these hypotheses and at the end of the test you want to know if it makes sense to go ahead with your changes. If you do not start out with a well defined hypothesis and a target metric, chances are that you are going to look for some metric that is affected positively and base your decision on it, which I will explain in the next post is a poor approach to take.
Statistics behind A/B Testing
Conversions as click-through probabilities
Most of the A/B tests are about getting users to click on something. The click could be on the button of a checkout page, link to a blog post or even the title of an up-selling email. In all of these, you want users to convert to the next stage of the journey like making a payment, reading the blog, or opening the email. The probability of users making a click is termed as the click-through probability and can be calculated by simply taking the ratio of users making a click over total users who were exposed to the page. All of these actions follow what is known as the binomial distribution.
Any experiment that follows the binomial distribution can generally be characterized by the following properties:
- There are two mutually exclusive outcomes, often referred to as success and failure.
- Independent events. The outcome from one event does not have any effect on the outcome of another
- Identical distribution. The probability of success remains same for all events
Now let’s look at some examples of different events and see if they can be modeled using the binomial distribution
Users completing the checkout page on an e-commerce website: There are two mutually exclusive outcomes here, either the users complete the checkout page and move to the payments page, or they don’t. Also, it is safe to assume that one user completing the checkout page would not have any impact on another user completing it. Of course, there are exceptions, for example. Family members exploring items independently but then completing the process using one account in which case the outcomes are not independent but for most cases it is safe to assume that the completion of the checkout page will be independent.
Items purchased on an e-commerce site: Using the same example above, but instead now looking at the purchases of individual items Once again, the outcomes are mutually exclusive and either the individual items would be purchased not. However, the events are not independent since a user can add multiple items in the shopping cart and then buy them together in which case the purchases are highly dependent, and the binomial distribution would not be the right choice.
Confidence Intervals and Significance Levels
One easy way to think about conversions is in terms of coin tosses since the underlying distribution for both is binomial. Now let’s run an A/B test where we are going to do hundred coin tosses wearing red and a blue shirt and see how many times we get a head.
Our hypothesis here is that wearing a red shirt improves conversion.
From the results, we see that we get 20% more heads when wearing a red shirt. Should we always wear a red shirt to all coin tosses? What happens if we repeat the experiment again?
One of the most important aspect of the A/B tests is that the results are reproducible. If you can’t replicate the success of an experiment then there is no use of getting improvements in the sample. But how we do define success? The answer to all of the above lie in the confidence intervals. To build basic intuition, always remember that all the A/B tests are performed on samples of users and different samples will have different means. Confidence intervals allow us to capture that variation and make reasonable conclusions.
A/B test results are often quoted at a significance level. A low significance level of 5% or 1% means that it is unlikely that the difference we are seeing can occur just by chance. Or conversely, we say that the difference is ‘statistically significant’ at a confidence level of 95% or 99% (compliment of 5% and 1% respectively).
Confidence intervals are ranges around the mean that are determined using the standard error. In the case of binomial distribution the standard error is given by where p is the probability of success and N is the sample size.
The 95% confidence intervals are typically constructed using the normal distribution. We can use the Central Limit Theorem to approximate the outcomes of a binomial distribution as the normal distribution or a bell curve (A common approximation used to justify this is to verify that the products Np and Npq are both >5). Let’s look the coin toss example again and see what the distribution looks like. This is just a normal distribution centered at 0.5 (a fair coin) and with standard error calculated above as 0.05. In a normal distribution 95% of the values lie within . Graphically the distribution for multiple experiments of 100 coin tosses would look as follows, where 95% of the observations would lie between ~0.40 and ~0.60.
Looking back at our coin toss results, we see that getting 55 heads or 44 heads is reasonable and lies well within the 95% confidence interval so the results actually aren’t surprising at all!
In summary, A 95% confidence interval means that if we were to repeat the experiment multiple times, then the interval that we constructed would cover the true population mean 95% of the time. Conversely, around 5% of the times our confidence interval will not contain the true population mean.
Type I and Type II errors
While A/B testing gives a solid launchpad to optimize product features and conversions, often at times you could see improvements where there actually aren’t or the experiment can yield no results despite actual improvements. If we categorize the results from A/B tests, we can have four possible outcomes.
The two red boxes are the cases where we make wrong inference from our experimental data and these are the ones that we need to be careful about when designing the experiment. Graphically these are the overlap regions as shown below. As we discussed above, statistical significance deals with the type I error where we declare a ‘winner’ when there is none. Type II errors on the other hand hide the winning variations and since running an experiment takes resources and time, you would want to consider this into account when designing the test.
Getting quick and reliable results from A/B testing
So we want to get reliable results from the A/B tests while minimizing the Type I and Type II errors. Type II errors can be minimized if we minimize the overlap in the above graph. Remember that the spread of the curve is determined by the standard error which is inversely proportional to the sample size.
Having a large enough sample is important to minimize the Type II error. Here is the graph of the distribution for 1000 coin tosses. Increasing the sample size reduces the spread and hence minimizes the type II error.
Larger differences are easier to detect with lower chance of a type II error. Again looking at the figure for the overlaps, it is easy to see that the further the two graphs are centered, the lower the overlap and hence lower chance of making a type II error.
So make bold changes to your website that would have a big impact and run the test for sufficient time and you should be good to go. Type I errors can be reduced by increasing the significance level, though that also increases the required sample size. Hence, it is always a trade-off on the accuracy of your results vs. the time that you want to invest. If you are making big changes with significant impact on the business, it is always a good idea to have enough users for your test while having a high significance level.
I hope this post was helpful in providing an overview of what’s happening when you are running your A/B tests. Let me know in the comments of any interesting A/B tests that you have run. Don’t forget to read ‘How (not) to do A/B testing’ in the next post to avoid making some common mistakes that most people make. Happy testing !