A/B testing is a randomized experiment, where “A” and “B” refer to two variants, undertaken in order to determine which variant is the more “effective.” It is a popular tool in web analytics; many do not have a good understanding of A/B testing, including people who use it frequently.
A/B testing should be deeply rooted in statistical hypothesis testing, yet this is not always the case. Apart from hypothesis testing, there are additional concerns while designing, executing, and interpreting the results of A/B testing. We will go through the basics in this article and things to know about A/B testing.
1. Don’t come to conclusions based on small sample sizes
This seems obvious for anyone with a very little statistical understanding but is not important that it’s worth including and putting sample sizes first. Sample sizes for A/B testing require care and skill as it is not as straightforward as most of them think. This is really only one part of a puzzle related to statistical confidence, which can come with both the necessary number of samples and required time for the experiment. A properly designed experiment will consider the number of samples and conversions required for a desired statistical confidence, and also allow the experiment to play out fully, without pulling ahead of time.
2. Don’t fail to observe the psychology of A/B testing
Let’s say we are conducting an email A/B test. The experiment is to test two groups having the same email content with different subject lines. Here the variants are subject lines being tested as these subject lines are visible to the user prior to opening the email. The metric being tested may be email open rate or click rate. It all depends on the goal of the campaign. Mostly the goal is to get the user to follow through on some call to action (CTA) and so that this click rate is a better metric.
Think about how the already-visible subject line can lead to different click rates after they have opened an email. It’s all about psychology. We have set expectations with two subjects, and only one of them is realistically specific. One subject line has not prepared the reader to guess what is inside so there is a higher chance of disappointment and so clicks will definitely suffer. On the other hand, the second subject line has set the expectation that the email contains information which it actually does, and those who open it are much more likely to click through.
3. Beware the local minima; A/B testing is not suited for everything
A/B testing is not for everything. It is only suited for a few. Suppose changing a landing page is probably a good A/B testing candidate and changing a button position on the website or form may be a good AB test. But redesigning a complete website may or may not be a good A/B test, depending on how the experiment is approached.
Mostly, gradual change is well-suited to A/B testing. However, gradual changes, which may not themselves, make valid A/B testing variant candidates to accomplish what you want. Conceptualize your product as a mathematical function; the local minima would be analogous to a design track which has been reached.
Your product can become firmly rooted in these design local minima when you fail to consider that pulling your existing product; you may be attracted initially when global minima could be reached by taking a more comprehensive approach to product redesign. The key point to note here is that jumping head first into A/B testing is a bad idea. First, define your goals, and once you are clear that A/B testing will help you with your goals, and then decide on your experiments. Later, design your experiments and implement A/B tests.
4. it’s all about the buckets
Firstly, think about we can best ensure comparability between buckets prior to bucket assignment, without having knowledge of any distribution of attributes in the population.
The answer is very simple: random selection and bucket assignment. Statistically, the best approach is random selection and assignment of buckets without regard to any attribute of the large population.
Let’s say you are testing a website feature which is changed and you are interested in a response from only a specific US region. By first splitting sample into two groups (control and treatment) irrespective of the user region, assignment of US visitors should be split between these groups.
From those two buckets, visitor attributes can then be inspected for testing purpose, for example:
if (region == “US” && bucket == “treatment”):
# do something treatment-related here
if (region == “US” && bucket == “control”):
# do something control-related here
# catch-all for non-US (and not relevant to testing scenario)
5. Include only those people in your analysis who could have been affected by the change
If we have users in the experiment whose experience will not be impacted by our change, then we adding noise and reducing the ability to detect an effect of the variants.
Below is the pair of examples:
- If you change a specific page layout, only add users to the experiment if they actually visit that page.
- If in an experiment you are lowering free shipping threshold from $x to $y, you should only include in the experiment only those users who have cart sizes between $x and $y; they are the only users to see a difference in the treatment vs control group.
Imagine that you are running an experiment on someone visits your sites, search pages, visits the search page, buys something from the homepage, and entering the experiment. It is evident that A/B testing is a specialty in its own way, and entering into experiments without knowing the process will be a mere surprise. These simple points discussed above will be useful to further explore.