A/B Testing, introduction and much more

Srijan Bhushan
6 min readJul 18, 2022

--

Is what we observe the truth, or just a coincidence?

What do you mean by AB test, what is that?

A/B tests aka controlled experiments are used widely in the software world to answer one simple question — “does this new thing I am going to introduce will make my website and business better, or worse, or have no effect?”. This new thing can be a UI (user interface) change or a feed algorithm change or a content change or a new marketing campaign, or something else.

What’s a simple use case, an example?

Let’s say there is an online coffee delivery website called coffeetogo.com. One of the people who works on the website had an idea to change the website background color — to grey — which is currently white. Before launching the change, an A/B test was conducted to see the affect of the new background color on time spent by users on the website. The time users spend on the website decreased — unexpectedly. The decision was made to not launch the change. Crisis averted.

Why do you have to call it A/B test? Why can’t we just call it a simple test and run that?

A/B test is called so because we set up two groups of people — A: who see the older version of the website, B: who see the newer version of the website. Both groups are then tested to see how time spent on the website changes — verified with statistical confidence measures.

That’s a fair argument — why don’t we just launch a simple test with the new thing and observe the outcome. Conituining example of coffeetogo.com website, we could just conduct a simple test — where we show people the new background color and see how their time spend behaves. The reason we do not do that, and instead, do a controlled A/B test — where we have two groups of people — and statistical confidence measures — is that Correlation does not equal to Causation. Even though in the simple test we might observe that time spent by users go up, we will not know for sure if background color was the real reason, it might be just correlation (just a relationship). For knowing if background color was the indeed the real cause, not a coincidence, we conduct an AB test, with two groups and other statistical measures to make sure of Causality.

What are steps to set up an AB Test?

Companies have experimentation platforms that can run thousands of experiments at scale. There have been instances where a new feature is released and an AB test shows high growth in revenue or user engagement. But how do we set up an AB test? What is the basic process?

A few steps, as follows:

Step 1: Form the hypothesis — such as — “a new feature X on the product website will have a change in user engagement”. Also, form the null hypothesis — which is — “a new feature X on the product website will have no change in user engagement.

Step 2: Form your OEC — overall evaluation metric. This could be a single metric or combination of many. An example, you are focussed on “average session per user” as the metric or “average revenue per use”. Decide on which is the metric, as a business, you think is the most important. Evaluation metrics can and should also include guardrail metrics.

OEC are like car’s dashboard

Step 3: Decide on your randomization unit — do you want your A and B groups to be randomly assigned by users or by time or something else?

Step 4: Decide on the required power (i.e. rate of rejecting the null hypothesis correctly i.e. β), statistical significance (i.e. alpha — α), practical significance, and sample sizes (n) for each group.

sample size ~ β * α / practical significance

You also need to estimate how long you want to run the AB test for. This is usually the time it takes to collect the required sample sizes and also any cyclic nature of users. You might want to see statistical significance i.e. α as Type I error rate. β is the rate of avoiding Type II error rate. So, Type II error rate = 1 -β.

Step 5: Run the experiment for the time period you decided on. Make sure as the test runs, there are no spillover effects or everything is running as expected.

Step 6: Measure and interpret the results of your test, and measure it’s statistical and practical significance. What do you think of the results? It could be that you see an increase in OEC metric — average session per user — and you will also see it’s significant!

Step 7: Decide on launch or no launch of this new feature.

Could there be something wrong with your A/B test? Always be careful and skeptical.

Like with everything important, you need to be careful with your A/B tests. There are a number of things that can go wrong and make your A/B tests invalid.

Here a few important pitfalls:

  • Lack of randomization — your A and B samples might not be truly random. There might be some skewness in the samples. This makes them invalid and incomparable .So make sure your samples are truly random. You can check this by comparing attribute distributions between the two groups.
  • Insufficient sample size and Sample Ratio Mismatch — the sample sizes of your A and B group could be small and not sufficient. Always check if your test reaches the required sample size for both samples. Uneven sample size too?
  • Spillover feature change — there might be another feature or A/B test that might be spilling over your A/B test. Make sure there are no parallel new features or A/B test being done on your sample.
  • Insufficient observation time: the time that the A/B test runs for is not optimal. There might be cyclical behavior that is not being included and making the test invalid.
  • Feature implementation: a segment — like a type of browser or OS — might not be experiencing the feature (variant) due to a bug.
  • Instrumentation and Data discrepancies — there might be issues with how your data system — logs, tracking, and features are being done. Make sure those technical aspects are working as expected.
  • Network effects: there might be cases, especially in social products, where one of the A or B groups is effecting and taking resources from the other group.
  • External effects: like weather, holiday season, internet outage etc.
  • Bots: bots traffic can introduce invalid data

Conclusion and parting thoughts

Most experiments fail, in the sense, the favorable outcome does not show up. So be ready and agile to reject your ideas and features, and test many features continuously. Most experiments cannot be predicted by us. The outcomes of experiments cannot be predicted by anyone. No matter how confident you are, many many have failed and learnt that predicting the outcome of an experiment is futile. Let the experiment and OEC tell us the results.

--

--

Srijan Bhushan
Srijan Bhushan

Written by Srijan Bhushan

Data/ML Engineering, Economics, History

No responses yet