How long should my test run for?


AKA: How many conversions do I need? When working in optimisation or analytics this question can come up a lot. So opposed to frequently repeating yourself this is a good subject to have documented somewhere for referring your inquisitive marketing persons. Below is an article I've complied with the help of several sources to best answer this question.

A common misconception is to stop the test once a statistical confidence level of 95% or higher is reached. Well, consider this - 1,000 A/A tests (two identical pages tested against each other) and during the test:
  • 77% reached 90% significance
  • 53% reached 95% significance
So if you stop your test as soon as you see significance, there's a 50% chance it’s just noise. [3] Moreover data from Google suggests that 90% of content changes have no or a negative impact [1] so we need to be sure we're testing correctly!

Sample size
A healthy sample size (visitors, conversions) is at the heart of making accurate statistical conclusions. An A/B test reaches confidence when the observed difference is bigger than chance alone can plausibly explain. Imagine you are trying to find out whether there's a difference between the heights of men and women. If you only measured a handful of men and women you would risk not detecting that men are taller than women. Why? Because random fluctuations mean you might choose an especially tall woman or group of women. However, if you measure many people, the averages for men and women will stabilise and you will detect the difference that truly exists. That's because statistical power increases with sample size. [1]

Understanding how many visitors / conversions you need
There's a couple of things you need to know:

  • Your baseline conversion rate (what the conversion rate currently is of your page). 
  • The minimum improvement you want to be able to detect, sometimes referred to as the Minimum Detectable Effect. You *could* think of this as the uplift expected from the experiment. Be warned: if you limit yourself to detecting uplifts of 10%+ you will miss the smaller wins that were out there.

The table below gives you a general guide for visitor and conversion volume *per branch* (an A/B test has 2 branches). Example: You're running an A/B test with a baseline conversion rate of 5% and you want to be able to detect a relative win of 5%, this would require 120K * 2 = 240K visitors which is approximately 12K conversions.



How to calculate The Minimum Detectable Effect
This value used will depend on what's being tested among other factors, from experience when little or no testing has been carried out then bigger wins are possible but when the low-hanging fruits have been picked off you could be looking more in the 2-7% range; this could mean longer and fewer experiments.

Time
In addition to having sufficient sample size (volume), experiments also need to run for at least one full business cycle, this is normally 7 days (although 14 would be better). Additionally experiments should not be stopped mid-cycle, i.e. on day 10, instead we should continue the test for another 4 days. Average order value and conversion rate can differ between each day of the week and results should reflect the full mix of visitor types which can vary between early morning on a week day and the afternoon of Sunday, moreover running experiments for sufficient time means we're less likely to be impacted by the novelty effect. [3] [5] [6]

Capping sample size
Now that we know how many visitors/ orders our experiment needs we can cap the volume. This means our experiment stops when we're completed a full business cycle and we have the minimum required sample size; at this point it's declared a winner or a loser.

What happens if your experiment is a mega-win?
Medical experiments use sequential experiment design for exactly this: sequential experiment design lets you set up checkpoints in advance where you can decide whether or not to continue the experiment. This means if the minimum detectable effect was originally set to 5% but the treatment creative is outperforming by 15% it's possible to have this evaluated at predefined checkpoints that would give the required significance level. [2]

References
Most A/B test results are illusory [1]
How not to run an A/B test [2]
How many conversions do I need [3]
Sample size calculator [4]
How long to run an A/B test for [5]
Novelty effect [6]

Right, that's it - thanks for reading. If you have any comments, questions or feedback please leave them below. And you can follow new posts from this blog on Twitter, Email or RSS.