A simple hack is to run an A-A-B-B test instead of an A-B test. Rather than spli...

wkonkel · on May 28, 2014

To explain this in a different way, let's use a simplified example:

Suppose I have a website with a "Click Me" button that's green in color. I want to increase clicks and think to myself, "perhaps if it was a red button instead of a green button, more people would click!" To test this, I would run an A-B test along the lines of:

if random(2) == 0 then color='red' else color='green';

In theory, I just push this code and track the number of clicks on the red button versus the green button and then pick the best. But in practice, when I push the code, there might be 5 clicks on green and none on red in the first hour. Maybe green is better? Maybe I didn't wait long enough? Okay, let's wait longer. A few hours later, there's now 10 clicks on red and only 6 clicks on green. Okay, so red is better? Let's wait even longer. A week later, there's 5000 clicks on red and 4500 clicks on green. That seems like enough data that I can make a conclusion about red vs. green. But is there a better way?

This is where A-A-B-B testing can help. Let's start by looking at just the A-A part of the test. If I split my audience into two groups (green1 and green2) and show them both green buttons, the results should be identical because both buttons are green. If I check back in an hour and the "green1" and the "green2" groups are off by 20%, then I have a large margin of error and need to wait longer. If I check back in 6 hours and they're off by 10%, then I need to wait longer. If I check back in a day and green1 and green2 are only off by 1% then that means we've probably waited long enough and my margin of error is around 1%. I can now add green1+green2 and compare it to red1+red2 groups and see if there's a clear winner (e.g. red is 5% better). And this only took a day instead of a week!

Godelization · on May 28, 2014

Using four buckets instead of two like that will improve your confidence in the results, but will also double the required sample / testing duration. You could just as easily use two buckets and wait twice as long to achieve the same effect.

blauwbilgorgel · on May 28, 2014

A/A testing (Null testing) or A/A/B testing gives a different effect than A/B testing.

Microsoft Research suggested (http://ai.stanford.edu/~ronnyk/2009controlledExperimentsOnTh...) that you continuously run A/A tests alongside your experiments. An A/A test can:

- Collect data and assess its variability for power calculations

- test the experimentation system (the Null hypothesis should be rejected about 5% of the time when a 95% confidence level is used)

- tell if users are split according to the planned percentages

gingerlime · on May 28, 2014

Can you explain why? I'm struggling with the math behind the whole thing as it is, but intuitively this sounds like a very clever hack. I wonder why it would double the experiment time if effectively people are seeing either A or B variants.

_pctq · on May 28, 2014

That comment is brilliant, thanks for contributing it.

You'll probably have to ensure it applies sequentially too, at least to be sure As and Bs are stable in their matching, but it seems to me an elegant solution for the problem (not that I'm statistician, though).

kansface · on May 28, 2014

This is better than stopping when you get a statistically significant finding which is nearly always the wrong thing to do. Do you have any math behind this?

intev · on May 28, 2014

I'm not sure I understand - isn't that essentially an A/B test because 25 + 25 = 50?

DHowett · on May 28, 2014

I believe it lets you compensate for the possibility that, say, all of your conversions might be coming from the bottom 1% of your users. Segmenting A into A1/A2 therefore insulates your interpretation of the results for A from being as heavily skewed.

intev · on May 28, 2014

Yes but in your A/B test you shouldn't be picking first half vs second half. Each visitor should be randomly assigned, so it should mitigate the problem you mentioned.