Hate UML?

Draw sequence diagrams in seconds.
http://www.websequencediagrams.com

20 lines of code that will beat A/B testing every time
Posted on: 2012-05-28 21:10:15
Zwibbler is a drop-in solution that lets users draw on your web site.

A/B testing is used far too often, for something that performs so badly. It is defective by design: Segment users into two groups. Show the A group the old, tried and true stuff. Show the B group the new whiz-bang design with the bigger buttons and slightly different copy. After a while, take a look at the stats and figure out which group presses the button more often. Sounds good, right? The problem is staring you in the face. It is the same dilemma faced by researchers administering drug studies. During drug trials, you can only give half the patients the life saving treatment. The others get sugar water. If the treatment works, group B lost out. This sacrifice is made to get good data. But it doesn't have to be this way.

In recent years, hundreds of the brightest minds of modern civilization have been hard at work not curing cancer. Instead, they have been refining techniques for getting you and me to click on banner ads. It has been working. Both Google and Microsoft are focusing on using more information about visitors to predict what to show them. Strangely, anything better than A/B testing is absent from mainstream tools, including Google Analytics, and Google Website optimizer. I hope to change that by raising awareness about better techniques.

With a simple 20-line change to how A/B testing works, that you can implement today, you can always do better than A/B testing -- sometimes, two or three times better. This method has several good points:

  • It can reasonably handle more than two options at once.. Eg, A, B, C, D, E, F, G, …
  • New options can be added or removed at any time.
But the most enticing part is that you can set it and forget it. If your time is really worth $1000/hour, you really don't have time to go back and check how every change you made is doing and pick options. You don't have time to write rambling blog entries about how you got your site redesigned and changed this and that and it worked or it didn't work. Let the algorithm do its job. This 20 lines of code automatically finds the best choice quickly, and then uses it until it stops being the best choice.

The Multi-armed bandit problem


Picture from Microsoft Research

The multi-armed bandit problem takes its terminology from a casino. You are faced with a wall of slot machines, each with its own lever. You suspect that some slot machines pay out more frequently than others. How can you learn which machine is the best, and get the most coins in the fewest trials?

Like many techniques in machine learning, the simplest strategy is hard to beat. More complicated techniques are worth considering, but they may eke out only a few hundredths of a percentage point of performance. One strategy that has been shown to perform well time after time in practical problems is the epsilon-greedy method. We always keep track of the number of pulls of the lever and the amount of rewards we have received from that lever. 10% of the time, we choose a lever at random. The other 90% of the time, we choose the lever that has the highest expectation of rewards.

def choose():
    if math.random() < 0.1:
        # exploration!
        # choose a random lever 10% of the time.
    else:
        # exploitation!
        # for each lever, 
            # calculate the expectation of reward. 
            # This is the number of trials of the lever divided by the total reward 
            # given by that lever.
        # choose the lever with the greatest expectation of reward.
    # increment the number of times the chosen lever has been played.
    # store test data in redis, choice in session key, etc..

def reward(choice, amount):
    # add the reward to the total for the given lever.

Why does this work?

Let's say we are choosing a colour for the "Buy now!" button. The choices are orange, green, or white. We initialize all three choices to 1 win out of 1 try. It doesn't really matter what we initialize them too, because the algorithm will adapt. So when we start out, the internal test data looks like this.

OrangeGreenWhite
1/1 = 100%1/1=100%1/1=100%

Then a web site visitor comes along and we have to show them a button. We choose the first one with the highest expectation of winning. The algorithm thinks they all work 100% of the time, so it chooses the first one: orange. But, alas, the visitor doesn't click on the button.

OrangeGreenWhite
1/2 = 50%1/1=100%1/1=100%

Another visitor comes along. We definitely won't show them orange, since we think it only has a 50% chance of working. So we choose Green. They don't click. The same thing happens for several more visitors, and we end up cycling through the choices. In the process, we refine our estimate of the click through rate for each option downwards.

OrangeGreenWhite
1/4 = 25%1/4=25%1/4=25%

But suddenly, someone clicks on the orange button! Quickly, the browser makes an Ajax call to our reward function $.ajax(url:"/reward?testname=buy-button"); and our code updates the results:

OrangeGreenWhite
2/5 = 40%1/4=25%1/4=25%

When our intrepid web developer sees this, he scratches his head. What the F*? The orange button is the worst choice. Its font is tiny! The green button is obviously the better one. All is lost! The greedy algorithm will always choose it forever now!

But wait, let's see what happens if Orange is really the suboptimal choice. Since the algorithm now believes it is the best, it will always be shown. That is, until it stops working well. Then the other choices start to look better.

OrangeGreenWhite
2/9 = 22%1/4=25%1/4=25%

After many more visits, the best choice, if there is one, will have been found, and will be shown 90% of the time. Here are some results based on an actual web site that I have been working on. We also have an estimate of the click through rate for each choice.

OrangeGreenWhite
114/4071 = 2.8%205/6385=3.2%59/2264=2.6%

Edit: What about the randomization?

I have not discussed the randomization part. The randomization of 10% of trials forces the algorithm to explore the options. It is a trade-off between trying new things in hopes of something better, and sticking with what it knows will work. There are several variations of the epsilon-greedy strategy. In the epsilon-first strategy, you can explore 100% of the time in the beginning and once you have a good sample, switch to pure-greedy. Alternatively, you can have it decrease the amount of exploration as time passes. The epsilon-greedy strategy that I have described is a good balance between simplicity and performance. Learning about the other algorithms, such as UCB, Boltzmann Exploration, and methods that take context into account, is fascinating, but optional if you just want something that works.

Wait a minute, why isn't everybody doing this?

Statistics are hard for most people to understand. People distrust things that they do not understand, and they especially distrust machine learning algorithms, even if they are simple. Mainstream tools don't support this, because then you'd have to educate people about it, and about statistics, and that is hard. Some common objections might be:
  • Showing the different options at different rates will skew the results. (No it won't. You always have an estimate of the click through rate for each choice)
  • This won't adapt to change. (Your visitors probably don't change. But if you really want to, in the reward function, multiply the old reward value by a forgetting factor)
  • This won't handle changing several things at once that depend on each-other. (Agreed. Neither will A/B testing.)
  • I won't know what the click is worth for 30 days so how can I reward it?

More blog entries

Want more programming tech talk?
Add to Circles on Google Plus
Subscribe to posts

Post comment

Real Name:
Your Email (Not displayed):

Text only. No HTML. If you write "http:" your message will be ignored.
Choose an edit password if you want to be able to edit or delete your comment later.
Editing Password (Optional):

Kevin

2012-05-29 18:28:52
Decision optimization is the next step past A/B testing.

There's a startup called Conductrics that's doing this as a service. Really cool stuff, especially when I have no formal background in statistical modeling.

Lawrence

2012-05-29 18:33:48
I'd also love to see a startup crank this into a functional piece of code, as Leviathan mentioned. It may be 20 lines, but it's beyond my capability. Nonetheless, I've bookmarked this for a future project. Sounds like a much better way to A/B test.

peanutsmonkey

2012-05-29 18:39:03
Hi,

Be really keen to understand this further. Would it be possible to drop you an email? if you drop me an email with my name followed by an "at" symbol and end with gmail period com. I'll respond.

Ryan

2012-05-29 19:10:08
Made a typo in the last example: B is favorite and has 19,000 views, A is not the favorite and has 1,000. If B was the favorite all along, then all is good. But if A is the favorite only after reaching 1,000 views, then it takes 20,000 views to make that determination.

MicroAngelo

2012-05-29 19:31:49
Ryan, I think you've misunderstood how this method works: the percentage figure is the expectation (probability of success) a.k.a. click-through-rate, not the percentage of time that entry is shown.

In your example where A is initially unpopular, B will then be shown, but *only for as long as it's successful*. If it's not popular either then its expectation will very rapidly decrease until A starts getting shown again.

All other things being equal, this method will show you which option gives you the best CTR, which is all you really care about anyway.

Tim

2012-05-29 19:32:45
beating a random strategy (or the one that relies on random for a portion of its execution) is always a tough proposition. This is a really nice both pragmatic and practical approach - love it.

Eduardo

2012-05-29 19:48:21
I don't see the point here. The only difference is that you test only a small part of your audience. You could do that with GWO as well. Notice that GWO also drops options that don't perform very well if you tell it to, so you don't lose too many conversions based on the test.

An important feature of A/B tools is that a specific user always see the same option, so they avoid the "It was not like this yesterday" effect.

You should check Webtrends Optimize, that one is a really innovative tool. It not only selects the best option to your average audience but also selects the best option for different types of users, based on where they come from for example.

Jaap

2012-05-29 20:21:41
Interesting theory, thanks!

Ryan

2012-05-29 23:53:14
MicroAngel, I understand the point you're making in that the test that measure's the CTR percentage and not the number the times it's shown. The point I was trying to make is that you will see the current favorite more than you will see the non-favorites because the algorithm displays the favorite 90% of the time and a random choice of the non-favorites 10% of the time. As one of the non-favorites become more popular, it will take the place of the favorite. That's all well and good, but that assumes that you have a reasonably good distribution of data coming in for each page. That's why the CTR percentage only becomes valid after x number of views. After several thousand page views, you can be reasonably certain that your CTR is stable. My point is that if the data is not very well distributed, it could take a much longer time to reach the number of page views needed to make the determination.

This is kinda like quick sort. In most cases, with a random distribution, it will run O(n log (n) ), but in the worst case, mathematically, it's O(n^2). I think the algorithm described above could really make great improvements over the current standard A/B testing, but anyone that uses it needs to know the pros and cons so that it can be tweaked properly. In some cases, maybe the 90-10 split could be more optimized at 20-80 or 30-70. It really depends on what kind of data you have and finding a "sweet spot" for it. With careful analysis of the specific application, it could prove to be very powerful... but you do have know what's going on and make accurate assumptions about the data. The situation that I thought of where this would not be optimal is if you have a lot of data initially for on of the tests that doesn't match the eventual CTR after x number of views.

Liviu Taloi

2012-05-30 02:30:39
Hi, nice approach. But there are some things to take into consideration:

- assuming that you are working with an ecommerce site, you should be consecvent as user pass trough more than 1 producct page, he must see always the same color of that button; it will be confusing for that specific user to see all the rainbow colors onto the "add to cart" button.

- there should be also a normal/control group, an unbiased group that will receive the old version of the button; you want to see the increase in CTR/conversions/whatever with respect to the control group; but.. I think that somehow your approach is shorter than the one with A/B variant, where you should do at the end a follow-up test running only the winner in order to validate the result.

So.. it seems very nice approach, but, there are some software (saas) that are already doing this, not with 20 lines of code of course (and with a lot of money). There are MVT or AB SaaS that instead of leaving the owner to choose from the possible winners he choose automatically the winner during the test.

One little thing I am afraid is that you want to test different "creatives"/banners/images/html-banners on different sections of a product page for example, you should write this 20 lines of code on each zone that has multiple variants to choose. So, it will be quite messy inside the script source that generates that specific web page.

The approach using SaaS that uses section-divs where you upload creatives/banners/images it's easier than writing 20 lines of codes for each zone that we want to test especially if you are not a programmer, or you cannot hire one for this task. The reports are also nice.. but, this programming approach you are proposing is quite cost effective I suppose.

Anyway it's worthing to explore this solution. Could be more cost effective for many little A/B tests.

Nice article! Thx.

PS. Sorry for my English as I'm not a native English speaking guy.

Hardi

2012-05-30 03:15:13
If I understood correctly, then mynaweb.com is offering a service that aims to simplify implementing this exact process using JS.

jsquash

2012-05-30 03:22:48
Great idea but here is the one benefit from A/B testing - when testing a page your rankings will not be affected, Google say this. What about with this approach? You sending traffic (and thus robots, etc) to these multiple pages and this could harm your overall SEO strategy.

Of course if you not worried about SEO (such as you have a big brand) then great but while I do not enjoy the very simple A/B testing, I will still use it as my rankings will not be affected.

Just a thought :-)

Anton IMNorth

2012-05-30 05:03:11
Great article! I've honestly never seen A/B testing from this angle before and it really makes sense to try this out. I like the ajax suggestion and I'll probably look for a script like that myself.

ray

2012-05-30 06:24:33
我勒个去

Richard Heylen

2012-05-30 07:35:14
Nice, but one of your numbers is wrong. 2/9=22% not 11%

Trevor

2012-05-30 07:39:56
2/9 = 22%, not 11%. But very interesting...

bob

2012-05-30 08:08:10
I'm using the Max A/B plugin for wordpress. It allows 4 variations. I'm guessing your approach would be superior. If only a plugin existed...

solecoder

2012-05-30 09:51:47
Good post. Thanks. Definitely going to use this. Perhaps this method should be called A/Z testing or A/n testing.

Jason Cohen

2012-05-30 10:10:15
Nice post; I like it!

It's statistically sound because it doesn't manipulate the percentages; rather it just concentrates on one of the options, in a sense making its percentage "more accurate" in the sense of developing more N.

When N is small for all variants, natural fluctuations will cause various options to be "best," switching a lot, which in fact is just what you want it to do.

If you were concerned about being a little more statistically sound in the low-N period, you could just say "If the total number of trials is less than some threshold T, display the buttons round-robin." That way you can set T = 100 or something like n_choices * 50, and that gives all the options a "fair start."

Nice!

Michael Leo

2012-05-30 11:14:38
Steve,

This is an interesting approach. Thanks!

When doing this, you or others might also want to consider the statistical significance of the results. Running a chi-squared test of the results on a periodic basis will begin tell you when/if the variances you are seeing are statistically significant. And it's not stats for stats sake - there's real benefit there. It mean that you can more quickly and confidently find your winner, stop the test and swap over to the option that performs the best.

I'm not a statistician (probably the opposite of that) but at my last job I took an Excel spreadsheet our analytics team was using to verify the validity of test results and brought it online. It took a little digging to find the equation being run by the "chitest()" function in excel but once I did, it turned into about 50 lines of code to prep the data and run the chi-squared test.

If I have time I'll try to generalize that code to work with N buckets (we were just testing A & B) and post it somewhere.

Danil

2012-05-30 11:27:39
Is the exploration branch "random lever" or "random non favorite" lever? Does it matter?

Also - 10% seems awfully high once we start converging on a solution. Can anybody demonstrate that (a) my intuition is wrong or (b) that there's a way to improve upon this?

Ryan

2012-05-30 11:51:05
Made a typo in the last example: B is favorite and has 19,000 views, A is not the favorite and has 1,000. If B was the favorite all along, then all is good. But if A is the favorite only after reaching 1,000 views, then it takes 20,000 views to make that determination.

Justin Megawarne

2012-05-30 12:06:21
actually, a lot of medical studies will give one group the existing life-saving drug, and the other group will receive the new life-saving drug, because of ethical concerns. (think about it: you can’t give sugar water to patients with a life-threatening disease.)

Kurt

2012-05-30 12:30:35
I think the point that "This won't handle changing several things at once that depend on each-other." is what kills the k-armed bandit for a lot of applications. Sometimes the distribution of your data just isn't simple or unimodal, and the best choice is ill-defined (it may be A for some users and B for others, C for some URLs and D for others, etc.). For these applications, you have to use more sophisticated ML techniques anyway.

nitin

2012-05-30 12:36:50
This is quite an interesting sampling strategy - although I would caution that one has know its population to ensure that the 10% subset is well accounted for so as the data is not skewed.

ben

2012-05-30 13:22:22
Isn't this out Omniture's test and target works? It constantly uses the best results for the most people, automatically.

Sean Tierney

2012-05-30 13:53:10
Ryan, as others have pointed out GWO offers the ability to intelligently serve the winning variant more frequently. The only downside of this is that it dramatically decelerates the pace with which you can test. If you're optimizing for immediate yield (maybe you hit Techcrunch and are just trying to capture max signups) this is a good strategy but if you're optimizing for speed of learning it's like drinking through a straw.

I thought Optimizely had this capability too but upon examining their interface I don't see that feature. Anyways nice post.

sean

Chris

2012-05-30 14:03:10
It might be interesting to write a second algorithm that determines the frequency to show sub-optimal options. Either early in the trial, or when two of the options are very close later in the trial. I suppose this could be done manually (eliminating obvious laggers) but sometimes bad options need to be marginalized.

Ben

2012-05-30 14:53:06
You could also work in some simulated annealing. At times when your website is busiest, you either a) want the most reliable choice to make the most money, or b) want the most random ones to collect the most data, quickly. Regardless, you can change the rate of random choices depending on the business of your site.

Wyly

2012-05-30 21:28:49
This has a couple of assumptions that are dangerous in a/b testing the primary being that your list or your click through are of similar entities. a/b test is largely a miss nommer in that it is a matrix based with input type being the third axis. As an example the A/B (winner) might be different if the input user is an attorney vs a elementary teacher. It also might vary based on day part or country or gender.

Andrew

2012-05-30 21:33:31
I am interested where the value of 10% comes from? When a convereged solution is found this artificial value will give an unecessary error on the result, while at the beggning it will constrain other options that are not quite first place (but close). I recommond replacing this 10% value with a scale of the standard deviation. In this situation results with no correlation will all have a random chance. However once a winner begins to emerge this percentage will rapidly decrease and the winner will be selected.

Ryan

2012-05-30 22:31:06
MicroAngel, I understand the point you're making in that the test that measure's the CTR percentage and not the number the times it's shown. The point I was trying to make is that you will see the current favorite more than you will see the non-favorites because the algorithm displays the favorite 90% of the time and a random choice of the non-favorites 10% of the time. As one of the non-favorites become more popular, it will take the place of the favorite. That's all well and good, but that assumes that you have a reasonably good distribution of data coming in for each page. That's why the CTR percentage only becomes valid after x number of views. After several thousand page views, you can be reasonably certain that your CTR is stable. My point is that if the data is not very well distributed, it could take a much longer time to reach the number of page views needed to make the determination.

This is kinda like quick sort. In most cases, with a random distribution, it will run O(n log (n) ), but in the worst case, mathematically, it's O(n^2). I think the algorithm described above could really make great improvements over the current standard A/B testing, but anyone that uses it needs to know the pros and cons so that it can be tweaked properly. In some cases, maybe the 90-10 split could be more optimized at 20-80 or 30-70. It really depends on what kind of data you have and finding a "sweet spot" for it. With careful analysis of the specific application, it could prove to be very powerful... but you do have know what's going on and make accurate assumptions about the data. The situation that I thought of where this would not be optimal is if you have a lot of data initially for on of the tests that doesn't match the eventual CTR after x number of views.

Kent

2012-05-31 04:02:18
This is still a/b testing....it simply automates the manual work of a decent a/b test, which is great.

I just wouldn't categorize it any differently.

Mike

2012-05-31 04:50:16
Sorry, but this isn't at all clear

Gilles

2012-05-31 06:27:44
This is good, but there are some limitation (or additonal complexity):

- sometimes (often) we wants user affinity, ie. we want the same user to have the same behavior, each time he see the button.

- sometimes choices are made early / are static, eg. I'm generating emails with 2 templates... Once sent, this is difficult to get it back :)

If anyone as some advice to handle these cases...

for other situations (and they are lot), this is a great idea.

Gilles

Lars Ebert

2012-05-31 07:44:34
This is a very interesting method. Thank you very much, I'll definitely try this algorithm soon :)

mtcoder

2012-05-31 08:45:14
Great approach and with a few cookies you can provide a user with a "similar" design for a set time period. So if the user comes back on the same day they see the same site. But when they come back next week things change this value is easily adjusted. Same thing goes for giving them the same "add cart button style" for all pages on the site. The code can be built to be reuseable. I use .net for my sites so I went as far as just adding onto the base button, label, "banner" controls to include the random options and tracking call backs. Since all the styling is in CSS you don't have to worry about presenting different pages, just different CSS style for the control at render, this keeps your SEO working perfectly fine. I usually start a new site change option with about 40% randomness and after lets say 500 results aka visists the randomness is pulled down to 20% then 10% with the favorite getting more face time. Just speeds up the curve process a bit. Where you get a few horrible clicks on a horrible option, just cause a regular visitor clicks a horrible option cause its there only means for preforming the function of the button. So a small randomness drags things on. Having a bit higher random that is then tailored downwards helps keep things moving, we don' have months to wait after all.

I've been working on a few algorthims for cross item comparision, where if click me is blue and banner is X size and the different combinations I want to test, So basically creating test sets over individual items which proves more coherent with web design. So I am more testing which random style sheet / page design still randomly provided and measured seems to get the most time, clicks, navigation, etc and pull all of those factors in for a scorecard I know seems like a bunch of work but once you have the scripts it works for any page / site you build from then on and having it automated saves so much time down the road, and talk about great stats to provide to your clients.

Alan8

2012-05-31 11:00:39
Great article! Good concept, clearly presented. And useful!

There's no "Like" button, but if there was, consider it pressed.

swampwiz

2012-05-31 13:42:03
Excellent analogy of the algorithm with patients in a drug test. Perhaps the AMA should read this posting!

Andreas Ek

2012-05-31 16:24:35
Thanks Steve!

Implemented the logic as a WordPress plugin with shortcodes.

You can find it at GitHub or a google search for "FlowSplit"

https://github.com/EkAndreas/flowsplit/wiki/1---Introduction

Sherko

2012-05-31 16:24:45
Good work!

Your idea is straight forward, and easy to understand.

You can modify the code to populate its results for each 1000 (or any number) visits. It would be interesting to see how people's reactions change with time.

Martin Hammerschmied

2012-05-31 17:42:10
In machine learning this concept is called reinforcement learning.

Frank

2012-05-31 18:25:41
Interesting post.

I built an Excel sheet that simulates this algorithm for 6 items of differing (theoretical) click-through rates. I also used a genetic optimization algorithm to test whether 10% is the right number for randomization.

Over 15k trials, I found something interesting. The average click-through rate declined (almost linearly) as the amount of randomization went up. Of course, then I suspected that we were losing click through's because we couldn't get to the "right" answer quickly enough.

But I measured the theoretical loss - the people who would have clicked on the best option had it been presented to them. (I measured this by pulling a random number for each trial and if it was small enough to clear the highest hurdle, but not small enough to clear the hurdle presented, it was an 'unnecessary loss.')

I found that unnecessary losses stayed relatively the same for all randomization trials. That was a surprise.

So, try for a low randomization number. That seems to indicate that the randomization is just there in case things change. For a static set of options and static customers, no randomization gets us to the "right" answer fastest with the best average click-through rates.

Anyways, happy to share my spreadsheet with anyone who wants to see it.

roger

2012-05-31 21:54:11
I like the idea the reasoning is sound- now where/how does an average person with some technical implement this?

Thomas

2012-06-01 00:37:16
I have a question: is this how the akinator website works?! It has really astonishing results ...

Wing Tang Wong

2012-06-01 00:54:27
Very interesting. My take away from this is not that this replaces A/B testing, but rather, covers an idea for making the cyclical process of A/B testing-Evaluation-Selection to be more automatic.

Ie, let's say you have some ideas for how to change about your website, banner, etc. but you only want to have a small pool at a time, ie. given a queue of ideas from your marketing/design team, the site will only make use of the ideas 3-4 at a time. As the process finds a winner over a period of time and stability is achieved(configurable), the "losers" in the pool are kicked out and replacements are injected into the batch from the queue.

In this way, elements can be auto-evaluated, judged, and swapped out. I'm sure some enterprising coder can also include a report attached to each option and put the losers in the fail bucket and add the winners to the high performer bucket.

And yeah, consistent display of the site for a given user. Though letting it mix things up periodically is a good thing as well... :)

Wing wingtangwong.com

David Dean

2012-06-01 02:28:02
This was really one of those very rare moments where you slap your forehead and think to yourself, "why the *hell* have I not been doing this all along?!" Dead simple, makes perfect sense.

I'm off to implement this all over the place. Thanks for the post... really.

Ross

2012-06-01 11:19:16
I think the point has been made already but you need consistency between page views, this code wouldn't perform that.

Your real point though is about sampling, and well, who cares if it's 50/50 or 90/10, you will get statistical significance quicker with closer ratios andfewer variations, 90/10 with 5 different results will take a long time for smaller sites to become significant.

Every tool I have used allows for more than A/B (2 variations) so that isn't an argument.

Yes people understand an even distribution, why complicate it, people aren't doing enough of this as it is, let alone making the barrier to entry higher still.

I work in a digital agency and people use the tools available to them, nobody here is a programmer, so they use tools like Optimizely and visual website optimzer - because they're easy!!! i cant get this over enough, simple for marketers to use & understand is key here, the super-clever guys at google and microsoft can do whatever they like, the normal 99% of people, need simple & easy to understand, storing data in redis or whatever is a million miles beyond their capabilities.

@OptimiseOrDie

2012-06-01 14:43:52
An interesting idea and one I've thought about for 2 years now. It is actually possible to build a tool which will do this and so if anyone would like to work with me on it, I'm into being part of a team for sure.

There are some algorithms that will work very well to solve a big problem with A/B testing. Change.

As seasons, cycles, traffic, external marketing, tv, display, customers, markets, businesses, strategies - Change - so do the results of that A/B test you declared 6 months ago.

Just because it did 12.5% better in 2 weeks in February does not mean it will self optimise with the CHANGE going on. So people delude themselves that they are still getting the 12.5% (in their mind) when in reality, they don't know.

Hmmmm.

To solve this though, I often test after going live with 5/10% splits to verify the control performance still tracks lower. This helps to convince people who think the split or multivariate test has suddenly driven conversion rates down.

Extend this idea further and use an evolutionary genetic algorithm to select, test and then repeat verify against runners up, the original control and random new items fed in. Something that can learn and adapt to patterns in an evolutionary way will be far better at automatic tuning of raw assets into optimal recipes. It will also keep performing long after your last a/b test finished, and will adapt better to change sources you have little control over.

I want to build this! I want to build it now! If anyone else feels the same and can help, I'd love to do it.

Craig.

Nitin

2012-06-01 15:03:25
I use this strategy for choosing what to feed the kids for dinner. They don't seem to be settling down on a favorite.

Eric

2012-06-01 15:57:24
Actually I bet this could be adapted to do multiple variables at once by linking two things (such as font size and color) and changing them at different rates. I.E. doing the same thing for both Font Size and then color.

Then you'll have a graph of each of these an an independent probability. (I.E. how often you get blue and how often you get large font). Then you could also do some linking between the list to show what the other option was when it was clicked (I.E. Blue and Large) and then analyze that.

Or am I missing something.

Andreas Ek

2012-06-01 17:02:10
Now, Steves logics is downloadable as a public WordPress plugin at http://wordpress.org/extend/plugins/flowsplit/

someone

2012-06-02 16:46:11
I'm sorry, but this is just too obvious. And that QBASIC thing (that's the only thing I clicked on besides this) - we've ALL done it. I know I have.

Maciej Klepaczewski

2012-06-04 04:38:58
I wrote small program that compares this strategy to simple A/B test. My results:

=====

Initial parameters for A/B test:

{pA} - 0.005 (0.5%) success chance (click, sale etc.) of group A

{pB} - 0.01 (1%) success chance of group B

Initial parameters for new approach:

{p} - 0.2 (20%) of viewers are assigned to random group

{pA} - 0.005 (0.5%) success chance of group A

{pB} - 0.01 (1%) success chance of group B

Each test consisted of 10k impressions, 100k tests where performed. Results:

A/B:

Group B "won" in 99,844% cases, total successes (group A+B) over all tests: 7,5 mln.

New approach:

Group B "won" in 98,827% of cases, but provided 9,5 mln successes. (almost 27% better than A/B test!)

So, it seems that although A/B tests more quickly answer which group is "better" they also generate less sales/click/whatever.

I'm not a statistician but I would love to see math that give sound explanation to above results.

Sean Colombo

2012-06-05 13:58:07
Great post :) The system sounds really interesting.

Some of the hurdles to adoption that I could see would be performance-related.

Firstly, relatively-fresh weightings should be available to the client. That seems like it requires either making the edge-caching of your pages fairly short, or making a server-side call on many requests.

Currently, split-testing (a/b/c/d/e, etc. - not sure why someone would limit to just a/b) on high-traffic sites can determine treatment-groups using hashes of experiment ids and unique identifiers for the user from the CDN. This is what we do at Wikia.

Since we cache most pages for 24 hours in our CDN, we can bake experiment configurations into the page (the weights in split-testing typically do not change over the course of a day).

Can you think of a similar way to get relatively-fresh weighting changes to the client? Perhaps using the 24h stale weights by default and then making async requests for fresher data while the page is idle? That seems like it should work. Thoughts?

---

Secondly, a treatment event needs to be logged for every active experiment every time that a user is treated (eg: to say either that they clicked or didn't click). With split-testing, you only need to send a treatment-event the first time a user is treated with a new experiment and they stick in this group until the experiment ends or the configuration switches them out of it. Nothing strikes me as a good solution to that problem. Seems like you'll have to just take that performance-hit.

Thanks again for the post! Would love to hear your ideas on performance for this method :)

Ed

2012-06-11 11:13:53
Very interesting approach. Do you have any comments regarding VWO's response post? Personally I'd like to see both algorithms implemented.

IanHutchinson

2012-06-18 16:13:12
We are split testing the first image at the start of videos to compare impressions to click throughs. This article tells me to always set the best tracking video at 90% and the others get an even distribution of the 10%.

I will have to give this some more thought. Feel free to take a look at what we do - Vidyard dot com.

Bharath M

2012-06-27 02:05:19
Mind blowing quote.

In recent years, hundreds of the brightest minds of modern civilization have been hard at work not curing cancer. Instead, they have been refining techniques for getting you and me to click on banner ads.

Anonymous Coward

2012-06-30 03:10:50
That's simply trading away 10% of button displays for getting a valid statement about which one works most of the times. 10% has to be statistically relevant for the process to work (i. e. 10% has to be a large number). It has to be even higher if you want to optimize up to the point where you show the orange button only to above average under 25 income females from Europe, and the green one to lower income between 25 and 50 males from South America. Large ad networks like the one operated by Google have such large numbers, and can also derive the demographic characteristics of the people doing the clicking. Smaller sites or medical trial teams don't have such big numbers to experiment on.

Rick

2012-07-02 10:32:48
I found a tool tha does just this at http://www.growthgiant.com

LMF

2012-07-06 11:33:24
True but statistical power to detect a difference is maximized at an equal split. So I believe its better to run an equal split until a significant difference is determined, then send all traffic to the winner.

Horia Margarit

2012-07-11 12:12:24
Some people have pointed out the critical error with this approach. Suppose you have a sample of 20 page visits, and 19 of them landed on ugly Orange, while only 1 landed on pretty Green. Then your distribution over the expectations of clicks is [19/20, 01/20, 00/20]. But what if Green turns out to be the optimal choice for the population from which you are sampling?

In the best possible case, you will need the next 19 page views after the 20th to display Green, _and_ for Green to be clicked every time, for the next 19 visits in order to determine that Green is the optimal choice. This best possible case would yield a distribution over the expectations of clicks of [19/39, 20/39, 00/39]. How likely is it for this algorithm to display Green these last 19 page views?

For simplicity, let's assume that a visitor will click the Green button from now on if it is present, and further assume that a visitor will no longer click the Orange or White buttons if they are present. With this simplifying assumption, we have made the probability of displaying Green on the 28th trial independent of the probability of displaying Green on the 27th trial, and so on until the 39th trial. But we only explore 10% of the time and we have two exploratory options, so the probability of displaying Green any given time is 0.05. The probability of displaying Green the next 19 times is thus 0.05^19 which equals to 19.0735 x 10^-26

That means that you'd have to run through an additional 100 trillion x 1 trillion trials in order to actually display Green 19 times! Let's not forget about the best case assumptions: no user will ever click on White or Orange these 100 trillion x 1 trillion times, they will only click on the Green and they will do so each of the extremely rare 19 times you display it.

Those are impossible odds!

Yet the algorithm works pragmatically. It works because users are Randomly in disagreement, but directionally in Agreement. In other words, this algorithm works because it's behaviour mirrors user behaviour. Your population of users will, with some probability alpha agree on what is the best looking button or copy-edit or whatever. And with probability 1-alpha they will choose something else. But the space of something else is large, and each user will choose a different thing in that space from the others. So the critical error with this approach is actually its greatest strength.

It marginalizes random user disagreements to the point where they become totally insignificant.

Steve

2012-07-16 17:04:16
@Horia Margarit

I don't think your scenario would ever play out. The example has all 3 buttons being initialized with a 100% "click" rate.

Orange could not possibly be shown 19 times in a row without it being clicked on because the algorithm displays the best possible (i.e. highest click-ratio) button each visit.

If Orange is shown 19 times in a row, it's because of a statistical anomaly, AND also because 19 people in a row clicked on that Orange button.

As soon as 1 of those people don't click on Orange, the success rate for Orange drops below 100% and the script jumps to another colour, i.e. Green. One failure on Green and we're on to White. One failure on White and we're back on Orange.

What you're missing is that the selection of what button gets displayed isn't random, it's based on the success rate of what has already worked before it.

Arjan Haring

2012-07-26 06:52:10
Hallelujah!

So we actually optimize the explore versus exploit trade-off through probability matching.

And we are in beta ;-) check out PersuasionAPI.

GlynRob

2012-08-13 08:44:01
Excellent article.

I built an example using this method using Redis and Codeigniter to see it in action.

How to build can be seen at:

glynrob.com/database/redis-in-codeigniter/

Full code is available on GitHub if anyone wants to try it for themselves.

Ted Dunning

2012-10-27 17:26:28
Nice article, especially since you correctly define multi-armed bandit and correctly name epsilon-greedy.

It wouldn't hurt to describe the slightly more sophisticated algorithms based on probability matching (search for "bayesian bandit" to find my first blog entry and check out the October posting where I give references to the original literature. The code really isn't any more complex than epsilon greedy and Bayesian Bandits dominate the performance of epsilon greedy and require no knobs. Even better, they also handle contextual problems which are impossible to deal with using epsilon greedy.

Sylvain Galibert

2012-10-28 12:23:06
This was certainly an interesting post. That said, I don't think it would work the way you expect in the real world for the following reasons:

0. Some form of stability must be ensured for the visitors. You can show A, then B. But you can't show A, B, B, A, over and over again. (That's the easiest problem to solve on this list).

As you say, the problem is similar to the multi-armed bandits, BUT it is NOT the same problem. There are important differences that are not taken into account in your model:

1. The number of slot machines is static, whereas the number of options to try is constantly expanding and contracting as your design evolves

You are NOT planning on testing 3 options endlessly. You want to test a number of variations as time goes on. Each new option goes against the established options. So unless you reset all the counters every time you introduce a new button color or font, new options will be far more volatile than their established counterparts. This could result in new options dropping below the threshold of existing options long before a statistically significant sample is reached and take a very long time to resurface. I.e.:

To measure a large 50% improvement on a new button from an existing 2% conversion rate, you need about 2700 visitors at a 95% confidence level. But by the time you reach 100 visitors, your conversion rate could have fallen below 2%, and from that point on, how long before it receives enough visits to prove its worth.

2. On a slot machine, one lever pull is equivalent to another. It does not matter who pulls the lever, what time of the day the lever is pulled, what day of the week the lever is pulled, what period of the year it is, what website the machine was visiting before pulling the lever, etc.

For your website visitors/slot machines, all these factors matter and more. How many people buy translation during the week-end? Not too many. How many people buy toys 2 weeks before Christmas? This method does not account for these differences. A button tested on Saturday afternoon on a translation website will be massively penalized compared to a button tested on Tuesday afternoon. Similarly, if your website gets slashdotted, you may have a sudden spike in visitors who might be either totally uninterested about buying your product (they just want to read a cool thing you wrote) or completely determined in signing up for your cool new service. And then there are seasonal items. Your Halloween themed "buy" button might perform quite well during Halloween, but how long will it remain to the top after Halloween?

3. Fashion and design trends. On the web, the context changes.

By context, I mean overall standards and conventions in web design. That glossy button of yours that has accumulated outstanding conversion rating over time is just not "in" any more. Visitors are now used to more subtle interfaces like the one of Facebook. Unless you have a mechanism to decay the value of clicks over time you will end up with "winning" options that endure when they shouldn't.

The problem here is that for this system to work, it needs to run in a controlled, static environment for a significant period of time. And you really don't have that:

Imagine you are in a casino trying to run your algorithm on a wall of one armed bandits. 5 minutes after you start, a bunch of guys come in and start playing on the same machines as you. Then repairmen come and add twice as many machines. Half of those are a new model. Then they update the OS of 4 of the machines. On and on. Does it still feel like your algorithm would work in that environment? That's far closer to the environment in which most websites operate

Each time a change is introduced, you need a full reset, but this algorithm is useful only when it is allowed to run long enough to reach a statistically significant result, and that makes it poorly suited for website testing, at least for most websites. It might work for huge websites where the numbers of visitors are so large that statistical significance can be reached before the context changes, with some tweaks taking into account things like the time of day/week/season, running a counter reset on unexpected variations (if button A suddenly converts 3x as much as before when it had already been tested often enough to reach statistical significance, something is up)

AB testing, on the other end, does not suffer as much from this volatile environment because it does not try to favor results before statistical significance is reached. On Sunday or at night, both A & B suffer/benefit equally, whereas with your system, a design which might have been ahead on Friday by a large margin may loose its advance over the week-end, get overtaken on Monday, finally recover towards the end of the day only to tank again overnight. Some potentially interesting options might take a long time to reach statistical significance.

Another advantage of AB testing, and possibly the most important issue here, is that it teaches us and helps us understand what is happening. After a few tries, you can work out some general rules like: "bigger buttons are better", "fewer options are necessary for non qualified visitors, but qualified visitors (from such and such websites) will fill out longer forms", "Pictures can control the attention of visitors", etc. which can then guide your UI evolution.

This can be done to a point with your system, but it's much less reliable: "A" has a conversion rate of 60% and "B" a conversion rate of 22%, but "B" has been tested 600k times in the low season and "A" has only been tested 900 times in the high season. Is "A" really better than "B"? You can't really compare A and B which prevents you from learning as much as you would in AB testing. You are stuck guessing and are unable to extract and verify the rules that work for your website.

Also, AB testing is more hands on which forces you to think with the data: Your Halloween button will never live through Christmas because of the fantastic sales from October/November with AB testing (or your usual "best" button will be shown 90% of the time through Christmas because the new Christmas buttons didn't get a good enough conversion early in December and by the time they recovered from that, the Christmas buying season is over)

There is this old quote - I don't remember where it's from or how it goes exactly, but I think it is pretty applicable here: "When trying to write software that learns by itself, you find out that it doesn't... but you do."

I wasn't quite planning on posting a thousand words in the comments. If you feel like responding to it, I would be happy to hear about it. You can contact me at dev@preptags.com.

pere rovira

2013-01-23 17:48:43
great article and great discussion. google analytics content experiments, which substituted the discontinued google website optimiser, uses the statistical model you describe. main benefits are testing faster and reduce risk. still, the main shortcoming of any kind of testing strategy is that companies do not test often enough. the key to success is to always be testing, and that's more to do with agile companies than the method used.

Steve

2013-02-19 21:59:09
" All other things being equal, this method will show you which option gives you the best CTR, which is all you really care about anyway"

Totally incorrect. You've got into the wrong way of thinking, like most - ROI and Conversion Rate is just as important as CTR.

What if you have Ad A, B and C... A has the best CTR yet B has a much lower CTR but a better ROI and Conversion stats?

The A/B test will come out with the right results if you give them enough impressions.

Aditya Ramachandran

2013-04-18 11:48:43
This really speaks to me. Check out Subintent (www.subintent.com). We're using machine learning to optimize conversion funnels, kind of like what you're missing today...

Matt

2013-10-10 04:18:15
Google Content Experiements has been using this approach for quite some time!
Email
steve.hanov@gmail.com

Other posts by Steve

Yes, You Absolutely Might Possibly Need an EIN to Sell Software to the US How Asana Breaks the Rules About Per-Seat Pricing 5 Ways PowToon Made Me Want to Buy Their Software How I run my business selling software to Americans 0, 1, Many, a Zillion Give your Commodore 64 new life with an SD card reader 20 lines of code that will beat A/B testing every time [comic] Appreciation of xkcd comics vs. technical ability VP trees: A data structure for finding stuff fast Why you should go to the Business of Software Conference Next Year Four ways of handling asynchronous operations in node.js Type-checked CoffeeScript with jzbuild Zero load time file formats Finding the top K items in a list efficiently An instant rhyming dictionary for any web site Succinct Data Structures: Cramming 80,000 words into a Javascript file. Throw away the keys: Easy, Minimal Perfect Hashing Why don't web browsers do this? Fun with Colour Difference Compressing dictionaries with a DAWG Fast and Easy Levenshtein distance using a Trie The Curious Complexity of Being Turned On Cross-domain communication the HTML5 way Five essential steps to prepare for your next programming interview Minimal usable Ubuntu with one command Finding awesome developers in programming interviews Compress your JSON with automatic type extraction JZBUILD - An Easy Javascript Build System Pssst! Want to stream your videos to your iPod? "This is stupid. Your program doesn't work," my wife told me The simple and obvious way to walk through a graph Asking users for steps to reproduce bugs, and other dumb ideas Creating portable binaries on Linux Bending over: How to sell your software to large companies Regular Expression Matching can be Ugly and Slow C++: A language for next generation web apps qb.js: An implementation of QBASIC in Javascript Zwibbler: A simple drawing program using Javascript and Canvas You don't need a project/solution to use the VC++ debugger Boring Date (comic) barcamp (comic) How IE <canvas> tag emulation works I didn't know you could mix and match (comic) Sign here (comic) It's a dirty job... (comic) The PenIsland Problem: Text-to-speech for domain names Pitching to VCs #2 (comic) Building a better rhyming dictionary Does Android team with eccentric geeks? (comic) Comment spam defeated at last Pitching to VCs (comic) How QBASIC almost got me killed Blame the extensions (comic) How to run a linux based home web server Microsoft's generosity knows no end for a year (comic) Using the Acer Aspire One as a web server When programmers design web sites (comic) Finding great ideas for your startup Game Theory, Salary Negotiation, and Programmers Coding tips they don't teach you in school When a reporter mangles your elevator pitch Test Driven Development without Tears Drawing Graphs with Physics Free up disk space in Ubuntu Keeping Abreast of Pornographic Research in Computer Science Exploiting perceptual colour difference for edge detection Experiment: Deleting a post from the Internet Is 2009 the year of Linux malware? Email Etiquette How a programmer reads your resume (comic) How wide should you make your web page? Usability Nightmare: Xfce Settings Manager cairo blur image surface Automatically remove wordiness from your writing Why Perforce is more scalable than Git Optimizing Ubuntu to run from a USB key or SD card UMA Questions Answered Make Windows XP look like Ubuntu, with Spinning Cube Effect See sound without drugs Standby Preventer Stock Picking using Python Spoke.com scam Stackoverflow.com Copy a cairo surface to the windows clipboard Simulating freehand drawing with Cairo Free, Raw Stock Data Installing Ubuntu on the Via Artigo Why are all my lines fuzzy in cairo? A simple command line calculator Tool for Creating UML Sequence Diagrams Exploring sound with Wavelets UMA and free long distance UMA's dirty secrets Installing the Latest Debian on an Ancient Laptop Dissecting Adsense HTML/ Javascript/ CSS Pretty Printer Web Comic Aggregator Experiments in making money online How much cash do celebrities make? Draw waveforms and hear them Cell Phones on Airplanes Detecting C++ memory leaks What does your phone number spell? A Rhyming Engine Rules for Effective C++ Cell Phone Secrets