< >

20 lines of code that will beat A/B testing every time

Posted 13 years ago

Zwibbler.com is a drop-in solution that lets users draw on your web site.

A/B testing is used far too often, for something that performs so badly. It is defective by design: Segment users into two groups. Show the A group the old, tried and true stuff. Show the B group the new whiz-bang design with the bigger buttons and slightly different copy. After a while, take a look at the stats and figure out which group presses the button more often. Sounds good, right? The problem is staring you in the face. It is the same dilemma faced by researchers administering drug studies. During drug trials, you can only give half the patients the life saving treatment. The others get sugar water. If the treatment works, group B lost out. This sacrifice is made to get good data. But it doesn't have to be this way.

In recent years, hundreds of the brightest minds of modern civilization have been hard at work not curing cancer. Instead, they have been refining techniques for getting you and me to click on banner ads. It has been working. Both Google and Microsoft are focusing on using more information about visitors to predict what to show them. Strangely, anything better than A/B testing is absent from mainstream tools, including Google Analytics, and Google Website optimizer. I hope to change that by raising awareness about better techniques.

With a simple 20-line change to how A/B testing works, that you can implement today, you can always do better than A/B testing -- sometimes, two or three times better. This method has several good points:

It can reasonably handle more than two options at once.. Eg, A, B, C, D, E, F, G, ÃƒÂ¢Ã¯Â¿Â½Ã‚Â¦
New options can be added or removed at any time.

But the most enticing part is that you can set it and forget it. If your time is really worth $1000/hour, you really don't have time to go back and check how every change you made is doing and pick options. You don't have time to write rambling blog entries about how you got your site redesigned and changed this and that and it worked or it didn't work. Let the algorithm do its job. This 20 lines of code automatically finds the best choice quickly, and then uses it until it stops being the best choice.

The Multi-armed bandit problem

Picture from Microsoft Research

The multi-armed bandit problem takes its terminology from a casino. You are faced with a wall of slot machines, each with its own lever. You suspect that some slot machines pay out more frequently than others. How can you learn which machine is the best, and get the most coins in the fewest trials?

Like many techniques in machine learning, the simplest strategy is hard to beat. More complicated techniques are worth considering, but they may eke out only a few hundredths of a percentage point of performance. One strategy that has been shown to perform well time after time in practical problems is the epsilon-greedy method. We always keep track of the number of pulls of the lever and the amount of rewards we have received from that lever. 10% of the time, we choose a lever at random. The other 90% of the time, we choose the lever that has the highest expectation of rewards.

def choose():
    if math.random() < 0.1:
        # exploration!
        # choose a random lever 10% of the time.
    else:
        # exploitation!
        # for each lever, 
            # calculate the expectation of reward. 
            # This is the number of trials of the lever divided by the total reward 
            # given by that lever.
        # choose the lever with the greatest expectation of reward.
    # increment the number of times the chosen lever has been played.
    # store test data in redis, choice in session key, etc..

def reward(choice, amount):
    # add the reward to the total for the given lever.

Why does this work?

Let's say we are choosing a colour for the "Buy now!" button. The choices are orange, green, or white. We initialize all three choices to 1 win out of 1 try. It doesn't really matter what we initialize them too, because the algorithm will adapt. So when we start out, the internal test data looks like this.

Orange	Green	White
1/1 = 100%	1/1=100%	1/1=100%

Then a web site visitor comes along and we have to show them a button. We choose the first one with the highest expectation of winning. The algorithm thinks they all work 100% of the time, so it chooses the first one: orange. But, alas, the visitor doesn't click on the button.

Orange	Green	White
1/2 = 50%	1/1=100%	1/1=100%

Another visitor comes along. We definitely won't show them orange, since we think it only has a 50% chance of working. So we choose Green. They don't click. The same thing happens for several more visitors, and we end up cycling through the choices. In the process, we refine our estimate of the click through rate for each option downwards.

Orange	Green	White
1/4 = 25%	1/4=25%	1/4=25%

But suddenly, someone clicks on the orange button! Quickly, the browser makes an Ajax call to our reward function $.ajax(url:"/reward?testname=buy-button"); and our code updates the results:

Orange	Green	White
2/5 = 40%	1/4=25%	1/4=25%

When our intrepid web developer sees this, he scratches his head. What the F*? The orange button is the worst choice. Its font is tiny! The green button is obviously the better one. All is lost! The greedy algorithm will always choose it forever now!

But wait, let's see what happens if Orange is really the suboptimal choice. Since the algorithm now believes it is the best, it will always be shown. That is, until it stops working well. Then the other choices start to look better.

Orange	Green	White
2/9 = 22%	1/4=25%	1/4=25%

After many more visits, the best choice, if there is one, will have been found, and will be shown 90% of the time. Here are some results based on an actual web site that I have been working on. We also have an estimate of the click through rate for each choice.

Orange	Green	White
114/4071 = 2.8%	205/6385=3.2%	59/2264=2.6%

Edit: What about the randomization?

I have not discussed the randomization part. The randomization of 10% of trials forces the algorithm to explore the options. It is a trade-off between trying new things in hopes of something better, and sticking with what it knows will work. There are several variations of the epsilon-greedy strategy. In the epsilon-first strategy, you can explore 100% of the time in the beginning and once you have a good sample, switch to pure-greedy. Alternatively, you can have it decrease the amount of exploration as time passes. The epsilon-greedy strategy that I have described is a good balance between simplicity and performance. Learning about the other algorithms, such as UCB, Boltzmann Exploration, and methods that take context into account, is fascinating, but optional if you just want something that works.

Wait a minute, why isn't everybody doing this?

Statistics are hard for most people to understand. People distrust things that they do not understand, and they especially distrust machine learning algorithms, even if they are simple. Mainstream tools don't support this, because then you'd have to educate people about it, and about statistics, and that is hard. Some common objections might be:

Showing the different options at different rates will skew the results. (No it won't. You always have an estimate of the click through rate for each choice)
This won't adapt to change. (Your visitors probably don't change. But if you really want to, in the reward function, multiply the old reward value by a forgetting factor)
This won't handle changing several things at once that depend on each-other. (Agreed. Neither will A/B testing.)
I won't know what the click is worth for 30 days so how can I reward it?

More blog entries

Steve Hanov makes a living working on Rhymebrain.com, rapt.ink, www.websequencediagrams.com, and Zwibbler.com. He lives in Waterloo, Canada.

Post comment

edit

Anonymous

two months ago

So after the first web visitor visited the webpage, if they clicked on the orange button, the next visitor would see the orange button? Does it mean the way A/B testing programmed based on the first web visitor behavior on the web (whether they clicked on the button or not) and for programmers, the sample they choose for option A is far more important than options B, C, D, E, etc.?

edit

Yo boy

seven months ago

This is the bees knees

edit

tes

seven months ago

that's trash

edit

so real

seven months ago

With a simple 20-line change to how A/B testing works, that you can implement today, you can always do better than A/B testing -- sometimes, two or three times better. This method has several good points:

It can reasonably handle more than two options at once.. Eg, A, B, C, D, E, F, G, ÃƒÂ¢Ã¯Â¿Â½Ã‚Â¦

New options can be added or removed at any time.

But the most enticing part is that you can set it and forget it. If your time is really worth $1000/hour, you really don't have time to go back and check how every change you made is doing and pick options. You don't have time to write rambling blog entries about how you got your site redesigned and changed this and that and it worked or it didn't work. Let the algorithm do its job. This 20 lines of code automatically finds the best choice quickly, and then uses it until it stops being the best choice.

The Multi-armed bandit problem

Picture from Microsoft Research

Like many techniques in machine learning, the simplest strategy is hard to beat. More complicated techniques are worth considering, but they may eke out only a few hundredths of a percentage point of performance. One strategy that has been shown to perform well time after time in practical problems is the epsilon-greedy method. We always keep track of the number of pulls of the lever and the amount of rewards we have received from that lever. 10% of the time, we choose a lever at random. The other 90% of the time, we choose the lever that has the highest expectation of rewards.

def choose():

if math.random() < 0.1:

# exploration!

# choose a random lever 10% of the time.

else:

# exploitation!

# for each lever,

# calculate the expectation of reward.

# This is the number of trials of the lever divided by the total reward

# given by that lever.

# choose the lever with the greatest expectation of reward.

# increment the number of times the chosen lever has been played.

# store test data in redis, choice in session key, etc..

def reward(choice, amount):

# add the reward to the total for the given lever.

Why does this work?

Orange Green White

1/1 = 100% 1/1=100% 1/1=100%

Orange Green White

1/2 = 50% 1/1=100% 1/1=100%

Orange Green White

1/4 = 25% 1/4=25% 1/4=25%

But suddenly, someone clicks on the orange button! Quickly, the browser makes an Ajax call to our reward function $.ajax(url:"/reward?testname=buy-button"); and our code updates the results:

Orange Green White

2/5 = 40% 1/4=25% 1/4=25%

When our intrepid web developer sees this, he scratches his head. What the F*? The orange button is the worst choice. Its font is tiny! The green button is obviously the better one. All is lost! The greedy algorithm will always choose it forever now!

Orange Green White

2/9 = 22% 1/4=25% 1/4=25%

Orange Green White

114/4071 = 2.8% 205/6385=3.2% 59/2264=2.6%

Edit: What about the randomization?

Wait a minute, why isn't everybody doing this?

Showing the different options at different rates will skew the results. (No it won't. You always have an estimate of the click through rate for each choice)

This won't adapt to change. (Your visitors probably don't change. But if you really want to, in the reward function, multiply the old reward value by a forgetting factor)

This won't handle changing several things at once that depend on each-other. (Agreed. Neither will A/B testing.)

I won't know what the click is worth for 30 days so how can I reward it?

edit

Dev

seven months ago

Great post, well explained

edit

Bobby Tables

seven months ago

and great post. Loved it

edit

Donald Trump

seven months ago

Donald Trump best girl

edit

Bobby Tables

seven months ago

Respect to the old-style comment box without a login, 2fa, id verification, fingerprint scan

edit

awkward

seven months ago

Pure, disinterested A/B testing where the goal is just to find the good way to do it, and there's enough leverage and traffic that funding that A/B testing is worthwhile is rare.

More frequently, A/B testing is a political technology that allows teams to move forward with changes to core, vital services of a site or app. By putting a new change behind an A/B test, the team technically derisks the change, by allowing it to be undone rapidly, and politically derisks the change, by tying it's deployment to rigorous testing that proves it at least does no harm to the existing process before applying it to all users. The change was judged to be valuable when development effort went into it, whether for technical, branding or other reasons.

In short, not many people want to funnel users through N code paths with slightly different behaviors, because not many people have a ton of users, a ton of engineering capacity, and a ton of potential upside from marginal improvements. Two path tests solve the more common problem of wanting to make major changes to critical workflows without killing the platform.

edit

Luke Holmes

seven months ago

mum i shid my pants

edit

seven months ago

True but gay

edit

Snake Plissken

seven months ago

algorithm is not written in rust :P

edit

seven months ago

Gay but true

edit

...

seven months ago

Paul Graham is a clown

edit

ilyag

seven months ago

I really like this commenting interface and layout. Thank you for making this blog!

Thank you for this insightful post! Your explanation of the multi-armed bandit problem highlights a practical and elegant alternative to A/B testing. It’s refreshing to see a solution that balances experimentation with user experience, especially in real-time applications. The simplicity of the epsilon-greedy approach makes it accessible, even for those intimidated by advanced algorithms. Keep sharing these impactful ideas—you're helping to bridge the gap between theory and practical implementation! Blessings!

edit

Adam

six years ago

Do you use this technique to write your blog articles?

I'm stupid

edit

Marcus

six years ago

I just said to myself "why isn't everyone doing this" just as my eyes scanned down to the same in the article.

edit

lhpopov

nine years ago

Very interesting article.

I just found out that Kameleoon does it.

"Multi-armed bandit tests

(Adaptive traffic distribution)"

from kameleoon.com/en/pricing-ab-testing-personalization.html

Also conductrics.com/real-time-optimization/ state "Using machine learning methods".

Also vwo.com/features/ with their "Multivariate Testing".

Probably you are not the first one to come up with this idea, but for sure gave it to some of the listed above players :-) Good work!

edit

Eric

nine years ago

How do you account for this method if you are trying to rank listings on a product page for an online marketplace (i.e. Ebay, etsy, Amazon)?

edit

MoreCowbell

nine years ago

Hi! I'm a time traveler who has gone back to 2013, the year this article was written. Google Website Optimizer still exists here in in the past, but Google Content Experiments (which does what this article describes) was launched in 2012 as Optimizer's eventual replacement. Live in the future, my friends.

edit

emanuel

nine years ago

Excellent post, thanks for sharing!

Love the site UX btw. It's super straight forward and works well. Fancyness is overrated.

edit

webxl

nine years ago

Pretty cool stuff. What the heck is "Â¢Ã¯Â¿Â½Ã‚Â¦"?

Re. this comment system: The UI is a lot more interesting than most blogs. Everything wraps the blog entry. Very simple and straightforward. I like it. But below the article is the expected place for a comment "box". Why not add a "Post a comment" button or link that takes you up here?

edit

Grangar

nine years ago

For a usability post this blog layout is surprisingly bad.

Like how you don't have to sign up tho.

edit

Michael Miquell

nine years ago

This should really be called "dymamic A/B"

edit

Paul

nine years ago

A/B testing *is* an Îµ-first algorithm -- so long as its findings are taken as truth.

The Îµ-first algorithm makes more sense than an Îµ-greedy one, in this context, because the total number of trials is approximately infinite. You can get a solid answer about the 'best' in a tiny fraction of the total page views the UI will encounter, but waiting for the results of a low-Îµ Îµ-greedy algorithm would be a major logistical headache.

If your Îµ is high enough to make the Îµ-greedy algorithm converge on the right result quickly, it's high enough to be a continual nuisance to users afterwards. You'll always want to *stop* the test and implement something, and going for the Îµ-greedy algorithm just slows that process down.

An Îµ-decreasing algorithm would work, but I don't think there's a compelling reason to choose it over Îµ-first.

edit

billyronj

nine years ago

You should A/B test this comment UI versus something better.

edit

Jake Glub

nine years ago

this article is cool, but its perception of a/b testing is misguided. A/B tests are devised to reach statistical significance in the most efficient manner. While they may cost something to conduct, they reward you with a more significant result. This significant result can guide you to using the better option in the real world, which is really a post-experiment "exploitation" phase of undetermined length

edit

uhef

eleven years ago

Nice article!

I believe the pseudo code should say "

# calculate the expectation of reward.

# This is the total reward given by that lever divided by

# number of trials of the lever.

instead of "

# calculate the expectation of reward.

# This is the number of trials of the lever divided by the total reward

# given by that lever.

The higher the reward earned by the lever, the higher should the expectation of the reward be.

edit

Matt

eleven years ago

Google Content Experiements has been using this approach for quite some time!

edit

Steve

twelve years ago

" All other things being equal, this method will show you which option gives you the best CTR, which is all you really care about anyway"

Totally incorrect. You've got into the wrong way of thinking, like most - ROI and Conversion Rate is just as important as CTR.

What if you have Ad A, B and C... A has the best CTR yet B has a much lower CTR but a better ROI and Conversion stats?

The A/B test will come out with the right results if you give them enough impressions.

edit

pere rovira

twelve years ago

great article and great discussion. google analytics content experiments, which substituted the discontinued google website optimiser, uses the statistical model you describe. main benefits are testing faster and reduce risk. still, the main shortcoming of any kind of testing strategy is that companies do not test often enough. the key to success is to always be testing, and that's more to do with agile companies than the method used.

edit

Sylvain Galibert

twelve years ago

This was certainly an interesting post. That said, I don't think it would work the way you expect in the real world for the following reasons:

0. Some form of stability must be ensured for the visitors. You can show A, then B. But you can't show A, B, B, A, over and over again. (That's the easiest problem to solve on this list).

As you say, the problem is similar to the multi-armed bandits, BUT it is NOT the same problem. There are important differences that are not taken into account in your model:

1. The number of slot machines is static, whereas the number of options to try is constantly expanding and contracting as your design evolves

You are NOT planning on testing 3 options endlessly. You want to test a number of variations as time goes on. Each new option goes against the established options. So unless you reset all the counters every time you introduce a new button color or font, new options will be far more volatile than their established counterparts. This could result in new options dropping below the threshold of existing options long before a statistically significant sample is reached and take a very long time to resurface. I.e.:

To measure a large 50% improvement on a new button from an existing 2% conversion rate, you need about 2700 visitors at a 95% confidence level. But by the time you reach 100 visitors, your conversion rate could have fallen below 2%, and from that point on, how long before it receives enough visits to prove its worth.

2. On a slot machine, one lever pull is equivalent to another. It does not matter who pulls the lever, what time of the day the lever is pulled, what day of the week the lever is pulled, what period of the year it is, what website the machine was visiting before pulling the lever, etc.

For your website visitors/slot machines, all these factors matter and more. How many people buy translation during the week-end? Not too many. How many people buy toys 2 weeks before Christmas? This method does not account for these differences. A button tested on Saturday afternoon on a translation website will be massively penalized compared to a button tested on Tuesday afternoon. Similarly, if your website gets slashdotted, you may have a sudden spike in visitors who might be either totally uninterested about buying your product (they just want to read a cool thing you wrote) or completely determined in signing up for your cool new service. And then there are seasonal items. Your Halloween themed "buy" button might perform quite well during Halloween, but how long will it remain to the top after Halloween?

3. Fashion and design trends. On the web, the context changes.

By context, I mean overall standards and conventions in web design. That glossy button of yours that has accumulated outstanding conversion rating over time is just not "in" any more. Visitors are now used to more subtle interfaces like the one of Facebook. Unless you have a mechanism to decay the value of clicks over time you will end up with "winning" options that endure when they shouldn't.

The problem here is that for this system to work, it needs to run in a controlled, static environment for a significant period of time. And you really don't have that:

Imagine you are in a casino trying to run your algorithm on a wall of one armed bandits. 5 minutes after you start, a bunch of guys come in and start playing on the same machines as you. Then repairmen come and add twice as many machines. Half of those are a new model. Then they update the OS of 4 of the machines. On and on. Does it still feel like your algorithm would work in that environment? That's far closer to the environment in which most websites operate

Each time a change is introduced, you need a full reset, but this algorithm is useful only when it is allowed to run long enough to reach a statistically significant result, and that makes it poorly suited for website testing, at least for most websites. It might work for huge websites where the numbers of visitors are so large that statistical significance can be reached before the context changes, with some tweaks taking into account things like the time of day/week/season, running a counter reset on unexpected variations (if button A suddenly converts 3x as much as before when it had already been tested often enough to reach statistical significance, something is up)

AB testing, on the other end, does not suffer as much from this volatile environment because it does not try to favor results before statistical significance is reached. On Sunday or at night, both A & B suffer/benefit equally, whereas with your system, a design which might have been ahead on Friday by a large margin may loose its advance over the week-end, get overtaken on Monday, finally recover towards the end of the day only to tank again overnight. Some potentially interesting options might take a long time to reach statistical significance.

Another advantage of AB testing, and possibly the most important issue here, is that it teaches us and helps us understand what is happening. After a few tries, you can work out some general rules like: "bigger buttons are better", "fewer options are necessary for non qualified visitors, but qualified visitors (from such and such websites) will fill out longer forms", "Pictures can control the attention of visitors", etc. which can then guide your UI evolution.

This can be done to a point with your system, but it's much less reliable: "A" has a conversion rate of 60% and "B" a conversion rate of 22%, but "B" has been tested 600k times in the low season and "A" has only been tested 900 times in the high season. Is "A" really better than "B"? You can't really compare A and B which prevents you from learning as much as you would in AB testing. You are stuck guessing and are unable to extract and verify the rules that work for your website.

Also, AB testing is more hands on which forces you to think with the data: Your Halloween button will never live through Christmas because of the fantastic sales from October/November with AB testing (or your usual "best" button will be shown 90% of the time through Christmas because the new Christmas buttons didn't get a good enough conversion early in December and by the time they recovered from that, the Christmas buying season is over)

There is this old quote - I don't remember where it's from or how it goes exactly, but I think it is pretty applicable here: "When trying to write software that learns by itself, you find out that it doesn't... but you do."

I wasn't quite planning on posting a thousand words in the comments. If you feel like responding to it, I would be happy to hear about it. You can contact me at dev@preptags.com.

edit

Ted Dunning

twelve years ago

Nice article, especially since you correctly define multi-armed bandit and correctly name epsilon-greedy.

It wouldn't hurt to describe the slightly more sophisticated algorithms based on probability matching (search for "bayesian bandit" to find my first blog entry and check out the October posting where I give references to the original literature. The code really isn't any more complex than epsilon greedy and Bayesian Bandits dominate the performance of epsilon greedy and require no knobs. Even better, they also handle contextual problems which are impossible to deal with using epsilon greedy.

edit

GlynRob

13 years ago

Excellent article.

I built an example using this method using Redis and Codeigniter to see it in action.

How to build can be seen at:

glynrob.com/database/redis-in-codeigniter/

Full code is available on GitHub if anyone wants to try it for themselves.

edit

Steve

13 years ago

@Horia Margarit

I don't think your scenario would ever play out. The example has all 3 buttons being initialized with a 100% "click" rate.

Orange could not possibly be shown 19 times in a row without it being clicked on because the algorithm displays the best possible (i.e. highest click-ratio) button each visit.

If Orange is shown 19 times in a row, it's because of a statistical anomaly, AND also because 19 people in a row clicked on that Orange button.

As soon as 1 of those people don't click on Orange, the success rate for Orange drops below 100% and the script jumps to another colour, i.e. Green. One failure on Green and we're on to White. One failure on White and we're back on Orange.

What you're missing is that the selection of what button gets displayed isn't random, it's based on the success rate of what has already worked before it.

edit

Horia Margarit

13 years ago

Some people have pointed out the critical error with this approach. Suppose you have a sample of 20 page visits, and 19 of them landed on ugly Orange, while only 1 landed on pretty Green. Then your distribution over the expectations of clicks is [19/20, 01/20, 00/20]. But what if Green turns out to be the optimal choice for the population from which you are sampling?

In the best possible case, you will need the next 19 page views after the 20th to display Green, _and_ for Green to be clicked every time, for the next 19 visits in order to determine that Green is the optimal choice. This best possible case would yield a distribution over the expectations of clicks of [19/39, 20/39, 00/39]. How likely is it for this algorithm to display Green these last 19 page views?

For simplicity, let's assume that a visitor will click the Green button from now on if it is present, and further assume that a visitor will no longer click the Orange or White buttons if they are present. With this simplifying assumption, we have made the probability of displaying Green on the 28th trial independent of the probability of displaying Green on the 27th trial, and so on until the 39th trial. But we only explore 10% of the time and we have two exploratory options, so the probability of displaying Green any given time is 0.05. The probability of displaying Green the next 19 times is thus 0.05^19 which equals to 19.0735 x 10^-26

That means that you'd have to run through an additional 100 trillion x 1 trillion trials in order to actually display Green 19 times! Let's not forget about the best case assumptions: no user will ever click on White or Orange these 100 trillion x 1 trillion times, they will only click on the Green and they will do so each of the extremely rare 19 times you display it.

Those are impossible odds!

Yet the algorithm works pragmatically. It works because users are Randomly in disagreement, but directionally in Agreement. In other words, this algorithm works because it's behaviour mirrors user behaviour. Your population of users will, with some probability alpha agree on what is the best looking button or copy-edit or whatever. And with probability 1-alpha they will choose something else. But the space of something else is large, and each user will choose a different thing in that space from the others. So the critical error with this approach is actually its greatest strength.

It marginalizes random user disagreements to the point where they become totally insignificant.

edit

LMF

13 years ago

True but statistical power to detect a difference is maximized at an equal split. So I believe its better to run an equal split until a significant difference is determined, then send all traffic to the winner.

edit

Anonymous Coward

13 years ago

That's simply trading away 10% of button displays for getting a valid statement about which one works most of the times. 10% has to be statistically relevant for the process to work (i. e. 10% has to be a large number). It has to be even higher if you want to optimize up to the point where you show the orange button only to above average under 25 income females from Europe, and the green one to lower income between 25 and 50 males from South America. Large ad networks like the one operated by Google have such large numbers, and can also derive the demographic characteristics of the people doing the clicking. Smaller sites or medical trial teams don't have such big numbers to experiment on.

edit

Bharath M

13 years ago

Mind blowing quote.

edit

IanHutchinson

13 years ago

We are split testing the first image at the start of videos to compare impressions to click throughs. This article tells me to always set the best tracking video at 90% and the others get an even distribution of the 10%.

I will have to give this some more thought. Feel free to take a look at what we do - Vidyard dot com.

edit

13 years ago

Very interesting approach. Do you have any comments regarding VWO's response post? Personally I'd like to see both algorithms implemented.

edit

Sean Colombo

13 years ago

Great post :) The system sounds really interesting.

Some of the hurdles to adoption that I could see would be performance-related.

Firstly, relatively-fresh weightings should be available to the client. That seems like it requires either making the edge-caching of your pages fairly short, or making a server-side call on many requests.

Currently, split-testing (a/b/c/d/e, etc. - not sure why someone would limit to just a/b) on high-traffic sites can determine treatment-groups using hashes of experiment ids and unique identifiers for the user from the CDN. This is what we do at Wikia.

Since we cache most pages for 24 hours in our CDN, we can bake experiment configurations into the page (the weights in split-testing typically do not change over the course of a day).

Can you think of a similar way to get relatively-fresh weighting changes to the client? Perhaps using the 24h stale weights by default and then making async requests for fresher data while the page is idle? That seems like it should work. Thoughts?

---

Secondly, a treatment event needs to be logged for every active experiment every time that a user is treated (eg: to say either that they clicked or didn't click). With split-testing, you only need to send a treatment-event the first time a user is treated with a new experiment and they stick in this group until the experiment ends or the configuration switches them out of it. Nothing strikes me as a good solution to that problem. Seems like you'll have to just take that performance-hit.

Thanks again for the post! Would love to hear your ideas on performance for this method :)

edit

Maciej Klepaczewski

13 years ago

I wrote small program that compares this strategy to simple A/B test. My results:

=====

Initial parameters for A/B test:

{pA} - 0.005 (0.5%) success chance (click, sale etc.) of group A

{pB} - 0.01 (1%) success chance of group B

Initial parameters for new approach:

{p} - 0.2 (20%) of viewers are assigned to random group

{pA} - 0.005 (0.5%) success chance of group A

{pB} - 0.01 (1%) success chance of group B

Each test consisted of 10k impressions, 100k tests where performed. Results:

A/B:

Group B "won" in 99,844% cases, total successes (group A+B) over all tests: 7,5 mln.

New approach:

Group B "won" in 98,827% of cases, but provided 9,5 mln successes. (almost 27% better than A/B test!)

So, it seems that although A/B tests more quickly answer which group is "better" they also generate less sales/click/whatever.

I'm not a statistician but I would love to see math that give sound explanation to above results.

edit

someone

13 years ago

I'm sorry, but this is just too obvious. And that QBASIC thing (that's the only thing I clicked on besides this) - we've ALL done it. I know I have.

edit

Andreas Ek

13 years ago

Now, Steves logics is downloadable as a public WordPress plugin at http://wordpress.org/extend/plugins/flowsplit/

edit

Eric

13 years ago

Actually I bet this could be adapted to do multiple variables at once by linking two things (such as font size and color) and changing them at different rates. I.E. doing the same thing for both Font Size and then color.

Then you'll have a graph of each of these an an independent probability. (I.E. how often you get blue and how often you get large font). Then you could also do some linking between the list to show what the other option was when it was clicked (I.E. Blue and Large) and then analyze that.

Or am I missing something.

edit

Nitin

13 years ago

I use this strategy for choosing what to feed the kids for dinner. They don't seem to be settling down on a favorite.

edit

@OptimiseOrDie

13 years ago

An interesting idea and one I've thought about for 2 years now. It is actually possible to build a tool which will do this and so if anyone would like to work with me on it, I'm into being part of a team for sure.

There are some algorithms that will work very well to solve a big problem with A/B testing. Change.

As seasons, cycles, traffic, external marketing, tv, display, customers, markets, businesses, strategies - Change - so do the results of that A/B test you declared 6 months ago.

Just because it did 12.5% better in 2 weeks in February does not mean it will self optimise with the CHANGE going on. So people delude themselves that they are still getting the 12.5% (in their mind) when in reality, they don't know.

Hmmmm.

To solve this though, I often test after going live with 5/10% splits to verify the control performance still tracks lower. This helps to convince people who think the split or multivariate test has suddenly driven conversion rates down.

Extend this idea further and use an evolutionary genetic algorithm to select, test and then repeat verify against runners up, the original control and random new items fed in. Something that can learn and adapt to patterns in an evolutionary way will be far better at automatic tuning of raw assets into optimal recipes. It will also keep performing long after your last a/b test finished, and will adapt better to change sources you have little control over.

I want to build this! I want to build it now! If anyone else feels the same and can help, I'd love to do it.

Craig.

edit

Ross

13 years ago

I think the point has been made already but you need consistency between page views, this code wouldn't perform that.

Your real point though is about sampling, and well, who cares if it's 50/50 or 90/10, you will get statistical significance quicker with closer ratios andfewer variations, 90/10 with 5 different results will take a long time for smaller sites to become significant.

Every tool I have used allows for more than A/B (2 variations) so that isn't an argument.

Yes people understand an even distribution, why complicate it, people aren't doing enough of this as it is, let alone making the barrier to entry higher still.

I work in a digital agency and people use the tools available to them, nobody here is a programmer, so they use tools like Optimizely and visual website optimzer - because they're easy!!! i cant get this over enough, simple for marketers to use & understand is key here, the super-clever guys at google and microsoft can do whatever they like, the normal 99% of people, need simple & easy to understand, storing data in redis or whatever is a million miles beyond their capabilities.

edit

David Dean

13 years ago

This was really one of those very rare moments where you slap your forehead and think to yourself, "why the *hell* have I not been doing this all along?!" Dead simple, makes perfect sense.

I'm off to implement this all over the place. Thanks for the post... really.

edit

Wing Tang Wong

13 years ago

Very interesting. My take away from this is not that this replaces A/B testing, but rather, covers an idea for making the cyclical process of A/B testing-Evaluation-Selection to be more automatic.

Ie, let's say you have some ideas for how to change about your website, banner, etc. but you only want to have a small pool at a time, ie. given a queue of ideas from your marketing/design team, the site will only make use of the ideas 3-4 at a time. As the process finds a winner over a period of time and stability is achieved(configurable), the "losers" in the pool are kicked out and replacements are injected into the batch from the queue.

In this way, elements can be auto-evaluated, judged, and swapped out. I'm sure some enterprising coder can also include a report attached to each option and put the losers in the fail bucket and add the winners to the high performer bucket.

And yeah, consistent display of the site for a given user. Though letting it mix things up periodically is a good thing as well... :)

Wing wingtangwong.com

edit

Thomas

13 years ago

I have a question: is this how the akinator website works?! It has really astonishing results ...

edit

roger

13 years ago

I like the idea the reasoning is sound- now where/how does an average person with some technical implement this?

edit

Frank

13 years ago

Interesting post.

I built an Excel sheet that simulates this algorithm for 6 items of differing (theoretical) click-through rates. I also used a genetic optimization algorithm to test whether 10% is the right number for randomization.

Over 15k trials, I found something interesting. The average click-through rate declined (almost linearly) as the amount of randomization went up. Of course, then I suspected that we were losing click through's because we couldn't get to the "right" answer quickly enough.

But I measured the theoretical loss - the people who would have clicked on the best option had it been presented to them. (I measured this by pulling a random number for each trial and if it was small enough to clear the highest hurdle, but not small enough to clear the hurdle presented, it was an 'unnecessary loss.')

I found that unnecessary losses stayed relatively the same for all randomization trials. That was a surprise.

So, try for a low randomization number. That seems to indicate that the randomization is just there in case things change. For a static set of options and static customers, no randomization gets us to the "right" answer fastest with the best average click-through rates.

Anyways, happy to share my spreadsheet with anyone who wants to see it.

edit

Martin Hammerschmied

13 years ago

In machine learning this concept is called reinforcement learning.

edit

Sherko

13 years ago

Good work!

Your idea is straight forward, and easy to understand.

You can modify the code to populate its results for each 1000 (or any number) visits. It would be interesting to see how people's reactions change with time.

edit

Andreas Ek

13 years ago

Thanks Steve!

Implemented the logic as a WordPress plugin with shortcodes.

You can find it at GitHub or a google search for "FlowSplit"

https://github.com/EkAndreas/flowsplit/wiki/1---Introduction

edit

swampwiz

13 years ago

Excellent analogy of the algorithm with patients in a drug test. Perhaps the AMA should read this posting!

edit

Alan8

13 years ago

Great article! Good concept, clearly presented. And useful!

There's no "Like" button, but if there was, consider it pressed.

edit

mtcoder

13 years ago

Great approach and with a few cookies you can provide a user with a "similar" design for a set time period. So if the user comes back on the same day they see the same site. But when they come back next week things change this value is easily adjusted. Same thing goes for giving them the same "add cart button style" for all pages on the site. The code can be built to be reuseable. I use .net for my sites so I went as far as just adding onto the base button, label, "banner" controls to include the random options and tracking call backs. Since all the styling is in CSS you don't have to worry about presenting different pages, just different CSS style for the control at render, this keeps your SEO working perfectly fine. I usually start a new site change option with about 40% randomness and after lets say 500 results aka visists the randomness is pulled down to 20% then 10% with the favorite getting more face time. Just speeds up the curve process a bit. Where you get a few horrible clicks on a horrible option, just cause a regular visitor clicks a horrible option cause its there only means for preforming the function of the button. So a small randomness drags things on. Having a bit higher random that is then tailored downwards helps keep things moving, we don' have months to wait after all.

I've been working on a few algorthims for cross item comparision, where if click me is blue and banner is X size and the different combinations I want to test, So basically creating test sets over individual items which proves more coherent with web design. So I am more testing which random style sheet / page design still randomly provided and measured seems to get the most time, clicks, navigation, etc and pull all of those factors in for a scorecard I know seems like a bunch of work but once you have the scripts it works for any page / site you build from then on and having it automated saves so much time down the road, and talk about great stats to provide to your clients.

edit

Lars Ebert

13 years ago

This is a very interesting method. Thank you very much, I'll definitely try this algorithm soon :)

edit

Gilles

13 years ago

This is good, but there are some limitation (or additonal complexity):

- sometimes (often) we wants user affinity, ie. we want the same user to have the same behavior, each time he see the button.

- sometimes choices are made early / are static, eg. I'm generating emails with 2 templates... Once sent, this is difficult to get it back :)

If anyone as some advice to handle these cases...

for other situations (and they are lot), this is a great idea.

Gilles

edit

Mike

13 years ago

Sorry, but this isn't at all clear

edit

Kent

13 years ago

This is still a/b testing....it simply automates the manual work of a decent a/b test, which is great.

I just wouldn't categorize it any differently.

edit

Andrew

13 years ago

I am interested where the value of 10% comes from? When a convereged solution is found this artificial value will give an unecessary error on the result, while at the beggning it will constrain other options that are not quite first place (but close). I recommond replacing this 10% value with a scale of the standard deviation. In this situation results with no correlation will all have a random chance. However once a winner begins to emerge this percentage will rapidly decrease and the winner will be selected.

edit

Wyly

13 years ago

This has a couple of assumptions that are dangerous in a/b testing the primary being that your list or your click through are of similar entities. a/b test is largely a miss nommer in that it is a matrix based with input type being the third axis. As an example the A/B (winner) might be different if the input user is an attorney vs a elementary teacher. It also might vary based on day part or country or gender.

edit

Ben

13 years ago

You could also work in some simulated annealing. At times when your website is busiest, you either a) want the most reliable choice to make the most money, or b) want the most random ones to collect the most data, quickly. Regardless, you can change the rate of random choices depending on the business of your site.

edit

Chris

13 years ago

It might be interesting to write a second algorithm that determines the frequency to show sub-optimal options. Either early in the trial, or when two of the options are very close later in the trial. I suppose this could be done manually (eliminating obvious laggers) but sometimes bad options need to be marginalized.

edit

Sean Tierney

13 years ago

Ryan, as others have pointed out GWO offers the ability to intelligently serve the winning variant more frequently. The only downside of this is that it dramatically decelerates the pace with which you can test. If you're optimizing for immediate yield (maybe you hit Techcrunch and are just trying to capture max signups) this is a good strategy but if you're optimizing for speed of learning it's like drinking through a straw.

I thought Optimizely had this capability too but upon examining their interface I don't see that feature. Anyways nice post.

sean

edit

ben

13 years ago

Isn't this out Omniture's test and target works? It constantly uses the best results for the most people, automatically.

edit

nitin

13 years ago

This is quite an interesting sampling strategy - although I would caution that one has know its population to ensure that the 10% subset is well accounted for so as the data is not skewed.

edit

Kurt

13 years ago

I think the point that "This won't handle changing several things at once that depend on each-other." is what kills the k-armed bandit for a lot of applications. Sometimes the distribution of your data just isn't simple or unimodal, and the best choice is ill-defined (it may be A for some users and B for others, C for some URLs and D for others, etc.). For these applications, you have to use more sophisticated ML techniques anyway.

edit

Justin Megawarne

13 years ago

actually, a lot of medical studies will give one group the existing life-saving drug, and the other group will receive the new life-saving drug, because of ethical concerns. (think about it: you canâ€™t give sugar water to patients with a life-threatening disease.)

edit

Danil

13 years ago

Is the exploration branch "random lever" or "random non favorite" lever? Does it matter?

Also - 10% seems awfully high once we start converging on a solution. Can anybody demonstrate that (a) my intuition is wrong or (b) that there's a way to improve upon this?

edit

Michael Leo

13 years ago

Steve,

This is an interesting approach. Thanks!

When doing this, you or others might also want to consider the statistical significance of the results. Running a chi-squared test of the results on a periodic basis will begin tell you when/if the variances you are seeing are statistically significant. And it's not stats for stats sake - there's real benefit there. It mean that you can more quickly and confidently find your winner, stop the test and swap over to the option that performs the best.

I'm not a statistician (probably the opposite of that) but at my last job I took an Excel spreadsheet our analytics team was using to verify the validity of test results and brought it online. It took a little digging to find the equation being run by the "chitest()" function in excel but once I did, it turned into about 50 lines of code to prep the data and run the chi-squared test.

If I have time I'll try to generalize that code to work with N buckets (we were just testing A & B) and post it somewhere.

edit

Jason Cohen

13 years ago

Nice post; I like it!

It's statistically sound because it doesn't manipulate the percentages; rather it just concentrates on one of the options, in a sense making its percentage "more accurate" in the sense of developing more N.

When N is small for all variants, natural fluctuations will cause various options to be "best," switching a lot, which in fact is just what you want it to do.

If you were concerned about being a little more statistically sound in the low-N period, you could just say "If the total number of trials is less than some threshold T, display the buttons round-robin." That way you can set T = 100 or something like n_choices * 50, and that gives all the options a "fair start."

Nice!

edit

solecoder

13 years ago

Good post. Thanks. Definitely going to use this. Perhaps this method should be called A/Z testing or A/n testing.

edit

bob

13 years ago

I'm using the Max A/B plugin for wordpress. It allows 4 variations. I'm guessing your approach would be superior. If only a plugin existed...

edit

Trevor

13 years ago

2/9 = 22%, not 11%. But very interesting...

edit

Richard Heylen

13 years ago

Nice, but one of your numbers is wrong. 2/9=22% not 11%

edit

Anton IMNorth

13 years ago

Great article! I've honestly never seen A/B testing from this angle before and it really makes sense to try this out. I like the ajax suggestion and I'll probably look for a script like that myself.

edit

jsquash

13 years ago

Great idea but here is the one benefit from A/B testing - when testing a page your rankings will not be affected, Google say this. What about with this approach? You sending traffic (and thus robots, etc) to these multiple pages and this could harm your overall SEO strategy.

Of course if you not worried about SEO (such as you have a big brand) then great but while I do not enjoy the very simple A/B testing, I will still use it as my rankings will not be affected.

Just a thought :-)

edit

Liviu Taloi

13 years ago

Hi, nice approach. But there are some things to take into consideration:

- assuming that you are working with an ecommerce site, you should be consecvent as user pass trough more than 1 producct page, he must see always the same color of that button; it will be confusing for that specific user to see all the rainbow colors onto the "add to cart" button.

- there should be also a normal/control group, an unbiased group that will receive the old version of the button; you want to see the increase in CTR/conversions/whatever with respect to the control group; but.. I think that somehow your approach is shorter than the one with A/B variant, where you should do at the end a follow-up test running only the winner in order to validate the result.

So.. it seems very nice approach, but, there are some software (saas) that are already doing this, not with 20 lines of code of course (and with a lot of money). There are MVT or AB SaaS that instead of leaving the owner to choose from the possible winners he choose automatically the winner during the test.

One little thing I am afraid is that you want to test different "creatives"/banners/images/html-banners on different sections of a product page for example, you should write this 20 lines of code on each zone that has multiple variants to choose. So, it will be quite messy inside the script source that generates that specific web page.

The approach using SaaS that uses section-divs where you upload creatives/banners/images it's easier than writing 20 lines of codes for each zone that we want to test especially if you are not a programmer, or you cannot hire one for this task. The reports are also nice.. but, this programming approach you are proposing is quite cost effective I suppose.

Anyway it's worthing to explore this solution. Could be more cost effective for many little A/B tests.

Nice article! Thx.

PS. Sorry for my English as I'm not a native English speaking guy.

edit

Ryan

13 years ago

MicroAngel, I understand the point you're making in that the test that measure's the CTR percentage and not the number the times it's shown. The point I was trying to make is that you will see the current favorite more than you will see the non-favorites because the algorithm displays the favorite 90% of the time and a random choice of the non-favorites 10% of the time. As one of the non-favorites become more popular, it will take the place of the favorite. That's all well and good, but that assumes that you have a reasonably good distribution of data coming in for each page. That's why the CTR percentage only becomes valid after x number of views. After several thousand page views, you can be reasonably certain that your CTR is stable. My point is that if the data is not very well distributed, it could take a much longer time to reach the number of page views needed to make the determination.

This is kinda like quick sort. In most cases, with a random distribution, it will run O(n log (n) ), but in the worst case, mathematically, it's O(n^2). I think the algorithm described above could really make great improvements over the current standard A/B testing, but anyone that uses it needs to know the pros and cons so that it can be tweaked properly. In some cases, maybe the 90-10 split could be more optimized at 20-80 or 30-70. It really depends on what kind of data you have and finding a "sweet spot" for it. With careful analysis of the specific application, it could prove to be very powerful... but you do have know what's going on and make accurate assumptions about the data. The situation that I thought of where this would not be optimal is if you have a lot of data initially for on of the tests that doesn't match the eventual CTR after x number of views.

edit

Jaap

13 years ago

Interesting theory, thanks!

edit

Eduardo

13 years ago

I don't see the point here. The only difference is that you test only a small part of your audience. You could do that with GWO as well. Notice that GWO also drops options that don't perform very well if you tell it to, so you don't lose too many conversions based on the test.

An important feature of A/B tools is that a specific user always see the same option, so they avoid the "It was not like this yesterday" effect.

You should check Webtrends Optimize, that one is a really innovative tool. It not only selects the best option to your average audience but also selects the best option for different types of users, based on where they come from for example.

edit

Tim

13 years ago

beating a random strategy (or the one that relies on random for a portion of its execution) is always a tough proposition. This is a really nice both pragmatic and practical approach - love it.

edit

MicroAngelo

13 years ago

Ryan, I think you've misunderstood how this method works: the percentage figure is the expectation (probability of success) a.k.a. click-through-rate, not the percentage of time that entry is shown.

In your example where A is initially unpopular, B will then be shown, but *only for as long as it's successful*. If it's not popular either then its expectation will very rapidly decrease until A starts getting shown again.

All other things being equal, this method will show you which option gives you the best CTR, which is all you really care about anyway.

edit

Ryan

13 years ago

Made a typo in the last example: B is favorite and has 19,000 views, A is not the favorite and has 1,000. If B was the favorite all along, then all is good. But if A is the favorite only after reaching 1,000 views, then it takes 20,000 views to make that determination.

edit

Lawrence

13 years ago

I'd also love to see a startup crank this into a functional piece of code, as Leviathan mentioned. It may be 20 lines, but it's beyond my capability. Nonetheless, I've bookmarked this for a future project. Sounds like a much better way to A/B test.

edit

Kevin

13 years ago

Decision optimization is the next step past A/B testing.

There's a startup called Conductrics that's doing this as a service. Really cool stuff, especially when I have no formal background in statistical modeling.

Rules for Effective C++

The rules for safe C++ code are surprisingly controversial.

cairo blur image surface

This really should have been included in cairo. Instead, everyone that wants to have shadows has to roll their own blur function. Here's my take on it. I'll even release this into the public domain.

The PenIsland Problem: Text-to-speech for domain names

Recently, I was contracted to run a list of domain names through the custom-built pronunciation engine that powers my rhyming web site. On the first attempt, I found that the results were embarrassingly bad. A quick inspection revealed the problem: most domain names are severalwordsstucktogether.

Why don't web browsers do this?

Why don't web pages start as fast as this computer from 1984?

"Your program is stupid. It doesn't work," my wife told me

Draw waveforms and hear them

A while back I thought it would be interesting to be able to draw arbitrary waveforms and then listen to how they sound. I had an audio engine just laying around, so I whipped up a quick application to do that.

The Curious Complexity of Being Turned On

In software, the simplest things can turn into a nightmare, especially at a large company.

How to detect if an object has been garbage collected in Javascript

If you are writing an application in Javascript, soon you will have to worry about memory leaks. But it is difficult to even know if a memory leak exists. This handy method can help.

Exploiting perceptual colour difference for edge detection

Think colour isn't important in image processing algorithms? Let's try it both ways, and see for yourself.

UMA Questions Answered

A bunch of questions answered about UMA wireless technology.