20 lines of code that will beat A/B testing every time
A/B testing is used far too often, for something that performs so badly. It is defective by design: Segment users into two groups. Show the A group the old, tried and true stuff. Show the B group the new whiz-bang design with the bigger buttons and slightly different copy. After a while, take a look at the stats and figure out which group presses the button more often. Sounds good, right? The problem is staring you in the face. It is the same dilemma faced by researchers administering drug studies. During drug trials, you can only give half the patients the life saving treatment. The others get sugar water. If the treatment works, group B lost out. This sacrifice is made to get good data. But it doesn't have to be this way.
In recent years, hundreds of the brightest minds of modern civilization have been hard at work not curing cancer. Instead, they have been refining techniques for getting you and me to click on banner ads. It has been working. Both Google and Microsoft are focusing on using more information about visitors to predict what to show them. Strangely, anything better than A/B testing is absent from mainstream tools, including Google Analytics, and Google Website optimizer. I hope to change that by raising awareness about better techniques.
With a simple 20-line change to how A/B testing works, that you can implement today, you can always do better than A/B testing -- sometimes, two or three times better. This method has several good points:
The multi-armed bandit problem takes its terminology from a casino. You are faced with a wall of slot machines, each with its own lever. You suspect that some slot machines pay out more frequently than others. How can you learn which machine is the best, and get the most coins in the fewest trials?
Like many techniques in machine learning, the simplest strategy is hard to beat. More complicated techniques are worth considering, but they may eke out only a few hundredths of a percentage point of performance. One strategy that has been shown to perform well time after time in practical problems is the epsilon-greedy method. We always keep track of the number of pulls of the lever and the amount of rewards we have received from that lever. 10% of the time, we choose a lever at random. The other 90% of the time, we choose the lever that has the highest expectation of rewards.
Let's say we are choosing a colour for the "Buy now!" button. The choices are orange, green, or white. We initialize all three choices to 1 win out of 1 try. It doesn't really matter what we initialize them too, because the algorithm will adapt. So when we start out, the internal test data looks like this.
Then a web site visitor comes along and we have to show them a button. We choose the first one with the highest expectation of winning. The algorithm thinks they all work 100% of the time, so it chooses the first one: orange. But, alas, the visitor doesn't click on the button.
Another visitor comes along. We definitely won't show them orange, since we think it only has a 50% chance of working. So we choose Green. They don't click. The same thing happens for several more visitors, and we end up cycling through the choices. In the process, we refine our estimate of the click through rate for each option downwards.
But suddenly, someone clicks on the orange button! Quickly, the browser makes an Ajax call to our reward function
When our intrepid web developer sees this, he scratches his head. What the F*? The orange button is the worst choice. Its font is tiny! The green button is obviously the better one. All is lost! The greedy algorithm will always choose it forever now!
But wait, let's see what happens if Orange is really the suboptimal choice. Since the algorithm now believes it is the best, it will always be shown. That is, until it stops working well. Then the other choices start to look better.
After many more visits, the best choice, if there is one, will have been found, and will be shown 90% of the time. Here are some results based on an actual web site that I have been working on. We also have an estimate of the click through rate for each choice.
But the most enticing part is that you can set it and forget it. If your time is really worth $1000/hour, you really don't have time to go back and check how every change you made is doing and pick options. You don't have time to write rambling blog entries about how you got your site redesigned and changed this and that and it worked or it didn't work. Let the algorithm do its job. This 20 lines of code automatically finds the best choice quickly, and then uses it until it stops being the best choice.
The Multi-armed bandit problem
Picture from Microsoft Research
def choose():
if math.random() < 0.1:
# exploration!
# choose a random lever 10% of the time.
else:
# exploitation!
# for each lever,
# calculate the expectation of reward.
# This is the number of trials of the lever divided by the total reward
# given by that lever.
# choose the lever with the greatest expectation of reward.
# increment the number of times the chosen lever has been played.
# store test data in redis, choice in session key, etc..
def reward(choice, amount):
# add the reward to the total for the given lever.
Why does this work?
Orange Green White 1/1 = 100% 1/1=100% 1/1=100%
Orange Green White 1/2 = 50% 1/1=100% 1/1=100%
Orange Green White 1/4 = 25% 1/4=25% 1/4=25% $.ajax(url:"/reward?testname=buy-button");
and our code updates the results:
Orange Green White 2/5 = 40% 1/4=25% 1/4=25%
Orange Green White 2/9 = 22% 1/4=25% 1/4=25%
Orange Green White 114/4071 = 2.8% 205/6385=3.2% 59/2264=2.6% Edit: What about the randomization?
I have not discussed the randomization part. The randomization of 10% of trials forces the algorithm to explore the options. It is a trade-off between trying new things in hopes of something better, and sticking with what it knows will work. There are several variations of the epsilon-greedy strategy. In the epsilon-first strategy, you can explore 100% of the time in the beginning and once you have a good sample, switch to pure-greedy. Alternatively, you can have it decrease the amount of exploration as time passes. The epsilon-greedy strategy that I have described is a good balance between simplicity and performance. Learning about the other algorithms, such as UCB, Boltzmann Exploration, and methods that take context into account, is fascinating, but optional if you just want something that works.
Wait a minute, why isn't everybody doing this?
Statistics are hard for most people to understand. People distrust things that they do not understand, and they especially distrust machine learning algorithms, even if they are simple. Mainstream tools don't support this, because then you'd have to educate people about it, and about statistics, and that is hard. Some common objections might be:
More blog entries
Now we even have AI to really help marketers crush their A/B tests.
Check out the post I just wrote around A/B testing levels of sophistication and how technology is changing things in 2017:
www.retentionscience.com/ab-testing/
I just found out that Kameleoon does it.
"Multi-armed bandit tests
(Adaptive traffic distribution)"
from kameleoon.com/en/pricing-ab-testing-personalization.html
Also conductrics.com/real-time-optimization/ state "Using machine learning methods".
Also vwo.com/features/ with their "Multivariate Testing".
Probably you are not the first one to come up with this idea, but for sure gave it to some of the listed above players :-) Good work!
Love the site UX btw. It's super straight forward and works well. Fancyness is overrated.
Re. this comment system: The UI is a lot more interesting than most blogs. Everything wraps the blog entry. Very simple and straightforward. I like it. But below the article is the expected place for a comment "box". Why not add a "Post a comment" button or link that takes you up here?
Like how you don't have to sign up tho.
The ε-first algorithm makes more sense than an ε-greedy one, in this context, because the total number of trials is approximately infinite. You can get a solid answer about the 'best' in a tiny fraction of the total page views the UI will encounter, but waiting for the results of a low-ε ε-greedy algorithm would be a major logistical headache.
If your ε is high enough to make the ε-greedy algorithm converge on the right result quickly, it's high enough to be a continual nuisance to users afterwards. You'll always want to *stop* the test and implement something, and going for the ε-greedy algorithm just slows that process down.
An ε-decreasing algorithm would work, but I don't think there's a compelling reason to choose it over ε-first.
What I would like to add to this discussion is that there are many other factors than just statistical testing power. Quality of the variations, how different the variations are, are variations being dropped in real time, can new tests be queued asynchronously, is a meaningless test automatically stopped? And so on.
To solve these systemic problems, and to automate everything in the process of A/B testing banners except for the bulk input of actual changes, is the reason I founded Perfectbanner.
I believe the pseudo code should say "
# calculate the expectation of reward.
# This is the total reward given by that lever divided by
# number of trials of the lever.
"
instead of "
# calculate the expectation of reward.
# This is the number of trials of the lever divided by the total reward
# given by that lever.
"
The higher the reward earned by the lever, the higher should the expectation of the reward be.
Totally incorrect. You've got into the wrong way of thinking, like most - ROI and Conversion Rate is just as important as CTR.
What if you have Ad A, B and C... A has the best CTR yet B has a much lower CTR but a better ROI and Conversion stats?
The A/B test will come out with the right results if you give them enough impressions.
0. Some form of stability must be ensured for the visitors. You can show A, then B. But you can't show A, B, B, A, over and over again. (That's the easiest problem to solve on this list).
As you say, the problem is similar to the multi-armed bandits, BUT it is NOT the same problem. There are important differences that are not taken into account in your model:
1. The number of slot machines is static, whereas the number of options to try is constantly expanding and contracting as your design evolves
You are NOT planning on testing 3 options endlessly. You want to test a number of variations as time goes on. Each new option goes against the established options. So unless you reset all the counters every time you introduce a new button color or font, new options will be far more volatile than their established counterparts. This could result in new options dropping below the threshold of existing options long before a statistically significant sample is reached and take a very long time to resurface. I.e.:
To measure a large 50% improvement on a new button from an existing 2% conversion rate, you need about 2700 visitors at a 95% confidence level. But by the time you reach 100 visitors, your conversion rate could have fallen below 2%, and from that point on, how long before it receives enough visits to prove its worth.
2. On a slot machine, one lever pull is equivalent to another. It does not matter who pulls the lever, what time of the day the lever is pulled, what day of the week the lever is pulled, what period of the year it is, what website the machine was visiting before pulling the lever, etc.
For your website visitors/slot machines, all these factors matter and more. How many people buy translation during the week-end? Not too many. How many people buy toys 2 weeks before Christmas? This method does not account for these differences. A button tested on Saturday afternoon on a translation website will be massively penalized compared to a button tested on Tuesday afternoon. Similarly, if your website gets slashdotted, you may have a sudden spike in visitors who might be either totally uninterested about buying your product (they just want to read a cool thing you wrote) or completely determined in signing up for your cool new service. And then there are seasonal items. Your Halloween themed "buy" button might perform quite well during Halloween, but how long will it remain to the top after Halloween?
3. Fashion and design trends. On the web, the context changes.
By context, I mean overall standards and conventions in web design. That glossy button of yours that has accumulated outstanding conversion rating over time is just not "in" any more. Visitors are now used to more subtle interfaces like the one of Facebook. Unless you have a mechanism to decay the value of clicks over time you will end up with "winning" options that endure when they shouldn't.
The problem here is that for this system to work, it needs to run in a controlled, static environment for a significant period of time. And you really don't have that:
Imagine you are in a casino trying to run your algorithm on a wall of one armed bandits. 5 minutes after you start, a bunch of guys come in and start playing on the same machines as you. Then repairmen come and add twice as many machines. Half of those are a new model. Then they update the OS of 4 of the machines. On and on. Does it still feel like your algorithm would work in that environment? That's far closer to the environment in which most websites operate
Each time a change is introduced, you need a full reset, but this algorithm is useful only when it is allowed to run long enough to reach a statistically significant result, and that makes it poorly suited for website testing, at least for most websites. It might work for huge websites where the numbers of visitors are so large that statistical significance can be reached before the context changes, with some tweaks taking into account things like the time of day/week/season, running a counter reset on unexpected variations (if button A suddenly converts 3x as much as before when it had already been tested often enough to reach statistical significance, something is up)
AB testing, on the other end, does not suffer as much from this volatile environment because it does not try to favor results before statistical significance is reached. On Sunday or at night, both A & B suffer/benefit equally, whereas with your system, a design which might have been ahead on Friday by a large margin may loose its advance over the week-end, get overtaken on Monday, finally recover towards the end of the day only to tank again overnight. Some potentially interesting options might take a long time to reach statistical significance.
Another advantage of AB testing, and possibly the most important issue here, is that it teaches us and helps us understand what is happening. After a few tries, you can work out some general rules like: "bigger buttons are better", "fewer options are necessary for non qualified visitors, but qualified visitors (from such and such websites) will fill out longer forms", "Pictures can control the attention of visitors", etc. which can then guide your UI evolution.
This can be done to a point with your system, but it's much less reliable: "A" has a conversion rate of 60% and "B" a conversion rate of 22%, but "B" has been tested 600k times in the low season and "A" has only been tested 900 times in the high season. Is "A" really better than "B"? You can't really compare A and B which prevents you from learning as much as you would in AB testing. You are stuck guessing and are unable to extract and verify the rules that work for your website.
Also, AB testing is more hands on which forces you to think with the data: Your Halloween button will never live through Christmas because of the fantastic sales from October/November with AB testing (or your usual "best" button will be shown 90% of the time through Christmas because the new Christmas buttons didn't get a good enough conversion early in December and by the time they recovered from that, the Christmas buying season is over)
There is this old quote - I don't remember where it's from or how it goes exactly, but I think it is pretty applicable here: "When trying to write software that learns by itself, you find out that it doesn't... but you do."
I wasn't quite planning on posting a thousand words in the comments. If you feel like responding to it, I would be happy to hear about it. You can contact me at dev@preptags.com.
It wouldn't hurt to describe the slightly more sophisticated algorithms based on probability matching (search for "bayesian bandit" to find my first blog entry and check out the October posting where I give references to the original literature. The code really isn't any more complex than epsilon greedy and Bayesian Bandits dominate the performance of epsilon greedy and require no knobs. Even better, they also handle contextual problems which are impossible to deal with using epsilon greedy.
I built an example using this method using Redis and Codeigniter to see it in action.
How to build can be seen at:
glynrob.com/database/redis-in-codeigniter/
Full code is available on GitHub if anyone wants to try it for themselves.
So we actually optimize the explore versus exploit trade-off through probability matching.
And we are in beta ;-) check out PersuasionAPI.
I don't think your scenario would ever play out. The example has all 3 buttons being initialized with a 100% "click" rate.
Orange could not possibly be shown 19 times in a row without it being clicked on because the algorithm displays the best possible (i.e. highest click-ratio) button each visit.
If Orange is shown 19 times in a row, it's because of a statistical anomaly, AND also because 19 people in a row clicked on that Orange button.
As soon as 1 of those people don't click on Orange, the success rate for Orange drops below 100% and the script jumps to another colour, i.e. Green. One failure on Green and we're on to White. One failure on White and we're back on Orange.
What you're missing is that the selection of what button gets displayed isn't random, it's based on the success rate of what has already worked before it.
In the best possible case, you will need the next 19 page views after the 20th to display Green, _and_ for Green to be clicked every time, for the next 19 visits in order to determine that Green is the optimal choice. This best possible case would yield a distribution over the expectations of clicks of [19/39, 20/39, 00/39]. How likely is it for this algorithm to display Green these last 19 page views?
For simplicity, let's assume that a visitor will click the Green button from now on if it is present, and further assume that a visitor will no longer click the Orange or White buttons if they are present. With this simplifying assumption, we have made the probability of displaying Green on the 28th trial independent of the probability of displaying Green on the 27th trial, and so on until the 39th trial. But we only explore 10% of the time and we have two exploratory options, so the probability of displaying Green any given time is 0.05. The probability of displaying Green the next 19 times is thus 0.05^19 which equals to 19.0735 x 10^-26
That means that you'd have to run through an additional 100 trillion x 1 trillion trials in order to actually display Green 19 times! Let's not forget about the best case assumptions: no user will ever click on White or Orange these 100 trillion x 1 trillion times, they will only click on the Green and they will do so each of the extremely rare 19 times you display it.
Those are impossible odds!
Yet the algorithm works pragmatically. It works because users are Randomly in disagreement, but directionally in Agreement. In other words, this algorithm works because it's behaviour mirrors user behaviour. Your population of users will, with some probability alpha agree on what is the best looking button or copy-edit or whatever. And with probability 1-alpha they will choose something else. But the space of something else is large, and each user will choose a different thing in that space from the others. So the critical error with this approach is actually its greatest strength.
It marginalizes random user disagreements to the point where they become totally insignificant.
In recent years, hundreds of the brightest minds of modern civilization have been hard at work not curing cancer. Instead, they have been refining techniques for getting you and me to click on banner ads.
I will have to give this some more thought. Feel free to take a look at what we do - Vidyard dot com.
Some of the hurdles to adoption that I could see would be performance-related.
Firstly, relatively-fresh weightings should be available to the client. That seems like it requires either making the edge-caching of your pages fairly short, or making a server-side call on many requests.
Currently, split-testing (a/b/c/d/e, etc. - not sure why someone would limit to just a/b) on high-traffic sites can determine treatment-groups using hashes of experiment ids and unique identifiers for the user from the CDN. This is what we do at Wikia.
Since we cache most pages for 24 hours in our CDN, we can bake experiment configurations into the page (the weights in split-testing typically do not change over the course of a day).
Can you think of a similar way to get relatively-fresh weighting changes to the client? Perhaps using the 24h stale weights by default and then making async requests for fresher data while the page is idle? That seems like it should work. Thoughts?
---
Secondly, a treatment event needs to be logged for every active experiment every time that a user is treated (eg: to say either that they clicked or didn't click). With split-testing, you only need to send a treatment-event the first time a user is treated with a new experiment and they stick in this group until the experiment ends or the configuration switches them out of it. Nothing strikes me as a good solution to that problem. Seems like you'll have to just take that performance-hit.
Thanks again for the post! Would love to hear your ideas on performance for this method :)
=====
Initial parameters for A/B test:
{pA} - 0.005 (0.5%) success chance (click, sale etc.) of group A
{pB} - 0.01 (1%) success chance of group B
Initial parameters for new approach:
{p} - 0.2 (20%) of viewers are assigned to random group
{pA} - 0.005 (0.5%) success chance of group A
{pB} - 0.01 (1%) success chance of group B
Each test consisted of 10k impressions, 100k tests where performed. Results:
A/B:
Group B "won" in 99,844% cases, total successes (group A+B) over all tests: 7,5 mln.
New approach:
Group B "won" in 98,827% of cases, but provided 9,5 mln successes. (almost 27% better than A/B test!)
So, it seems that although A/B tests more quickly answer which group is "better" they also generate less sales/click/whatever.
I'm not a statistician but I would love to see math that give sound explanation to above results.
Then you'll have a graph of each of these an an independent probability. (I.E. how often you get blue and how often you get large font). Then you could also do some linking between the list to show what the other option was when it was clicked (I.E. Blue and Large) and then analyze that.
Or am I missing something.
There are some algorithms that will work very well to solve a big problem with A/B testing. Change.
As seasons, cycles, traffic, external marketing, tv, display, customers, markets, businesses, strategies - Change - so do the results of that A/B test you declared 6 months ago.
Just because it did 12.5% better in 2 weeks in February does not mean it will self optimise with the CHANGE going on. So people delude themselves that they are still getting the 12.5% (in their mind) when in reality, they don't know.
Hmmmm.
To solve this though, I often test after going live with 5/10% splits to verify the control performance still tracks lower. This helps to convince people who think the split or multivariate test has suddenly driven conversion rates down.
Extend this idea further and use an evolutionary genetic algorithm to select, test and then repeat verify against runners up, the original control and random new items fed in. Something that can learn and adapt to patterns in an evolutionary way will be far better at automatic tuning of raw assets into optimal recipes. It will also keep performing long after your last a/b test finished, and will adapt better to change sources you have little control over.
I want to build this! I want to build it now! If anyone else feels the same and can help, I'd love to do it.
Craig.
Your real point though is about sampling, and well, who cares if it's 50/50 or 90/10, you will get statistical significance quicker with closer ratios andfewer variations, 90/10 with 5 different results will take a long time for smaller sites to become significant.
Every tool I have used allows for more than A/B (2 variations) so that isn't an argument.
Yes people understand an even distribution, why complicate it, people aren't doing enough of this as it is, let alone making the barrier to entry higher still.
I work in a digital agency and people use the tools available to them, nobody here is a programmer, so they use tools like Optimizely and visual website optimzer - because they're easy!!! i cant get this over enough, simple for marketers to use & understand is key here, the super-clever guys at google and microsoft can do whatever they like, the normal 99% of people, need simple & easy to understand, storing data in redis or whatever is a million miles beyond their capabilities.
I'm off to implement this all over the place. Thanks for the post... really.
Ie, let's say you have some ideas for how to change about your website, banner, etc. but you only want to have a small pool at a time, ie. given a queue of ideas from your marketing/design team, the site will only make use of the ideas 3-4 at a time. As the process finds a winner over a period of time and stability is achieved(configurable), the "losers" in the pool are kicked out and replacements are injected into the batch from the queue.
In this way, elements can be auto-evaluated, judged, and swapped out. I'm sure some enterprising coder can also include a report attached to each option and put the losers in the fail bucket and add the winners to the high performer bucket.
And yeah, consistent display of the site for a given user. Though letting it mix things up periodically is a good thing as well... :)
Wing wingtangwong.com
I built an Excel sheet that simulates this algorithm for 6 items of differing (theoretical) click-through rates. I also used a genetic optimization algorithm to test whether 10% is the right number for randomization.
Over 15k trials, I found something interesting. The average click-through rate declined (almost linearly) as the amount of randomization went up. Of course, then I suspected that we were losing click through's because we couldn't get to the "right" answer quickly enough.
But I measured the theoretical loss - the people who would have clicked on the best option had it been presented to them. (I measured this by pulling a random number for each trial and if it was small enough to clear the highest hurdle, but not small enough to clear the hurdle presented, it was an 'unnecessary loss.')
I found that unnecessary losses stayed relatively the same for all randomization trials. That was a surprise.
So, try for a low randomization number. That seems to indicate that the randomization is just there in case things change. For a static set of options and static customers, no randomization gets us to the "right" answer fastest with the best average click-through rates.
Anyways, happy to share my spreadsheet with anyone who wants to see it.
Your idea is straight forward, and easy to understand.
You can modify the code to populate its results for each 1000 (or any number) visits. It would be interesting to see how people's reactions change with time.
Implemented the logic as a WordPress plugin with shortcodes.
You can find it at GitHub or a google search for "FlowSplit"
https://github.com/EkAndreas/flowsplit/wiki/1---Introduction
There's no "Like" button, but if there was, consider it pressed.
I've been working on a few algorthims for cross item comparision, where if click me is blue and banner is X size and the different combinations I want to test, So basically creating test sets over individual items which proves more coherent with web design. So I am more testing which random style sheet / page design still randomly provided and measured seems to get the most time, clicks, navigation, etc and pull all of those factors in for a scorecard I know seems like a bunch of work but once you have the scripts it works for any page / site you build from then on and having it automated saves so much time down the road, and talk about great stats to provide to your clients.
- sometimes (often) we wants user affinity, ie. we want the same user to have the same behavior, each time he see the button.
- sometimes choices are made early / are static, eg. I'm generating emails with 2 templates... Once sent, this is difficult to get it back :)
If anyone as some advice to handle these cases...
for other situations (and they are lot), this is a great idea.
Gilles
I just wouldn't categorize it any differently.
This is kinda like quick sort. In most cases, with a random distribution, it will run O(n log (n) ), but in the worst case, mathematically, it's O(n^2). I think the algorithm described above could really make great improvements over the current standard A/B testing, but anyone that uses it needs to know the pros and cons so that it can be tweaked properly. In some cases, maybe the 90-10 split could be more optimized at 20-80 or 30-70. It really depends on what kind of data you have and finding a "sweet spot" for it. With careful analysis of the specific application, it could prove to be very powerful... but you do have know what's going on and make accurate assumptions about the data. The situation that I thought of where this would not be optimal is if you have a lot of data initially for on of the tests that doesn't match the eventual CTR after x number of views.
I thought Optimizely had this capability too but upon examining their interface I don't see that feature. Anyways nice post.
sean
Also - 10% seems awfully high once we start converging on a solution. Can anybody demonstrate that (a) my intuition is wrong or (b) that there's a way to improve upon this?
This is an interesting approach. Thanks!
When doing this, you or others might also want to consider the statistical significance of the results. Running a chi-squared test of the results on a periodic basis will begin tell you when/if the variances you are seeing are statistically significant. And it's not stats for stats sake - there's real benefit there. It mean that you can more quickly and confidently find your winner, stop the test and swap over to the option that performs the best.
I'm not a statistician (probably the opposite of that) but at my last job I took an Excel spreadsheet our analytics team was using to verify the validity of test results and brought it online. It took a little digging to find the equation being run by the "chitest()" function in excel but once I did, it turned into about 50 lines of code to prep the data and run the chi-squared test.
If I have time I'll try to generalize that code to work with N buckets (we were just testing A & B) and post it somewhere.
It's statistically sound because it doesn't manipulate the percentages; rather it just concentrates on one of the options, in a sense making its percentage "more accurate" in the sense of developing more N.
When N is small for all variants, natural fluctuations will cause various options to be "best," switching a lot, which in fact is just what you want it to do.
If you were concerned about being a little more statistically sound in the low-N period, you could just say "If the total number of trials is less than some threshold T, display the buttons round-robin." That way you can set T = 100 or something like n_choices * 50, and that gives all the options a "fair start."
Nice!
Of course if you not worried about SEO (such as you have a big brand) then great but while I do not enjoy the very simple A/B testing, I will still use it as my rankings will not be affected.
Just a thought :-)
- assuming that you are working with an ecommerce site, you should be consecvent as user pass trough more than 1 producct page, he must see always the same color of that button; it will be confusing for that specific user to see all the rainbow colors onto the "add to cart" button.
- there should be also a normal/control group, an unbiased group that will receive the old version of the button; you want to see the increase in CTR/conversions/whatever with respect to the control group; but.. I think that somehow your approach is shorter than the one with A/B variant, where you should do at the end a follow-up test running only the winner in order to validate the result.
So.. it seems very nice approach, but, there are some software (saas) that are already doing this, not with 20 lines of code of course (and with a lot of money). There are MVT or AB SaaS that instead of leaving the owner to choose from the possible winners he choose automatically the winner during the test.
One little thing I am afraid is that you want to test different "creatives"/banners/images/html-banners on different sections of a product page for example, you should write this 20 lines of code on each zone that has multiple variants to choose. So, it will be quite messy inside the script source that generates that specific web page.
The approach using SaaS that uses section-divs where you upload creatives/banners/images it's easier than writing 20 lines of codes for each zone that we want to test especially if you are not a programmer, or you cannot hire one for this task. The reports are also nice.. but, this programming approach you are proposing is quite cost effective I suppose.
Anyway it's worthing to explore this solution. Could be more cost effective for many little A/B tests.
Nice article! Thx.
PS. Sorry for my English as I'm not a native English speaking guy.
This is kinda like quick sort. In most cases, with a random distribution, it will run O(n log (n) ), but in the worst case, mathematically, it's O(n^2). I think the algorithm described above could really make great improvements over the current standard A/B testing, but anyone that uses it needs to know the pros and cons so that it can be tweaked properly. In some cases, maybe the 90-10 split could be more optimized at 20-80 or 30-70. It really depends on what kind of data you have and finding a "sweet spot" for it. With careful analysis of the specific application, it could prove to be very powerful... but you do have know what's going on and make accurate assumptions about the data. The situation that I thought of where this would not be optimal is if you have a lot of data initially for on of the tests that doesn't match the eventual CTR after x number of views.
An important feature of A/B tools is that a specific user always see the same option, so they avoid the "It was not like this yesterday" effect.
You should check Webtrends Optimize, that one is a really innovative tool. It not only selects the best option to your average audience but also selects the best option for different types of users, based on where they come from for example.
In your example where A is initially unpopular, B will then be shown, but *only for as long as it's successful*. If it's not popular either then its expectation will very rapidly decrease until A starts getting shown again.
All other things being equal, this method will show you which option gives you the best CTR, which is all you really care about anyway.
Be really keen to understand this further. Would it be possible to drop you an email? if you drop me an email with my name followed by an "at" symbol and end with gmail period com. I'll respond.
There's a startup called Conductrics that's doing this as a service. Really cool stuff, especially when I have no formal background in statistical modeling.