A/B Testing and the Infinite Monkey Theory

http://en.wikipedia.org/wiki/Portraits_of_Shakespeare

A/B Testing and the Infinite

Monkey Theorem.Lukasz Twardowski

www.useitbetter.com

http://www.useitbetter.com

a monkey hitting keys at random for an infinite amount of time will almost surely type the complete works of William Shakespeare.

a monkey hitting keys at random for an infinite amount of time will almost surely

A/B testing

reach the conversion rate of Amazon.

A/B testing

helps find out which

of two versions performs better

while running simultaneously.

THEORY

We do this because every day is different,

unlike in the Groundhog Day movie.

Groundhog Day (1993, Dir. Harold Ramis)

http://nerds.airbnb.com/experiments-at-airbnb/

A single change, bad or good, will not change a trend.

Unless a change is A/B tested,you won’t know its impact.

http://nerds.airbnb.com/experiments-at-airbnb/

Why the monkeymetaphor?

The industry average hit ratefor A/B testing

=

Provide the benchmark:EXERCISE 1.

The industry average hit ratefor A/B testing

= 14%

Just 1 out of 7 A/B tests is successful!

http://conversionxl.com/ab-tests-fail/

Provide the benchmark:EXERCISE 1.


King Kong (1933, Dir. Merian Cooper, Ernest Schoedsack)

How tobe the greatest monkey in the bizif infinity is not an option?

Be a quick monkey.

How to be the best monkey in the biz?

1 out of 7 tests winsx 2 weeks per test

= slow growth

Do the math:EXERCISE 2.

Unless you experiment at scale.

The currency in which you pay for A/B tests is traffic.

The currency in which you pay for A/B tests is traffic. The more you have, the more tests you can run.

The currency in which you pay for A/B tests is traffic. The more you have, the more tests you can run. Never waste what you have.

Shop Direct

Scaled to 101 experiments a month in two years.

100+ year old companyEtsy

25 releases a day,most of them are A/B tests.

A startup launched in 2005

http://www.slideshare.net/danmckinley/design-for-continuous-experimentation(linkedin)

http://www.slideshare.net/danmckinley/design-for-continuous-experimentation

Zero Tests Per Month.

Here’s the test idea, numbers and execution.

Can we proceed?

Let’s meet to discuss. Maybe

next week?

Looks good. Will check with Z

and get back to you.

So here’s the test idea, numbers…

Sorry, had other priorities.

Can we meet next week?

Sure! (D***!)Have you

checked with Z?

Have you…?

Have you…?

Ground rules: 1. Test ideas are subject to prioritization not approval.

evidencex opportunity size

x strategy=

priority

Magic formula:EXERCISE 3.

The worst idea gets tested if resources are available.

101 Tests Per Month.

Ok then, we’ll do this, this

and that test. Others will wait.

Guys, our strategy shifted

to checkout optimization.

Guys, we need to increase

basket value.

Now this and that one…

And this…

These two would work…

Xmas is coming!

DO NOTHING!

…this, this and that…

Ground rules: 2. Accept the fact that things will go wrong.

Cheat likea monkey.


If 1 out of 7 testswins, what about the other 6?

https://www.groovehq.com/blog/failed-ab-tests

What was the result of the Button Colors Test by Groove?

EXERCISE 1.

http://en.wikipedia.org/wiki/Portraits_of_Shakespeare

If 1 out of 7 testswins, what about the other 6? 5 of themwill be inconclusive.

Most tests are inconclusive because:

a) too few users were using the changed feature for it to get statistical significance.

b) the changed feature had little to do with metrics used to evaluate the test.

c) there were multiple changes in the same test and they levelled up.

Complete the sentence:EXERCISE 4.

You do it to find out what works and how well.

A/B testing is NOT about __________.making money

You can successfully run tests that have no chance of success.

… removing a feature… slowing down the website…

Cheat: Experiment to test significance.Test results show that…

didn’t reduce conversion.

… we shouldn’t waste time on that.

Cheat: test significance.Test results show that…

Cheat: One change per test. Order matters.

Select products, produce videos, upload, add links, launch test

Add linksSelect products Produce videos …

INCONCLUSIVE

… people don’tclick “watch video” links.

Cheat: Measure againstyour hypothesis.

… adding videos had no impact on conversion.

INCONCLUSIVECONCLUSIVE

Test results show that…

A great presentation by Etsy:

goo.gl/WQpY65

The benefit you getfrom A/B testing is knowledge notrevenue.

The benefit you getfrom A/B testing is knowledge notrevenue. Revenue willcome as a result of applied knowledge.

Don’t be a monkey.


Don’t be a gnome either.

What about this 1 test out of 7 that fails?


3 out of 4 companies (that are A/B testing) make changes based on intuition or best practices.

50% NOT A/B testing

50% A/B testing


collect underpants + ?=

profit

Solve equation:EXERCISE 5.

A/B testis launched.

Test results comeback negative.

The idea gets killed,next test islaunched.

A/B Testing Flow

Fail Fast Approach

One failed test doesn’t make collecting underpants a bad idea.

A/B testis launched.

Test results comeback negative.

Survey responsesgive a clue why.

Users are surveyedalongside the test.

Respondents’logs give

another clue.

Respondentsare emailed to

clarify the issue.

The issue is solved,the test relaunched.

Users’ behaviorsare logged.

Pre-test researchis done.

Example of A/B Testing Flow at Spotify

Prepare for failure.

Courtesy of @bendressler researcher at Spotify

The real price you payfor not researchingwhy tests fail is the death of great ideas.

UserTesting

Voice ofCustomer

I predictthat doing B

will change X by Y% because

of Z.

Are MetricsGood?

Accepted

Rejected

What reallyhappened?

Insight and Evidence

Metrics Based Evaluation

Hypothesis check

Evidence-Led FlowHypothesis Based

A/B Testing

Qual/QuantAnalytics

UserTesting

Voice ofCustomer

I predictthat doing B

will change X by Y% because

of Z.

Are MetricsGood?

Accepted

Rejected

What reallyhappened?

Insight and Evidence

Metrics Based Evaluation

Hypothesis check

Evidence-Led FlowHypothesis Based

A/B Testing

1TB Behavioural Raw Data

40MUnique

Interactions

Collect behavioral

data.

Build segmentation

rules.

41Sets of Rules

Created

Explore,analyze.visualize.

Quantifyan opportunity

Translatean insightinto a test.

average stats per website from the last month

UseItBetter - The Platform forEvidence-Led Experimentation at Scale

An analyst researching for an infinite amount of time will almost surely get you to 100% hit ratio. Which isn’t good either.

If you are going to A/B test:

1. Never waste your traffic.

1. Never waste your traffic. 2. Many small changes are better than one big change.

1. Never waste your traffic. 2. Many small changes are better than one big change. 3. Even the smallest change needs an insight.

1. Never waste your traffic. 2. Many small changes are better than one big change. 3. Even the smallest change needs an insight. 4. Prepare for failure.

1. Never waste your traffic. 2. Many small changes are better than one big change. 3. Even the smallest change needs an insight. 4. Prepare for failure. 5. It’s OK to fail if you know why you failed.

1. Never waste your traffic. 2. Many small changes are better than one big change. 3. Even the smallest change needs an insight. 4. Prepare for failure. 5. It’s OK to fail if you know why you failed. 6. Iterate.

1. Never waste your traffic. 2. Many small changes are better than one big change. 3. Even the smallest change needs an insight. 4. Prepare for failure. 5. It’s OK to fail if you know why you failed. 6. Iterate. 7. Be honest.

For the sake of this presentation, I assumed that the results of the 7 tests I referred to had been correctly read out by the people who

are familiar with the terms like statistical significance, confidence intervals, p-value etc.

Otherwise, it’s likely that the one winning test was just a phantom.

Disclaimer

Get in touch:THE FINAL EXERCISE

Łukasz Twardowskihttps://linkedin.com/in/twardowski

https://linkedin.com/in/twardowski

Data & Analytics

A/B Testing and the Infinite Monkey Theory