Conductrics bandit basicsemetrics1016

Conductrics twitter: @mgershoff

Bandit Basics – A Different

take on Online Optimization

Who is this guy?

Matt Gershoff

CEO: Conductrics

Many Years in Database Marketing (New York

and Paris)

and a bit of Web Analytics

www.conductrics.com

twitter:@mgershoff Email:[email protected]

Speak Up


What Are We Going to Hear?

• Optimization Basics

• Multi-Armed Bandit

• Its a Problem, Not a Method

• Some Methods

• AB Testing

• Epsilon Greedy

• Upper Confidence Interval (UCB)

• Some Results

Choices Targeting

Learning Optimization



OPTIMIZATION

If THIS Then THAT ITTT brings together:

1.Decision Rules

2.Predictive Analytics

3.Choice Optimization


If THIS Then THAT

OPTIMIZATION

Find and Apply the Rule with the most Value

If THIS Then THAT

If THIS Then THAT If THIS Then THAT

If THIS Then THAT

If THIS Then THAT

If THIS Then THAT

If THIS Then THAT

If THIS Then THAT If THIS Then THAT

If THIS Then THAT

If THIS Then THAT


OPTIMIZATION

If Then Facebook

High Spend

Urban GEO

.

.

.

.

.

Home Page

App Use

Offer A

Offer B

Offer C

.

.

.

.

.

Offer Y

Offer Z

Variables whose Values

Are Given to You Variables whose Values

You Control

F1

F2

Fm

S Valuei

Predictive Model

THIS THAT

Inputs Outputs


But …

Offer A ?

Offer B ?

Offer C ?

.

.

.

.

.

Offer Y ?

Offer Z ?

THAT

1. We Don’t Have Data on ‘THAT’

2. Need to Collect – Sample

3. How to Sample Efficiently?

Where

Marketing Applications:

• Websites

• Mobile

• Social Media Campaigns

• Banner Ads

Pharma: Clinical Trials


What is a Multi Armed Bandit

A

OR

B

One Armed Bandit –>Slot Machine


The problem:

How to pick between Slot Machines so that

you walk out with most $$$ from Casino at the

end of the Night?

Objective

Pick so as to get the most

return/profit as you can

over time

Technical term: Minimize Regret


… but how to Pick?

A

OR

B

Sequential Selection


Need to Sample, but do it efficiently

Explore – Collect Data

• Data Collection is costly – an Investment

• Be Efficient – Balance the potential value of

collecting new data with exploiting what

you currently know.

A

OR

B


Multi-Armed Bandits

“Bandit problems embody in essential

form a conflict evident in all human

action: choosing actions which yield

immediate reward vs. choosing actions

… whose benefit will come only later.”*

- Peter Whittle

*Source: Qing Zhao, UC Davis. Plenary talk at SPAWC, June, 2010.


Exploration Exploitation

1) Explore/Learn – Try out different actions

to learn how they perform over time – This is

a data collection task.

2) Exploit/Earn – Take advantage of what

you have learned to get highest payoff –

Your current best guess


Not A New Problem

1933 – first work on competing

options

1940 – WWII Problem Allies

attempt to tackle

1953 – Bellman formulates as a

Dynamic Programing problem

Source: http://www.lancs.ac.uk/~winterh/GRhist.html


Testing

• Explore First

–All actions have an equal chance of

selection (uniform random).

–Use hypothesis testing to select a

‘Winner’.

• Then Exploit - Keep only ‘Winner’

for selection


Learn First


Time

Explore/

Learn

Exploit/

Earn

Data Collection/Sample Apply Leaning

P-Values: A Digression

P-Values:

• NOT the probability that the Null is

True. P( Null=True| DATA)

• P(DATA (or more extreme)| Null=True)

• Not a great tool for deciding when to

stop sampling See:

http://andrewgelman.com/2010/09/noooooooooooooo_1/

http://www.stat.duke.edu/~berger/papers/02-01.pdf




A Couple Other Methods

1. Epsilon Greedy

Nice and Simple

2. Upper Confidence Bounds(UCB)

Adapts to Uncertainty


1) Epsilon-Greedy


Greedy

What do you mean by ‘Greedy’?

Make whatever choice seems

best at the moment.

Epsilon Greedy

• Explore – randomly select action

percent of the time (say 20%)

• Exploit – Play greedy (pick the

current best) 1- (say 80%)

What do you mean by ‘Epsilon

Greedy’?

Epsilon Greedy User

Select

Randomly

Like AB Testing

Select Current

Best

(Be Greedy)

Explore/Learn

(20%)

Exploit/Earn

(80%)


Epsilon Greedy

Action Value

A $5.00

B $4.00

C $3.00

D $2.00

E $1.00

80% Select Best 20% Random


Continuous Sampling


Time

Explore/Learn

Exploit/Earn

Epsilon Greedy

–Super Simple/low cost to implement

–Tends to be surprisingly effective

–Less affected by ‘Seasonality’

–Not optimal (hard to pick best )

–Doesn’t use measure of variance

–Should/How to decrease Exploration over

time? Conductrics twitter: @mgershoff

Upper Confidence Bound

Basic Idea:

1) Calculate both mean and a measure

of uncertainty (variance) for each

action.

2) Make Greedy selections based on

mean + uncertainty bonus


Confidence Interval Review

Confidence Interval = mean +/- z*Std

Mean - 2*Std +2*Std


Upper Confidence

Mean +Bonus

Score each option using the upper

portion of the interval as a Bonus



$0

Estimated Reward

$5 $10

A

B

C

1) Use upper portion of CI as ‘Bonus’

Select A


2) Make Greedy Selections


$0 Estimated Reward

$5 $10

A

B

C

1) Selecting Action ‘A’ reduces uncertainty

bonus (because more data)

Select C


2) Action ‘C’ now has highest score



• Like A/B Test – uses variance

measure

• Unlike A/B Test – no hypothesis

test

• Automatically Balances

Exploration with Exploitation


Treatment

Conversion

Rate Served

V2V3 9.9% 14,893

V2V2 9.7% 9,720

V2V1 8.0% 2,441

V1V3 3.3% 2,090

V2V3 2.6% 1,849

V2V2 2.0% 1,817

V1V1 1.8% 1,926

V3V1 1.8% 1,821

V1V2 1.5% 1,873

Case Study:


Case Study

Test Method Conversion

Rate

Adaptive 7%

Non Adaptive 4.5%

AB Testing V Bandit


Option A ->

Option B ->

Option C ->

Why Should I Care?


• More Efficient Learning

• Automation

• Changing World

Questions?



Matt Gershoff

p) 646-384-5151

e) [email protected]

t) @mgershoff

Thank You!

mailto:[email protected]

Documents

Conductrics bandit basicsemetrics1016