60
Intuit Confidential and Proprietary 1 CTG Data Science Lab August 17, 2016 Multi-armed Bandit Problem Potential Improvement for DARTS Aniruddha Bhargava, Yika Yujia Luo

multi-armed bandit

Embed Size (px)

Citation preview

Page 1: multi-armed bandit

Intuit Confidential and Proprietary 1

CTG Data Science LabAugust 17, 2016

Multi-armed Bandit ProblemPotential Improvement for DARTS

Aniruddha Bhargava, Yika Yujia Luo

Page 2: multi-armed bandit

Intuit Confidential and Proprietary 2

Agenda

1. Problem Overview

2. Algorithms

Non-contextual cases

Contextual cases

3. Industry Review

4. Advanced Topics

Page 3: multi-armed bandit

Intuit Confidential and Proprietary 3

Problem Overview

Page 4: multi-armed bandit

Intuit Confidential and Proprietary 4

When do we run into Multi-armed Bandit Problem (MAB)?Gambling Research Funding

Clinical Trials Content Management

Page 5: multi-armed bandit

Intuit Confidential and Proprietary 5

What is Multi-armed Bandit Problem (MAB)?

Goal: Pick the best restaurant efficiently

Logistics: Select a restaurant for each person, who leaves you a tip afterwards

$1 $8 $10

How?

$3 $6 $6Average: $2 Average: $7 Average: $6

Page 6: multi-armed bandit

Intuit Confidential and Proprietary 6

MAB Terminology

Exploration: a learning process of people’s preferences, always involves a certain degree of randomness

Exploitation: use the current, reliable knowledge of a certain parameter to select a restaurant

Arm: restaurant

Expected Reward: Average tips in the end

Regret: expected tip loss after sending a person to a restaurant that is not the best

Policy: a strategy that you use to select restaurant

Total Cumulative Regret: the total tips you lose -- a performance measure for bandit algorithms

Expected: $1

Expected: $10

Regret is $9!

Expected: $8Regret is $9!

Regret is $2!

0 Regret!

Total regret: $20

User: People sent to restaurants

Reward: Tips

$0

$8

$2

$6

Page 7: multi-armed bandit

Intuit Confidential and Proprietary 7

Big Picture

MAB Big Picture

DecisionMaking

OptimizationMAB

Choose the best product by finding the best restaurant to go

Minimize total regretby avoiding sending people to bad restaurants as much as possible

Page 8: multi-armed bandit

Intuit Confidential and Proprietary 8

Algorithms(Non-contextual Cases)

“Anytime you are faced with the problem of both exploring and exploiting a search space, you have a bandit problem. Any method of solving that problem is a bandit algorithm”

-- Chris Stucchio

Page 9: multi-armed bandit

Intuit Confidential and Proprietary 9

Non-Contextual

Non-contextual V.S. Contextual

User Product

IMPORTANT THING HERE: Although everyone has different taste, we pick one best restaurant for everyone

Page 10: multi-armed bandit

Intuit Confidential and Proprietary 10

ε-greedy

Thompson Sampling

Upper Confidence Bound (UCB)

MAB Policies

There are more bandit algorithms… ...

A/B Testing

Adaptive

Page 11: multi-armed bandit

Intuit Confidential and Proprietary 11

AB Testing

Person i Random100%

Exploration

33.3%

33.3%

33.3%

Exploitation

Person j100%

Page 12: multi-armed bandit

Intuit Confidential and Proprietary 12

ε-greedy

Person i

Highest average tips

Random

20%

80%

Record person i’s feedback,

Update that restaurant’s average

tips value

Select (ε = 0.2)

Update33.3%

33.3%33.3%

Page 13: multi-armed bandit

Intuit Confidential and Proprietary 13

Upper Confidence Bound (UCB)

Person iHighest upper

confidence bound Record person i’s

feedback,Update the upper confidence bound

of that restaurant’s average tips

Select

Update

Average tips from restaurant j #people went

to restaurant j

#people

100%

Page 14: multi-armed bandit

Intuit Confidential and Proprietary 14

Thompson Sampling (Bayesian)

Person iHighest tips from

the sampling

Record person i’s feedback,

Update that restaurant’s average

tip distribution

Select

Update

Simulate 3 restaurants’average tip distribution,randomly draw a value from each distribution

SamplingMcDonald’s

Subway

Chili's

Average Tips($)

100%

Page 15: multi-armed bandit

Intuit Confidential and Proprietary 15

Thompson Sampling (Bayesian)

Pr(r < b) = 10% Pr(r < b) = 0.01%

Page 16: multi-armed bandit

Intuit Confidential and Proprietary 16

Algorithm Comparison

1. Exploration V.S Exploitation

2. Total Regret

3. Batch Update

Page 17: multi-armed bandit

Intuit Confidential and Proprietary 17

Algorithm Comparison: Exploration V.S. Exploitation

IMPORTANT THING HERE: Exploration costs money!

Exp

lora

tion

(%)

Time (%)

75

50

25

0

100

25 50 75 100

AB Testing

εε-greedy

UCB/Thompson

Page 18: multi-armed bandit

Intuit Confidential and Proprietary 18

Algorithm Comparison: Total Regret

M44%

S28%

C28%

AdaptiveAB Testing

M70%

S18%

C12%

Time Time

Page 19: multi-armed bandit

Intuit Confidential and Proprietary 19

Algorithm Comparison: Batch Update

AB Testing ε-greedy UCB Thompson

Very Robust Depends Not Robust Robust

System UserQuestion

AnswerStore

ManyAnswers

Page 20: multi-armed bandit

Intuit Confidential and Proprietary 20

Algorithm Comparison: Summary

AB Testing ε-greedy UCB Thompson

• Easy to implement

• If good ε found, lower total regret and faster to find best arm than ε-first

• Good for large amount of arms• Find the best arm fast • Low total regret

• Robust to batch update

Pros

Cons

• Easy to implement

• Good for small amount of arms

• Robust to batch update

• Not robust to batch update

• Sensitive to statistical assumptions

• High total regrets

• Need to figure out good ε

• High total regrets

Page 21: multi-armed bandit

Intuit Confidential and Proprietary 21

ContextualNon-Contextual

Non-contextual V.S. Contextual

Female

Vegetarian

Married

Latino

Burger

Non-Vegetarian

Cheap

Good Service

User Product

IMPORTANT THING HERE: Everyone has different tastes, so we pick one best restaurant for each person

Page 22: multi-armed bandit

Intuit Confidential and Proprietary 22

Agenda

1. Problem Overview

2. Algorithms

Non-contextual cases

Contextual cases

3. Industry Review

4. Advanced Topics

Page 23: multi-armed bandit

Intuit Confidential and Proprietary 23

Algorithms(Contextual Bandits)

Page 24: multi-armed bandit

Intuit Confidential and Proprietary 24

What do we mean by context?

Likes spicy food, refined tastes, plays violin, Male, …

From Wisconsin, likes German food, likes Football, Male, …

Student, doesn’t like seafood, allergic to cats, Female, …

Chief of AFC, watches shows on competitive eating, Female, …

User side Arm side

Tex-Mex style, sit down dining,founded in 1975, …

Serves sandwiches, has veggie options, founded in 1965, …

Breakfast, lunch, and dinner, cheap, founded in 1940, …

Page 25: multi-armed bandit

Intuit Confidential and Proprietary 25

User Context

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 250

2

4

6

8

10

12

14

16Average reward over time

Non-contextual Best possible without context Context (user) Best possible with context

Non-Contextual

User Context

Page 26: multi-armed bandit

Intuit Confidential and Proprietary 26

Arm Context

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 250

2

4

6

8

10

12

14

16Average reward over time

Non-contextual Contextual (arm) Contextual (user)

Best possible without user context Best possible with user context Context (arm and user)

Non-contextualOnly arm context

Both arm and user context

Page 27: multi-armed bandit

User context can increase the optimal rewards;Arm context can get you there faster!

Takeaway Message

Page 28: multi-armed bandit

Intuit Confidential and Proprietary 28

User side:

Population segmentation

e.g. DARTS

Clustering users

Learning embedding

Arms side:

Linear models:

LinUCB, Linear TS, OFUL

Maintain estimate of best arm

More data → shrink uncertainty

Exploiting Context

Page 29: multi-armed bandit

Intuit Confidential and Proprietary 29

Assumptions:

• Users can be represented as points in space

• Users cluster together so that points that are close are similar

• Stationarity

Exploiting User Context

Page 30: multi-armed bandit

Intuit Confidential and Proprietary 30

Exploiting User Context

meat vegetarian

spic

ym

ild Joe

Yao

Nichola

PeterAniruddha

Rachel

SophieYika

Vineeta

Jason Andre

Chris

Madeline

John

Page 31: multi-armed bandit

Intuit Confidential and Proprietary 31

Linear

Exploiting User Context

meat vegetarian

spic

ym

ild Joe

Yao

Nichola

PeterAniruddha

Rachel

SophieYika

Vineeta

Jason Andre

Chris

Madeline

John

Page 32: multi-armed bandit

Intuit Confidential and Proprietary 32

Exploiting User Context

meat vegetarian

spic

ym

ild Joe

Yao

Nichola

PeterAniruddha

Rachel

SophieYika

Vineeta

Jason Andre

Chris

Madeline

John

Quadratic

Page 33: multi-armed bandit

Intuit Confidential and Proprietary 33

Exploiting User Context

meat vegetarian

spic

ym

ild Joe

Yao

Nichola

PeterAniruddha

Rachel

SophieYika

Vineeta

Jason Andre

Chris

Madeline

John

40% 35% 25%

Hierarchical

Page 34: multi-armed bandit

Intuit Confidential and Proprietary 34

Exploiting User Context

meat vegetarian

spic

ym

ild Joe

Yao

Nichola

PeterAniruddha

Rachel

SophieYika

Vineeta

Jason Andre

Chris

Madeline

John

80% 15% 5%

5% 15% 80%

Hierarchical

Page 35: multi-armed bandit

Intuit Confidential and Proprietary 35

Exploiting User Context

meat vegetarian

spic

ym

ild Joe

Yao

Nichola

PeterAniruddha

Rachel

SophieYika

Vineeta

Jason Andre

Chris

Madeline

John

5% 50% 45%

80% 15% 5%

5% 10% 85%

15% 80% 5%

Hierarchical

Page 36: multi-armed bandit

Intuit Confidential and Proprietary 36

80% 15% 5%

Exploiting User Context

meat vegetarian

spic

ym

ild Joe

Yao

Nichola

PeterAniruddha

Rachel

SophieYika

Vineeta

Jason Andre

Chris

Madeline

John

5% 5% 90%10% 45% 45%

5% 50% 45%

15% 80% 5%

Hierarchical

Page 37: multi-armed bandit

Intuit Confidential and Proprietary 37

Assumptions:• We can represent arms as vectors.• Rewards are a noisy version of the inner product.• Stationarity.

Look at only arm context and no user context

Methods include:• Linear UCB• Linear Thompson Sampling• OFUL (Optimism in the Face of Uncertainty – Linear)• ... and many more.

Linear modelsExploiting Arm Context

Page 38: multi-armed bandit

Intuit Confidential and Proprietary 38

The Math Slide

Standard noisy linear model:rt = xtTθ* + ηt

θ* : the optimal armxt : arm pulled at time trt : reward at time t

ηt : noise at time t

Ct : confidence set

λ : ridge termXt : matrix of all arms pulled till time t

Collect all data and write:r = X θ* + η

Least Squares Solution: θLS = (XTX)-1 XTr

Ridge regression: θLSR = (XTX + λI)-1 XTr

Typical Linear Bandit algorithm:θ0 = 0t = 0,1,2,…

xt = argmaxx∈Ct (xTθt )

θt = (XtTXt + λI)-1 Xt

Trt

Page 39: multi-armed bandit

Intuit Confidential and Proprietary 39

Exploiting Arm Context Arms

Optimal arm

meat vegetarian

spic

ym

ild

Mince pie

Buffalo wings

Tofu scramble

Grilledvegetables

Ratatouille

Tandoori Chicken

Jalapeno scramble

Pad Thai

Penne Arrabiata

Set of Armsx1, x2, …

θ* : the optimal arm

Page 40: multi-armed bandit

Intuit Confidential and Proprietary 40

Exploiting Arm Context Arms

Optimal arm

Next armchosen

Reward (=cos(θ)) is small, but we can still infer information about other arms!

Buffalo wings

θ

Page 41: multi-armed bandit

Intuit Confidential and Proprietary 41

Exploiting Arm Context

C1

θ1

Arms

Optimal arm

Next armchosen

Estimate of optimal armRegion ofuncertainty

Page 42: multi-armed bandit

Intuit Confidential and Proprietary 42

Exploiting Arm Context

We’ve already honed in on a pretty good choice

x2

Arms

Optimal arm

Next armchosen

Estimate of optimal armRegion ofuncertainty

Page 43: multi-armed bandit

Intuit Confidential and Proprietary 43

Exploiting Arm Context

And the process continues …

C2

θ2

Arms

Optimal arm

Next armchosen

Estimate of optimal armRegion ofuncertainty

Page 44: multi-armed bandit

Intuit Confidential and Proprietary 44

• Big assumption that we know good features.

• Finding features takes a lot of work.

• Few arms, many people → learn an embedding of arms

• Few people, many arms → Featurize, linear bandits

• Linear models are a naive assumption, see kernel methods.

Some Caveats

Page 45: multi-armed bandit

Intuit Confidential and Proprietary 45

Agenda

1. Problem Overview

2. Algorithms

Non-contextual cases

Contextual cases

3. Industry Review

4. Advanced Topics

Page 46: multi-armed bandit

Intuit Confidential and Proprietary 46

Industry Review

Page 47: multi-armed bandit

Intuit Confidential and Proprietary 47

Companies using MAB

Page 48: multi-armed bandit

Intuit Confidential and Proprietary 48

Headlines, Photos and Ads

Washington Post Google

Page 49: multi-armed bandit

Intuit Confidential and Proprietary 49

Used Upper Confidence Bound (UCB) to picking headlines and photos

Washington Post

Page 50: multi-armed bandit

Intuit Confidential and Proprietary 50

Google ExperimentsUsed Thompson Sampling (TS)Updated models twice a dayTwo metrics used to gauge end of experiment:

• 95% confidence that alternate better or …• "potential value remaining in the experiment”

Page 51: multi-armed bandit

The more arms the higher the gain over A/B testing.

Takeaway Message

Page 52: multi-armed bandit

Intuit Confidential and Proprietary 52

Advanced Topics

Page 53: multi-armed bandit

Intuit Confidential and Proprietary 53

Biasing

Data Joining and Latency

Non-stationary

Topics

Page 54: multi-armed bandit

Intuit Confidential and Proprietary 54

Bias

Website 1 Website 2

50% 50%Probability

Numbersold

100 20

90% 10%Probability

Numbersold

100 20

Who did better?

Page 55: multi-armed bandit

Intuit Confidential and Proprietary 55

• Be careful when using past data!

• Inverse Propensity Score Matching

• New sales estimates:

Bias

Website 1: 100*0.5+20*0.5 = 60

Website 2: 100*0.5*(0.5/0.9) + 20*0.5*(0.5/0.1) = 75

Page 56: multi-armed bandit

Intuit Confidential and Proprietary 56

Data Joining and Latency

Courtesy: Microsoft MWT white paper

Context, decision

RewardsLatency

Page 57: multi-armed bandit

Intuit Confidential and Proprietary 57

Non-Stationarity – Beer example

January April July October December

Stouts and porters

Pale Ales and IPAs

Wits and Lagers

Oktoberfests and Reds

Christmas Ales

My yearly beer taste:

Page 58: multi-armed bandit

Intuit Confidential and Proprietary 58

Preferences change over time.

There may be periodicity in data, Tax season is a great example.

Some solutions:

• Slow changes → System with finite memory

• Abrupt changes → Subspace tracking/anomaly detection

Non-Stationarity

Page 59: multi-armed bandit

Preferences change over time, biases are added and data

needs to be joined from different sources.

Takeaway Message

Page 60: multi-armed bandit

Intuit Confidential and Proprietary 60

Thank You.Questions?