multi-armed bandit

Preview:

Citation preview

Intuit Confidential and Proprietary 1

CTG Data Science LabAugust 17, 2016

Multi-armed Bandit ProblemPotential Improvement for DARTS

Aniruddha Bhargava, Yika Yujia Luo

Intuit Confidential and Proprietary 2

Agenda

1. Problem Overview

2. Algorithms

Non-contextual cases

Contextual cases

3. Industry Review

4. Advanced Topics

Intuit Confidential and Proprietary 3

Problem Overview

Intuit Confidential and Proprietary 4

When do we run into Multi-armed Bandit Problem (MAB)?Gambling Research Funding

Clinical Trials Content Management

Intuit Confidential and Proprietary 5

What is Multi-armed Bandit Problem (MAB)?

Goal: Pick the best restaurant efficiently

Logistics: Select a restaurant for each person, who leaves you a tip afterwards

$1 $8 $10

How?

$3 $6 $6Average: $2 Average: $7 Average: $6

Intuit Confidential and Proprietary 6

MAB Terminology

Exploration: a learning process of people’s preferences, always involves a certain degree of randomness

Exploitation: use the current, reliable knowledge of a certain parameter to select a restaurant

Arm: restaurant

Expected Reward: Average tips in the end

Regret: expected tip loss after sending a person to a restaurant that is not the best

Policy: a strategy that you use to select restaurant

Total Cumulative Regret: the total tips you lose -- a performance measure for bandit algorithms

Expected: $1

Expected: $10

Regret is $9!

Expected: $8Regret is $9!

Regret is $2!

0 Regret!

Total regret: $20

User: People sent to restaurants

Reward: Tips

$0

$8

$2

$6

Intuit Confidential and Proprietary 7

Big Picture

MAB Big Picture

DecisionMaking

OptimizationMAB

Choose the best product by finding the best restaurant to go

Minimize total regretby avoiding sending people to bad restaurants as much as possible

Intuit Confidential and Proprietary 8

Algorithms(Non-contextual Cases)

“Anytime you are faced with the problem of both exploring and exploiting a search space, you have a bandit problem. Any method of solving that problem is a bandit algorithm”

-- Chris Stucchio

Intuit Confidential and Proprietary 9

Non-Contextual

Non-contextual V.S. Contextual

User Product

IMPORTANT THING HERE: Although everyone has different taste, we pick one best restaurant for everyone

Intuit Confidential and Proprietary 10

ε-greedy

Thompson Sampling

Upper Confidence Bound (UCB)

MAB Policies

There are more bandit algorithms… ...

A/B Testing

Adaptive

Intuit Confidential and Proprietary 11

AB Testing

Person i Random100%

Exploration

33.3%

33.3%

33.3%

Exploitation

Person j100%

Intuit Confidential and Proprietary 12

ε-greedy

Person i

Highest average tips

Random

20%

80%

Record person i’s feedback,

Update that restaurant’s average

tips value

Select (ε = 0.2)

Update33.3%

33.3%33.3%

Intuit Confidential and Proprietary 13

Upper Confidence Bound (UCB)

Person iHighest upper

confidence bound Record person i’s

feedback,Update the upper confidence bound

of that restaurant’s average tips

Select

Update

Average tips from restaurant j #people went

to restaurant j

#people

100%

Intuit Confidential and Proprietary 14

Thompson Sampling (Bayesian)

Person iHighest tips from

the sampling

Record person i’s feedback,

Update that restaurant’s average

tip distribution

Select

Update

Simulate 3 restaurants’average tip distribution,randomly draw a value from each distribution

SamplingMcDonald’s

Subway

Chili's

Average Tips($)

100%

Intuit Confidential and Proprietary 15

Thompson Sampling (Bayesian)

Pr(r < b) = 10% Pr(r < b) = 0.01%

Intuit Confidential and Proprietary 16

Algorithm Comparison

1. Exploration V.S Exploitation

2. Total Regret

3. Batch Update

Intuit Confidential and Proprietary 17

Algorithm Comparison: Exploration V.S. Exploitation

IMPORTANT THING HERE: Exploration costs money!

Exp

lora

tion

(%)

Time (%)

75

50

25

0

100

25 50 75 100

AB Testing

εε-greedy

UCB/Thompson

Intuit Confidential and Proprietary 18

Algorithm Comparison: Total Regret

M44%

S28%

C28%

AdaptiveAB Testing

M70%

S18%

C12%

Time Time

Intuit Confidential and Proprietary 19

Algorithm Comparison: Batch Update

AB Testing ε-greedy UCB Thompson

Very Robust Depends Not Robust Robust

System UserQuestion

AnswerStore

ManyAnswers

Intuit Confidential and Proprietary 20

Algorithm Comparison: Summary

AB Testing ε-greedy UCB Thompson

• Easy to implement

• If good ε found, lower total regret and faster to find best arm than ε-first

• Good for large amount of arms• Find the best arm fast • Low total regret

• Robust to batch update

Pros

Cons

• Easy to implement

• Good for small amount of arms

• Robust to batch update

• Not robust to batch update

• Sensitive to statistical assumptions

• High total regrets

• Need to figure out good ε

• High total regrets

Intuit Confidential and Proprietary 21

ContextualNon-Contextual

Non-contextual V.S. Contextual

Female

Vegetarian

Married

Latino

Burger

Non-Vegetarian

Cheap

Good Service

User Product

IMPORTANT THING HERE: Everyone has different tastes, so we pick one best restaurant for each person

Intuit Confidential and Proprietary 22

Agenda

1. Problem Overview

2. Algorithms

Non-contextual cases

Contextual cases

3. Industry Review

4. Advanced Topics

Intuit Confidential and Proprietary 23

Algorithms(Contextual Bandits)

Intuit Confidential and Proprietary 24

What do we mean by context?

Likes spicy food, refined tastes, plays violin, Male, …

From Wisconsin, likes German food, likes Football, Male, …

Student, doesn’t like seafood, allergic to cats, Female, …

Chief of AFC, watches shows on competitive eating, Female, …

User side Arm side

Tex-Mex style, sit down dining,founded in 1975, …

Serves sandwiches, has veggie options, founded in 1965, …

Breakfast, lunch, and dinner, cheap, founded in 1940, …

Intuit Confidential and Proprietary 25

User Context

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 250

2

4

6

8

10

12

14

16Average reward over time

Non-contextual Best possible without context Context (user) Best possible with context

Non-Contextual

User Context

Intuit Confidential and Proprietary 26

Arm Context

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 250

2

4

6

8

10

12

14

16Average reward over time

Non-contextual Contextual (arm) Contextual (user)

Best possible without user context Best possible with user context Context (arm and user)

Non-contextualOnly arm context

Both arm and user context

User context can increase the optimal rewards;Arm context can get you there faster!

Takeaway Message

Intuit Confidential and Proprietary 28

User side:

Population segmentation

e.g. DARTS

Clustering users

Learning embedding

Arms side:

Linear models:

LinUCB, Linear TS, OFUL

Maintain estimate of best arm

More data → shrink uncertainty

Exploiting Context

Intuit Confidential and Proprietary 29

Assumptions:

• Users can be represented as points in space

• Users cluster together so that points that are close are similar

• Stationarity

Exploiting User Context

Intuit Confidential and Proprietary 30

Exploiting User Context

meat vegetarian

spic

ym

ild Joe

Yao

Nichola

PeterAniruddha

Rachel

SophieYika

Vineeta

Jason Andre

Chris

Madeline

John

Intuit Confidential and Proprietary 31

Linear

Exploiting User Context

meat vegetarian

spic

ym

ild Joe

Yao

Nichola

PeterAniruddha

Rachel

SophieYika

Vineeta

Jason Andre

Chris

Madeline

John

Intuit Confidential and Proprietary 32

Exploiting User Context

meat vegetarian

spic

ym

ild Joe

Yao

Nichola

PeterAniruddha

Rachel

SophieYika

Vineeta

Jason Andre

Chris

Madeline

John

Quadratic

Intuit Confidential and Proprietary 33

Exploiting User Context

meat vegetarian

spic

ym

ild Joe

Yao

Nichola

PeterAniruddha

Rachel

SophieYika

Vineeta

Jason Andre

Chris

Madeline

John

40% 35% 25%

Hierarchical

Intuit Confidential and Proprietary 34

Exploiting User Context

meat vegetarian

spic

ym

ild Joe

Yao

Nichola

PeterAniruddha

Rachel

SophieYika

Vineeta

Jason Andre

Chris

Madeline

John

80% 15% 5%

5% 15% 80%

Hierarchical

Intuit Confidential and Proprietary 35

Exploiting User Context

meat vegetarian

spic

ym

ild Joe

Yao

Nichola

PeterAniruddha

Rachel

SophieYika

Vineeta

Jason Andre

Chris

Madeline

John

5% 50% 45%

80% 15% 5%

5% 10% 85%

15% 80% 5%

Hierarchical

Intuit Confidential and Proprietary 36

80% 15% 5%

Exploiting User Context

meat vegetarian

spic

ym

ild Joe

Yao

Nichola

PeterAniruddha

Rachel

SophieYika

Vineeta

Jason Andre

Chris

Madeline

John

5% 5% 90%10% 45% 45%

5% 50% 45%

15% 80% 5%

Hierarchical

Intuit Confidential and Proprietary 37

Assumptions:• We can represent arms as vectors.• Rewards are a noisy version of the inner product.• Stationarity.

Look at only arm context and no user context

Methods include:• Linear UCB• Linear Thompson Sampling• OFUL (Optimism in the Face of Uncertainty – Linear)• ... and many more.

Linear modelsExploiting Arm Context

Intuit Confidential and Proprietary 38

The Math Slide

Standard noisy linear model:rt = xtTθ* + ηt

θ* : the optimal armxt : arm pulled at time trt : reward at time t

ηt : noise at time t

Ct : confidence set

λ : ridge termXt : matrix of all arms pulled till time t

Collect all data and write:r = X θ* + η

Least Squares Solution: θLS = (XTX)-1 XTr

Ridge regression: θLSR = (XTX + λI)-1 XTr

Typical Linear Bandit algorithm:θ0 = 0t = 0,1,2,…

xt = argmaxx∈Ct (xTθt )

θt = (XtTXt + λI)-1 Xt

Trt

Intuit Confidential and Proprietary 39

Exploiting Arm Context Arms

Optimal arm

meat vegetarian

spic

ym

ild

Mince pie

Buffalo wings

Tofu scramble

Grilledvegetables

Ratatouille

Tandoori Chicken

Jalapeno scramble

Pad Thai

Penne Arrabiata

Set of Armsx1, x2, …

θ* : the optimal arm

Intuit Confidential and Proprietary 40

Exploiting Arm Context Arms

Optimal arm

Next armchosen

Reward (=cos(θ)) is small, but we can still infer information about other arms!

Buffalo wings

θ

Intuit Confidential and Proprietary 41

Exploiting Arm Context

C1

θ1

Arms

Optimal arm

Next armchosen

Estimate of optimal armRegion ofuncertainty

Intuit Confidential and Proprietary 42

Exploiting Arm Context

We’ve already honed in on a pretty good choice

x2

Arms

Optimal arm

Next armchosen

Estimate of optimal armRegion ofuncertainty

Intuit Confidential and Proprietary 43

Exploiting Arm Context

And the process continues …

C2

θ2

Arms

Optimal arm

Next armchosen

Estimate of optimal armRegion ofuncertainty

Intuit Confidential and Proprietary 44

• Big assumption that we know good features.

• Finding features takes a lot of work.

• Few arms, many people → learn an embedding of arms

• Few people, many arms → Featurize, linear bandits

• Linear models are a naive assumption, see kernel methods.

Some Caveats

Intuit Confidential and Proprietary 45

Agenda

1. Problem Overview

2. Algorithms

Non-contextual cases

Contextual cases

3. Industry Review

4. Advanced Topics

Intuit Confidential and Proprietary 46

Industry Review

Intuit Confidential and Proprietary 47

Companies using MAB

Intuit Confidential and Proprietary 48

Headlines, Photos and Ads

Washington Post Google

Intuit Confidential and Proprietary 49

Used Upper Confidence Bound (UCB) to picking headlines and photos

Washington Post

Intuit Confidential and Proprietary 50

Google ExperimentsUsed Thompson Sampling (TS)Updated models twice a dayTwo metrics used to gauge end of experiment:

• 95% confidence that alternate better or …• "potential value remaining in the experiment”

The more arms the higher the gain over A/B testing.

Takeaway Message

Intuit Confidential and Proprietary 52

Advanced Topics

Intuit Confidential and Proprietary 53

Biasing

Data Joining and Latency

Non-stationary

Topics

Intuit Confidential and Proprietary 54

Bias

Website 1 Website 2

50% 50%Probability

Numbersold

100 20

90% 10%Probability

Numbersold

100 20

Who did better?

Intuit Confidential and Proprietary 55

• Be careful when using past data!

• Inverse Propensity Score Matching

• New sales estimates:

Bias

Website 1: 100*0.5+20*0.5 = 60

Website 2: 100*0.5*(0.5/0.9) + 20*0.5*(0.5/0.1) = 75

Intuit Confidential and Proprietary 56

Data Joining and Latency

Courtesy: Microsoft MWT white paper

Context, decision

RewardsLatency

Intuit Confidential and Proprietary 57

Non-Stationarity – Beer example

January April July October December

Stouts and porters

Pale Ales and IPAs

Wits and Lagers

Oktoberfests and Reds

Christmas Ales

My yearly beer taste:

Intuit Confidential and Proprietary 58

Preferences change over time.

There may be periodicity in data, Tax season is a great example.

Some solutions:

• Slow changes → System with finite memory

• Abrupt changes → Subspace tracking/anomaly detection

Non-Stationarity

Preferences change over time, biases are added and data

needs to be joined from different sources.

Takeaway Message

Intuit Confidential and Proprietary 60

Thank You.Questions?

Recommended