multi-armed bandit

Intuit Confidential and Proprietary 1

CTG Data Science LabAugust 17, 2016

Multi-armed Bandit ProblemPotential Improvement for DARTS

Aniruddha Bhargava, Yika Yujia Luo

Agenda

1. Problem Overview

2. Algorithms

Non-contextual cases

Contextual cases

3. Industry Review

4. Advanced Topics

Problem Overview

When do we run into Multi-armed Bandit Problem (MAB)?Gambling Research Funding

Clinical Trials Content Management

What is Multi-armed Bandit Problem (MAB)?

Goal: Pick the best restaurant efficiently

Logistics: Select a restaurant for each person, who leaves you a tip afterwards

$1 $8 $10

$3 $6 $6Average: $2 Average: $7 Average: $6

MAB Terminology

Exploration: a learning process of people’s preferences, always involves a certain degree of randomness

Exploitation: use the current, reliable knowledge of a certain parameter to select a restaurant

Arm: restaurant

Expected Reward: Average tips in the end

Regret: expected tip loss after sending a person to a restaurant that is not the best

Policy: a strategy that you use to select restaurant

Total Cumulative Regret: the total tips you lose -- a performance measure for bandit algorithms

Expected: $1

Expected: $10

Regret is $9!

Expected: $8Regret is $9!

Regret is $2!

0 Regret!

Total regret: $20

User: People sent to restaurants

Reward: Tips

Big Picture

MAB Big Picture

DecisionMaking

OptimizationMAB

Choose the best product by finding the best restaurant to go

Minimize total regretby avoiding sending people to bad restaurants as much as possible

Algorithms(Non-contextual Cases)

“Anytime you are faced with the problem of both exploring and exploiting a search space, you have a bandit problem. Any method of solving that problem is a bandit algorithm”

-- Chris Stucchio

Non-Contextual

Non-contextual V.S. Contextual

User Product

IMPORTANT THING HERE: Although everyone has different taste, we pick one best restaurant for everyone

ε-greedy

Thompson Sampling

Upper Confidence Bound (UCB)

MAB Policies

There are more bandit algorithms… ...

A/B Testing

Adaptive

AB Testing

Person i Random100%

Exploration

Exploitation

Person j100%

ε-greedy

Person i

Highest average tips

Random

Record person i’s feedback,

Update that restaurant’s average

tips value

Select (ε = 0.2)

Update33.3%

33.3%33.3%

Upper Confidence Bound (UCB)

Person iHighest upper

confidence bound Record person i’s

feedback,Update the upper confidence bound

of that restaurant’s average tips

Select

Update

Average tips from restaurant j #people went

to restaurant j

#people

Thompson Sampling (Bayesian)

Person iHighest tips from

the sampling

Record person i’s feedback,

Update that restaurant’s average

tip distribution

Select

Update

Simulate 3 restaurants’average tip distribution,randomly draw a value from each distribution

SamplingMcDonald’s

Subway

Chili's

Average Tips($)

Thompson Sampling (Bayesian)

Pr(r < b) = 10% Pr(r < b) = 0.01%

Algorithm Comparison

1. Exploration V.S Exploitation

2. Total Regret

3. Batch Update

Algorithm Comparison: Exploration V.S. Exploitation

IMPORTANT THING HERE: Exploration costs money!

Time (%)

25 50 75 100

AB Testing

εε-greedy

UCB/Thompson

Algorithm Comparison: Total Regret

AdaptiveAB Testing

Time Time

Algorithm Comparison: Batch Update

AB Testing ε-greedy UCB Thompson

Very Robust Depends Not Robust Robust

System UserQuestion

AnswerStore

ManyAnswers

Algorithm Comparison: Summary

AB Testing ε-greedy UCB Thompson

• Easy to implement

• If good ε found, lower total regret and faster to find best arm than ε-first

• Good for large amount of arms• Find the best arm fast • Low total regret

• Robust to batch update

• Easy to implement

• Good for small amount of arms

• Robust to batch update

• Not robust to batch update

• Sensitive to statistical assumptions

• High total regrets

• Need to figure out good ε

• High total regrets

ContextualNon-Contextual

Non-contextual V.S. Contextual

Female

Vegetarian

Married

Latino

Burger

Non-Vegetarian

Good Service

User Product

IMPORTANT THING HERE: Everyone has different tastes, so we pick one best restaurant for each person

Agenda

1. Problem Overview

2. Algorithms

Contextual cases

3. Industry Review

4. Advanced Topics

Algorithms(Contextual Bandits)

What do we mean by context?

Likes spicy food, refined tastes, plays violin, Male, …

From Wisconsin, likes German food, likes Football, Male, …

Student, doesn’t like seafood, allergic to cats, Female, …

Chief of AFC, watches shows on competitive eating, Female, …

User side Arm side

Tex-Mex style, sit down dining,founded in 1975, …

Serves sandwiches, has veggie options, founded in 1965, …

Breakfast, lunch, and dinner, cheap, founded in 1940, …

User Context

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 250

16Average reward over time

Non-contextual Best possible without context Context (user) Best possible with context

Non-Contextual

User Context

Arm Context

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 250

16Average reward over time

Non-contextual Contextual (arm) Contextual (user)

Best possible without user context Best possible with user context Context (arm and user)

Non-contextualOnly arm context

Both arm and user context

User context can increase the optimal rewards;Arm context can get you there faster!

Takeaway Message

User side:

Population segmentation

e.g. DARTS

Clustering users

Learning embedding

Arms side:

Linear models:

LinUCB, Linear TS, OFUL

Maintain estimate of best arm

More data → shrink uncertainty

Exploiting Context

Assumptions:

• Users can be represented as points in space

• Users cluster together so that points that are close are similar

• Stationarity

Exploiting User Context

meat vegetarian

ild Joe

Nichola

PeterAniruddha

Rachel

SophieYika

Vineeta

Jason Andre

Madeline

Linear

meat vegetarian

ild Joe

Nichola

PeterAniruddha

Rachel

SophieYika

Vineeta

Jason Andre

Madeline

meat vegetarian

ild Joe

Nichola

PeterAniruddha

Rachel

SophieYika

Vineeta

Jason Andre

Madeline

Quadratic

meat vegetarian

ild Joe

Nichola

PeterAniruddha

Rachel

SophieYika

Vineeta

Jason Andre

Madeline

40% 35% 25%

Hierarchical

meat vegetarian

ild Joe

Nichola

PeterAniruddha

Rachel

SophieYika

Vineeta

Jason Andre

Madeline

80% 15% 5%

5% 15% 80%

Hierarchical

meat vegetarian

ild Joe

Nichola

PeterAniruddha

Rachel

SophieYika

Vineeta

Jason Andre

Madeline

5% 50% 45%

80% 15% 5%

5% 10% 85%

15% 80% 5%

Hierarchical

80% 15% 5%

meat vegetarian

ild Joe

Nichola

PeterAniruddha

Rachel

SophieYika

Vineeta

Jason Andre

Madeline

5% 5% 90%10% 45% 45%

5% 50% 45%

15% 80% 5%

Hierarchical

Assumptions:• We can represent arms as vectors.• Rewards are a noisy version of the inner product.• Stationarity.

Look at only arm context and no user context

Methods include:• Linear UCB• Linear Thompson Sampling• OFUL (Optimism in the Face of Uncertainty – Linear)• ... and many more.

Linear modelsExploiting Arm Context

The Math Slide

Standard noisy linear model:rt = xtTθ* + ηt

θ* : the optimal armxt : arm pulled at time trt : reward at time t

ηt : noise at time t

Ct : confidence set

λ : ridge termXt : matrix of all arms pulled till time t

Collect all data and write:r = X θ* + η

Least Squares Solution: θLS = (XTX)-1 XTr

Ridge regression: θLSR = (XTX + λI)-1 XTr

Typical Linear Bandit algorithm:θ0 = 0t = 0,1,2,…

xt = argmaxx∈Ct (xTθt )

θt = (XtTXt + λI)-1 Xt

Exploiting Arm Context Arms

Optimal arm

meat vegetarian

Mince pie

Buffalo wings

Tofu scramble

Grilledvegetables

Ratatouille

Tandoori Chicken

Jalapeno scramble

Pad Thai

Penne Arrabiata

Set of Armsx1, x2, …

θ* : the optimal arm

Exploiting Arm Context Arms

Optimal arm

Next armchosen

Reward (=cos(θ)) is small, but we can still infer information about other arms!

Buffalo wings

Exploiting Arm Context

Optimal arm

Next armchosen

Estimate of optimal armRegion ofuncertainty

We’ve already honed in on a pretty good choice

Optimal arm

Next armchosen

And the process continues …

Optimal arm

Next armchosen

• Big assumption that we know good features.

• Finding features takes a lot of work.

• Few arms, many people → learn an embedding of arms

• Few people, many arms → Featurize, linear bandits

• Linear models are a naive assumption, see kernel methods.

Some Caveats

Agenda

1. Problem Overview

2. Algorithms

Contextual cases

3. Industry Review

4. Advanced Topics

Industry Review

Companies using MAB

Headlines, Photos and Ads

Washington Post Google

Used Upper Confidence Bound (UCB) to picking headlines and photos

Washington Post

Google ExperimentsUsed Thompson Sampling (TS)Updated models twice a dayTwo metrics used to gauge end of experiment:

• 95% confidence that alternate better or …• "potential value remaining in the experiment”

The more arms the higher the gain over A/B testing.

Takeaway Message

Advanced Topics

Biasing

Data Joining and Latency

Non-stationary

Topics

Website 1 Website 2

50% 50%Probability

Numbersold

100 20

90% 10%Probability

Numbersold

100 20

Who did better?

• Be careful when using past data!

• Inverse Propensity Score Matching

• New sales estimates:

Website 1: 100*0.5+20*0.5 = 60

Website 2: 100*0.5*(0.5/0.9) + 20*0.5*(0.5/0.1) = 75

Data Joining and Latency

Courtesy: Microsoft MWT white paper

Context, decision

RewardsLatency

Non-Stationarity – Beer example

January April July October December

Stouts and porters

Pale Ales and IPAs

Wits and Lagers

Oktoberfests and Reds

Christmas Ales

My yearly beer taste:

Preferences change over time.

There may be periodicity in data, Tax season is a great example.

Some solutions:

• Slow changes → System with finite memory

• Abrupt changes → Subspace tracking/anomaly detection

Non-Stationarity

Preferences change over time, biases are added and data

needs to be joined from different sources.

Takeaway Message

Thank You.Questions?

multi-armed bandit

Documents

From Multi-armed Bandit Problems to Response-Adaptive

Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Mechanisms with Learning for Stochastic Multi-armed Bandit ...lcm.csa.iisc.ernet.in/downloads/MAB-mechanisms-Oct1515.pdf · Besides stochastic multi-armed bandit problems, there exist

Decision Making in a Social Multi Armed Bandit Task

Interactive Restless Multi-armed Bandit Game and Swarm

Multi--Armed Bandit Models for Efficient Long--Term ... › 272585 › 1 › 18-month... · Exp3 Exponential-weight algorithm for exploration and exploitation MAB Multi-armed bandit

Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays

Combinatorial Multi-Armed Bandit with General … Multi-Armed Bandit with General Reward Functions Wei Chen Wei Huy Fu Liz Jian Lix Yu Liu{Pinyan Luk Abstract In this paper, we study

Multi-armed Bandit Problems with Dependent Arms

Digital Forensics Tool Selection with Multi-armed Bandit

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 17: The Multi-Armed Bandit Problem 1Lecture 17: The Multi-Armed Bandit Problem

The Stochastic Multi-Armed Bandit Problem: In …vaibhav/talks/2017a.pdfThe Stochastic Multi-Armed Bandit Problem: In Neuroscience, Ecology, and Engineering Vaibhav Srivastava CYber

Multi-Armed Bandit: Learning in Dynamic Systems with ...ewh.ieee.org/r10/xian/com/zhaoqing.pdfc QingZhao,UCDavis. TalkatXidianUniv.,September,2011. 1 Multi-Armed Bandit: Learning in

The multi-armed bandit problem with covariatesrigollet/PDFs/PerRig13.pdfWe consider a multi-armed bandit problem in a setting where each arm produces a noisy reward realization which

One Armed Bandit

The Multi-Armed Bandit, with Constraints - Stony Brookfeinberg/public/Bandits-DFR.pdfThe Multi-Armed Bandit, with Constraints Eric V. Denardo,1 Eugene A. Feinberg2 and Uriel G. Rothblum3

Multi Armed Bandits - cs.ubc.ca Armed Bandits Alireza Shafaei A Quick Review Online Convex Optimization (OCO) Measuring The Performance Bandit Convex Optimization Motivation Multi

Learning in A Changing World: Restless Multi-Armed Bandit with … · 2011. 12. 30. · Learning in A Changing World: Restless Multi-Armed Bandit with Unknown Dynamics Haoyang Liu,

Multi-Armed Bandits, Gittins Index, and its Calculationamahaj1/projects/bandits/book/2013-bandit...Multi-Armed Bandits, Gittins Index, and Its Calculation Jhelum Chakravorty and Aditya

The non-stochastic multi-armed bandit problemyfreund/papers/bandits.pdfThe non-stochastic multi-armed bandit problem Peter Auer Institute for Theoretical Computer Science Graz University