Multi-armed Bandit Problems with Dependent Arms

Sandeep Pandey (spandey@cs.cmu.edu)

Deepayan Chakrabarti (deepay@yahoo-inc.com)

Deepak Agarwal (dagarwal@yahoo-inc.com)

Background: Bandits

Bandit “arms”

μ1 μ2 μ3(unknown reward

probabilities)

Pull arms sequentially so as to maximize the total expected reward

• Show ads on a webpage to maximize clicks

• Product recommendation to maximize sales

Dependent Arms

Reward probabilities μi are generally assumed to be independent of each other

What if they are dependent? E.g., ads on similar topics, using similar

text/phrases, should have similar rewards

“Skiing, snowboarding”

“Skiing, snowshoes”

“Get Vonage!”

μ1=0.3 μ2=0.28 μ3=10-6

“Snowshoe rental”

μ2=0.31

Dependent Arms

Reward probabilities μi are generally assumed to be independent of each other

What if they are dependent? E.g., ads on similar topics, using similar

text/phrases, should have similar rewards A click on one ad other “similar” ads may

generate clicks as well Can we increase total reward using this

dependency?

μi ~ f(π[i])

Cluster Model of Dependence

Cluster 1 Cluster 2

Successes si ~ Bin(ni, μi)

# pulls of arm i

Some distribution

(known)

Cluster-specific parameter (unknown)

μi ~ f(π1) μi ~ f(π2)

Total reward:

Discounted: ∑ αt.E[R(t)], α = discounting factor

Undiscounted: ∑ E[R(t)]

Discounted Reward

x”1 x”2

x’1 x’2

Pull Arm 1

x”3 x”4

x’3 x’4

Pull Arm 3

MDP for cluster 1

MDP for cluster 2

The optimal policy can be computed using per-cluster MDPs only.

Optimal Policy:

• Compute an (“index”, arm) pair for each cluster

• Pick the cluster with the largest index, and pull the corresponding arm

Discounted Reward

x”1 x”2

x’1 x’2

Pull Arm 1

x”3 x”4

x’3 x’4

Pull Arm 3

MDP for cluster 1

MDP for cluster 2

The optimal policy can be computed using per-cluster MDPs only.

Optimal Policy:

• Compute an (“index”, arm) pair for each cluster

• Pick the cluster with the largest index, and pull the corresponding arm

• Reduces the problem to smaller state spaces

• Reduces to Gittins’ Theorem [1979] for independent bandits

• Approximation bounds on the index for k-step lookahead

μi ~ f(π1) μi ~ f(π2)

Total reward:

Discounted: ∑ αt.E[R(t)], α = discounting factor

Undiscounted: ∑ E[R(t)]

Undiscounted Reward

All arms in a cluster are similar They can be grouped into one hypothetical “cluster arm”

“Cluster arm” 1

“Cluster arm” 2

Undiscounted Reward

“Cluster arm” 1

“Cluster arm” 2

Two-Level Policy

In each iteration:

Pick “cluster arm” using a traditional bandit policy

Pick an arm within that cluster using a traditional bandit policy

Each “cluster arm” must have some estimated

reward probability

Issues

What is the reward probability of a “cluster arm”?

How do cluster characteristics affect performance?

Reward probability of a “cluster arm” What is the reward probability r of a “cluster

arm”? MEAN: r = ∑si / ∑ni,

i.e., average success rate, summing over all arms in the cluster [Kocsis+/2006, Pandey+/2007] Initially, r = μavg = average μ of arms in cluster

Finally, r = μmax = max μ among arms in cluster “Drift” in the reward probability of the “cluster arm”

Reward probability drift causes problems

Drift Non-optimal clusters might temporarily look better

optimal arm is explored only O(log T) times

Cluster 1 Cluster 2

Best (optimal) arm, with reward

probability μopt

(opt cluster)

Reward probability of a “cluster arm” What is the reward probability r of a “cluster

arm”? MEAN: r = ∑si / ∑ni

MAX: r = max( E[μi] )

PMAX: r = E[ max(μi) ]

Both MAX and PMAX aim to estimate μmax and thus reduce drift

for all arms i in cluster

Reward probability of a “cluster arm”

MEAN: r = ∑si / ∑ni

MAX: r = max( E[μi] )

PMAX: r = E[ max(μi) ]

Both MAX and PMAX aim to estimate μmax and thus reduce drift

Bias in estimation

of μmax

Variance of estimator

Unbiased

Comparison of schemes

10 clusters, 11.3 arms/cluster MAX performs best

Issues

What is the reward probability of a “cluster arm”?

How do cluster characteristics affect performance?

Effects of cluster characteristics We analytically study the effects of cluster

characteristics on the “crossover-time” Crossover-time: Time when the expected reward

probability of the optimal cluster becomes highest among all “cluster arms”

Effects of cluster characteristics Crossover-time Tc for MEAN depends on:

Cluster separation Δ = μopt – μmax outside opt cluster

Δ increases Tc decreases

Cluster size Aopt

Aopt increases Tc increases

Cohesiveness in opt cluster 1-avg(μopt – μi) Cohesiveness increases Tc decreases

Experiments (effect of separation)

Δ increases Tc decreases higher reward

Experiments (effect of size)

Aopt increases Tc increases lower reward

Experiments (effect of cohesiveness)

Cohesiveness increases Tc decreases higher reward

Related Work

Typical multi-armed bandit problems Do not consider dependencies Very few arms

Bandits with side information Cannot handle dependencies among arms

Active learning Emphasis on #examples required to achieve a

given prediction accuracy

Conclusions

We analyze bandits where dependencies are encapsulated within clusters

Discounted Reward the optimal policy is an index scheme on the clusters

Undiscounted Reward Two-level Policy with MEAN, MAX, and PMAX Analysis of the effect of cluster characteristics on

performance, for MEAN

Discounted Reward

x3 x4 x”1 x”2

x’1 x’2

Pull Arm 1

success

failure

Change of belief for both arms 1

and 2Estimated

reward probabilities

• Create a belief-state MDP

• Each state contains the estimated reward probabilities for all arms

• Solve for optimal

1 2 3 4

Background: Bandits

Bandit “arms”

p1 p2 p3(unknown payoff

probabilities)

Regret = optimal payoff – actual payoff

Reward probability of a “cluster arm” What is the reward probability of a “cluster

arm”? Eventually, every “cluster arm” must converge to

the most rewarding arm μmax within that cluster since a bandit policy is used within each cluster However, “drift” causes problems

Experiments

Simulation based on one week’s worth of data from a large-scale ad-matching application

10 clusters, with 11.3 arms/cluster on average

Comparison of schemes

10 clusters, 11.3 arms/cluster Cluster separation Δ = 0.08 Cluster size Aopt = 31 Cohesiveness = 0.75 MAX performs best

Reward probability drift causes problems

Intuitively, to reduce regret, we must: Quickly converge to the optimal “cluster arm” and then to the best arm within that cluster

Cluster 1 Cluster 2

Best (optimal) arm, with reward

probability μopt

(opt cluster)

Multi-armed Bandit Problems with Dependent Arms

Documents

Interactive Restless Multi-armed Bandit Game and Swarm

Combinatorial Multi-Armed Bandit with General … Multi-Armed Bandit with General Reward Functions Wei Chen Wei Huy Fu Liz Jian Lix Yu Liu{Pinyan Luk Abstract In this paper, we study

Multi-armed Bandit Problems with Dependent Arms

SPECIAL SECTION: GAME THEORY Truthful multi-armed bandit

THE MULTI-ARMED BANDIT PROBLEMhockpeng/bandit5a.pdfTHE MULTI-ARMED BANDIT PROBLEM: AN EFFICIENT NON-PARAMETRIC SOLUTION Hock Peng Chan stachp@nus.edu.sg Department of Statistics and

Multi-Armed Bandit and Applications

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit …aditya/E1245_Online_Prediction_Learning_F201… · Multi-armed bandit problems are the most basic examples of

Multi-Armed Bandits: History, Theory, Applications QingZhao,UCDavis. PlenarytalkatSPAWC,June,2010. 18 Bandit and MDP Multi-Armed Bandit as A Class of MDP: (Bellman’56) N independent

The multi-armed bandit problem with covariatesrigollet/PDFs/PerRig13.pdfWe consider a multi-armed bandit problem in a setting where each arm produces a noisy reward realization which

X-Armed Bandits · Keywords: bandits with inﬁnitely many arms, optimistic online optimi zation, regret bounds, minimax rates 1. Introduction In the classical stochastic bandit problem

Contextual Multi-Armed Banditsproceedings.mlr.press/v9/lu10a/lu10a.pdfContextual Multi-Armed Bandits ... tion of the classical multi-armed bandit problem by Lai and Robbins and the

A two-armed bandit theory of market pricing. - Yale University

Lecture 9: Exploration and Exploitation 9: Exploration and Exploitation Lecture 9: Exploration and Exploitation David Silver. ... The Multi-Armed Bandit A multi-armed bandit is a tuple

Decision Making in a Social Multi Armed Bandit Task

Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Multi-armed Bandit with Additional Observationsalinlab.kaist.ac.kr/resource/Multi_armed_Bandit... · Multi-armed Bandit with Additional Observations 13:3 the expert problem (i.e.,

Asymptotically Optimal Multi-Armed Bandit …mnk/papers/mab-c-arx-2015.pdfAsymptotically Optimal Multi-Armed Bandit Policies under a Cost Constraint Apostolos N. Burnetas aburnetas@math.uoa.gr

Multi-Armed Bandit: Learning in Dynamic Systems with ...ewh.ieee.org/r10/xian/com/zhaoqing.pdfc QingZhao,UCDavis. TalkatXidianUniv.,September,2011. 1 Multi-Armed Bandit: Learning in

The Multi-Armed Bandit Problem€¦ · Sumeet Katariya Electrical and Computer Engineering December 7, 2013 Sumeet Katariya Multi-armed Bandit. Motivation Model Algorithms Outline

Multi--Armed Bandit Models for Efficient Long--Term ... › 272585 › 1 › 18-month... · Exp3 Exponential-weight algorithm for exploration and exploitation MAB Multi-armed bandit