Multi-armed Bandit Problems with Dependent Arms

Preview:

DESCRIPTION

Multi-armed Bandit Problems with Dependent Arms. Sandeep Pandey (spandey@cs.cmu.edu) Deepayan Chakrabarti (deepay@yahoo-inc.com) Deepak Agarwal (dagarwal@yahoo-inc.com). (unknown reward probabilities). μ 1. μ 2. μ 3. Background: Bandits. Bandit “arms”. - PowerPoint PPT Presentation

Citation preview

1

Multi-armed Bandit Problems with Dependent Arms

Sandeep Pandey (spandey@cs.cmu.edu)

Deepayan Chakrabarti (deepay@yahoo-inc.com)

Deepak Agarwal (dagarwal@yahoo-inc.com)

2

Background: Bandits

Bandit “arms”

μ1 μ2 μ3(unknown reward

probabilities)

Pull arms sequentially so as to maximize the total expected reward

• Show ads on a webpage to maximize clicks

• Product recommendation to maximize sales

3

Dependent Arms

Reward probabilities μi are generally assumed to be independent of each other

What if they are dependent? E.g., ads on similar topics, using similar

text/phrases, should have similar rewards

“Skiing, snowboarding”

“Skiing, snowshoes”

“Get Vonage!”

μ1=0.3 μ2=0.28 μ3=10-6

“Snowshoe rental”

μ2=0.31

4

Dependent Arms

Reward probabilities μi are generally assumed to be independent of each other

What if they are dependent? E.g., ads on similar topics, using similar

text/phrases, should have similar rewards A click on one ad other “similar” ads may

generate clicks as well Can we increase total reward using this

dependency?

5

μi ~ f(π[i])

Cluster Model of Dependence

Arm 1

Arm 4

Arm 3

Arm 2

Cluster 1 Cluster 2

Successes si ~ Bin(ni, μi)

# pulls of arm i

Some distribution

(known)

Cluster-specific parameter (unknown)

6

Cluster Model of Dependence

Arm 1

Arm 4

Arm 3

Arm 2

μi ~ f(π1) μi ~ f(π2)

Total reward:

Discounted: ∑ αt.E[R(t)], α = discounting factor

Undiscounted: ∑ E[R(t)]

t=0

t=0

T

7

Discounted Reward

x1 x2

x”1 x”2

x’1 x’2

Pull Arm 1

x3 x4

x”3 x”4

x’3 x’4

Pull Arm 3

Arm 2

Arm 4

MDP for cluster 1

MDP for cluster 2

The optimal policy can be computed using per-cluster MDPs only.

Optimal Policy:

• Compute an (“index”, arm) pair for each cluster

• Pick the cluster with the largest index, and pull the corresponding arm

8

Discounted Reward

x1 x2

x”1 x”2

x’1 x’2

Pull Arm 1

x3 x4

x”3 x”4

x’3 x’4

Pull Arm 3

Arm 2

Arm 4

MDP for cluster 1

MDP for cluster 2

The optimal policy can be computed using per-cluster MDPs only.

Optimal Policy:

• Compute an (“index”, arm) pair for each cluster

• Pick the cluster with the largest index, and pull the corresponding arm

• Reduces the problem to smaller state spaces

• Reduces to Gittins’ Theorem [1979] for independent bandits

• Approximation bounds on the index for k-step lookahead

9

Cluster Model of Dependence

Arm 1

Arm 4

Arm 3

Arm 2

μi ~ f(π1) μi ~ f(π2)

Total reward:

Discounted: ∑ αt.E[R(t)], α = discounting factor

Undiscounted: ∑ E[R(t)]

t=0

t=0

T

10

Undiscounted Reward

Arm 1

Arm 4

Arm 3

Arm 2

All arms in a cluster are similar They can be grouped into one hypothetical “cluster arm”

“Cluster arm” 1

“Cluster arm” 2

11

Undiscounted Reward

Arm 1

Arm 4

Arm 3

Arm 2

“Cluster arm” 1

“Cluster arm” 2

Two-Level Policy

In each iteration:

Pick “cluster arm” using a traditional bandit policy

Pick an arm within that cluster using a traditional bandit policy

Each “cluster arm” must have some estimated

reward probability

12

Issues

What is the reward probability of a “cluster arm”?

How do cluster characteristics affect performance?

13

Reward probability of a “cluster arm” What is the reward probability r of a “cluster

arm”? MEAN: r = ∑si / ∑ni,

i.e., average success rate, summing over all arms in the cluster [Kocsis+/2006, Pandey+/2007] Initially, r = μavg = average μ of arms in cluster

Finally, r = μmax = max μ among arms in cluster “Drift” in the reward probability of the “cluster arm”

14

Reward probability drift causes problems

Drift Non-optimal clusters might temporarily look better

optimal arm is explored only O(log T) times

Arm 1

Arm 4

Arm 3

Arm 2

Cluster 1 Cluster 2

Best (optimal) arm, with reward

probability μopt

(opt cluster)

15

Reward probability of a “cluster arm” What is the reward probability r of a “cluster

arm”? MEAN: r = ∑si / ∑ni

MAX: r = max( E[μi] )

PMAX: r = E[ max(μi) ]

Both MAX and PMAX aim to estimate μmax and thus reduce drift

for all arms i in cluster

16

Reward probability of a “cluster arm”

MEAN: r = ∑si / ∑ni

MAX: r = max( E[μi] )

PMAX: r = E[ max(μi) ]

Both MAX and PMAX aim to estimate μmax and thus reduce drift

Bias in estimation

of μmax

Variance of estimator

High

Unbiased

Low

High

17

Comparison of schemes

10 clusters, 11.3 arms/cluster MAX performs best

18

Issues

What is the reward probability of a “cluster arm”?

How do cluster characteristics affect performance?

19

Effects of cluster characteristics We analytically study the effects of cluster

characteristics on the “crossover-time” Crossover-time: Time when the expected reward

probability of the optimal cluster becomes highest among all “cluster arms”

20

Effects of cluster characteristics Crossover-time Tc for MEAN depends on:

Cluster separation Δ = μopt – μmax outside opt cluster

Δ increases Tc decreases

Cluster size Aopt

Aopt increases Tc increases

Cohesiveness in opt cluster 1-avg(μopt – μi) Cohesiveness increases Tc decreases

21

Experiments (effect of separation)

Δ increases Tc decreases higher reward

22

Experiments (effect of size)

Aopt increases Tc increases lower reward

23

Experiments (effect of cohesiveness)

Cohesiveness increases Tc decreases higher reward

24

Related Work

Typical multi-armed bandit problems Do not consider dependencies Very few arms

Bandits with side information Cannot handle dependencies among arms

Active learning Emphasis on #examples required to achieve a

given prediction accuracy

25

Conclusions

We analyze bandits where dependencies are encapsulated within clusters

Discounted Reward the optimal policy is an index scheme on the clusters

Undiscounted Reward Two-level Policy with MEAN, MAX, and PMAX Analysis of the effect of cluster characteristics on

performance, for MEAN

26

Discounted Reward

x1 x2

x3 x4 x”1 x”2

x’1 x’2

x3 x4

x3 x4

Pull Arm 1

success

failure

Change of belief for both arms 1

and 2Estimated

reward probabilities

Pull

Arm 2

Pul

l A

rm 3

Pull

Arm 4

• Create a belief-state MDP

• Each state contains the estimated reward probabilities for all arms

• Solve for optimal

1 2 3 4

27

Background: Bandits

Bandit “arms”

p1 p2 p3(unknown payoff

probabilities)

Regret = optimal payoff – actual payoff

28

Reward probability of a “cluster arm” What is the reward probability of a “cluster

arm”? Eventually, every “cluster arm” must converge to

the most rewarding arm μmax within that cluster since a bandit policy is used within each cluster However, “drift” causes problems

29

Experiments

Simulation based on one week’s worth of data from a large-scale ad-matching application

10 clusters, with 11.3 arms/cluster on average

30

Comparison of schemes

10 clusters, 11.3 arms/cluster Cluster separation Δ = 0.08 Cluster size Aopt = 31 Cohesiveness = 0.75 MAX performs best

31

Reward probability drift causes problems

Intuitively, to reduce regret, we must: Quickly converge to the optimal “cluster arm” and then to the best arm within that cluster

Arm 1

Arm 4

Arm 3

Arm 2

Cluster 1 Cluster 2

Best (optimal) arm, with reward

probability μopt

(opt cluster)

Recommended