208
Tutorial Bandit Processes and Index Policies Richard Weber, University of Cambridge Young European Queueing Theorists (YEQT VII) workshop on Scheduling and priorities in queueing systems, Eindhoven, November 4–5–6, 2013 1 / 52

Bandit Processes and Index Policies

Embed Size (px)

Citation preview

Page 1: Bandit Processes and Index Policies

Tutorial

Bandit Processes and Index Policies

Richard Weber, University of Cambridge

Young European Queueing Theorists (YEQT VII) workshop on

Scheduling and priorities in queueing systems,

Eindhoven, November 4–5–6, 2013

1 / 52

Page 2: Bandit Processes and Index Policies

2 / 52

Page 3: Bandit Processes and Index Policies

3 / 52

Page 4: Bandit Processes and Index Policies

Tutorial slides

http://www.statslab.cam.ac.uk/~rrw1/talks/yetq.pdf

4 / 52

Page 5: Bandit Processes and Index Policies

Abstract

We consider the problem of optimal control for a number of alternative

Markov decision processes, where at each moment in time exactly one

process must be “continued” while all other processes are “frozen”. Only

the process that is continued produces any reward and changes its state.

The aim is to maximize expected total discounted reward. A familiar

example would be the problem of optimally scheduling the processing of

jobs that reside in a single-server queue. A each moment one of the jobs

is processed by the server, while all the other jobs are made to wait. Our

aim is to minimize the expected total holding cost incurred until all jobs

are complete. Another example is that of hunting for an apartment; we

must choose the order in which to view apartments and decide when to

stop viewing and rent the best apartment of those that have been

viewed. This type of problem has a beautiful and surprising solution in

terms of Gittins indices. In this tutorial I will review the theory of bandit

processes and Gittins indices, describe some applications in scheduling

and queueing, and tell you about some frontiers of research in the field.

5 / 52

Page 6: Bandit Processes and Index Policies

Reading list

D. Bertsimas and J. Nino-Mora, Conservation laws, extended polymatroids andmultiarmed bandit problems; a polyhedral approach to indexable systems, Math.Operat Res., 21:257–306, 1996.

J. C. Gittins, K. D. Glazebrook, and R. R. Weber. Multiarmed Bandit AllocationIndices, (2nd editon), Wiley, 2011.

K. D. Glazebrook, D. J. Hodge, C. Kirkbride, R. J. Minty, Stochastic scheduling: Ashort history of index policies and new approaches to index generation for dynamicresource allocation, J Scheduling., 2013.

K. Liu, R. R. Weber and Q. Zhao. Indexability and Whittle Index for restless banditproblems involving reset processes, Decision and Control and European ControlConference (CDC-ECC), 2011 50th IEEE Conference on, 12-15 December, 7690–7696,2011,

R. R. Weber and G. Weiss. On an index policy for restless bandits. J Appl Probab,27:637–648, 1990.

P. Whittle. Restless bandits: activity allocation in a changing world. In A Celebrationof Applied Probability, ed. J. Gani, J Appl Probab, 25A:287–298, 1998.

P. Whittle. Risk-sensitivity, large deviations and stochastic control, Eur J Oper Res,73:295-303, 1994

6 / 52

Page 7: Bandit Processes and Index Policies

Roadmap

Warm up

Single machine scheduling

Interchange arguments.

Index policies.

Pandora’s problem

Scheduling M/M/1 queue.

Bandit processess

Bandit processes.

Gittins index theorem.

Playing golf with morethan one ball.

Pandora plays golf.

Multi-class queues

M/M/1.

Achievable region method.

Tax problems.

Branching bandits.

M/G/1.

Restless bandits

Restless bandits.

Whittle index.

Asymptotic optimality.

Risk-sensitive indices.

7 / 52

Page 8: Bandit Processes and Index Policies

Warm up

8 / 52

Page 9: Bandit Processes and Index Policies

Single machine scheduling

N jobs are to be processed successively on one machine.

In what order should we process them?

Job i has a known processing times ti, a positive integer.

On completion of job i a reward ri is obtained.

If processed in the order 1, 2, . . . , N , total discounted reward is

r1βt1 + r2β

t1+t2 + · · ·+ rNβt1+···+tN

where 0 < β < 1.

9 / 52

Page 10: Bandit Processes and Index Policies

Single machine scheduling

N jobs are to be processed successively on one machine.

In what order should we process them?

Job i has a known processing times ti, a positive integer.

On completion of job i a reward ri is obtained.

If processed in the order 1, 2, . . . , N , total discounted reward is

r1βt1 + r2β

t1+t2 + · · ·+ rNβt1+···+tN

where 0 < β < 1.

9 / 52

Page 11: Bandit Processes and Index Policies

Single machine scheduling

N jobs are to be processed successively on one machine.

In what order should we process them?

Job i has a known processing times ti, a positive integer.

On completion of job i a reward ri is obtained.

If processed in the order 1, 2, . . . , N , total discounted reward is

r1βt1 + r2β

t1+t2 + · · ·+ rNβt1+···+tN

where 0 < β < 1.

9 / 52

Page 12: Bandit Processes and Index Policies

Single machine scheduling

N jobs are to be processed successively on one machine.

In what order should we process them?

Job i has a known processing times ti, a positive integer.

On completion of job i a reward ri is obtained.

If processed in the order 1, 2, . . . , N , total discounted reward is

r1βt1 + r2β

t1+t2 + · · ·+ rNβt1+···+tN

where 0 < β < 1.

9 / 52

Page 13: Bandit Processes and Index Policies

Dynamic programming solution

Let Sk ⊂ {1, . . . , N} be a set of k uncompleted jobs.

The dynamic programming equation is

F (Sk) = maxi∈Sk

[βtiri + βtiF (Sk − {i})

],

with F0(∅) = 0.

In principle we can solve the scheduing problem with dynamicprogramming.

But of course there is an easier way . . .

10 / 52

Page 14: Bandit Processes and Index Policies

Interchange arguments

Consider processing jobs in the order:

i1, . . . , ik, i, j, ik+3, . . . , iN .

Consider interchanging the order of jobs i and j:

i1, . . . , ik, j, i, ik+3, . . . , iN .

Rewards under the two schedules are respectively

R1 + βT+tiri + βT+ti+tjrj +R2

R1 + βT+tjrj + βT+tj+tiri +R2.

T = ti1 + · · ·+ tik , and

R1 and R2 are rewards accruing from jobs coming before and afteri, j; (which is the same in both schedules).

11 / 52

Page 15: Bandit Processes and Index Policies

Interchange arguments

Consider processing jobs in the order:

i1, . . . , ik, i, j, ik+3, . . . , iN .

Consider interchanging the order of jobs i and j:

i1, . . . , ik, j, i, ik+3, . . . , iN .

Rewards under the two schedules are respectively

R1 + βT+tiri + βT+ti+tjrj +R2

R1 + βT+tjrj + βT+tj+tiri +R2.

T = ti1 + · · ·+ tik , and

R1 and R2 are rewards accruing from jobs coming before and afteri, j; (which is the same in both schedules).

11 / 52

Page 16: Bandit Processes and Index Policies

Interchange arguments

Consider processing jobs in the order:

i1, . . . , ik, i, j, ik+3, . . . , iN .

Consider interchanging the order of jobs i and j:

i1, . . . , ik, j, i, ik+3, . . . , iN .

Rewards under the two schedules are respectively

R1 + βT+tiri + βT+ti+tjrj +R2

R1 + βT+tjrj + βT+tj+tiri +R2.

T = ti1 + · · ·+ tik , and

R1 and R2 are rewards accruing from jobs coming before and afteri, j; (which is the same in both schedules).

11 / 52

Page 17: Bandit Processes and Index Policies

Simple algebra =⇒ Reward of the first schedule is greater if

riβti/(1− βti) > rjβ

tj/(1− βtj ).

Hence a schedule can be optimal only if the jobs are taken indecreasing order of the indices riβti/(1− βti).

Theorem

Total discounted reward is maximized by the index policy of alwaysprocessing the uncompleted job of greatest index, computed as

Gi =riβ

ti(1− β)

(1− βti).

Notice that1− βti1− β

= 1 + β + · · ·+ βti−1,

so Gi → ri/ti as β → 1.

12 / 52

Page 18: Bandit Processes and Index Policies

Simple algebra =⇒ Reward of the first schedule is greater if

riβti/(1− βti) > rjβ

tj/(1− βtj ).

Hence a schedule can be optimal only if the jobs are taken indecreasing order of the indices riβti/(1− βti).

Theorem

Total discounted reward is maximized by the index policy of alwaysprocessing the uncompleted job of greatest index, computed as

Gi =riβ

ti(1− β)

(1− βti).

Notice that1− βti1− β

= 1 + β + · · ·+ βti−1,

so Gi → ri/ti as β → 1.

12 / 52

Page 19: Bandit Processes and Index Policies

Weitzman’s Pandora problem

Weitzman, M. L. (1979). Optimal search for the best alternative.Econometrica, 47:641–654.

Martin L. Weitzman is Professor of Economics at Harvard University.

13 / 52

Page 20: Bandit Processes and Index Policies

Weitzman’s Pandora problem

Pandora has n boxes.

13 / 52

Page 21: Bandit Processes and Index Policies

Weitzman’s Pandora problem

Pandora has n boxes.

Box i contains a prize, of value xi, distributed with knownc.d.f. Fi.

At known cost ci she can open box i and discover xi.

Pandora may open boxes in any order, and stop at will..

She opens a subset of boxes S ⊆ {1, . . . , n} and then stops.She wishes to maximize the expected value of

R = maxi∈S

xi −∑i∈S

ci.

13 / 52

Page 22: Bandit Processes and Index Policies

Weitzman’s Pandora problem

Pandora has n boxes.

Box i contains a prize, of value xi, distributed with knownc.d.f. Fi.

At known cost ci she can open box i and discover xi.

Pandora may open boxes in any order, and stop at will..

She opens a subset of boxes S ⊆ {1, . . . , n} and then stops.She wishes to maximize the expected value of

R = maxi∈S

xi −∑i∈S

ci.

13 / 52

Page 23: Bandit Processes and Index Policies

Weitzman’s Pandora problem

Pandora has n boxes.

Box i contains a prize, of value xi, distributed with knownc.d.f. Fi.

At known cost ci she can open box i and discover xi.

Pandora may open boxes in any order, and stop at will.

.

She opens a subset of boxes S ⊆ {1, . . . , n} and then stops.She wishes to maximize the expected value of

R = maxi∈S

xi −∑i∈S

ci.

13 / 52

Page 24: Bandit Processes and Index Policies

Weitzman’s Pandora problem

Pandora has n boxes.

Box i contains a prize, of value xi, distributed with knownc.d.f. Fi.

At known cost ci she can open box i and discover xi.

Pandora may open boxes in any order, and stop at will..

She opens a subset of boxes S ⊆ {1, . . . , n} and then stops.She wishes to maximize the expected value of

R = maxi∈S

xi −∑i∈S

ci.

13 / 52

Page 25: Bandit Processes and Index Policies

Reasons for liking Weitzman’s problem

Weitzmans’ problem is attractive.

1. It has many applications:

hunting for a house

selling a house (accepting the best offer)

searching for a job,

looking for research project to focus upon.

2. It has an index policy solution, a so-called Pandora rule

- Calculate a ‘reservation prize’ value for each box.

- Open boxes in descending order of reservation prizes until aprize is found whose value exceeds the reservation prize of anyunopened box.

14 / 52

Page 26: Bandit Processes and Index Policies

Reasons for liking Weitzman’s problem

Weitzmans’ problem is attractive.

1. It has many applications:

hunting for a house

selling a house (accepting the best offer)

searching for a job,

looking for research project to focus upon.

2. It has an index policy solution, a so-called Pandora rule

- Calculate a ‘reservation prize’ value for each box.

- Open boxes in descending order of reservation prizes until aprize is found whose value exceeds the reservation prize of anyunopened box.

14 / 52

Page 27: Bandit Processes and Index Policies

Index policy for Pandora’s problem

We seek to maximize expected value of

R = maxi∈S

xi −∑i∈S

ci.

where S is the set of opened boxes.

The reservation value (index) of box i is

x∗i = inf{y : y ≥ −ci + E[max(xi, y)]

}.

Weitzman’s Pandora rule. Open the unopened box with greatestreservation value, until all reservations values are less than thegreatest prize that has been found.

15 / 52

Page 28: Bandit Processes and Index Policies

Index policy for Pandora’s problem

We seek to maximize expected value of

R = maxi∈S

xi −∑i∈S

ci.

where S is the set of opened boxes.

The reservation value (index) of box i is

x∗i = inf{y : y ≥ −ci + E[max(xi, y)]

}.

Weitzman’s Pandora rule. Open the unopened box with greatestreservation value, until all reservations values are less than thegreatest prize that has been found.

15 / 52

Page 29: Bandit Processes and Index Policies

Deal or no deal

16 / 52

Page 30: Bandit Processes and Index Policies

Scheduling in a two-class M/M/1 queue

Jobs of two types arrive at a single server according toindependent Poisson streams, with rates λi, i = 1, 2.

Service times of type (class) i jobs are

exponentially distributed with mean µ−1i , i = 1, 2.

independent of each other and of the arrival streams.

Assume overall rate at which work arrives at the system,namely λ1/µ1 + λ2/µ2, is less than 1.

17 / 52

Page 31: Bandit Processes and Index Policies

Our goal: minimize holding cost rate

We wish to choose policy π

— which is to be past-measurable [non-anticipative], non-idlingand preemptive —,

to minimize the long-term holding cost rate, i.e.

minimizeπ

{c1Eπ(N1) + c2Eπ(N2)

}.

ci is a positive-valued class i holding cost rate.

Ni is the number of class i jobs in the system.

Eπ is expectation under steady state distribution induced by π.

18 / 52

Page 32: Bandit Processes and Index Policies

Our goal: minimize holding cost rate

We wish to choose policy π

— which is to be past-measurable [non-anticipative], non-idlingand preemptive —,

to minimize the long-term holding cost rate, i.e.

minimizeπ

{c1Eπ(N1) + c2Eπ(N2)

}.

ci is a positive-valued class i holding cost rate.

Ni is the number of class i jobs in the system.

Eπ is expectation under steady state distribution induced by π.

18 / 52

Page 33: Bandit Processes and Index Policies

A conservation law

Work-in-system process, W π(t), is the aggregate of theremaining service times of all jobs in the system at t.

Pollaczek–Khinchine formula for the M/G/1 queue states that ifX has the distribution of a service time then the mean work insystem is Eπ[W ] = 1

2λEX2/(1− λEX).

Since π is non-idling, Eπ[W ] is independent of π, and so

Eπ[W ] =Eπ(N1)

µ1+Eπ(N2)

µ2=ρ1µ

−11 + ρ2µ

−12

1− ρ1 − ρ2,

ρi = λi/µi is the rate at which class i work enters the system.

19 / 52

Page 34: Bandit Processes and Index Policies

Some constraints

Let xπi = Eπ(Ni)/µi be the mean work-in-system of class i.

The work-in-system of type 1 jobs is no less than it would be if wealways gave them priority. Hence

xπ1 =Eπ(N1)

µ1≥ ρ1µ

−11

1− ρ1.

with equality under the policy (denoted 1→ 2) that always givespriority to type 1 jobs.

Similiarly,

xπ2 =Eπ(N2)

µ2≥ ρ2µ

−12

1− ρ2.

under 2→ 1.

20 / 52

Page 35: Bandit Processes and Index Policies

Some constraints

Let xπi = Eπ(Ni)/µi be the mean work-in-system of class i.

The work-in-system of type 1 jobs is no less than it would be if wealways gave them priority. Hence

xπ1 =Eπ(N1)

µ1≥ ρ1µ

−11

1− ρ1.

with equality under the policy (denoted 1→ 2) that always givespriority to type 1 jobs.

Similiarly,

xπ2 =Eπ(N2)

µ2≥ ρ2µ

−12

1− ρ2.

under 2→ 1.

20 / 52

Page 36: Bandit Processes and Index Policies

Some constraints

Let xπi = Eπ(Ni)/µi be the mean work-in-system of class i.

The work-in-system of type 1 jobs is no less than it would be if wealways gave them priority. Hence

xπ1 =Eπ(N1)

µ1≥ ρ1µ

−11

1− ρ1.

with equality under the policy (denoted 1→ 2) that always givespriority to type 1 jobs.

Similiarly,

xπ2 =Eπ(N2)

µ2≥ ρ2µ

−12

1− ρ2.

under 2→ 1.

20 / 52

Page 37: Bandit Processes and Index Policies

A mathematical program

Consider the optimization problem

minimizeπ

{c1Eπ(N1) + c2Eπ(N2)} = minimizeπ

{c1µ1xπ1 + c2µ2xπ2}

= minimize(x1,x2)∈X

{c1µ1x1 + c2µ2x2} ,

where X is the set of (xπ1 , xπ2 ) that are achievable for some π.

Have argued that

X ⊆ P =

{(x1, x2) :x1 ≥

ρ1µ−11

1− ρ1, x2 ≥

ρ2µ−12

1− ρ2,

x1 + x2 =ρ1µ

−11 + ρ2µ

−12

1− ρ1 − ρ2

}.

21 / 52

Page 38: Bandit Processes and Index Policies

A mathematical program

Consider the optimization problem

minimizeπ

{c1Eπ(N1) + c2Eπ(N2)} = minimizeπ

{c1µ1xπ1 + c2µ2xπ2}

= minimize(x1,x2)∈X

{c1µ1x1 + c2µ2x2} ,

where X is the set of (xπ1 , xπ2 ) that are achievable for some π.

Have argued that

X ⊆ P =

{(x1, x2) :x1 ≥

ρ1µ−11

1− ρ1, x2 ≥

ρ2µ−12

1− ρ2,

x1 + x2 =ρ1µ

−11 + ρ2µ

−12

1− ρ1 − ρ2

}.

21 / 52

Page 39: Bandit Processes and Index Policies

A linear program relaxationLP relaxation of our optimization problem is

minimize(x1,x2)∈P

{c1µ1x1 + c2µ2x2

}

P =

{(x1, x2) :x1 ≥

ρ1µ−11

1− ρ1, x2 ≥

ρ2µ−12

1− ρ2,

x1 + x1 =ρ1µ

−11 + ρ2µ

−12

1− ρ1 − ρ2

}.

Easy to solve!

If c1µ1 > c2µ2 then the optimal solution is at x∗ where

x∗1 =ρ1µ

−11

1− ρ1, x∗2 =

ρ1µ−11 + ρ2µ

−12

1− ρ1 − ρ2− ρ1µ

−11

1− ρ1.

Furthermore x∗ ∈ X and achieved by the policy 1→ 2.22 / 52

Page 40: Bandit Processes and Index Policies

Lessons

Long term holding cost rate is maximized by giving priority tothe job class of greatest ciµi.

The policy 1→ 2 is an index policy.

At all times the server devotes its effort to the job ofgreatest index, i.e. greatest ciµi.

The obvious generalizaton of this result is true if there weremore than 2 job types.

Given the hint of this simple example, how would you prove it?

23 / 52

Page 41: Bandit Processes and Index Policies

Proof strategy

Find a set of linear inequalities that must be satisfied. Thesedefine a polytope P , such that achievable x must lie in P .

There are constraints like

xi ≥ρiµ−11

1− ρ1, x1 + x2 ≥

ρ1µ−11 + ρ2µ

−12

1− ρ1 − ρ2∑i∈S

xi ≥∑

i∈S ρiµ−11

1−∑

i∈S ρi, for all S ⊆ {1, . . . ,m}.

Minimize∑

i ciµixi over x ∈ P .

Show optimal solution, x∗, is in X and is achieved by indexpolicy that prioritizes jobs in decreasing order of ciµi.

24 / 52

Page 42: Bandit Processes and Index Policies

Proof strategy

Find a set of linear inequalities that must be satisfied. Thesedefine a polytope P , such that achievable x must lie in P .

There are constraints like

xi ≥ρiµ−11

1− ρ1, x1 + x2 ≥

ρ1µ−11 + ρ2µ

−12

1− ρ1 − ρ2∑i∈S

xi ≥∑

i∈S ρiµ−11

1−∑

i∈S ρi, for all S ⊆ {1, . . . ,m}.

Minimize∑

i ciµixi over x ∈ P .

Show optimal solution, x∗, is in X and is achieved by indexpolicy that prioritizes jobs in decreasing order of ciµi.

24 / 52

Page 43: Bandit Processes and Index Policies

Proof strategy

Find a set of linear inequalities that must be satisfied. Thesedefine a polytope P , such that achievable x must lie in P .

There are constraints like

xi ≥ρiµ−11

1− ρ1, x1 + x2 ≥

ρ1µ−11 + ρ2µ

−12

1− ρ1 − ρ2∑i∈S

xi ≥∑

i∈S ρiµ−11

1−∑

i∈S ρi, for all S ⊆ {1, . . . ,m}.

Minimize∑

i ciµixi over x ∈ P .

Show optimal solution, x∗, is in X and is achieved by indexpolicy that prioritizes jobs in decreasing order of ciµi.

24 / 52

Page 44: Bandit Processes and Index Policies

Bandit processes

25 / 52

Page 45: Bandit Processes and Index Policies

Two-job scheduling problem

(r1, t1) = (9, 5), (r2, t2) = (6, 2).

0, 0, 0, 0, 9, 0, ...

0, 6, 0, 0, 0, 0, ...

0 < β < 1.

26 / 52

Page 46: Bandit Processes and Index Policies

Two-job scheduling problem

(r1, t1) = (9, 5), (r2, t2) = (6, 2).

0, 0, 0, 0, 9, 0, ...

, 6, 0, 0, 0, 0, ...0

0 < β < 1.

26 / 52

Page 47: Bandit Processes and Index Policies

Two-job scheduling problem

(r1, t1) = (9, 5), (r2, t2) = (6, 2).

0, 0, 0, 0, 9, 0, ...

, , 0, 0, 0, 0, ...0, 6

0 < β < 1.

26 / 52

Page 48: Bandit Processes and Index Policies

Two-job scheduling problem

(r1, t1) = (9, 5), (r2, t2) = (6, 2).

, 0, 0, 0, 9, 0, ...

, , 0, 0, 0, 0, ...0, 6, 0

0 < β < 1.

26 / 52

Page 49: Bandit Processes and Index Policies

Two-job scheduling problem

(r1, t1) = (9, 5), (r2, t2) = (6, 2).

, , 0, 0, 9, 0, ...

, , 0, 0, 0, 0, ...0, 6, 0, 0

0 < β < 1.

26 / 52

Page 50: Bandit Processes and Index Policies

Two-job scheduling problem

(r1, t1) = (9, 5), (r2, t2) = (6, 2).

, , , 0, 9, 0, ...

, , 0, 0, 0, 0, ...0, 6, 0, 0, 0

0 < β < 1.

26 / 52

Page 51: Bandit Processes and Index Policies

Two-job scheduling problem

(r1, t1) = (9, 5), (r2, t2) = (6, 2).

, , , , 9, 0, ...

, , 0, 0, 0, 0, ...0, 6, 0, 0, 0, 0

0 < β < 1.

26 / 52

Page 52: Bandit Processes and Index Policies

Two-job scheduling problem

(r1, t1) = (9, 5), (r2, t2) = (6, 2).

, , , , , 0, ...

, , , , 0, 0, ...0, 6, 0, 0, 0, 0, 9,

Reward = 0 + 6 + 0 + 0 + 0+ 0 + 0 + 9β β2 β3 β4 β5 β6 β7

0 < β < 1.

26 / 52

Page 53: Bandit Processes and Index Policies

Two-armed bandit

3, 10, 4, 9, 12, 1, ...

5, 6, 2, 15, 2, 7, ...

0 < β < 1. Of course, in practice we must choose which arms topull without knowing the future sequences of rewards.

Each of the two arms is a bandit process.

27 / 52

Page 54: Bandit Processes and Index Policies

Two-armed bandit

3, 10, 4, 9, 12, 1, ...

, 6, 2, 15, 2, 7, ...5

0 < β < 1. Of course, in practice we must choose which arms topull without knowing the future sequences of rewards.

Each of the two arms is a bandit process.

27 / 52

Page 55: Bandit Processes and Index Policies

Two-armed bandit

3, 10, 4, 9, 12, 1, ...

, , 2, 15, 2, 7, ...5, 6

0 < β < 1. Of course, in practice we must choose which arms topull without knowing the future sequences of rewards.

Each of the two arms is a bandit process.

27 / 52

Page 56: Bandit Processes and Index Policies

Two-armed bandit

, 10, 4, 9, 12, 1, ...

, , 2, 15, 2, 7, ...5, 6, 3

0 < β < 1. Of course, in practice we must choose which arms topull without knowing the future sequences of rewards.

Each of the two arms is a bandit process.

27 / 52

Page 57: Bandit Processes and Index Policies

Two-armed bandit

, , 4, 9, 12, 1, ...

, , 2, 15, 2, 7, ...5, 6, 3, 10,

0 < β < 1. Of course, in practice we must choose which arms topull without knowing the future sequences of rewards.

Each of the two arms is a bandit process.

27 / 52

Page 58: Bandit Processes and Index Policies

Two-armed bandit

, , , 9, 12, 1, ...

, , 2, 15, 2, 7, ...5, 6, 3, 10, 4

0 < β < 1. Of course, in practice we must choose which arms topull without knowing the future sequences of rewards.

Each of the two arms is a bandit process.

27 / 52

Page 59: Bandit Processes and Index Policies

Two-armed bandit

, , , , 12, 1, ...

, , 2, 15, 2, 7, ...5, 6, 3, 10, 4, 9

0 < β < 1. Of course, in practice we must choose which arms topull without knowing the future sequences of rewards.

Each of the two arms is a bandit process.

27 / 52

Page 60: Bandit Processes and Index Policies

Two-armed bandit

, , , , , 1, ...

, , 2, 15, 2, 7, ...5, 6, 3, 10, 4, 9, 12

0 < β < 1. Of course, in practice we must choose which arms topull without knowing the future sequences of rewards.

Each of the two arms is a bandit process.

27 / 52

Page 61: Bandit Processes and Index Policies

Two-armed bandit

, , , , , 1, ...

, , , 15, 2, 7, ...5, 6, 3, 10, 4, 9, 12, 2

0 < β < 1. Of course, in practice we must choose which arms topull without knowing the future sequences of rewards.

Each of the two arms is a bandit process.

27 / 52

Page 62: Bandit Processes and Index Policies

Two-armed bandit

, , , , , 1, ...

, , , , 2, 7, ...5, 6, 3, 10, 4, 9, 12, 2, 15

0 < β < 1. Of course, in practice we must choose which arms topull without knowing the future sequences of rewards.

Each of the two arms is a bandit process.

27 / 52

Page 63: Bandit Processes and Index Policies

Two-armed bandit

, , , , , 1, ...

, , , , 2, 7, ...5, 6, 3, 10, 4, 9, 12, 2, 15

Reward = 5 + 6 + 3 + 10 + . . .β β2 β3

0 < β < 1.

Of course, in practice we must choose which arms topull without knowing the future sequences of rewards.

Each of the two arms is a bandit process.

27 / 52

Page 64: Bandit Processes and Index Policies

Two-armed bandit

, , , , , 1, ...

, , , , 2, 7, ...5, 6, 3, 10, 4, 9, 12, 2, 15

Reward = 5 + 6 + 3 + 10 + . . .β β2 β3

0 < β < 1. Of course, in practice we must choose which arms topull without knowing the future sequences of rewards.

Each of the two arms is a bandit process.

27 / 52

Page 65: Bandit Processes and Index Policies

Bandit processes

A bandit process is a special type of Markov Decision Process inwhich there are just two possible actions:

u = 1 (continue)

produces reward r(xt) and the state changes, to xt+1,according to Markov dynamics Pi(xt, xt+1).

u = 0 (freeze)

produces no reward and the state does not change (hence theterm ‘freeze’).

A simple family of alternative bandit processes (SFABP) is acollection of N such bandit processes, in known statesx1(t), . . . , xN (t).

28 / 52

Page 66: Bandit Processes and Index Policies

Bandit processes

A bandit process is a special type of Markov Decision Process inwhich there are just two possible actions:

u = 1 (continue)

produces reward r(xt) and the state changes, to xt+1,according to Markov dynamics Pi(xt, xt+1).

u = 0 (freeze)

produces no reward and the state does not change (hence theterm ‘freeze’).

A simple family of alternative bandit processes (SFABP) is acollection of N such bandit processes, in known statesx1(t), . . . , xN (t).

28 / 52

Page 67: Bandit Processes and Index Policies

SFABP

At each time, t ∈ {0, 1, 2, . . .},One bandit process is to be activated (pulled/continued)

If arm i activated then it changes state:

x→ y with probability Pi(x, y)

and produces reward ri(xi(t)).

All other bandit processes remain passive (not pulled/frozen).

Objective: maximize the expected total β-discounted reward

E

[ ∞∑t=0

rit(xit(t))βt

],

where it is the arm pulled at time t, (0 < β < 1).

29 / 52

Page 68: Bandit Processes and Index Policies

SFABP

At each time, t ∈ {0, 1, 2, . . .},One bandit process is to be activated (pulled/continued)

If arm i activated then it changes state:

x→ y with probability Pi(x, y)

and produces reward ri(xi(t)).

All other bandit processes remain passive (not pulled/frozen).

Objective: maximize the expected total β-discounted reward

E

[ ∞∑t=0

rit(xit(t))βt

],

where it is the arm pulled at time t, (0 < β < 1).

29 / 52

Page 69: Bandit Processes and Index Policies

SFABP

At each time, t ∈ {0, 1, 2, . . .},One bandit process is to be activated (pulled/continued)

If arm i activated then it changes state:

x→ y with probability Pi(x, y)

and produces reward ri(xi(t)).

All other bandit processes remain passive (not pulled/frozen).

Objective: maximize the expected total β-discounted reward

E

[ ∞∑t=0

rit(xit(t))βt

],

where it is the arm pulled at time t, (0 < β < 1).

29 / 52

Page 70: Bandit Processes and Index Policies

Dynamic effort allocation

Job Scheduling: in what order should I work on the tasks inmy in-tray?

Research projects: how should I allocate my research timeamongst my favorite open problems so as to maximize thevalue of my completed research?

Searching for information: shall I spend more time browsingthe web, or go to the library, or ask a friend?

Dating strategy: should I contact a new prospect, or tryanother date with someone I have dated before?

30 / 52

Page 71: Bandit Processes and Index Policies

Dynamic effort allocation

Job Scheduling: in what order should I work on the tasks inmy in-tray?

Research projects: how should I allocate my research timeamongst my favorite open problems so as to maximize thevalue of my completed research?

Searching for information: shall I spend more time browsingthe web, or go to the library, or ask a friend?

Dating strategy: should I contact a new prospect, or tryanother date with someone I have dated before?

30 / 52

Page 72: Bandit Processes and Index Policies

Dynamic effort allocation

Job Scheduling: in what order should I work on the tasks inmy in-tray?

Research projects: how should I allocate my research timeamongst my favorite open problems so as to maximize thevalue of my completed research?

Searching for information: shall I spend more time browsingthe web, or go to the library, or ask a friend?

Dating strategy: should I contact a new prospect, or tryanother date with someone I have dated before?

30 / 52

Page 73: Bandit Processes and Index Policies

Dynamic effort allocation

Job Scheduling: in what order should I work on the tasks inmy in-tray?

Research projects: how should I allocate my research timeamongst my favorite open problems so as to maximize thevalue of my completed research?

Searching for information: shall I spend more time browsingthe web, or go to the library, or ask a friend?

Dating strategy: should I contact a new prospect, or tryanother date with someone I have dated before?

30 / 52

Page 74: Bandit Processes and Index Policies

Dynamic programming solution

The dynamic programming equation is

F (x1, . . . , xN )

= maxi

{ri(xi) + β

∑y

Pi(xi, y)F (x1, . . . , xi−1, y, xi+1, . . . , xN )}

If bandit i moves on a state space of size ki, then (x1, . . . , xN )moves on a state space of size

∏i ki (exponential in N).

31 / 52

Page 75: Bandit Processes and Index Policies

Dynamic programming solution

The dynamic programming equation is

F (x1, . . . , xN )

= maxi

{ri(xi) + β

∑y

Pi(xi, y)F (x1, . . . , xi−1, y, xi+1, . . . , xN )}

If bandit i moves on a state space of size ki, then (x1, . . . , xN )moves on a state space of size

∏i ki (exponential in N).

31 / 52

Page 76: Bandit Processes and Index Policies

Gittins Index solution

Theorem [Gittins, ‘74, ‘79, ‘89]

Expected discounted reward is maximized by always continuing thebandit having greatest value of ‘dynamic allocation index’

Gi(xi) = supτ≥1

E[∑τ−1

t=0 ri(xi(t))βt∣∣∣ xi(0) = xi

]E[∑τ−1

t=0 βt∣∣∣ xi(0) = xi

]where τ is a (past-measurable) stopping-time.

Gi(xi) is called the Gittins index.

Gittins and Jones (1974). A dynamic allocation index for the sequential design of

experiments. In Gani, J., editor, Progress in Statistics, pages 241–66. North-Holland,

Amsterdam, NL. Read at the 1972 European Meeting of Statisticians, Budapest.

32 / 52

Page 77: Bandit Processes and Index Policies

Gittins Index solution

Theorem [Gittins, ‘74, ‘79, ‘89]

Expected discounted reward is maximized by always continuing thebandit having greatest value of ‘dynamic allocation index’

Gi(xi) = supτ≥1

E[∑τ−1

t=0 ri(xi(t))βt∣∣∣ xi(0) = xi

]E[∑τ−1

t=0 βt∣∣∣ xi(0) = xi

]where τ is a (past-measurable) stopping-time.

Gi(xi) is called the Gittins index.

Gittins and Jones (1974). A dynamic allocation index for the sequential design of

experiments. In Gani, J., editor, Progress in Statistics, pages 241–66. North-Holland,

Amsterdam, NL. Read at the 1972 European Meeting of Statisticians, Budapest.

32 / 52

Page 78: Bandit Processes and Index Policies

Gittins Index solution

Theorem [Gittins, ‘74, ‘79, ‘89]

Expected discounted reward is maximized by always continuing thebandit having greatest value of ‘dynamic allocation index’

Gi(xi) = supτ≥1

E[∑τ−1

t=0 ri(xi(t))βt∣∣∣ xi(0) = xi

]E[∑τ−1

t=0 βt∣∣∣ xi(0) = xi

]where τ is a (past-measurable) stopping-time.

Gi(xi) is called the Gittins index.

Gittins and Jones (1974). A dynamic allocation index for the sequential design of

experiments. In Gani, J., editor, Progress in Statistics, pages 241–66. North-Holland,

Amsterdam, NL. Read at the 1972 European Meeting of Statisticians, Budapest.

32 / 52

Page 79: Bandit Processes and Index Policies

Multi-armed bandit allocation indices

1st Edition edition, 1989, Gittins

Many applications to clinical trials, job scheduling, search, etc.

33 / 52

Page 80: Bandit Processes and Index Policies

Multi-armed bandit allocation indices

1st Edition edition, 1989, Gittins

Many applications to clinical trials, job scheduling, search, etc.

2nd Edition edition, 2011, Gittins, Glazebrook and Weber

33 / 52

Page 81: Bandit Processes and Index Policies

Multi-armed bandit allocation indices

1st Edition edition, 1989, Gittins

Many applications to clinical trials, job scheduling, search, etc.

2nd Edition edition, 2011, Gittins, Glazebrook and Weber

33 / 52

Page 82: Bandit Processes and Index Policies

Exploration versus exploitation

“Bandit problems embody in essential form a conflictevident in all human action: information versus immediatepayoff.”

— Peter Whittle (1989)

34 / 52

Page 83: Bandit Processes and Index Policies

Gittins index

Gi(xi) = supτ≥1

E[∑τ−1

t=0 ri(xi(t))βt∣∣∣ xi(0) = xi

]E[∑τ−1

t=0 βt∣∣∣ xi(0) = xi

]

35 / 52

Page 84: Bandit Processes and Index Policies

Gittins index

Gi(xi) = supτ≥1

E[∑τ−1

t=0 ri(xi(t))βt∣∣∣ xi(0) = xi

]E[∑τ−1

t=0 βt∣∣∣ xi(0) = xi

]Discounted reward up to τ .

35 / 52

Page 85: Bandit Processes and Index Policies

Gittins index

Gi(xi) = supτ≥1

E[∑τ−1

t=0 ri(xi(t))βt∣∣∣ xi(0) = xi

]E[∑τ−1

t=0 βt∣∣∣ xi(0) = xi

]Discounted reward up to τ .

Discounted time up to τ .

35 / 52

Page 86: Bandit Processes and Index Policies

Gittins index

Gi(xi) = supτ≥1

E[∑τ−1

t=0 ri(xi(t))βt∣∣∣ xi(0) = xi

]E[∑τ−1

t=0 βt∣∣∣ xi(0) = xi

]Discounted reward up to τ .

Discounted time up to τ .

Note the role of the stopping time τ .Stopping times are times recognisable when they occur.

35 / 52

Page 87: Bandit Processes and Index Policies

Gittins index

Gi(xi) = supτ≥1

E[∑τ−1

t=0 ri(xi(t))βt∣∣∣ xi(0) = xi

]E[∑τ−1

t=0 βt∣∣∣ xi(0) = xi

]Discounted reward up to τ .

Discounted time up to τ .

Note the role of the stopping time τ .Stopping times are times recognisable when they occur.How do you make perfect toast?

There is a rule for timing toast,One never has to guess,Just wait until it starts to smoke,then 7 seconds less. (David Kendall)

35 / 52

Page 88: Bandit Processes and Index Policies

Gittins index

Gi(xi) = supτ≥1

E[∑τ−1

t=0 ri(xi(t))βt∣∣∣ xi(0) = xi

]E[∑τ−1

t=0 βt∣∣∣ xi(0) = xi

]Discounted reward up to τ .

Discounted time up to τ .

In the scheduling problem τ = ti and

Gi =riβ

ti

1 + β + · · ·+ βti−1= (1− β)

riβti

1− βti

35 / 52

Page 89: Bandit Processes and Index Policies

Single machine scheduling

N jobs are to be processed successively on one machine.

Job i has a known processing times ti, a positive integer.

On completion of job i a reward ri is obtained.

Total discounted reward is maximized by the index policy whichprocesses jobs in decreasing order of indices, Gi.

Gi =riβ

ti

1 + β + · · ·+ βti−1= (1− β)

riβti

1− βti

36 / 52

Page 90: Bandit Processes and Index Policies

Gittins index via calibration

Can also define Gittins index by comparing to arm of fixed pay rate:

Gi(xi) = sup

{λ :

∞∑t=0

βtλ ≤ supτ≥1

E

[τ−1∑t=0

βtri(xi(t)) +

∞∑t=τ

βtλ∣∣∣xi(0) = xi

]}.

Consider a problem with two bandit processes: the bandit processBi and a calibrating bandit process, say Λ, which pays out aknown reward λ at each step it is continued.

The Gittins index of Bi is the value of λ for which we areindifferent as to which of Bi and Λ to continue initially.

37 / 52

Page 91: Bandit Processes and Index Policies

Gittins index theorem is surprising!

Peter Whittle tells the story:

“A colleague of high repute asked an equally well-known col-league:

— What would you say if you were told that themulti-armed bandit problem had been solved?’

— Sir, the multi-armed bandit problem is not of such anature that it can be solved.’

38 / 52

Page 92: Bandit Processes and Index Policies

Gittins index theorem is surprising!

Peter Whittle tells the story:

“A colleague of high repute asked an equally well-known col-league:

— What would you say if you were told that themulti-armed bandit problem had been solved?’

— Sir, the multi-armed bandit problem is not of such anature that it can be solved.’

38 / 52

Page 93: Bandit Processes and Index Policies

What has happened since 1989?

Index theorem has become better known.

Alternative proofs have been explored.

Playing golf with N balls

Achievable performance region approach

Many applications (economics, engineering, . . . ).

Notions of indexation have been generalized.

Restless Bandits

39 / 52

Page 94: Bandit Processes and Index Policies

Proofs of the Index Theorem

Since Gittins (1974, 1979), many researchers have reproved,remodelled and resituated the index theorem.

Beale (1979)

Karatzas (1984)

Varaiya, Walrand, Buyukkoc (1985)

Chen, Katehakis (1986)

Kallenberg (1986)

Katehakis, Veinott (1986)

Eplett (1986)

Kertz (1986)

Tsitsiklis (1986)

Mandelbaum (1986, 1987)

Lai, Ying (1988)

Whittle (1988)

Weber (1992)

El Karoui, Karatzas (1993)

Ishikida and Varaiya (1994)

Tsitsiklis (1994)

Bertsimas, Nino-Mora (1996)

Glazebrook, Garbe (1996)

Kaspi, Mandelbaum (1998)

Bauerle, Stidham (2001)

Dimitriu, Tetali, Winkler (2003)

40 / 52

Page 95: Bandit Processes and Index Policies

Proofs of the Index Theorem

Interchange arguments (but cunning ones!)

Economic/gaming argument

Linear programming relaxation (achievable region method)

41 / 52

Page 96: Bandit Processes and Index Policies

Golf with N balls

Dumitriu, Tetali and Winkler, (2003). On playing golf with two balls.

N balls are strewn about a golf course at locations x1, . . . , xN .

Objective

Minimize the expected total cost incurred up to sinking a first ball.

Answer

When ball i is in location xi it has an index γi(xi).

Play the ball of smallest index, until a ball goes in the whole.

42 / 52

Page 97: Bandit Processes and Index Policies

Golf with N balls

Dumitriu, Tetali and Winkler, (2003). On playing golf with two balls.

N balls are strewn about a golf course at locations x1, . . . , xN .

Hitting a ball i, that is in location xi, costs c(xi),

xi → y with probability P (xi, y)

Ball goes in the hole with probability P (xi, 0).

Objective

Minimize the expected total cost incurred up to sinking a first ball.

Answer

When ball i is in location xi it has an index γi(xi).

Play the ball of smallest index, until a ball goes in the whole.

42 / 52

Page 98: Bandit Processes and Index Policies

Golf with N balls

Dumitriu, Tetali and Winkler, (2003). On playing golf with two balls.

N balls are strewn about a golf course at locations x1, . . . , xN .

Hitting a ball i, that is in location xi, costs c(xi),

xi → y with probability P (xi, y)

Ball goes in the hole with probability P (xi, 0).

Objective

Minimize the expected total cost incurred up to sinking a first ball.

Answer

When ball i is in location xi it has an index γi(xi).

Play the ball of smallest index, until a ball goes in the whole.

42 / 52

Page 99: Bandit Processes and Index Policies

Gittins index theorem for golf with N balls

Golf with one ball

Consider golf with one ball, initially in location xi.

Let’s offer the golfer a prize λ, obtained when he sinks this ball inthe hole (state 0).

We might ask, what is the least λ for which it is optimal for him totake at least one more stroke — allowing him the option to retireat any point thereafter?

γi(xi) = inf

{λ : 0 ≤ sup

τ≥1E

[λ1{xi(τ)=0} −

τ−1∑t=0

ci(xi(t)

∣∣∣∣∣xi(0) = xi

]}.

Call γi(xi) the fair prize, (or Gittins index).

43 / 52

Page 100: Bandit Processes and Index Policies

Gittins index theorem for golf with N balls

Golf with one ball

Consider golf with one ball, initially in location xi.

Let’s offer the golfer a prize λ, obtained when he sinks this ball inthe hole (state 0).

We might ask, what is the least λ for which it is optimal for him totake at least one more stroke — allowing him the option to retireat any point thereafter?

γi(xi) = inf

{λ : 0 ≤ sup

τ≥1E

[λ1{xi(τ)=0} −

τ−1∑t=0

ci(xi(t)

∣∣∣∣∣xi(0) = xi

]}.

Call γi(xi) the fair prize, (or Gittins index).

43 / 52

Page 101: Bandit Processes and Index Policies

Gittins index theorem for golf

Golf with one ball

Consider golf with one ball, initially in location xi.

Let’s offer the golfer a prize λ, obtained when he sinks this ball inthe hole (state 0).

We might ask, what is the least λ for which it is optimal for him totake at least one more stroke — allowing him the option to retireat any point thereafter?

γi(xi) = inf

{λ : 0 ≤ sup

τ≥1E

[λ1{xi(τ)=0} −

τ−1∑t=0

ci(xi(t))

∣∣∣∣∣xi(0) = xi

]}.

Call γi(xi) the fair prize, (or Gittins index).

44 / 52

Page 102: Bandit Processes and Index Policies

How to play golfwith one ball and an increasing fair prize

Having been offered a fair prize the golfer will play until the ball

goes in the hole, or

reaches a state xi(t) from which the offered prize is no longergreat enough to tempt him to play further.

If the latter occurs, let us increase the prize to γi(xi(t)).

It becomes the ‘prevailing prize’ at time t, i.e. maxs≤t γi(xi(s)).

The prevailing prize is nondecreasing in t.

Now the golfer need never retire and can keep playing until the ballgoes in the hole, say at time τ .

But his expected profit is just 0.

E

[γi(xi(τ − 1))−

τ−1∑t=0

ci(xi(t)

∣∣∣∣∣ xi(0) = xi

]= 0.

45 / 52

Page 103: Bandit Processes and Index Policies

How to play golfwith one ball and an increasing fair prize

Having been offered a fair prize the golfer will play until the ball

goes in the hole, or

reaches a state xi(t) from which the offered prize is no longergreat enough to tempt him to play further.

If the latter occurs, let us increase the prize to γi(xi(t)).

It becomes the ‘prevailing prize’ at time t, i.e. maxs≤t γi(xi(s)).

The prevailing prize is nondecreasing in t.

Now the golfer need never retire and can keep playing until the ballgoes in the hole, say at time τ .

But his expected profit is just 0.

E

[γi(xi(τ − 1))−

τ−1∑t=0

ci(xi(t)

∣∣∣∣∣ xi(0) = xi

]= 0.

45 / 52

Page 104: Bandit Processes and Index Policies

How to play golfwith one ball and an increasing fair prize

Having been offered a fair prize the golfer will play until the ball

goes in the hole, or

reaches a state xi(t) from which the offered prize is no longergreat enough to tempt him to play further.

If the latter occurs, let us increase the prize to γi(xi(t)).

It becomes the ‘prevailing prize’ at time t, i.e. maxs≤t γi(xi(s)).

The prevailing prize is nondecreasing in t.

Now the golfer need never retire and can keep playing until the ballgoes in the hole, say at time τ .

But his expected profit is just 0.

E

[γi(xi(τ − 1))−

τ−1∑t=0

ci(xi(t)

∣∣∣∣∣ xi(0) = xi

]= 0.

45 / 52

Page 105: Bandit Processes and Index Policies

Golf with 1 ball

γ(x) = 3.0

x

γ(x)

′′

46 / 52

Page 106: Bandit Processes and Index Policies

Golf with 1 ball

γ(x) = 3.0, γ(x′) = 2.5

xx′

γ(x) γ(x)

′′

46 / 52

Page 107: Bandit Processes and Index Policies

Golf with 1 ball

γ(x) = 3.0, γ(x′) = 2.5, γ(x′′) = 4.0

xx′

x′′

γ(x) γ(x)

γ(x′′)

46 / 52

Page 108: Bandit Processes and Index Policies

Golf with 1 ball

γ(x) = 3.0, γ(x′) = 2.5, γ(x′′) = 4.0Prevailing prize sequence is 3.0, 3.0, 4.0, . . .

xx′

x′′

γ(x) γ(x)

γ(x′′)

46 / 52

Page 109: Bandit Processes and Index Policies

Golf with 2 balls

γ(x) = 3.0γ(y) = 3.2

x

γ(x)

yγ(y)

′′

47 / 52

Page 110: Bandit Processes and Index Policies

Golf with 2 balls

γ(x) = 3.0, γ(x′) = 2.5γ(y) = 3.2

xx′

γ(x) γ(x)

yγ(y)

′′

47 / 52

Page 111: Bandit Processes and Index Policies

Golf with 2 balls

γ(x) = 3.0, γ(x′) = 2.5, γ(x′′) = 4.0γ(y) = 3.2

xx′

x′′

γ(x) γ(x)

γ(x′′)

yγ(y)

′′

47 / 52

Page 112: Bandit Processes and Index Policies

Golf with 2 balls

γ(x) = 3.0, γ(x′) = 2.5, γ(x′′) = 4.0γ(y) = 3.2, γ(y′) = 3.5

xx′

x′′

γ(x) γ(x)

γ(x′′)

y

y′

γ(y)

γ(y′)

′′

47 / 52

Page 113: Bandit Processes and Index Policies

Golf with 2 balls

γ(x) = 3.0, γ(x′) = 2.5, γ(x′′) = 4.0γ(y) = 3.2, γ(y′) = 3.5, γ(y′′) = 4.2

xx′

x′′

γ(x) γ(x)

γ(x′′)

y

y′y′′

γ(y)

γ(y′)γ(y′′)

47 / 52

Page 114: Bandit Processes and Index Policies

Golf with 2 balls

γ(x) = 3.0, γ(x′) = 2.5, γ(x′′) = 4.0γ(y) = 3.2, γ(y′) = 3.5, γ(y′′) = 4.2

xx′

x′′

γ(x) γ(x)

γ(x′′)

y

y′y′′

γ(y)

γ(y′)γ(y′′)

47 / 52

Page 115: Bandit Processes and Index Policies

Index theorem for golf with N balls

Suppose the golfer keeps playing until a ball goes in the hole.

His prize is the prevailing prize of the ball he sinks.

Prevailing prizes are defined in such a way that the golfer cannotmake a strictly positive profit, and so for any policy σ,

Eσ(cost incurred) ≥ Eσ(prize eventually won) (1)

Let π be the policy: always play the ball with least prevailing prize.

Because each ball’s sequence of prevailing prizes is nondecreasing.

Eσ(prize eventually won) ≥ Eπ(prize eventually won) (2)

But the golfer breaks even under π.

Eπ(prize eventually won) = Eπ(cost incurred) (3)

48 / 52

Page 116: Bandit Processes and Index Policies

Index theorem for golf with N balls

Suppose the golfer keeps playing until a ball goes in the hole.

His prize is the prevailing prize of the ball he sinks.

Prevailing prizes are defined in such a way that the golfer cannotmake a strictly positive profit, and so for any policy σ,

Eσ(cost incurred) ≥ Eσ(prize eventually won) (1)

Let π be the policy: always play the ball with least prevailing prize.

Because each ball’s sequence of prevailing prizes is nondecreasing.

Eσ(prize eventually won) ≥ Eπ(prize eventually won) (2)

But the golfer breaks even under π.

Eπ(prize eventually won) = Eπ(cost incurred) (3)

48 / 52

Page 117: Bandit Processes and Index Policies

Index theorem for golf with N balls

Suppose the golfer keeps playing until a ball goes in the hole.

His prize is the prevailing prize of the ball he sinks.

Prevailing prizes are defined in such a way that the golfer cannotmake a strictly positive profit, and so for any policy σ,

Eσ(cost incurred) ≥ Eσ(prize eventually won) (1)

Let π be the policy: always play the ball with least prevailing prize.

Because each ball’s sequence of prevailing prizes is nondecreasing.

Eσ(prize eventually won) ≥ Eπ(prize eventually won) (2)

But the golfer breaks even under π.

Eπ(prize eventually won) = Eπ(cost incurred) (3)

48 / 52

Page 118: Bandit Processes and Index Policies

Index theorem for golf with N balls

Suppose the golfer keeps playing until a ball goes in the hole.

His prize is the prevailing prize of the ball he sinks.

Prevailing prizes are defined in such a way that the golfer cannotmake a strictly positive profit, and so for any policy σ,

Eσ(cost incurred) ≥ Eσ(prize eventually won) (1)

Let π be the policy: always play the ball with least prevailing prize.

Because each ball’s sequence of prevailing prizes is nondecreasing.

Eσ(prize eventually won) ≥ Eπ(prize eventually won) (2)

But the golfer breaks even under π.

Eπ(prize eventually won) = Eπ(cost incurred) (3)

48 / 52

Page 119: Bandit Processes and Index Policies

Golf and the multi-armed banditHaving solved the golf problem, the solution to the multi-armedbandit problem follows. Just let P (x, 0) = 1− β for all x.

The expected cost incurred until a first ball is sunk equals theexpected total β-discounted cost over the infinite horizon.

The fair prize, g(x), is 1/(1− β) times the Gittins index, G(x).

g(x) = inf

{g : sup

τ≥1E

[τ−1∑t=0

−c(x(t))βt

+ (1− β)(1 + β + · · ·+ βτ−1)g

∣∣∣∣∣x(0) = x

]≥ 0

}

=1

1− βinfτ≥1

E[∑τ−1

t=0 c(x(t))βt∣∣∣x(0) = x

]E[∑τ−1

t=0 βt∣∣∣x(0) = x

]

49 / 52

Page 120: Bandit Processes and Index Policies

Golf and the multi-armed banditHaving solved the golf problem, the solution to the multi-armedbandit problem follows. Just let P (x, 0) = 1− β for all x.

The expected cost incurred until a first ball is sunk equals theexpected total β-discounted cost over the infinite horizon.

The fair prize, g(x), is 1/(1− β) times the Gittins index, G(x).

g(x) = inf

{g : sup

τ≥1E

[τ−1∑t=0

−c(x(t))βt

+ (1− β)(1 + β + · · ·+ βτ−1)g

∣∣∣∣∣x(0) = x

]≥ 0

}

=1

1− βinfτ≥1

E[∑τ−1

t=0 c(x(t))βt∣∣∣x(0) = x

]E[∑τ−1

t=0 βt∣∣∣x(0) = x

]

49 / 52

Page 121: Bandit Processes and Index Policies

Golf and the multi-armed banditHaving solved the golf problem, the solution to the multi-armedbandit problem follows. Just let P (x, 0) = 1− β for all x.

The expected cost incurred until a first ball is sunk equals theexpected total β-discounted cost over the infinite horizon.

The fair prize, g(x), is 1/(1− β) times the Gittins index, G(x).

g(x) = inf

{g : sup

τ≥1E

[τ−1∑t=0

−c(x(t))βt

+ (1− β)(1 + β + · · ·+ βτ−1)g

∣∣∣∣∣x(0) = x

]≥ 0

}

=1

1− βinfτ≥1

E[∑τ−1

t=0 c(x(t))βt∣∣∣x(0) = x

]E[∑τ−1

t=0 βt∣∣∣x(0) = x

]49 / 52

Page 122: Bandit Processes and Index Policies

Gittins index theorem and Weitzman’s problem

Theorem (Gittins index theorem, 1972) The problem posed bya family of alternative bandit processes, is solved by alwayscontinuing the bandit process having the greatest Gittins index.

Compare this to the solution to the Weitzman’s problem which is

Theorem (Weitzman’s Pandora rule, 1979). Pandora’s problemis solved by always opening the unopened box with greatestreservation value, until all reservations values are less than thegreatest prize that has been found.

50 / 52

Page 123: Bandit Processes and Index Policies

Pandora plays golf

Learn how to play golf with more than one ball.

- You can solve Weitzman’s Pandora’s boxes problem.

Each of Pandora’s boxes is a ball, starting in state 1, say.

First time ball i is hit a cost ci is incurred, and ball lands atlocation xi (chosen as a sample from Fi).

Second time ball i is hit, (from its current state xi), a cost−xi is incurred, the ball goes in the hole and the game ends.

Problem of minimizing the expected cost of putting a ball in thehole

≡Problem of maximizing the expected value of Pandora’s greatestdiscovered prize, net of costs of opening boxes.

Gittins =⇒ Pandora, mentioned by Chade and Smith (2006)

51 / 52

Page 124: Bandit Processes and Index Policies

Pandora plays golf

Learn how to play golf with more than one ball.

- You can solve Weitzman’s Pandora’s boxes problem.

Each of Pandora’s boxes is a ball, starting in state 1, say.

First time ball i is hit a cost ci is incurred, and ball lands atlocation xi (chosen as a sample from Fi).

Second time ball i is hit, (from its current state xi), a cost−xi is incurred, the ball goes in the hole and the game ends.

Problem of minimizing the expected cost of putting a ball in thehole

≡Problem of maximizing the expected value of Pandora’s greatestdiscovered prize, net of costs of opening boxes.

Gittins =⇒ Pandora, mentioned by Chade and Smith (2006)

51 / 52

Page 125: Bandit Processes and Index Policies

Pandora plays golf

Learn how to play golf with more than one ball.

- You can solve Weitzman’s Pandora’s boxes problem.

Each of Pandora’s boxes is a ball, starting in state 1, say.

First time ball i is hit a cost ci is incurred, and ball lands atlocation xi (chosen as a sample from Fi).

Second time ball i is hit, (from its current state xi), a cost−xi is incurred, the ball goes in the hole and the game ends.

Problem of minimizing the expected cost of putting a ball in thehole

≡Problem of maximizing the expected value of Pandora’s greatestdiscovered prize, net of costs of opening boxes.

Gittins =⇒ Pandora, mentioned by Chade and Smith (2006)

51 / 52

Page 126: Bandit Processes and Index Policies

Summary of Lecture 1

Interchange arguments

Index policies

Pandora’s problem

Achievable region method for M/M/1 queue

Bandit processes

Gittins index theorem

Golf with many balls

Solution to Pandora’s problem

52 / 52

Page 127: Bandit Processes and Index Policies

Tutorial

Bandit Processes and Index Policies

Richard Weber, University of Cambridge

Young European Queueing Theorists (YEQT VII) workshop on

Scheduling and priorities in queueing systems,

Eindhoven, November 4–5–6, 2013

1 / 40

Page 128: Bandit Processes and Index Policies

Summary of Lecture 1

Warmup

Interchange arguments

Index policies

Pandora’s problem

Achievable region method for M/M/1 queue

Bandit processes

Bandit processes

Gittins index theorem

Golf with many balls

Solution to Pandora’s problem

2 / 40

Page 129: Bandit Processes and Index Policies

How do you make perfect toast?

There is a rule for timing toast,One never has to guess,Just wait until it starts to smoke,then 7 seconds less.

(David Kendall)

3 / 40

Page 130: Bandit Processes and Index Policies

Summary of Lecture 2

Index policies for multi-class queues

M/M/1 (preemptive)

Achievable region method

Branching bandits

Tax problems

M/G/1 (nonpreemptive)

Index policies for restless bandits

Restless bandits

Whittle index

Asymptopic optimality

Risk-sensitive indices

4 / 40

Page 131: Bandit Processes and Index Policies

Multi-class queues

5 / 40

Page 132: Bandit Processes and Index Policies

Multi-class M/M/1 preemptive

Let the average work-in-system of class i be xπi = Eπ[Ni]/µi.

Find linear inequalities that must be satisfied by the xπi .

These define a polytope P , such that achievable xπ are in P .

In particular for any S ⊆ E = {1, 2, . . . , k}, consider∑i∈S

xπi = average work-in-system due to job classes in S

≤ f(S)

for some f(S). Equality achieved by always giving priority tojobs of classes not in S.

Minimize∑

i ciµixi over x ∈ P .

Show optimal solution, x∗, is achievable by index policy thatprioritizes jobs in decreasing order of ciµi.

6 / 40

Page 133: Bandit Processes and Index Policies

Multi-class M/M/1 preemptive

Let the average work-in-system of class i be xπi = Eπ[Ni]/µi.

Find linear inequalities that must be satisfied by the xπi .

These define a polytope P , such that achievable xπ are in P .

In particular for any S ⊆ E = {1, 2, . . . , k}, consider∑i∈S

xπi = average work-in-system due to job classes in S

≤ f(S)

for some f(S). Equality achieved by always giving priority tojobs of classes not in S.

Minimize∑

i ciµixi over x ∈ P .

Show optimal solution, x∗, is achievable by index policy thatprioritizes jobs in decreasing order of ciµi.

6 / 40

Page 134: Bandit Processes and Index Policies

Multi-class M/M/1 preemptive

Let the average work-in-system of class i be xπi = Eπ[Ni]/µi.

Find linear inequalities that must be satisfied by the xπi .

These define a polytope P , such that achievable xπ are in P .

In particular for any S ⊆ E = {1, 2, . . . , k}, consider∑i∈S

xπi = average work-in-system due to job classes in S

≤ f(S)

for some f(S). Equality achieved by always giving priority tojobs of classes not in S.

Minimize∑

i ciµixi over x ∈ P .

Show optimal solution, x∗, is achievable by index policy thatprioritizes jobs in decreasing order of ciµi.

6 / 40

Page 135: Bandit Processes and Index Policies

Proof of cµ-rule optimality

Linear program relaxation:

minimize∑

i ciµixi∑i∈S xi ≤ f(S), for all S ⊂ E∑i∈E xi = f(E)

xi ≥ 0

Dual linear program:

maximize∑

S ySf(S)∑S:i∈S yS ≤ ciµi, for all i

yS ≤ 0, S ⊂ EyE is unconstrained

Assume c1µ1 > · · · > ckµk. Let π be policy 1→ 2→ · · · → k.

Denote Sj = {j, . . . , k}.Taking x = xπ =⇒∑

i∈Sjxi = f(Sj), j = 1, . . . , k,

Primal feasibility holds.Complementary slackness anddual feasilibilty hold if set yS asfollows:

yE = c1µ1,yS2 = c2µ2 − c1µ1,yS3 = c3µ3 − c2µ2,

...ySk

= ckµk − ck−1µk−1,and all other yS = 0.

=⇒ π is optimal.

7 / 40

Page 136: Bandit Processes and Index Policies

Proof of cµ-rule optimality

Linear program relaxation:

minimize∑

i ciµixi∑i∈S xi ≤ f(S), for all S ⊂ E∑i∈E xi = f(E)

xi ≥ 0

Dual linear program:

maximize∑

S ySf(S)∑S:i∈S yS ≤ ciµi, for all i

yS ≤ 0, S ⊂ EyE is unconstrained

Assume c1µ1 > · · · > ckµk. Let π be policy 1→ 2→ · · · → k.

Denote Sj = {j, . . . , k}.Taking x = xπ =⇒∑

i∈Sjxi = f(Sj), j = 1, . . . , k,

Primal feasibility holds.Complementary slackness anddual feasilibilty hold if set yS asfollows:

yE = c1µ1,yS2 = c2µ2 − c1µ1,yS3 = c3µ3 − c2µ2,

...ySk

= ckµk − ck−1µk−1,and all other yS = 0.

=⇒ π is optimal.

7 / 40

Page 137: Bandit Processes and Index Policies

Proof of cµ-rule optimality

Linear program relaxation:

minimize∑

i ciµixi∑i∈S xi ≤ f(S), for all S ⊂ E∑i∈E xi = f(E)

xi ≥ 0

Dual linear program:

maximize∑

S ySf(S)∑S:i∈S yS ≤ ciµi, for all i

yS ≤ 0, S ⊂ EyE is unconstrained

Assume c1µ1 > · · · > ckµk. Let π be policy 1→ 2→ · · · → k.

Denote Sj = {j, . . . , k}.

Taking x = xπ =⇒∑i∈Sj

xi = f(Sj), j = 1, . . . , k,

Primal feasibility holds.Complementary slackness anddual feasilibilty hold if set yS asfollows:

yE = c1µ1,yS2 = c2µ2 − c1µ1,yS3 = c3µ3 − c2µ2,

...ySk

= ckµk − ck−1µk−1,and all other yS = 0.

=⇒ π is optimal.

7 / 40

Page 138: Bandit Processes and Index Policies

Proof of cµ-rule optimality

Linear program relaxation:

minimize∑

i ciµixi∑i∈S xi ≤ f(S), for all S ⊂ E∑i∈E xi = f(E)

xi ≥ 0

Dual linear program:

maximize∑

S ySf(S)∑S:i∈S yS ≤ ciµi, for all i

yS ≤ 0, S ⊂ EyE is unconstrained

Assume c1µ1 > · · · > ckµk. Let π be policy 1→ 2→ · · · → k.

Denote Sj = {j, . . . , k}.Taking x = xπ =⇒∑

i∈Sjxi = f(Sj), j = 1, . . . , k,

Primal feasibility holds.Complementary slackness anddual feasilibilty hold if set yS asfollows:

yE = c1µ1,yS2 = c2µ2 − c1µ1,yS3 = c3µ3 − c2µ2,

...ySk

= ckµk − ck−1µk−1,and all other yS = 0.

=⇒ π is optimal.

7 / 40

Page 139: Bandit Processes and Index Policies

Proof of cµ-rule optimality

Linear program relaxation:

minimize∑

i ciµixi∑i∈S xi ≤ f(S), for all S ⊂ E∑i∈E xi = f(E)

xi ≥ 0

Dual linear program:

maximize∑

S ySf(S)∑S:i∈S yS ≤ ciµi, for all i

yS ≤ 0, S ⊂ EyE is unconstrained

Assume c1µ1 > · · · > ckµk. Let π be policy 1→ 2→ · · · → k.

Denote Sj = {j, . . . , k}.Taking x = xπ =⇒∑

i∈Sjxi = f(Sj), j = 1, . . . , k,

Primal feasibility holds.Complementary slackness anddual feasilibilty hold if set yS asfollows:

yE = c1µ1,yS2 = c2µ2 − c1µ1,yS3 = c3µ3 − c2µ2,

...ySk

= ckµk − ck−1µk−1,and all other yS = 0.

=⇒ π is optimal.

7 / 40

Page 140: Bandit Processes and Index Policies

Proof of cµ-rule optimality

Linear program relaxation:

minimize∑

i ciµixi∑i∈S xi ≤ f(S), for all S ⊂ E∑i∈E xi = f(E)

xi ≥ 0

Dual linear program:

maximize∑

S ySf(S)∑S:i∈S yS ≤ ciµi, for all i

yS ≤ 0, S ⊂ EyE is unconstrained

Assume c1µ1 > · · · > ckµk. Let π be policy 1→ 2→ · · · → k.

Denote Sj = {j, . . . , k}.Taking x = xπ =⇒∑

i∈Sjxi = f(Sj), j = 1, . . . , k,

Primal feasibility holds.Complementary slackness anddual feasilibilty hold if set yS asfollows:

yE = c1µ1,yS2 = c2µ2 − c1µ1,yS3 = c3µ3 − c2µ2,

...ySk

= ckµk − ck−1µk−1,and all other yS = 0.

=⇒ π is optimal.

7 / 40

Page 141: Bandit Processes and Index Policies

Nonpreemptive multi-class M/G/1

Jobs again of k classes {1, . . . , k}. Holding cost rates ci.

Class i job has service time ti, chosen ∼ Fi.While processing it a random number of new jobs of type jarrive, distributed as a Poisson random variable with meanλjti.

Example of a branching bandit.

Gittins index theorem holds.

(Proof similar to that is last lecture— but when hit, golf ball splits into many golf balls.)

Problem: minimize time-average holding cost.

8 / 40

Page 142: Bandit Processes and Index Policies

Nonpreemptive multi-class M/G/1

Jobs again of k classes {1, . . . , k}. Holding cost rates ci.

Class i job has service time ti, chosen ∼ Fi.While processing it a random number of new jobs of type jarrive, distributed as a Poisson random variable with meanλjti.

Example of a branching bandit.

Gittins index theorem holds.

(Proof similar to that is last lecture— but when hit, golf ball splits into many golf balls.)

Problem: minimize time-average holding cost.

8 / 40

Page 143: Bandit Processes and Index Policies

Nonpreemptive multi-class M/G/1

Let µ−1i = Eti.

c1µ1 > · · · > ckµk.

Assume Gittins index theorem for branching bandits.

We now prove that the nonpreemptive scheduling policy thatminimizes the expected weighted holding cost is the cµ-rule, thepriority policy π : 1→ 2→ · · · → k.

9 / 40

Page 144: Bandit Processes and Index Policies

Tax problems

Consider a job which enters at time 0, leaves at time t, and paystax in between: discounted cost is∫ t

0cie−αsds =

1

α[ci − cie−αt].

An alternative view (on the r.h.s.) is that we

‘pay ci/α on entry’,

‘receive refund ci/α on exit’ (with discount factor of e−αt

applied).

If we cannot control when jobs enter, then we just want tomaximize collection of refunds.

10 / 40

Page 145: Bandit Processes and Index Policies

Tax problems

Consider a job which enters at time 0, leaves at time t, and paystax in between: discounted cost is∫ t

0cie−αsds =

1

α[ci − cie−αt].

An alternative view (on the r.h.s.) is that we

‘pay ci/α on entry’,

‘receive refund ci/α on exit’ (with discount factor of e−αt

applied).

If we cannot control when jobs enter, then we just want tomaximize collection of refunds.

10 / 40

Page 146: Bandit Processes and Index Policies

Gittins index in tax problems

The Gittins index is

Gi = supτ

E[sum of discounted refunds collected up to τ ]

E[integral of discounted time up to τ ]

which in the limit α→ 0

→ supτ

E[sum of refunds collected up to τ ]

(A policy which is α-discount optimal for all sufficiently small α isaverage-cost optimal.)

11 / 40

Page 147: Bandit Processes and Index Policies

Nonpreemptive M/G/1 queue

Suppose that G1 > G2 > · · · > Gk.

G1 = supτ

E[sum of refunds collected up to τ ]

Eτ=

c1Et1

= c1µ1.

To find Gi we start with one class i job, process it, and then any‘daughters’ in classes 1, . . . , i− 1 until the system is again clear ofjobs in classes 1, . . . , i− 1.

Poisson arrivals =⇒ expected refunds C, and expected time T ,accumulating during clearing is proportional to ti. So for some θ,

Gi =E[ci + (θti)C]

E[ti + (θti)T ].

12 / 40

Page 148: Bandit Processes and Index Policies

Nonpreemptive M/G/1 queue

Suppose that G1 > G2 > · · · > Gk.

G1 = supτ

E[sum of refunds collected up to τ ]

Eτ=

c1Et1

= c1µ1.

To find Gi we start with one class i job, process it, and then any‘daughters’ in classes 1, . . . , i− 1 until the system is again clear ofjobs in classes 1, . . . , i− 1.

Poisson arrivals =⇒ expected refunds C, and expected time T ,accumulating during clearing is proportional to ti. So for some θ,

Gi =E[ci + (θti)C]

E[ti + (θti)T ].

12 / 40

Page 149: Bandit Processes and Index Policies

Nonpreemptive M/G/1 queue

Similarly, we might start with one job in class i− 1, process it, andthen clear all daughter jobs in classes 1, . . . , i− 2.

But the index calculation gives the same value if we also clear alldaughter class i− 1 jobs, and so for the same θ as above

Gi−1 =E[ci−1 + (θti−1)C]

E[ti−1 + (θti−1)T ]

Gi =E[ci + (θti)C]

E[ti + (θti)T ].

Gi−1 ≥ Gi =⇒ ci−1/Eti−1 ≥ ci/Eti.

13 / 40

Page 150: Bandit Processes and Index Policies

Nonpreemptive M/G/1 queue

Similarly, we might start with one job in class i− 1, process it, andthen clear all daughter jobs in classes 1, . . . , i− 2.

But the index calculation gives the same value if we also clear alldaughter class i− 1 jobs, and so for the same θ as above

Gi−1 =E[ci−1 + (θti−1)C]

E[ti−1 + (θti−1)T ]

Gi =E[ci + (θti)C]

E[ti + (θti)T ].

Gi−1 ≥ Gi =⇒ ci−1/Eti−1 ≥ ci/Eti.

13 / 40

Page 151: Bandit Processes and Index Policies

Optimality of cµ-rule in multi-class M/G/1

Gittins index theorem for branching bandits+ Gittins indices ordered the same as ciµi =⇒

Theorem

The average waiting cost in a multi-class M/G/1 queue is minimized(amongst nonpreemptive strategies) by always processing the job ofgreatest ciµi, where µi = 1/Eti.

Notice that the Gi do depend on the arrival rates, but theirordering does not.

14 / 40

Page 152: Bandit Processes and Index Policies

Restless bandits

15 / 40

Page 153: Bandit Processes and Index Policies

Spinning plates

16 / 40

Page 154: Bandit Processes and Index Policies

Restless bandits

[Whittle ‘88]

Two actions are available: active (a = 1) or passive (a = 0).

Rewards, r(x, a), and transitions, P (y |x, a), depend on thestate and the action taken.

Objective: Maximize time-average reward from n restlessbandits under a constraint that only m (m < n) of themreceive the active action simultaneously.

17 / 40

Page 155: Bandit Processes and Index Policies

Restless bandits

[Whittle ‘88]

Two actions are available: active (a = 1) or passive (a = 0).

Rewards, r(x, a), and transitions, P (y |x, a), depend on thestate and the action taken.

Objective: Maximize time-average reward from n restlessbandits under a constraint that only m (m < n) of themreceive the active action simultaneously.

17 / 40

Page 156: Bandit Processes and Index Policies

Restless bandits

[Whittle ‘88]

Two actions are available: active (a = 1) or passive (a = 0).

Rewards, r(x, a), and transitions, P (y |x, a), depend on thestate and the action taken.

Objective: Maximize time-average reward from n restlessbandits under a constraint that only m (m < n) of themreceive the active action simultaneously.

17 / 40

Page 157: Bandit Processes and Index Policies

Restless bandits

[Whittle ‘88]

Two actions are available: active (a = 1) or passive (a = 0).

Rewards, r(x, a), and transitions, P (y |x, a), depend on thestate and the action taken.

Objective: Maximize time-average reward from n restlessbandits under a constraint that only m (m < n) of themreceive the active action simultaneously.

active a = 1 passive a = 0

work, increasing fatigue rest, recovery

17 / 40

Page 158: Bandit Processes and Index Policies

Restless bandits

[Whittle ‘88]

Two actions are available: active (a = 1) or passive (a = 0).

Rewards, r(x, a), and transitions, P (y |x, a), depend on thestate and the action taken.

Objective: Maximize time-average reward from n restlessbandits under a constraint that only m (m < n) of themreceive the active action simultaneously.

active a = 1 passive a = 0

work, increasing fatigue rest, recovery

high speed low speedP (y |x, 0) = εP (y |x, 1) , y 6= x

17 / 40

Page 159: Bandit Processes and Index Policies

Restless bandits

[Whittle ‘88]

Two actions are available: active (a = 1) or passive (a = 0).

Rewards, r(x, a), and transitions, P (y |x, a), depend on thestate and the action taken.

Objective: Maximize time-average reward from n restlessbandits under a constraint that only m (m < n) of themreceive the active action simultaneously.

active a = 1 passive a = 0

work, increasing fatigue rest, recovery

high speed low speed

inspection no inspection

17 / 40

Page 160: Bandit Processes and Index Policies

Opportunistic spectrum access

Communication channels may be busy or free.

0 1 2 3 T

Channel 1

Channel 2

Opportunities

Opportunities

Aim is to ‘inspect’ m out of n channels, maximizing the number ofthese that are found to be free.

18 / 40

Page 161: Bandit Processes and Index Policies

Opportunistic spectrum access

Communication channels may be busy or free.

0 1 2 3 T

Channel 1

Channel 2

Opportunities

Opportunities

Aim is to ‘inspect’ m out of n channels, maximizing the number ofthese that are found to be free.

18 / 40

Page 162: Bandit Processes and Index Policies

Relaxed problem for a single restless bandit

Consider a relaxed problem, posed for 1 bandit only.

Seek to maximize average reward obtained from this bandit undera constraint that a = 1 for only a fraction ρ = m/n of the time.

19 / 40

Page 163: Bandit Processes and Index Policies

LP for the relaxed problem

Let zax be proportion of time that the bandit is in state x andaction a is taken (under a stationary Markov policy).

An upper bound for our problem can found from a LP in variables{zax : x ∈ E, a ∈ {0, 1}}:

maximize∑x,a

r(x, a)zax

s.t. zax ≥ 0 , for all x, a ;∑x,a

zax = 1 ;∑a

zax =∑y

zayP (x | y, a(y)) , for all x ;∑x

z0x = 1− ρ .

20 / 40

Page 164: Bandit Processes and Index Policies

LP for the relaxed problem

Let zax be proportion of time that the bandit is in state x andaction a is taken (under a stationary Markov policy).

An upper bound for our problem can found from a LP in variables{zax : x ∈ E, a ∈ {0, 1}}:

maximize∑x,a

r(x, a)zax

s.t. zax ≥ 0 , for all x, a

;∑x,a

zax = 1 ;∑a

zax =∑y

zayP (x | y, a(y)) , for all x ;∑x

z0x = 1− ρ .

20 / 40

Page 165: Bandit Processes and Index Policies

LP for the relaxed problem

Let zax be proportion of time that the bandit is in state x andaction a is taken (under a stationary Markov policy).

An upper bound for our problem can found from a LP in variables{zax : x ∈ E, a ∈ {0, 1}}:

maximize∑x,a

r(x, a)zax

s.t. zax ≥ 0 , for all x, a ;∑x,a

zax = 1

;∑a

zax =∑y

zayP (x | y, a(y)) , for all x ;∑x

z0x = 1− ρ .

20 / 40

Page 166: Bandit Processes and Index Policies

LP for the relaxed problem

Let zax be proportion of time that the bandit is in state x andaction a is taken (under a stationary Markov policy).

An upper bound for our problem can found from a LP in variables{zax : x ∈ E, a ∈ {0, 1}}:

maximize∑x,a

r(x, a)zax

s.t. zax ≥ 0 , for all x, a ;∑x,a

zax = 1 ;∑a

zax =∑y

zayP (x | y, a(y)) , for all x

;∑x

z0x = 1− ρ .

20 / 40

Page 167: Bandit Processes and Index Policies

LP for the relaxed problem

Let zax be proportion of time that the bandit is in state x andaction a is taken (under a stationary Markov policy).

An upper bound for our problem can found from a LP in variables{zax : x ∈ E, a ∈ {0, 1}}:

maximize∑x,a

r(x, a)zax

s.t. zax ≥ 0 , for all x, a ;∑x,a

zax = 1 ;∑a

zax =∑y

zayP (x | y, a(y)) , for all x ;∑x

z0x = 1− ρ .

20 / 40

Page 168: Bandit Processes and Index Policies

The subsidy problem

Optimal value of the dual LP problem is g, where this can befound from the average-cost dynamic programming equation

φ(x) + g = maxa∈{0,1}

{r(x, a) + λ(1− a) +

∑y

φ(y)P (y |x, a)

}.

g, φ(x) and λ are the Lagrange multipliers for constraints.

λ may be interpreted as a subsidy for taking a = 0.

Solution partitions state space into sets: E0 (a = 0), E1 (a = 1)and E01 (randomization between a = 0 and a = 1).

21 / 40

Page 169: Bandit Processes and Index Policies

The subsidy problem

Optimal value of the dual LP problem is g, where this can befound from the average-cost dynamic programming equation

φ(x) + g = maxa∈{0,1}

{r(x, a) + λ(1− a) +

∑y

φ(y)P (y |x, a)

}.

g, φ(x) and λ are the Lagrange multipliers for constraints.

λ may be interpreted as a subsidy for taking a = 0.

Solution partitions state space into sets: E0 (a = 0), E1 (a = 1)and E01 (randomization between a = 0 and a = 1).

21 / 40

Page 170: Bandit Processes and Index Policies

Indexability

Reasonable that as the subsidy λ (for a = 0) increases from −∞to +∞ the set of states E0 (where a = 0 optimal) should increasemonotonically.

If it does then we say the bandit is indexable.

Whittle index, W (x), is the least subsidy for which it can beoptimal to take a = 0 in state x.

This motivates a heuristic policy:

Apply the active action to the m bandits with the greatestWhittle indices.

Like Gittins indices for classical bandits, Whittle indices can becomputed separately for each bandit.

Same as the Gittins index when a = 0 is freezing action.

22 / 40

Page 171: Bandit Processes and Index Policies

Indexability

Reasonable that as the subsidy λ (for a = 0) increases from −∞to +∞ the set of states E0 (where a = 0 optimal) should increasemonotonically.

If it does then we say the bandit is indexable.

Whittle index, W (x), is the least subsidy for which it can beoptimal to take a = 0 in state x.

This motivates a heuristic policy:

Apply the active action to the m bandits with the greatestWhittle indices.

Like Gittins indices for classical bandits, Whittle indices can becomputed separately for each bandit.

Same as the Gittins index when a = 0 is freezing action.

22 / 40

Page 172: Bandit Processes and Index Policies

Indexability

Reasonable that as the subsidy λ (for a = 0) increases from −∞to +∞ the set of states E0 (where a = 0 optimal) should increasemonotonically.

If it does then we say the bandit is indexable.

Whittle index, W (x), is the least subsidy for which it can beoptimal to take a = 0 in state x.

This motivates a heuristic policy:

Apply the active action to the m bandits with the greatestWhittle indices.

Like Gittins indices for classical bandits, Whittle indices can becomputed separately for each bandit.

Same as the Gittins index when a = 0 is freezing action.

22 / 40

Page 173: Bandit Processes and Index Policies

Indexability

Reasonable that as the subsidy λ (for a = 0) increases from −∞to +∞ the set of states E0 (where a = 0 optimal) should increasemonotonically.

If it does then we say the bandit is indexable.

Whittle index, W (x), is the least subsidy for which it can beoptimal to take a = 0 in state x.

This motivates a heuristic policy:

Apply the active action to the m bandits with the greatestWhittle indices.

Like Gittins indices for classical bandits, Whittle indices can becomputed separately for each bandit.

Same as the Gittins index when a = 0 is freezing action.

22 / 40

Page 174: Bandit Processes and Index Policies

Two questions

Under what assumptions is a restless bandit indexable?

This is somewhat mysterious.

Special classes of restless bandits are indexable: such as ‘dualspeed’, Glazebrook, Nino-Mora, Ansell (2002), W. (2007).

Indexability can be proved in some problems (such as theopportunistic spectrum access problem, Liu and Zhao (2009)).

How good is the heuristic policy using Whittle indices?

It may be optimal. (opportunistic spectrum access —identical channels, Ahmad, Liu, Javidi, and Zhao (2009)).

Lots of papers with numerical work.

It is often asymptotically optimal, W. and Weiss (1990).

23 / 40

Page 175: Bandit Processes and Index Policies

Two questions

Under what assumptions is a restless bandit indexable?

This is somewhat mysterious.

Special classes of restless bandits are indexable: such as ‘dualspeed’, Glazebrook, Nino-Mora, Ansell (2002), W. (2007).

Indexability can be proved in some problems (such as theopportunistic spectrum access problem, Liu and Zhao (2009)).

How good is the heuristic policy using Whittle indices?

It may be optimal. (opportunistic spectrum access —identical channels, Ahmad, Liu, Javidi, and Zhao (2009)).

Lots of papers with numerical work.

It is often asymptotically optimal, W. and Weiss (1990).

23 / 40

Page 176: Bandit Processes and Index Policies

Two questions

Under what assumptions is a restless bandit indexable?

This is somewhat mysterious.

Special classes of restless bandits are indexable: such as ‘dualspeed’, Glazebrook, Nino-Mora, Ansell (2002), W. (2007).

Indexability can be proved in some problems (such as theopportunistic spectrum access problem, Liu and Zhao (2009)).

How good is the heuristic policy using Whittle indices?

It may be optimal. (opportunistic spectrum access —identical channels, Ahmad, Liu, Javidi, and Zhao (2009)).

Lots of papers with numerical work.

It is often asymptotically optimal, W. and Weiss (1990).

23 / 40

Page 177: Bandit Processes and Index Policies

Two questions

Under what assumptions is a restless bandit indexable?

This is somewhat mysterious.

Special classes of restless bandits are indexable: such as ‘dualspeed’, Glazebrook, Nino-Mora, Ansell (2002), W. (2007).

Indexability can be proved in some problems (such as theopportunistic spectrum access problem, Liu and Zhao (2009)).

How good is the heuristic policy using Whittle indices?

It may be optimal. (opportunistic spectrum access —identical channels, Ahmad, Liu, Javidi, and Zhao (2009)).

Lots of papers with numerical work.

It is often asymptotically optimal, W. and Weiss (1990).

23 / 40

Page 178: Bandit Processes and Index Policies

Two questions

Under what assumptions is a restless bandit indexable?

This is somewhat mysterious.

Special classes of restless bandits are indexable: such as ‘dualspeed’, Glazebrook, Nino-Mora, Ansell (2002), W. (2007).

Indexability can be proved in some problems (such as theopportunistic spectrum access problem, Liu and Zhao (2009)).

How good is the heuristic policy using Whittle indices?

It may be optimal. (opportunistic spectrum access —identical channels, Ahmad, Liu, Javidi, and Zhao (2009)).

Lots of papers with numerical work.

It is often asymptotically optimal, W. and Weiss (1990).

23 / 40

Page 179: Bandit Processes and Index Policies

Two questions

Under what assumptions is a restless bandit indexable?

This is somewhat mysterious.

Special classes of restless bandits are indexable: such as ‘dualspeed’, Glazebrook, Nino-Mora, Ansell (2002), W. (2007).

Indexability can be proved in some problems (such as theopportunistic spectrum access problem, Liu and Zhao (2009)).

How good is the heuristic policy using Whittle indices?

It may be optimal. (opportunistic spectrum access —identical channels, Ahmad, Liu, Javidi, and Zhao (2009)).

Lots of papers with numerical work.

It is often asymptotically optimal, W. and Weiss (1990).

23 / 40

Page 180: Bandit Processes and Index Policies

Two questions

Under what assumptions is a restless bandit indexable?

This is somewhat mysterious.

Special classes of restless bandits are indexable: such as ‘dualspeed’, Glazebrook, Nino-Mora, Ansell (2002), W. (2007).

Indexability can be proved in some problems (such as theopportunistic spectrum access problem, Liu and Zhao (2009)).

How good is the heuristic policy using Whittle indices?

It may be optimal. (opportunistic spectrum access —identical channels, Ahmad, Liu, Javidi, and Zhao (2009)).

Lots of papers with numerical work.

It is often asymptotically optimal, W. and Weiss (1990).

23 / 40

Page 181: Bandit Processes and Index Policies

Asymptotic optimalityAt time t there are (n1, . . . , nk) bandits in states 1, . . . , k.Suppose a priority policy orders the states 1, 2, . . . . Let

m = ρn.

zi = ni/n be proportion in state i.

nai = number that receive action a.

uai (z) = nai /ni.

qaij = rate a bandit in state i jumps to state j under action a;

qij(z) = u0i (z)q0ij + u1i (z)q

1ij

24 / 40

Page 182: Bandit Processes and Index Policies

Asymptotic optimalityAt time t there are (n1, . . . , nk) bandits in states 1, . . . , k.Suppose a priority policy orders the states 1, 2, . . . . Let

m = ρn.

zi = ni/n be proportion in state i.

nai = number that receive action a.

uai (z) = nai /ni.

qaij = rate a bandit in state i jumps to state j under action a;

qij(z) = u0i (z)q0ij + u1i (z)q

1ij

24 / 40

Page 183: Bandit Processes and Index Policies

Asymptotic optimalityAt time t there are (n1, . . . , nk) bandits in states 1, . . . , k.Suppose a priority policy orders the states 1, 2, . . . . Let

m = ρn.

zi = ni/n be proportion in state i.

nai = number that receive action a.

uai (z) = nai /ni.

qaij = rate a bandit in state i jumps to state j under action a;

qij(z) = u0i (z)q0ij + u1i (z)q

1ij

24 / 40

Page 184: Bandit Processes and Index Policies

Asymptotic optimalityAt time t there are (n1, . . . , nk) bandits in states 1, . . . , k.Suppose a priority policy orders the states 1, 2, . . . . Let

m = ρn.

zi = ni/n be proportion in state i.

nai = number that receive action a.

uai (z) = nai /ni.

qaij = rate a bandit in state i jumps to state j under action a;

qij(z) = u0i (z)q0ij + u1i (z)q

1ij

24 / 40

Page 185: Bandit Processes and Index Policies

Asymptotic optimalityAt time t there are (n1, . . . , nk) bandits in states 1, . . . , k.Suppose a priority policy orders the states 1, 2, . . . . Let

m = ρn.

zi = ni/n be proportion in state i.

nai = number that receive action a.

uai (z) = nai /ni.

n1 n2 n3 n4

m = ρn

0 1

u11(z) = u12(z) = 1 u03(z) = 1

qaij = rate a bandit in state i jumps to state j under action a;

qij(z) = u0i (z)q0ij + u1i (z)q

1ij

24 / 40

Page 186: Bandit Processes and Index Policies

Asymptotic optimalityAt time t there are (n1, . . . , nk) bandits in states 1, . . . , k.Suppose a priority policy orders the states 1, 2, . . . . Let

m = ρn.

zi = ni/n be proportion in state i.

nai = number that receive action a.

uai (z) = nai /ni.

n1 n2 n3 n4

m = ρn

0 1

u11(z) = u12(z) = 1 0 < u13(z) < 1 u03(z) = 1

qaij = rate a bandit in state i jumps to state j under action a;

qij(z) = u0i (z)q0ij + u1i (z)q

1ij

24 / 40

Page 187: Bandit Processes and Index Policies

Asymptotic optimalityAt time t there are (n1, . . . , nk) bandits in states 1, . . . , k.Suppose a priority policy orders the states 1, 2, . . . . Let

m = ρn.

zi = ni/n be proportion in state i.

nai = number that receive action a.

uai (z) = nai /ni.

n1 n2 n3 n4

m = ρn

0 1

u11(z) = u12(z) = 1 0 < u13(z) < 1 u03(z) = 1

qaij = rate a bandit in state i jumps to state j under action a;

qij(z) = u0i (z)q0ij + u1i (z)q

1ij

24 / 40

Page 188: Bandit Processes and Index Policies

Fluid model approximation

Now suppose n is large. Transitions will be happening very fast.

By the law of large numbers we expect that the ‘path’ z(t) evolvesclose to the fluid model in which

dzi/dt =∑j

[ujq1ji + (1− uj)q0ji]zj − zi

∑j

[uiq1ij + (1− ui)q0ij ]

where uj = uj(z) is a function of z so that∑

i ui(z)zi = ρ.

Suppose, in order of Whittle index, the states are 1, . . . , k.

For z1 + · · ·+ zh−1 < ρ, z1 + · · ·+ zh+1 > ρ, we have

uj(z) =

1 j < hρ−∑

i<h zizj

j = h

0 j > h

25 / 40

Page 189: Bandit Processes and Index Policies

Fluid model approximation

Now suppose n is large. Transitions will be happening very fast.

By the law of large numbers we expect that the ‘path’ z(t) evolvesclose to the fluid model in which

dzi/dt =∑j

[ujq1ji + (1− uj)q0ji]zj − zi

∑j

[uiq1ij + (1− ui)q0ij ]

where uj = uj(z) is a function of z so that∑

i ui(z)zi = ρ.

Suppose, in order of Whittle index, the states are 1, . . . , k.

For z1 + · · ·+ zh−1 < ρ, z1 + · · ·+ zh+1 > ρ, we have

uj(z) =

1 j < hρ−∑

i<h zizj

j = h

0 j > h

25 / 40

Page 190: Bandit Processes and Index Policies

Fluid approximation

But this is not so bad. For i < h,

dzi/dt =∑j<h

q1jizj +∑j>h

q0jizj +

ρ−∑j<h

zj

q1hi

+

∑j≤h

zj − ρ

q0hi − zi∑j

q1ij

Similar expressions hold for i > h and i = h.

The general form is

dz/dt = A(z)z + b(z)

where A(z) and b(z) are constants within polyhedral regions.

26 / 40

Page 191: Bandit Processes and Index Policies

Asymptotic optimality

dz/dt = A(z)z + b(z)

System has an asymptotically stable equilibrium point if from anystart z(0), z(t)→ z∗ as t→∞.

Theorem [W. and Weiss ‘90]

If bandits are indexable, and the fluid model for the Whittle indexpolicy has an asymptotically stable equilibrium point, then theWhittle index policy is asymptotically optimal, — in the sense thatthe reward per bandit tends to the reward that is obtained underthe relaxed policy.

(proof via large deviation theory for sample paths.)

27 / 40

Page 192: Bandit Processes and Index Policies

Possibility of limit cycle

z1

z2

z3 = 1− z1 − z2

z = A1z + b1

z = A2z + b2

z = A3z + b3

1

10

k = 3 and ρ = 1/2. Dynamics differ in the regions z1 > 1/2,z1 < 1/2 and z1 + z2 > 1/2;z1 + z2 < 1/2.

28 / 40

Page 193: Bandit Processes and Index Policies

Heuristic may not be asymptotically optimal

(q0ij

)=

−2 1 0 12 −2 0 00 56 − 113

212

1 1 12

− 52

,(q1ij

)=

−2 1 0 12 −2 0 00 7

25− 113

4001

4001 1 1

2− 5

2

r0 = (0, 1, 10, 10) , r1 = (10, 10, 10, 0) , ρ = 0.835

Bandit is indexable.

Equilibrium point is (z1, z2, z3, z4) = (0.409, 0.327, 0.100, 0.164).

z1 + z2 + z3 = 0.836.

Relaxed policy obtains 10 per bandit per unit time.

29 / 40

Page 194: Bandit Processes and Index Policies

Heuristic may not be asymptotically optimal

(q0ij

)=

−2 1 0 12 −2 0 00 56 − 113

212

1 1 12

− 52

,(q1ij

)=

−2 1 0 12 −2 0 00 7

25− 113

4001

4001 1 1

2− 5

2

r0 = (0, 1, 10, 10) , r1 = (10, 10, 10, 0) , ρ = 0.835

Bandit is indexable.

Equilibrium point is (z1, z2, z3, z4) = (0.409, 0.327, 0.100, 0.164).

z1 + z2 + z3 = 0.836.

Relaxed policy obtains 10 per bandit per unit time.

29 / 40

Page 195: Bandit Processes and Index Policies

Heuristic is not asymptotically optimal

But equilibrium point z is not asymptotically stable.

0.10

0.16

0.32

0.42

z1

z2

z3

z4

t →

a = 1

a = 0

a = 0/1

Relaxed policy obtains 10 per bandit.

Heuristic obtains only 9.9993 per bandit.

30 / 40

Page 196: Bandit Processes and Index Policies

Equilibrium point optimization

Recall the fluid model equilibrium point optimization problem

minimize c(z, u) : z = a(z, u) = 0,∑

i ziui = ρ,∑

i zi = 1,

where

c(z, u) =∑

i[c1iui + c0i (1− ui)]zi

a(z, u) =∑

j [ujq1ji + (1− uj)q0ji]zj − zi

∑j [uiq

1ij + (1− ui)q0ij ]

Dual

φi + g = min{c0i − λ+

∑j q

0ijφj , c

1i +

∑j q

1ijφj

}

31 / 40

Page 197: Bandit Processes and Index Policies

‘Large deviations’-inspired model

Suppose n is large, but we try to model more sensitively thedependence on n by considering deviations from the fluid model.

dz/dt = a(z, u) + ε(z, u)

AssumeP (ε(z, u)dt = ηdt) ∼ e−nI(z,u,η)dt

(inspired by the theory of large deviations).

I(z, u, 0) = 0 and I(z, u, η) is convex increasing in η.

32 / 40

Page 198: Bandit Processes and Index Policies

Risk-sensitive control

We are now positioned to consided a risk-sensitive performancemeasure

E[∫ T

0 c(z, u)dt]

replaced by − 1

θlogE

[e−θ

∫ T0 c(z,u)dt

]

≈ E[∫ T

0 c(z, u) dt]− 1

2θ var(∫ T

0 c(z, u) dt)

θ > 0 corresponds to risk-seeking.

θ < 0 corresponds to risk-aversion.

33 / 40

Page 199: Bandit Processes and Index Policies

Risk-sensitive control

We are now positioned to consided a risk-sensitive performancemeasure

E[∫ T

0 c(z, u)dt]

replaced by − 1

θlogE

[e−θ

∫ T0 c(z,u)dt

]

≈ E[∫ T

0 c(z, u) dt]− 1

2θ var(∫ T

0 c(z, u) dt)

θ > 0 corresponds to risk-seeking.

θ < 0 corresponds to risk-aversion.

33 / 40

Page 200: Bandit Processes and Index Policies

Large deviations

Consider a path {z(t), u(t), ε(t), 0 ≤ t ≤ T}.

P (path)× cost(path) = e−n∫ T0 I(z,u,ε)dt × e−θ

∫ T0 c(z,u)dt.

Let θ = βn.

E(

exp(−θ∫ T0 c(x, u)dt)

)is determined by summing the above

over all paths, and so essentially

E(

exp(−θ∫ T0 c(x, u)dt)

)∼ exp

(−nβ inf

ε

[∫ T

0c(z, u) + β−1I(z, u, ε)dt

]).

34 / 40

Page 201: Bandit Processes and Index Policies

Risk-sensitive optimal equilibrium point

The constraint z = a(z, u) + ε can be imposed with a Lagrangemultiplier, and so we seek z, u, ε, φ to extremize:∫ T

0

[c(z, u) + φT (z − a(z, u)− ε) + β−1I(z, u, ε)

]dt

One could now look for the extremizing path.

But suppose a fixed point is reached at large T .

It is the point found by extremizing,

c(z, u) + φT (−a(z, u)− ε) + β−1I(z, u, ε)

+ λ(

1− ρ−∑i zi(1− ui))

35 / 40

Page 202: Bandit Processes and Index Policies

Risk-sensitive optimal equilibrium point

The constraint z = a(z, u) + ε can be imposed with a Lagrangemultiplier, and so we seek z, u, ε, φ to extremize:∫ T

0

[c(z, u) + φT (z − a(z, u)− ε) + β−1I(z, u, ε)

]dt

One could now look for the extremizing path.

But suppose a fixed point is reached at large T .

It is the point found by extremizing,

c(z, u) + φT (−a(z, u)− ε) + β−1I(z, u, ε)

+ λ(

1− ρ−∑i zi(1− ui))

35 / 40

Page 203: Bandit Processes and Index Policies

Example

Consider a large number of components of a single type.

z1, z2 are the proportions of components in up and down states.

u = 0: up components go down at rate z1;

u = 1: up components go down at rate z1, anddown components go up at rate z2 = z2u.

Stochastic model:z2 = −z2u+ z1 + ε2

where ε2dt is Brownian motion with variance z1 + z2u.

I(z, u, ε2) =ε22

2(z1 + z2u).

36 / 40

Page 204: Bandit Processes and Index Policies

Risk-sensitive optimal equilibrium point

Suppose cost rate is cz2u− rz1.

In pursuit of an index we ask for what λ are both u = 0 and u = 1optimal when extremizing (w.r.t. zi, φ, ε2)

cz2 − rz1 − φ(−z2u+ z1 + ε2) + β−1ε22

2(z1 + z2u)− λz2(1− u)

λ is a subsidy for not attending to components that are down.

. . . find a candidate for a risk-sensitive index:

λ∗i =1

2(ri − ci) +

1

8β(ri + ci)

2.

Consider two component types for which r1 − c1 = r2 − c2.

If β = 0 we should we are indifferent between these.

If β > 0, (risk-seeking), there is a distinct preference.

37 / 40

Page 205: Bandit Processes and Index Policies

Risk-sensitive optimal equilibrium point

Suppose cost rate is cz2u− rz1.

In pursuit of an index we ask for what λ are both u = 0 and u = 1optimal when extremizing (w.r.t. zi, φ, ε2)

cz2 − rz1 − φ(−z2u+ z1 + ε2) + β−1ε22

2(z1 + z2u)− λz2(1− u)

λ is a subsidy for not attending to components that are down.

. . . find a candidate for a risk-sensitive index:

λ∗i =1

2(ri − ci) +

1

8β(ri + ci)

2.

Consider two component types for which r1 − c1 = r2 − c2.

If β = 0 we should we are indifferent between these.

If β > 0, (risk-seeking), there is a distinct preference.

37 / 40

Page 206: Bandit Processes and Index Policies

Summary of Lecture 2

Multi-class M/M/1 (nonpreemptive)

Achievable region method

Multi-class M/G/1 (nonpreemptive)

Branching bandits

Tax problems

Restless bandits

Whittle index

Asymptopic optimality

Risk-sensitive indices

38 / 40

Page 207: Bandit Processes and Index Policies

Questions

39 / 40

Page 208: Bandit Processes and Index Policies

Reading list

D. Bertsimas and J. Nino-Mora, Conservation laws, extended polymatroids andmultiarmed bandit problems; a polyhedral approach to indexable systems, Math.Operat Res., 21:257–306, 1996.

J. C. Gittins, K. D. Glazebrook, and R. R. Weber. Multiarmed Bandit AllocationIndices, (2nd editon), Wiley, 2011.

K. D. Glazebrook, D. J. Hodge, C. Kirkbride, R. J. Minty, Stochastic scheduling: Ashort history of index policies and new approaches to index generation for dynamicresource allocation, J Scheduling., 2013.

K. Liu, R. R. Weber and Q. Zhao. Indexability and Whittle Index for restless banditproblems involving reset processes, Decision and Control and European ControlConference (CDC-ECC), 2011 50th IEEE Conference on, 12-15 December, 7690–7696,2011,

R. R. Weber and G. Weiss. On an index policy for restless bandits. J Appl Probab,27:637–648, 1990.

P. Whittle. Restless bandits: activity allocation in a changing world. In A Celebrationof Applied Probability, ed. J. Gani, J Appl Probab, 25A:287–298, 1998.

P. Whittle. Risk-sensitivity, large deviations and stochastic control, Eur J Oper Res,73:295-303, 1994

40 / 40