A Simple Distribution-Free Approach to the Max k-Armed Bandit Problem

A Simple Distribution-Free

Approach to the Max k-Armed Bandit

Problem

Matthew Streeter and Stephen Smith

Carnegie Mellon University

Outline The max k-armed bandit problem Previous work Our distribution-free approach Experimental evaluation

What is the max k-armed bandit

problem?

You are in a room with k slot machines

Pulling the arm of machine i returns a payoff drawn (independently at random) from unknown distribution Di

Allowed n total pulls Goal: maximize total payoff

> 50 years of papers

The classical k-armed

bandit

You are in a room with k slot machines

Pulling the arm of machine i returns a payoff drawn (independently at random) from unknown distribution Di

Allowed n total pulls Goal: maximize highest payoff

Introduced ~2003

The max k-armed bandit

Why study it?

Goal: improve multi-start heuristics

A multi-start heuristic runs an underlying randomized heuristic a bunch of times and returns the best solution

Examples: HBSS (Bresina 1996) VBSS (Cicirello & Smith 2005) GRASPs (Feo & Resende 1995, and many others)

Given: some optimization problem, k randomized heuristics

Each time you run a heuristic, get a solution with a certain quality

Allowed n runs Goal: maximize quality of best solution

Application: selecting among heuristics

Given n pulls, how can we maximize the (expected) maximum payoff?

If n=1, should pull blue arm (higher mean) If n=1000, should mainly pull maroon arm (higher variance)

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

The max k-armed bandit: example

Distributional assumptions? Without distributional assumptions, optimal strategy is not interesting.

For example suppose payoffs are in {0,1}; arms are shuffled so you don’t know which is which. Optimal strategy samples the arms in round-robin order!

Can’t distinguish a “good” arm until you receive payoff 1, at which point max payoff can’t be improved

Why? Extremal Types Theorem: let Mn = max. of n independent draws from some fixed distribution. As n, distribution of Mn a GEV distribution

GEV sometimes gives an excellent fit to payoff distributions we care about

Distributional assumptions? All previous work assumed each machine returns payoff from a generalized extreme value (GEV) distribution

Previous work Cicirello & Smith (CP 2004, AAAI 2005): Assumed Gumbel distributions (special case of GEV), no rigorous performance guarantees

Good results selecting among heuristics for the RCPSP/max

Streeter & Smith (AAAI 2006) Rigorous result for general GEV distributions

But no experimental evaluation

Our contributions Threshold ascent: strategy to solve max k-armed problem using classical k-armed solver as subroutine

Chernoff interval estimation: strategy for classical k-armed bandit algorithm that works well when mean payoffs are small (we assume payoffs in [0,1])

Threshold Ascent Parameters: strategy S for classical k-armed bandit, integer m > 0

Idea: Initialize t - Use S to maximize number of payoffs that exceed t

Once m payoffs > t have been received, increase t and repeat

Threshold Ascent

Designed to work well when: For t > tcritical, there is a growing gap between probability that eventually-best arm yields payoff > t and corresponding prob. for other arms

Threshold Ascent Parameters: strategy S for classical k-armed bandit, integer m > 0

Idea: Initialize t - Use S to maximize number of payoffs that exceed t

Once m payoffs > t have been received, increase t and repeat

m controls exploration/exploitation tradeoff (larger m means algorithm converges more before increasing t)

as t gets large, S sees a classical k-armed bandit instance where almost all payoffs are zero

we don’t really start S from scratch each time we increase t

Interval Estimation Interval estimation (Lai & Robbins 1987, Kaelbling 1993) maintains confidence interval for each arm’s mean payoff; pulls arm with highest upper bound

12

3

Arm 1 Arm 2 Arm 3

Chernoff Interval Estimation

We analyze a variant of interval estimation with confidence intervals derived from Chernoff bounds

regret = average_payoff(strategy) - *, where * = mean payoff of best arm.

We prove an O(sqrt(*)*X) regret bound, where X = sqrt(k (log n)/n).

Using Hoeffding’s inequality just gives O(X). (Auer et al. 2002). As * 0, our bound is much better.

Can get comparable bounds using “multiplicative weight update” algorithms

Experimental Evaluation

The RCPSP/max Assign start times to activities subject to resource and temporal constraints

Goal: find a schedule with minimum makespan

NP-hard, “one of the most intractable problems in operations research” (Mohring 2000)

Multi-start heuristics give state-of-the-art performance (Cicirello & Smith 2005)

Evaluation Five multi-start heuristics; each is a randomized rule for greedily building a schedule

LPF - “longest path following” LST - “latest start time” MST - “minimum slack time” MTS - “most total successors” RSM - “resource scheduling method”

Three max k-armed bandit strategies: Threshold Ascent (m=100, S = Chernoff interval estimation with 99% confidence intervals)

round robin sampling QD-BEACON (Cicirello & Smith 2004, 2005)

Note: we use a less aggressive variant of interval estimation in

these experiments

Evaluation Ran on 169 instances from ProGen/max library

For each instance, ran each of five rules 10,000 times and saved results in file

For each of three strategies, solve as max 5-armed bandit with n=10,000 pulls

Define regret = difference between max. possible payoff and max. payoff actually obtained

Results

Threshold Ascent outperforms the other max k-armed bandit strategies, as well as the five “pure” strategies

Summary & Conclusions The max k-armed bandit problem is a simple online learning problem with applications to heuristic search

We described a new, distribution-free approach to the max k-armed bandit problem

Our strategy is effective at selecting among randomized priority dispatching rules for the RCPSP/max

Documents

A Simple Distribution-Free Approach to the Max k-Armed Bandit Problem