Taming the monster: A fast and simple algorithm for contextual bandits PRESENTED BY Satyen Kale...

Preview:

Citation preview

Taming the monster :A fast and s imple a lgor i thm for contextual bandi tsP R E S E N T E D B Y S a t y e n K a l e

J o i n t w o r k w i t h A l e k h A g a r w a l , D a n i e l H s u , J o h n L a n g f o r d , L i h o n g L i a n d R o b S c h a p i r e

Learning to interact: example #1

Loop:› 1. Patient arrives with symptoms, medical history, genome, …› 2. Physician prescribes treatment.› 3. Patient’s health responds (e.g., improves, worsens).

Goal: prescribe treatments that yield good health outcomes.

Learning to interact: example #2

Loop:› 1. User visits website with profile, browsing history, …› 2. Website operator choose content/ads to display.› 3. User reacts to content/ads (e.g., click, “like”).

Goal: choose content/ads that yield desired user behavior.

Contextual bandit setting (i.i.d. version)

Set X of contexts/features and K possible actions For t = 1,2,…,T:

› 0. Nature draws (xt, rt) from distribution D over X × [0,1]K.

› 1. Observe context xt. [e.g., user profile, browsing history]

› 2. Choose action at ϵ [K]. [e.g., content/ad to display]

› 3. Collect reward rt(at). [e.g., indicator of click or positive feedback]

Goal: algorithm for choosing actions at that yield high reward.

Contextual setting: use features xt to choose good actions at.

Bandit setting: rt(a) for a ≠ at is not observed.› Exploration vs. exploitation

Learning objective and difficulties

No single action is good in all situations – need to exploit context. Policy class Π: set of functions (“policies”) from X [K]

(e.g., advice of experts, linear classifiers, neural networks). Regret (i.e., relative performance) to policy class Π:

… a strong benchmark if Π contains a policy with high reward. Difficulties: feedback on action only informs about subset of policies;

explicit bookkeeping is computationally infeasible when Π is large.

Arg max oracle (AMO)

Given fully-labeled data (x1, r1),…,(xt, rt), AMO returns

Abstraction for efficient search of policy class Π. In practice: implement using standard heuristics (e.g., convex

relax., backprop) for cost-sensitive multiclass learning algorithms.

Our results

New fast and simple algorithm for contextual bandits› Optimal regret bound (up to log factors): › Amortized calls to argmax oracle (AMO) per round.

Comparison to previous work› [Thompson’33]: no general analysis.› [ACBFS’02]: Exp4 algorithm; optimal regret, enumerates policies.› [LZ’07]: ε-greedy variant; suboptimal regret, one AMO call/round.› [DHKKLRZ’11]: “monster paper”; optimal regret, O(T5K4) AMO calls/round.

Note: Exp4 also works in adversarial setting.

Rest of this talk

1. Action distributions, reward estimates viainverse probability weights [oldies but goodies]

2. Algorithm for finding policy distributionsthat balance exploration/exploitation

3. Warm-start / epoch trick

New

New

Basic algorithm structure (same as Exp4)

Start with initial distribution Q1 over policies Π.

For t=1,2,…,T:› 0. Nature draws (xt,rt) from distribution D over X × [0,1]K.

› 1. Observe context xt.

› 2a. Compute distribution pt over actions {1,2,…,K} (based on Qt and xt).

› 2b. Draw action at from pt.

› 3. Collect reward rt(at).

› 4. Compute new distribution Qt+1 over policies Π.

Inverse probability weighting (old trick)

Importance-weighted estimate of reward from round t:

Unbiased, and has range & variance bounded by 1/pt(a).

Can estimate total reward and regret of any policy:

Constructing policy distributions

Optimization problem (OP):

Find policy distribution Q such that:

Low estimated regret (LR) – “exploitation"

Low estimation variance (LV) – “exploration”

Theorem: If we obtain policy distributions Qt via solving (OP), then with high probability, regret after T rounds is at most

Feasibility

Feasibility of (OP): implied by minimax argument.

Monster solution [DHKKLRZ’11]: solves variant of (OP) with ellipsoid algorithm, where Separation Oracle = AMO + perceptron + ellipsoid.

Coordinate descent algorithm

INPUT: Initial weights Q. LOOP:

› IF (LR) is violated, THEN replace Q by cQ.› IF there is a policy π causing (LV) to be violated, THEN

• UPDATE Q(π) = Q(π) + α.› ELSE

• RETURN Q.

Above, both 0 < c < 1 and α have closed form expressions.

(Technical detail: actually optimize over sub-distributions Q that may sum to < 1.)

Claim: Can check by making one AMO call per iteration.

Iteration bound for coordinate descent

# steps of coordinate descent =

Also gives bound on sparsity of Q.

Analysis via a potential function argument.

Warm-start

If we warm-start coordinate descent (initialize with Qt to get Qt+1), then only need

coordinate descent iterations over all T rounds.

Caveat: need one AMO call/round to even check if (OP) is solved.

Epoch trick

Regret analysis: Qt has low instantaneous expected regret(crucially relying on i.i.d. assumption).› Therefore same Qt can be used for O(t) more rounds!

Epoching: Split T rounds into epochs, solve (OP) once per epoch. Doubling: only update on rounds 21,22,23,24,…

› Total of O(log T) updates, so overall # AMO calls unchanged (up to log factors).

Squares: only update on rounds 12,22,32,42,…› Total of O(T1/2) updates, each requiring AMO calls, on average.

Experiments

Algorithm Epsilon-greedy

Bagging Linear UCB “Online Cover”

[Supervised]

Loss 0.095 0.059 0.128 0.053 0.051

Time (seconds)

22 339 212000 17 6.9

Bandit problem derived from classification task (RCV1).Reporting progressive validation loss.

“Online Cover” = variant with stateful AMO.

Recommended