Reinforcement Learning - Exploration vs Exploitationhome.deib.polimi.it/restelli/MyWebSite/pdf/rl5.pdf · Marcello Restelli Multi–Arm Bandit Bayesian MABs Frequentist MABs Stochastic

Reinforcement LearningExploration vs Exploitation

Marcello Restelli

March–April, 2015

MarcelloRestelli

Multi–ArmBanditBayesian MABs

Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions

MarkovDecisionProcesses

Exploration vs Exploitation Dilemma

Online decision making involves a fundamental choice:

Exploitation: make the best decision given currentinformationExploration: gather more information

The best long–term strategy may involve short–termsacrificesGather enough information to make the best overalldecisions

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


Examples

Restaurant SelectionExploitation: Go to favorite restaurantExploration: Try a new restaurant

Online Banner AdvertisementsExploitation: Show the most successful advertExploration: Show a different advert

Oil DrillingExploitation: Drill at the best known locationExploration: Drill at a new location

Game PlayingExploitation: Play the move you believe is bestExploration: Play an experimental move

Clinical TrialExploitation: Choose the best treatment so farExploration: Try a new treatment

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


Common Approaches in RL

ε–Greedy

at =

{a∗t with probability 1− εrandom action with probability ε

SoftmaxBias exploration towards promising actionsSoftmax action selection methods grade actionprobabilities by estimated valuesThe most common softmax uses a Gibbs (orBoltzmann) distribution:

π(a|s) =e

Q(s,a)τ

e∑

a′∈AQ(s,a′)τ

τ is a “computational” temperature:τ → ∞: P = 1

|A|τ → 0: greedy

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


Multi–Arm Bandits (MABs)

A multi–armed bandit is a tuple〈A,R〉A is a set of N possible actions(one per machine = arm)R(r |a) is an unknown probabilitydistribution of rewards given theaction chosenAt each time step t the agentselects an action at ∈ AThe environment generates areward rt ∼ R(·,a)

The goal is to maximizecumulative reward:

∑Tt=1 rt

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


Regret

The action–value is the mean reward for action a

Q(a) = E[r |a]

The optimal value V ∗ is

V ∗ = Q(a∗) = maxa∈A

Q(a)

The regret is the opportunity loss for one step

lt = E[V ∗ −Q(at )]

The total regret is the total opportunity loss

Lt = E

[T∑

t=1

V ∗ −Q(at )

]Maximize cumulative reward ≡ minimize total regret

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


Counting Regret

The count Nt (a) is expected number of selections foraction aThe gap ∆a is the difference in value between actiona and optimal action a∗, ∆a = V ∗ −Q(a)

Regret is a function of gaps and the counts

Lt = E

[T∑

t=1

V ∗ −Q(at )

]=

∑a∈A

E[Nt (a)](V ∗ −Q(a))

=∑a∈A

E[Nt (a)](∆a)

A good algorithm ensures small counts for large gapsProblem: gaps are not known!

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


Greedy Algorithm

We consider algorithms that estimate Qt (a) ≈ Q(a)

Estimate the value of each action by Monte–Carloevaluation

Qt (a) =1

Nt (a)

T∑t=1

rt1(at = a)

The greedy algorithm selects action with highest value

a∗t = arg maxa∈A

Qt (a)

Greedy can lock onto a suboptimal action forever⇒ Greedy has linear total regret

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


ε–Greedy Algorithm

The ε–greedy algorithm continues to explore foreverWith probability 1− ε select a = arg maxa∈A Q(a)With probability ε select a random action

Constant ε ensures minimum regret

lt ≥ε

|A|∑a∈A

∆a

⇒ ε–greedy has linear total regret

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


ε–Greedy on the 10–Armed Testbed

N = 10 possible actionsQ(a) are chosenrandomly from a normaldistribution N (0,1)

Rewards rt are alsonormal N (Q(at ),1)

1000 playsResults averaged over2000 trials

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


Optimistic Initial Values

All methods depend on Q0(a), i.e., they are biasedEncourage exploration: initialize action valuesoptimistically

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


Decaying εt–Greedy Algorithm

Pick a decay schedule for ε1, ε2, . . .Consider the following schedule

c > 0d = min

a|∆a>0∆a

εt = min{

1,c|A|d2t

}Decaying εt–greedy has logarithmic asymptotic totalregret!Unfortunately, schedule requires advance knowledgeof gapsGoal: find an algorithm with sublinear regret for anymulti–armed bandit (without knowledge of R)

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


Two Formulations

Bayesian formulation(Q(a1),Q(a2), . . . ) are random variables with priordistribution (f1, f2, . . . )Policy: choose an arm based on the priors (f1, f2, . . . )and the observation historyTotal discounted reward over an infinite horizon:

Vπ(f1, f2, . . . ) = Eπ

{ ∞∑t=0

βtRπ(t)|f1, f2, . . .

}(0 < β < 1)

Frequentist formulation(Q(a1),Q(a2), . . . ) are unknown deterministicparametersPolicy: choose an arm based on the observationhistoryTotal reward over a finite horizon of length T

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


Bandit and MDP

Multi-armed bandit as a class of MDPN independent arms with fully observable states[Z1(t), . . . ,ZN(t)]One arm is activated at each timeActive arm changes state (known Markov process)and offers reward Ri (Zi (t))Passive arms are frozen and generate no reward

Why is sampling stochastic processes with unknowndistributions an MDP?

The state of each arm is the posterior distribution fi (t)(information state)For an active arm, fi (t + 1) is updated from fi (t) and thenew observationFor a passive arm, fi (t + 1) = fi (t)

Solving MAB using DP: exponential complexity w.r.t. N

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


Gittins Index

The index structure of the optimal policy [Gittins’74]Assign each state of each arm a priority indexActivate the arm with highest current index value

ComplexityArms are decoupled (one N-dim to N separate 1-dimproblems)Linear complexity with NForward inductionPolynomial (cubic) with the state space size of a singlearmComputational cost is too high for real-time use

Approximations of the Gittins indexThompson sampling

Incomplete learning

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


Optimism in Face of Uncertainty

The more uncertain we are about an action–valueThe more important it is to explore that actionIt could turn out to be the best action

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


Lower Bound

The performance of any algorithm is determined bysimilarity between optimal arm and other armsHard problems have similar–looking arms withdifferent meansThis is formally described by the gap ∆a and thesimilarity in distributions KL(R(·|a)||R(·,a∗))

TheoremLai and Robbins Asymptotic total regret is at leastlogarithmic in number of steps

limt→∞

Lt ≥ log t∑

a|∆a>0

∆a

KL(R(·|a)||R(·,a∗))

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


Upper Confidence Bounds

Estimate an upper confidence Ut (a) for each actionvalueSuch that Q(a) ≤ Qt (a) + Ut (a) with high probabilityThis depends on the number of items N(a) has beenselected

Small Nt (a)⇒ large Ut (a) (estimated value isuncertain)Large Nt (a)⇒ small Ut (a) (estimated value isaccurate)

Select action maximizing Upper Confidence Bound(UCB)

at = arg maxa∈A

Q(a) + U(a)

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


Hoeffding’s Inequality

TheoremHoeffding’s Inequality Let X1, . . . ,Xt be i.i.d. randomvariables in [0,1], and let X t = 1

T∑T

t=1 Xt be the samplemean. Then

P[E[X ] > X t + u] ≤ e−2tu2

We will apply Hoeffding’s Inequality to rewards of the banditconditioned on selecting action a

P[Q(a) > Qt (a) + Ut (a)] ≤ e−2Nt (a)Ut (a)2

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


Calculating Upper Confidence Bounds

Pick a probability p that true value exceeds UCBNow solve for Ut (a)

e−2Nt (a)Ut (a)2= p

Ut (a) =

√− log p2Nt (a)

Reduce p as we observe more rewards, e.g., p = t−4

Ensures we select optimal actions as t →∞

Ut (a) =

√2 log tNt (a)

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


UCB1

This leads to the UCB1 algorithm

at = arg maxa∈A

Q(a) +

√2 log tNt (a)

TheoremAt time T , the expected total regret of the UCB policy is atmost

E[LT ] ≤ 8 log T∑

a|∆a<∆a∗

1∆a

+

(1 +

π2

3

)∑a∈A

∆a

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


Example: UCB vs ε–Greedy on 10–armedBandit

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


Other Upper Confidence Bounds

Upper confidence bounds can be applied to otherinequalities

Bernstein’s inequalityEmpirical Bernstein’s inequalityChernoff inequalityAzuma’s inequality. . .

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


The Adversarial Bandit Setting

N armsAt each round t = 1, . . . ,T

The learner chooses It ∈ 1, . . . ,NAt the same time the adversary selects reward vectorrt = (r1,t , . . . , rN,t ) ∈ [0,1]N

The learner receives the reward rIt ,t , while the rewardsof the other arms are not received

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


Variation on SoftmaxEXP3.P

It is possible to drive regret down by annealing τ

EXP3.P: Exponential weight algorithm for explorationand exploitationThe probability of choosing arm a at time t is

π(a|t) = (1− β)wt (a)∑

a′∈Awt (a′)+

β

|A|

wt+1(a) =

wt (a)e(−η rt (a)

πt (a)

)if arm a is pulled at t

wt (a) otherwise

η > 0 and β > 0 are the parameters of the algorithmRegret: E[LT ] ≤ O(

√T |A| log |A|)

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


MAB with Infinitely Many Arms

Unstructured set of actionsUCB Arm-Increasing Rule

Structured set of actionsLinear BanditsLipschitz BanditsUnimodalBandits in trees

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


The Contextual Bandit Setting

At each round t = 1, . . . ,TThe world produces some context s ∈ SThe learner chooses It ∈ 1, . . . ,NThe world reacts with reward RIt ,tThere is no dynamics

Learn a good policy (low regret) for choosing actionsgiven contextIdeas

Run a different MAB for each distinct contextPerform generalization over contexts

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


Exploration–Exploitation in MDPs

Multi–armed bandit are one–step decision–makingproblemsMDPs represent sequential decision–making problemsWhat exploration–exploitation approaches can be usedin MDPs?How to measure the efficiency of an RL algorithm informal terms?

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


Sample Complexity

LetM be an MDP with N states, K actions, discountfactor γ ∈ [0,1) and a maximal reward Rmax > 0Let A be an algorithm that acts in the environment,producing experience: s0,a0, r1, s1,a1, r2, . . .

Let V At = E[

∑∞τ=0 γ

τ rt+τ |s0,a0, r1, . . . , st−1,at−1, rt , st ]

Let V ∗ be the value function of the optimal policy

Definition [Kakade,2003]Let ε > 0 be a prescribed accuracy and δ > 0 be anallowed probability of failure. The expressionη(ε, δ,N,K , γ,Rmax ) is a sample complexity bound foralgorithm A if independently of the choice of s0, withprobability at least 1− δ, the number of timesteps such thatV A

t < V ∗(st )− ε is at most η(ε, δ,N,K , γ,Rmax ).

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


Efficient Exploration

DefinitionAn algorithm with sample complexity that is polynomial in1ε , log 1

δ ,N,K ,1

1−γ ,Rmax is called PAC-MDP (probablyapproximately correct in MDPs)

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


Exploration StrategiesPAC–MDP Efficiency

ε–greedy and Boltzmann are not PAC-MDP efficient.Their sample complexity is exponential in the numberof states

Other approaches with exponential complexity:variance minimization of action–valuesoptimistic value inizializationstate bonuses: frequency, recency, error, . . .

Example of PAC–MDP efficient approaches:model–based: E3, R–MAX, MBIEmodel–free: Delayed Q–learning

Bayesian RL: optimal exploration strategyonly tractable for very simple problems

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


Explicit–Exploit–or–Explore (E3) AlgorithmKearns and Singh, 2002

Model–based approach with polynomial samplecomplexity (PAC–MDP)

uses optimism in the face of uncertaintyassumes knowledge of maximum reward

Maintains counts for state and actions to quantifyconfidence in model estimates

A state s is known if all actions in s have beensufficiently often executed

From observed data, E3 constructs two MDPsMDPknown: includes known states with (approximatelyexact) estimates of P and R (drives exploitation)MDPunknown: MDPknown without reward + special state s′

where the agent receives maximum reward (drivesexploration)

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


E3 Sketch

Input: State sOutput: Action aif s is known then

Plan in MDPknownif resulting plan has value above some threshold then

return first action of planelse

Plan in MDPunknownreturn first action of plan

end ifelse

return action with the least observations in send if

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


E3 Example

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


E3 Example

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


E3 Example

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


R–MAXBrafman and Tennenholtz, 2002

Similar to E3: implicit instead of explicit explorationBased on reward function

RR−MAX (s,a) =

{R(s,a) c(s,a) ≥ m(s,a known)

Rmax c(s,a) < m(s,a unknown)

Sample Complexity: O(|S|2|A|ε3(1−γ)6

)

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


Limitations of E3 and R–MAX

E3 and R–MAX are called “efficient” because itssample complexity scales only polynomially in thenumber of statesIn natural environments, however, this number of statesis enormous: it is exponential in the number of statevariablesHence E3 and R–MAX scale exponentially in thenumber of state variablesGeneralization over states and actions is crucial forexploration

Relational Reinforcement Learning: try to modelstructure of environmentsKWIK-R–MAX: extension of R–MAX that is polynomialin the number of parameters used to approximate thestate transition model

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


Delayed Q–learning

Optimistic initializationGreedy action selectionQ–learning updates in batches (no learning rate)Updates only if values significantly decrease

Sample complexity: O(|S||A|

ε4(1−γ)8

)

MarcelloRestelli


Frequentist MABs

Stochastic Setting

Adversarial Setting

MAB Extensions


Bayesian RL

Agent maintains a distribution (belief) b(m) over MDPmodels m.

typically, MDP structure is fixed; belief over theparametersbelief updated after each observation(s,a, r , s′) : b → b′

only tractable for very simple problemsBayes–optimal policy π∗ = arg maxπ Vπ(b, s)

no other policy leads to more rewards in expectationw.r.t. prior distribution over MDPssolves the exploration–exploitation tradeoff implicitly:minimizes uncertainty about the parameters, whileexploiting where it is certainis not PAC-MDP efficient!

Documents

Reinforcement Learning - Exploration vs Exploitationhome.deib.polimi.it/restelli/MyWebSite/pdf/rl5.pdf · Marcello Restelli Multi–Arm Bandit Bayesian MABs Frequentist MABs Stochastic