COSC 878 Seminar on Large Scale Statistical Machine Learning 1

1

COSC 878

Seminar on Large Scale Statistical Machine Learning

2

Today’s Plan

• Course Websitehttp://people.cs.georgetown.edu/~huiyang/cosc-878/ • Join the Google group:

https://groups.google.com/forum/#!forum/cosc878

• Students Introduction• Team-up and Presentation Scheduling• First Talk

http://people.cs.georgetown.edu/~huiyang/cosc-878/





Reinforcement Learning: A Survey

Grace1/13/15

What is Reinforcement Learning

• The problem faced by an agent that must learn behavior through trial-and-error interactions with a dynamic environment

Solve RL Problems – Two strategies

• Search in the behavior space– To find one behavior that perform well in the

environment– Genetic algorithms, genetic programming

• Statistical methods and dynamic programming– Estimate the utility of taking actions in states of

the world– We focus on this strategy

Standard RL model

What we learn in RL

• The agent’s job is to find a policy \pi that maximizes some long-run measure of reinforcement. – A policy \pi maps states to actions– Reinforcement = reward

Difference between RL and Supervised Learning

• In RL, no presentation of input/output pairs– No training data– We only know the immediate reward– Not know the best actions in long run

• In RL, need to evaluate the system online while learning– Online evaluation (know the online performance)

is important

Difference between RL and AI/Planning

• AI algorithms are less general– AI algorithms require a predefined model of state

transitions– And assume determinism

• RL assumes that the state space can be enumerated and stored in memory

Models

• The difficult part:– How to model future into the model

• Three models– Finite horizon– Infinite horizon– Average-reward

Finite Horizon

• At a given moment in time, the agent optimizes its expected reward for the next h steps

• Ignore what will happen after h steps

Infinite Horizon

• Maximize the long run reward• Does not put limit on the number of future steps• Future rewards are discounted geometrically• Mathematically more tractable than finite horizon

Discount factor (between 0 and 1)

Average-reward • Maximize the long run average reward

• It is the limiting case of infinite horizon when \gamma approaches 1• Weakness:

– Cannot know when get large rewards– When we prefer large initial reward, we have no way to know it in this model

• Cures:– Maximize both the long run average and the initial rewards– The Bias optimal model

Compare model optimality

• all unlabeled arrows produce a reward of 0

• A single action


Finite horizonh=4• Upper line:• 0+0+2+2+2=6• Middle:• 0+0+0+0+0=0• Lower:• 0+0+0+0+0=0


Infinite horizon\gamma=0.9• Upper line:• 0*0.9^0 +

0*0.9^1+2*0.9^2+ 2*0.9^3+2*0.9^4… = 2*0.9^2*(1+0.9+0.9^2..)= 1.62*(1)/(1-0.9)=16.2

• Middle:• … 10*0.9^5+…~=59• Lower:• … + 11*0.9^6+… = 58.5


Average reward• Upper line:• ~= 2• Middle:• ~=10• Lower:• ~= 11

Parameters

• Finite horizon and infinite horizon both have parameters– h– \gamma

• These parameters matter to the choice of optimality model– Choose them carefully in your application

• Average reward model’s advantage: not influenced by those parameters

19

MARKOV MODELS

Markov Process• Markov Property1 (the “memoryless” property) for a system, its next state depends on its current state.

Pr(Si+1|Si,…,S0)=Pr(Si+1|Si)

• Markov Process a stochastic process with Markov property.

e.g.

20 1A. A. Markov, ‘06

s0 s1 …… si ……si+1

21

• Markov Chain• Hidden Markov Model• Markov Decision Process• Partially Observable Markov Decision Process• Multi-armed Bandit

Family of Markov Models

APagerank(A)

• Discrete-time Markov process• Example: Google PageRank1

Markov Chain

BPagerank(B)

𝑃 𝑎𝑔𝑒𝑟𝑎𝑛𝑘 (𝑆 )=1−𝛼𝑁

+𝛼 ∑𝑌 ∈Π

𝑃 𝑎𝑔𝑒𝑟𝑎𝑛𝑘 (𝑌 )𝐿(𝑌 )

# of pages # of outlinkspages linked to S

22

DPagerank(D)

CPagerank(C)

EPagerank(E)

Random jump factor

1L. Page et. al., ‘99

The stable state distribution of such an MC is PageRank

State S – web pageTransition probability M PageRank: how likely a random

web surfer will land on a page

(S, M)

Hidden Markov Model• A Markov chain that states are hidden and

observable symbols are emitted with some probability according to its states1.

23

s0 s1 s2 ……

o0 o1 o2

p0

0

p1 p2

1 2

Si– hidden state pi -- transition probability oi --observationei --observation probability (emission probability) 1Leonard E. Baum et. al., ‘66

(S, M, O, e)

• MDP extends MC with actions and rewards1

si– state ai – action ri – reward pi – transition probability

p0 p1 p2

Markov Decision Process

24

……s0 s1

r0

a0

s2

r1

a1

s3

r2

a2

1R. Bellman, ‘57

(S, M, A, R, γ)

Definition of MDP• A tuple (S, M, A, R, γ)

– S : state space– M: transition matrix

Ma(s, s') = P(s'|s, a)

– A: action space– R: reward function

R(s,a) = immediate reward taking action a at state s– γ: discount factor, 0< γ ≤1

• policy π

π(s) = the action taken at state s• Goal is to find an optimal policy π* maximizing the expected

total rewards.

25

Policy

Policy: (s) = aAccording to which, select an action a at state s.

(s0) =move right and ups0

(s1) =move right and ups1

(s2) = move rights2

26 [Slide altered from Carlos Guestrin’s ML lecture]

Value of Policy

Value: V(s) Expected long-term reward starting from s

Start from s0

s0

R(s0)(s0)

V(s0) = E[R(s0) + R(s1) + 2 R(s2) + 3 R(s3) + 4 R(s4) + ]

Future rewards discounted by [0,1)

27 [Slide altered from Carlos Guestrin’s ML lecture]

Value of Policy


Start from s0

s0

R(s0)(s0)

V(s0) = E[R(s0) + R(s1) + 2 R(s2) + 3 R(s3) + 4 R(s4) + ]


s1

R(s1) s1’’

s1’

R(s1’)

R(s1’’)28 [Slide altered from Carlos Guestrin’s ML

lecture]

Value of Policy


Start from s0

s0

R(s0)(s0)

V(s0) = E[R(s0) + R(s1) + 2 R(s2) + 3 R(s3) + 4 R(s4) + ]


s1

R(s1) s1’’

s1’

R(s1’)

R(s1’’)

(s1)

R(s2)

s2

(s1’)

(s1’’)

s2’’

s2’

R(s2’)

R(s2’’)29 [Slide altered from Carlos Guestrin’s ML

lecture]

30

Computing the value of a policy

V(s0) =

=

=

=

Value function

A possible next stateThe current state

Optimality — Bellman Equation

The Bellman equation1 to MDP is a recursive definition of the optimal value function V*(.)

𝑉 ∗ ( s)=max𝑎 [𝑅 (𝑠 ,𝑎 )+𝛾∑

𝑠 ′

𝑀𝑎(𝑠 ,𝑠 ′ )𝑉 ∗(𝑠 ′ )]

31

Optimal Policyπ∗ ( s )=arg𝑚𝑎𝑥

𝑎 [𝑅 (𝑠 ,𝑎 )+𝛾∑𝑠 ′

𝑀𝑎 (𝑠 ,𝑠 ′ )𝑉 ∗(𝑠 ′)]

1R. Bellman, ‘57

state-value function

Optimality — Bellman Equation

The Bellman equation can be rewritten as

32

Optimal Policy

π∗ ( s )=arg𝑚𝑎𝑥𝑎

𝑄 (𝑠 ,𝑎 )

action-value function

Relationship between V and

Q

33

MDP algorithms

• Value Iteration• Policy Iteration• Modified Policy Iteration• Prioritized Sweeping• Temporal Difference (TD) Learning• Q-Learning

Model free approaches

Model-based approaches

[Bellman, ’57, Howard, ‘60, Puterman and Shin, ‘78, Singh & Sutton, ‘96, Sutton & Barto, ‘98, Richard Sutton, ‘88, Watkins, ‘92]

Solve Bellman equation

Optimal value V*(s)

Optimal policy *(s)

[Slide altered from Carlos Guestrin’s ML lecture]

Documents

COSC 878 Seminar on Large Scale Statistical Machine Learning 1