0 O O I d d - University of Texas at Austinpstone/Courses/394Rfall19/resources/ChX7Xsli… · Chain Markov Reward Process (Length - - I 9)-I P= I p--I t ITHE-O. . 090 TO.- - - - O

0 O O

I d d•

ooo00

I R, f R

,f R

,

O O O

ii d dI l •

•

! I d 8122 d 8122

I I O O

I . - d/

p⑧insist

" " t ' "'

. in1-- - - -

-- - -

A - step TD, - step TD

2-step TD

( Monte Carlo )( TD ( o ) )

Targets

TD ( o ) : Re t 8 V Gen)

n - step TD : Rt t 812¥ ,t . - - t 8

" "

Ran ,t V

"

V ( Sem )

Monk Carlo i Re t 8R* ,t - - - t 8

" " "

Rt. t

Chain Markov Reward Process ( Length-19 )- I P= I p - I t I

DE - O ..

. .- 09070 . - - - - O → Ed

Mini 5 - chain 2=0.5 8 - I-

True

Vcs) :

- I - I o'

J I ExperiencesA B c D E

B ,C ,D ,E ,

'' IO O O O O

I - step TD - I ← → + Ic

,

B,

A,

- I

A. B. C,

D,

E,

t I

O O O O O

3 - step TD - I ← → fl

O O O O O

a - step TD - I ← → tl

Vent St ) - Van.GE/tfG.e:e+n-Ve..n.ilSe))


FA - O ..

. .- 09070 . - - - - O → FE

Mini 5 - chain 2=0.5 V - I-

True VCs) :- I - I o

'

5 I ExperiencesA B c D E ✓ B

,C ,D ,

E t IO O O O

0I - step TD - I ←

0.5→ + I

C,

B,

A,

- I

A. B. C,

D,

E,

t I

o O%Is¥53 - step TD - I ← → fl

o

000ora - step TD - I ←

0.5 0.5 0.5 0.5→ + ,Vent

St ) - Van

.GE/tfG.e:t+n-Vt+n.ilSeI

)

Chain Markov Reward Process ( Length-19 )- I P= I p - I tl

THE - O ..

. .- 09020 . - - - - O → Ed

Miini5-chainx-o.TV#TruevLs

) :- I - I o

'

5 Zz ExperiencesA B c D EVB ,

C, Di E. '- I

IO O O

0I - step TD - I ←

- o- 5 0.5

→ + I✓ C

,

B,

A,

- I

A. B. C,

D,

E,

t I0¢0of0- 0.5 - 0.50/50-5 0.5→ f I3 - step TD - I ←

-0.25or00toor- 0.5

850/-50.5 0.5

→ + IA - step TD - I ← -0.25 - 0.25

V*n( St )←V⇐nfse)t£Ge+n- Verma )


THE - O ..

. .- 0=020 . - - - - O → Ed

Mini5-chain2=o.s

Trucks) :

- I - I o'

5 Zz ExperiencesA B c D E ✓ B ,C,D,E tl

or0orororI - step TD - I ←

-450 0 0.25

¥55→ + I

✓ c ,B,A,

'

- I

-0.25 VA ,B,

C,

D,

E,

tl0¢0of0-0.51to¥ → t I3 - step TD - I ←o o

- 0.25 0.75 0.750.375or000or-0/58.50/-50/50/5

→ + Is - step TD - I ←

o.rs

-073,55jobs0.75 0-75

V*n( St ) - Van.GE/tfG.e:t+n-Vt+n.ilSeI)

Chain Markov Reward Process ( Length = 19 )- I P= I p - I t I

FA - O ..

. .. . 09020 . - - - - O → FE

a -

state"A" Distance = to

Chain Markov Reward Process ( Length-

- 19 )- I P = I p

-- I t I

THE - O ..

. .. . 090 TO . - - - - O → FE

in -

state"

A" Distance = to

Influence state A value viaa t I reward :

=MC : must get to +1 goal at least once

after visiting A

TD (o ) : must visit t I goal at least once anytimeand then wait for value to propagate .

Chain Markov Reward Process ( Length-

- I 9 )- I P = I p

-- I t I

THE - O ..

. .. . 090 TO . - - - - O → Ed

in -Distance = 10state

"

A"

Influence state A value viaa t I reward :

- -

MC : must get to + I goal at least once

after visiting A

TD (o ) : must visit t I goal at least once anytimeand then wait for value to propagate .

statistical Properties :-

MC : Unbiased,

high varianceFast propagation of reward ( all visited States )

TD ( 0 ) : Biased updates,

low variance

Slow propagation of reward ( must recent state )

h - step TD : Medium variance

medium propagation of reward

Correct balance problem specific !

O•

I d•

j O

od

•

d i

• •

:i

. I•

I OI

O •

n - step n - step

TD SANSA

on - policyon - policy

prediction control

-- - - -

--

--

I -

-O

• I • •

RequiresI

d. dI d d importance I

I OI O O Sampling ! I

od

dede• I • •I

d ii i

• •

I • •I

::

.

t t t i

I O lO O I

Od '

teLd ) I

•

,

as• • •

fn - step n - step

n . step n - step Ioff policy Expected I

TD SARSAI

SANSA SARSA,

on - policyOh - policy off - policy off policy

prediction control I control control I

I

I i

-

-

--

- --

-

-- - - -

--

--

I -

-O

• I • •

RequiresI

d. dI d d importance I

I OI O O Sampling ! I

od

dede• I • •

I

to ii i

• •

I • •I

::

.

t t t i

I O lO O I

Od '

teLd ) I

•

,

as• • •

fn - step n - step

n . step n - step Ioff policy Expected I

TD SARSAI

SANSA SARSA,


prediction control I control control I

I

I r

-

-

--

- --

-

Targets :

-

Fatter : Rt +812 #T . - - +8

" - '

Rem . iTV

"

rn,

attn )in

- step

" staff7

i.lt#+n.i(RttVRe+it.--t8n-'Rt+n-itVQSei-n,attn

) )

-- - - -

--

--

I -

-O

• I • •

RequiresI

•90

.a )

d. dI d d importance I L

do I 1

OyeOyesampling ! I

④• I • •

I•

ooo co

& I; ;

I - step Expected SARSA

• •

, • •I off - policy

i f f f ,But no importance sampling !

• If O O O I

. No actions considered

OI I I e Ld ) I other than :

•

I

•• • •

fK

.action prescribed

n - step n - step by qcs ,a )

n - step n - Step Ioff policy Expected I 2

.actins in

expectation

TD SARSAI

SANSA SARSAy

Wir.

E.

Teva ,


prediction control I control control I . N - step Expected SAKSA, takes N - I actions

( I from Tb that- - -

-- -

--

require reweighingTargets :

-

Fatter : Rt t VR #t . - - +8

" - '

Ran . itf

"# rn,

attn )in

- step

" staff7

i.lt#+n.fRtt8Re+it.--t8mRt+n-itVQSei-n,attn

) )

ControlVariates-

i

-- -

"i r

-

-

pH ) i i. ply) r

'

I i ,

'

.

. ,-

. .

-

,

/-

,

✓i

,

I.

f- -

-

I

iT f--

-✓

I

"

'T

My -

- Efx ] My=ECy]Unknown Known

Assumption : X is positively correlated with Y

i. e.

when X > Efx ),

it is likely that y > Ely )ex : rain and traffic

ControlVariates-

i

-- -

"i r

-

-

pH ) i i. ply) r

'

I i ,

'

.

. ,-

. .

-

,

/-

,

✓i

,

I.

f--

I

iT f-

-✓

I

"

'T

My -

- Efx ] My = Ely ]Unknown Known


i. e.

when X > Efx ),


Sample x - P ( * ) and y- ply )

Consider two different estimators :

ii.= x

rix= x -1Cny - y ]

ControlVariates-

i

-- -

"I r

-

-

pH ) i i. ply) r

'

I i ,

'

.

. ,-

. .

-

,

/-

,

✓i r

.

-

y My - Y i

c--

itlmaooI t

-

-

•T•'

TMy -

- Efx ] My = Ely ]Unknown Known


i. e.

when X > EH ),


Sample x - p ( * ) and y- ply )


ii.= x

rix= x -1Cny - y ]-

Expectationzero !

.n - Step PDIS with CV

Control Variates -

-

. G = ee ( Rent 8 Get, )

-- -

,-

,

⇐ h,

..

- - --

- - .

pH ) i i. ply) r

'

tf' - E) the,

( se )'

Cv

I i ,

'

.

. ,r

.. -

,

'-

- - --

'

'

/-

,

/i

.

'

my . y'

t-

itiniT t

-"

•T•'

'T E KeIf = I

,thus ;

My -

- Efx ] My

=ECy]

unknown knownE ( I - le ) = O

Assumption : X is positively correlated with Y→ EH - ee ) Uh

. ,( se ) ) = O

i. e.

when X > EH ),


Sample x - PC * ) and y- ply )


ii.= x

rix= x -1Cny - y ]-

Expectationzero !

.n - Step PD IS with CV

Control Variates -

-

. G = I ( Rent 8Ge+, :h )

-- -

,-

,

⇐ h.

..

- - --

- - .

pH ) i i. ply) r

'

tf' - E) the,

( se )'

Cv

I i ,

'

.

. ,-

.. -

,

'-

- - --

'

'

/-

,

/i

.

'

my . y'

t-

itin.

iT t-

-✓

•T•'

'T E If = I

,thus ;

My -

- Efx ] My = Ely ]unknown known

E ( I - le ) = O

Assumption : X is positively correlated with Y→ EH - ee ) Uh

. ,( se ) ) = O

i. e.

when X > EH ),


Compare to without CV :

Sample x - p ( * ) and y- ply ) Ge :b

= Et (Reti +8 Get , :L )Consider two different estimators :

if Ce = O ;

in,

= X Cv : Gein-

- Vu. , ( se )

ii.,= Xt(My - y ] No Cv : Gein = O

-

Expectation" how much better or worse than

zero !average is this sample ?

"

one step off policy-

Q - learningecfed SARS A

Same when Tera ,

= ITS

• A

I lO Omazy

the•

• a•

• a

Q .

learning ExpectedSAR SA

- Both off policy. in importance sampling !

one step off policy n - step off policy- -

• aca

Q - learning f f

LeafedSAKSA O O O

dede asSame when Teva ,

= Tip • •• • •

I I IO O O

• . hete LLL} J

• • a•

•

o ol I I

"HH Lf ) maO O

•. a

•. a Lt ) Lt ) Lf )

•• a •

• B B• a

Q -

learning Expected n - step n - stepTree

SANSA Q - learning Expected BackupsSANSA

- Both off policy. no importance Sampling !

Tree Backups Target :

Documents

0 O O I d d - University of Texas at Austinpstone/Courses/394Rfall19/resources/ChX7Xsli… · Chain Markov Reward Process (Length - - I 9)-I P= I p--I t ITHE-O. . 090 TO.- - - - O