Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
0 O O
I d d•
ooo00
I R, f R
,f R
,
O O O
ii d dI l •
•
! I d 8122 d 8122
I I O O
I . - d/
p⑧insist
" " t ' "'
. in1-- - - -
-- - -
A - step TD, - step TD
2-step TD
( Monte Carlo )( TD ( o ) )
Targets
TD ( o ) : Re t 8 V Gen)
n - step TD : Rt t 812¥ ,t . - - t 8
" "
Ran ,t V
"
V ( Sem )
Monk Carlo i Re t 8R* ,t - - - t 8
" " "
Rt. t
Chain Markov Reward Process ( Length-19 )- I P= I p - I t I
DE - O ..
. .- 09070 . - - - - O → Ed
Mini 5 - chain 2=0.5 8 - I-
True
Vcs) :
- I - I o'
J I ExperiencesA B c D E
B ,C ,D ,E ,
'' IO O O O O
I - step TD - I ← → + Ic
,
B,
A,
- I
A. B. C,
D,
E,
t I
O O O O O
3 - step TD - I ← → fl
O O O O O
a - step TD - I ← → tl
Vent St ) - Van.GE/tfG.e:e+n-Ve..n.ilSe))
Chain Markov Reward Process ( Length-19 )- I P= I p - I t I
FA - O ..
. .- 09070 . - - - - O → FE
Mini 5 - chain 2=0.5 V - I-
True VCs) :- I - I o
'
5 I ExperiencesA B c D E ✓ B
,C ,D ,
E t IO O O O
0I - step TD - I ←
0.5→ + I
C,
B,
A,
- I
A. B. C,
D,
E,
t I
o O%Is¥53 - step TD - I ← → fl
o
000ora - step TD - I ←
0.5 0.5 0.5 0.5→ + ,Vent
St ) - Van
.GE/tfG.e:t+n-Vt+n.ilSeI
)
Chain Markov Reward Process ( Length-19 )- I P= I p - I tl
THE - O ..
. .- 09020 . - - - - O → Ed
Miini5-chainx-o.TV#TruevLs
) :- I - I o
'
5 Zz ExperiencesA B c D EVB ,
C, Di E. '- I
IO O O
0I - step TD - I ←
- o- 5 0.5
→ + I✓ C
,
B,
A,
- I
A. B. C,
D,
E,
t I0¢0of0- 0.5 - 0.50/50-5 0.5→ f I3 - step TD - I ←
-0.25or00toor- 0.5
850/-50.5 0.5
→ + IA - step TD - I ← -0.25 - 0.25
V*n( St )←V⇐nfse)t£Ge+n- Verma )
Chain Markov Reward Process ( Length-19 )- I P= I p - I t I
THE - O ..
. .- 0=020 . - - - - O → Ed
Mini5-chain2=o.s
Trucks) :
- I - I o'
5 Zz ExperiencesA B c D E ✓ B ,C,D,E tl
or0orororI - step TD - I ←
-450 0 0.25
¥55→ + I
✓ c ,B,A,
'
- I
-0.25 VA ,B,
C,
D,
E,
tl0¢0of0-0.51to¥ → t I3 - step TD - I ←o o
- 0.25 0.75 0.750.375or000or-0/58.50/-50/50/5
→ + Is - step TD - I ←
o.rs
-073,55jobs0.75 0-75
V*n( St ) - Van.GE/tfG.e:t+n-Vt+n.ilSeI)
Chain Markov Reward Process ( Length = 19 )- I P= I p - I t I
FA - O ..
. .. . 09020 . - - - - O → FE
a -
state"A" Distance = to
Chain Markov Reward Process ( Length-
- 19 )- I P = I p
-- I t I
THE - O ..
. .. . 090 TO . - - - - O → FE
in -
state"
A" Distance = to
Influence state A value viaa t I reward :
=MC : must get to +1 goal at least once
after visiting A
TD (o ) : must visit t I goal at least once anytimeand then wait for value to propagate .
Chain Markov Reward Process ( Length-
- I 9 )- I P = I p
-- I t I
THE - O ..
. .. . 090 TO . - - - - O → Ed
in -Distance = 10state
"
A"
Influence state A value viaa t I reward :
- -
MC : must get to + I goal at least once
after visiting A
TD (o ) : must visit t I goal at least once anytimeand then wait for value to propagate .
statistical Properties :-
MC : Unbiased,
high varianceFast propagation of reward ( all visited States )
TD ( 0 ) : Biased updates,
low variance
Slow propagation of reward ( must recent state )
h - step TD : Medium variance
medium propagation of reward
Correct balance problem specific !
O•
I d•
j O
od
•
d i
• •
:i
. I•
I OI
O •
n - step n - step
TD SANSA
on - policyon - policy
prediction control
-- - - -
--
--
I -
-O
• I • •
RequiresI
d. dI d d importance I
I OI O O Sampling ! I
od
dede• I • •I
d ii i
• •
I • •I
::
.
t t t i
I O lO O I
Od '
teLd ) I
•
,
as• • •
fn - step n - step
n . step n - step Ioff policy Expected I
TD SARSAI
SANSA SARSA,
on - policyOh - policy off - policy off policy
prediction control I control control I
I
I i
-
-
--
- --
-
-- - - -
--
--
I -
-O
• I • •
RequiresI
d. dI d d importance I
I OI O O Sampling ! I
od
dede• I • •
I
to ii i
• •
I • •I
::
.
t t t i
I O lO O I
Od '
teLd ) I
•
,
as• • •
fn - step n - step
n . step n - step Ioff policy Expected I
TD SARSAI
SANSA SARSA,
on - policyOh - policy off - policy off policy
prediction control I control control I
I
I r
-
-
--
- --
-
Targets :
-
Fatter : Rt +812 #T . - - +8
" - '
Rem . iTV
"
rn,
attn )in
- step
" staff7
i.lt#+n.i(RttVRe+it.--t8n-'Rt+n-itVQSei-n,attn
) )
-- - - -
--
--
I -
-O
• I • •
RequiresI
•90
.a )
d. dI d d importance I L
do I 1
OyeOyesampling ! I
④• I • •
I•
ooo co
& I; ;
I - step Expected SARSA
• •
, • •I off - policy
i f f f ,But no importance sampling !
• If O O O I
. No actions considered
OI I I e Ld ) I other than :
•
I
•• • •
fK
.action prescribed
n - step n - step by qcs ,a )
n - step n - Step Ioff policy Expected I 2
.actins in
expectation
TD SARSAI
SANSA SARSAy
Wir.
E.
Teva ,
on - policyOh - policy off - policy off policy
prediction control I control control I . N - step Expected SAKSA, takes N - I actions
( I from Tb that- - -
-- -
--
require reweighingTargets :
-
Fatter : Rt t VR #t . - - +8
" - '
Ran . itf
"# rn,
attn )in
- step
" staff7
i.lt#+n.fRtt8Re+it.--t8mRt+n-itVQSei-n,attn
) )
ControlVariates-
i
-- -
"i r
-
-
pH ) i i. ply) r
'
I i ,
'
.
. ,-
. .
-
,
/-
,
✓i
,
I.
f- -
-
I
iT f--
-✓
I
"
'T
My -
- Efx ] My=ECy]Unknown Known
Assumption : X is positively correlated with Y
i. e.
when X > Efx ),
it is likely that y > Ely )ex : rain and traffic
ControlVariates-
i
-- -
"i r
-
-
pH ) i i. ply) r
'
I i ,
'
.
. ,-
. .
-
,
/-
,
✓i
,
I.
f--
I
iT f-
-✓
I
"
'T
My -
- Efx ] My = Ely ]Unknown Known
Assumption : X is positively correlated with Y
i. e.
when X > Efx ),
it is likely that y > Ely )ex : rain and traffic
Sample x - P ( * ) and y- ply )
Consider two different estimators :
ii.= x
rix= x -1Cny - y ]
ControlVariates-
i
-- -
"I r
-
-
pH ) i i. ply) r
'
I i ,
'
.
. ,-
. .
-
,
/-
,
✓i r
.
-
y My - Y i
c--
itlmaooI t
-
-
•T•'
TMy -
- Efx ] My = Ely ]Unknown Known
Assumption : X is positively correlated with Y
i. e.
when X > EH ),
it is likely that y > Ely )ex : rain and traffic
Sample x - p ( * ) and y- ply )
Consider two different estimators :
ii.= x
rix= x -1Cny - y ]-
Expectationzero !
.n - Step PDIS with CV
Control Variates -
-
. G = ee ( Rent 8 Get, )
-- -
,-
,
⇐ h,
..
- - --
- - .
pH ) i i. ply) r
'
tf' - E) the,
( se )'
Cv
I i ,
'
.
. ,r
.. -
,
'-
- - --
'
'
/-
,
/i
.
'
my . y'
t-
itiniT t
-"
•T•'
'T E KeIf = I
,thus ;
My -
- Efx ] My
=ECy]
unknown knownE ( I - le ) = O
Assumption : X is positively correlated with Y→ EH - ee ) Uh
. ,( se ) ) = O
i. e.
when X > EH ),
it is likely that y > Ely )ex : rain and traffic
Sample x - PC * ) and y- ply )
Consider two different estimators :
ii.= x
rix= x -1Cny - y ]-
Expectationzero !
.n - Step PD IS with CV
Control Variates -
-
. G = I ( Rent 8Ge+, :h )
-- -
,-
,
⇐ h.
..
- - --
- - .
pH ) i i. ply) r
'
tf' - E) the,
( se )'
Cv
I i ,
'
.
. ,-
.. -
,
'-
- - --
'
'
/-
,
/i
.
'
my . y'
t-
itin.
iT t-
-✓
•T•'
'T E If = I
,thus ;
My -
- Efx ] My = Ely ]unknown known
E ( I - le ) = O
Assumption : X is positively correlated with Y→ EH - ee ) Uh
. ,( se ) ) = O
i. e.
when X > EH ),
it is likely that y > Ely )ex : rain and traffic
Compare to without CV :
Sample x - p ( * ) and y- ply ) Ge :b
= Et (Reti +8 Get , :L )Consider two different estimators :
if Ce = O ;
in,
= X Cv : Gein-
- Vu. , ( se )
ii.,= Xt(My - y ] No Cv : Gein = O
-
Expectation" how much better or worse than
zero !average is this sample ?
"
one step off policy-
Q - learningecfed SARS A
Same when Tera ,
= ITS
• A
I lO Omazy
the•
• a•
• a
Q .
learning ExpectedSAR SA
- Both off policy. in importance sampling !
one step off policy n - step off policy- -
• aca
Q - learning f f
LeafedSAKSA O O O
dede asSame when Teva ,
= Tip • •• • •
I I IO O O
• . hete LLL} J
• • a•
•
o ol I I
"HH Lf ) maO O
•. a
•. a Lt ) Lt ) Lf )
•• a •
• B B• a
Q -
learning Expected n - step n - stepTree
SANSA Q - learning Expected BackupsSANSA
- Both off policy. no importance Sampling !
Tree Backups Target :