Upload
angel-maher
View
212
Download
0
Tags:
Embed Size (px)
Citation preview
Selective attention in RL
B. RavindranJoint work with Shravan Matthur, Vimal Mathew,
Sanjay Karanth, Andy Barto
Features, features, everywhere!
• We inhabit a world rich in sensory input
• Focus on features relevant to task at hand
Features, features, everywhere!
• We inhabit a world rich in sensory input
• Focus on features relevant to task at hand
• Two questions: 1. How do you characterize relevance?
2. How do you identify relevant features?
Outline
• Characterization of relevance– MDP homomorphisms– Factored Representations– Relativized Options
• Identifying features– Option schemas– Deictic Options
5
Markov Decision Processes• MDP, M, is the tuple:
– S : set of states.– A : set of actions.– : set of admissible state-action
pairs.– : probability of transition.– : expected reward.
• Policy • Maximize total expected reward
– optimal policy
M = S,A,Ψ,P,R
AS×⊆Ψ
P :Ψ×S→ 0,1[ ]
ℜ→Ψ:R
[ ]1,0 : →Ψπ
Notion of Equivalence
A,E( ) ≡ B,N( )
A,W( ) ≡ B,S( )
A,N( ) ≡ B,E( )
A,S( ) ≡ B,W( )
N
E
S
W
M = S,A,Ψ,P,R ′M = ′S , ′A ,Ψ ′ ,P ′ , ′R
Find reduced models that preserve some aspects of the original model
MDP Homomorphism
),( as ),( as
),( as ′′ ),( as ′′
saP
asP ′′′P′
P r
h hagg.
R
R′
MDPs M = S,A,Ψ,P,R , ′M = ′S , ′A , ′Ψ , ′P , ′R
surjection h :Ψ→ ′Ψ defined by h((s,a)) =( f (s),gs(a)) where:f :S→ ′S , gs :As→ ′Af (s), for all s∈S, are surjections such thatfor all s, s∈S, and a∈As:(1) ′P f(s), gs(a), f(s)( )= P s,a,t( )
t∈ s[ ] f∑
(2) ′R f(s), gs(a)( )=R s, a( )
Example
( ) ( )NB,EA, ≡
A,W( ) ≡ B,S( )
A,N( ) ≡ B,E( )
( ) ( )WB,SA, ≡
N
E
S
W
RPASM ,,,, Ψ= RPASM ′′′Ψ′′=′ , , ,,
)},,({ ),(),( EBANBhEAh ==
State dependent action recoding
Some Theoretical Results
• Optimal Value equivalence:
If then•
• Solve homomorphic image and lift the policy to the original MDP.
Q*(s,a) =Q* ( ′s , ′a ).h(s,a) = ( ′s , ′a )Corollary:
If h(s1,a1) =h(s2 ,a2 ) then Q∗(s1,a1) =Q
∗(s2 ,a2 ).
[generalizing those of Dean and Givan, 1997]
Theorem: If is a homomorphic image of , then a policy optimal in induces an optimal policy in .
′M M
M′M
More results
• Polynomial time algorithm to find reduced images Dean and Givan ’97, Lee and Yannakakis ’92, Ravindran ‘04
• Approximate homomorphisms Ravindran, Barto ‘04
– Bounds for the loss in the optimal value function• Soft homomorphisms Sorg, Singh ‘09
– Fuzzy notions of equivalence between two MDPs– Efficient algorithm for finding them
• Transfer learning (Soni, Singh et al ‘06), partial observability (Wolfe ‘10), etc.
Still more results
• Symmetries are special cases of homomorphisms Matthur Ravindran 08
– Finding symmetries is GI-complete– Harder than finding general reductions
• Efficient algorithms for constructing the reduced image Matthur Ravindran 07
– Factored MDPs– Polynomial in the size of the smaller MDP
Attention?
• How to use this concept for modeling attention?
• Combine with hierarchical RL– Look at sub-task specific relevance – Structured homomorphisms– Deixis (δεῖξις to point)
• State and action spaces defined as product of features/variables.
• Factor transition probabilities.• Exploit structure to define simple
transformations.
1x 1x′
2x
3x
2x′
3x′
r
),|( 211 xxxP ′
),|( 212 xxxP ′
),|( 323 xxxP ′
),|( 21 xxrP
2 Slice
Temporal
Bayes
Net
Factored MDPs
3x 3x′
Using Factored Representations
• Represent symmetry information in terms of features.– Eg: As an example the NE-SW symmetry can be
represented as
• Simple forms of transformations– Projections– Permutations
(x, y)−N ≡(y,x)−E& (x,y)−W≡(y,x)−S
Hierarchical Reinforcement Learning
Options frameworkOptions (Sutton, Precup, & Singh, 1999): A generalization of
actions to include temporally-extended courses of action
state each in gterminatin of yprobabilit the is
during followed policy c)(stochasti the is
started be may whichin states of set the is
triple a is option An
]1,0[
]10[
,,
→•→Ψ•
⊆•>=<
S:,:
SII
oo
o
o
o
βπ
βπ
Example: robot docking π : pre-defined controller
β : terminate when docked or charger not visible
I : all states in which charger is in sight
o
• Gather all the red objects
• Five options – one for each room
• Sub-goal options• Implicitly represent
option policy• Option MDPs
related to each other
Sub-goal Options
Relativized Options
• Relativized options (Ravindran and Barto ’03)
– Spatial abstraction - MDP Homomorphisms
– Temporal abstraction – options framework
• Abstract representation of a related family of sub-tasks – Each sub-task can be derived by applying
well defined transformations
Relativized Options (Cont)
Relativized option:
: Option homomorphism
: Option MDP (Image of h)
: Initiation set
: Termination criterion
O = h,MO, I ,β
I ⊆S
h
β :SO → [0,1]
OM
reduced state
actionoption
Top level
actions
perceptenv
• Single relativized option – get-object-exit-room
• Especially useful when learning option policy– Speed up– Knowledge transfer
• Terminology: Iba ’89 • Related to
parameterized sub-tasks (Dietterich ’00, Andre and Russell ’01, 02)
Rooms World Task
Option Schema
• Finding the right transformation?– Given a set of candidate transformations
• Option MDP and policy can be viewed as a policy schema (Schmidt ’75)
– Template of a policy– Acquire schema in a prototypical setting– Learn bindings of sensory inputs and
actions to schema
Problem Formulation
• Given:– of a relativized option– , a family of transformations
• Identify the option homomorphism • Formulate as a parameter estimation
problem– One parameter for each sub-task, takes
values from H– Samples: – Bayesian learning
MO , I ,βH
h
L,,,, 2211 asas
Algorithm• Assume uniform prior: • Experience:
• Update Posteriors:
),(0 shp
1,, +nnn sas
pn (h, s ) =PO f(sn),gsn (an), f(sn+1)( )⋅pn−1(h,s)
Normalizing Factor
P sn ,an , sn+1 h,s( ) =PO f(sn),gsn (an), f(sn+1)( )
Complex Game World
• Symmetric option MDP• One delayer • 40 transformations
– 8 spatial transformations combined with 5 projections
• Parameters of option MDP different from the rooms
ResultsSpeed of Convergence
• Learning the policy is more difficult than learning the correct transformation!
ResultsTransformation Weights in Room 4
• Transformation 12 eventually converges to 1
ResultsTransformation Weights in Room 2
• Weights oscillate a lot• Some transformation dominates eventually
– Changes from one run to another
27
Deictic Representation
• Making attention more explicit• Sense world via pointers – selective attention• Actions defined with respect to pointers• Agre ’88
– Game domain Pengo• Pointers can be arbitrarily complex
– ice-cube-next-to-me– robot-chasing-me
Move block to top of block .
Deixis and Abstraction• Deictic pointers project states and actions
onto some abstract state-action space• Consistent Representation (Whitehead and Ballard ’91)
– states with same abstract representation have the same optimal value.
– Lion algorithm, works with deterministic systems• Extend relativized options to model deictic
representation Ravindran Barto Mathew ‘07
– Factored MDPs– Restrict transformations available
• Only projections– Homomorphism conditions ensure consistent
representation
Deictic Option Schema• Deictic option schema:
– O - A relativized option– K - A set of deictic pointers– D - A collection of sets of possible projections, one
for each pointer
• Finding the correct transformation for a given state gives a consistent representation
• Use a factored version of a parameter estimation algorithm
ODK ,,
Classes of Pointers
• Independent pointers
• Mutually dependent pointers
• Dependent pointers
1x 1x′
2x
3x
2x′
3x′
1x 1x′
2x 2x′
Problem Formulation
• Given:– of a relativized option
• Identify the right pointer configuration for each sub-task
• Formulate as a parameter estimation problem– One parameter for each set of connected pointers per
sub-task – Takes values from – Samples: – Heuristic modification of Bayesian learning
ODK ,,
L,,,, 2211 asasK2
Heuristic Update Rule
wnl (h, s ) =
POl f(sn),gsn (an), f(sn+1)( )⋅wn−1
l (h,s)
Normalizing Factor
• Use a heuristic update rule:
where, POl (s,a, s ') =max ν,PO
l (s,a,s')( )
and ν is a small positive constant
Game Domain
• 2 deictic pointers: delayer and retriever• 2 fixed pointers: where-am-I and have-diamond• 8 possible values for delayer and retriever
– 64 transformations
Experimental Setup• Composite Agent
– Uses 64 transformations and a single component weight vector
• Deictic Agent– Uses 2 component weight vector– 8 transformations per component
• Hierarchical SMDP Q-learning
Experimental Results – Speed of Convergence
Experimental Results – Timing
• Execution Time
– Composite – 14 hours and 29 minutes
– Factored – 4 hours and 16 minutes
Experimental Results – Composite Weights
Mean 2006Std. Dev. 1673
Experimental Results – Delayer Weights
Mean 52Std. Dev. 28.21
Experimental Results – Retriever Weights
Mean 3045Std. Dev. 2332.42
Robosoccer• Hard to learn a polciy
for the entire game• Look at simpler
problems– Keepaway– Half-field offence
• Learn policy in a relative frame of reference
• Keep changing the bindings
Summary
• Richer representations needed for RL– Deictic representations
• Deictic Option Schemas– Combines ideas of hierarchical RL and
MDP homomorphisms– Captures aspects of deictic representation
Future
• Build an integrated cognitive architecture, that uses both bottom-up and top-down information.– In perception– In decision making
• Combine aspects of supervised, unsupervised, reinforcement learning and planning approaches