Upload
barnett-chiu
View
381
Download
0
Embed Size (px)
DESCRIPTION
In real-world scenarios, decision making can be a very challenging task even for modern computers. Generalized reinforcement learning (GRL) was developed to facilitate complex decision making in highly dynamical systems through flexible policy generalization mechanisms using kernel-based methods. GRL combines the use of sampling, kernel functions, stochastic process, non-parametric regression and functional clustering.
Citation preview
Generalized Reinforcement Learning
Framework Barnett P. Chiu
3.22.2013
Overview • Standard Formulation of Reinforcement Learning • Challenges in standard RL framework due to its
representation • Generalized/Alternative action formulation
• Action as an Operator • Parametric-Action Model
• Reinforcement Field – Using kernel as a similarity measure over “decision
contexts” (i.e. generalized state-action pair) – Value predictions using functions (vectors) from RKHS (a vector
space) – Representing policy using kernelized samples
Reinforcement Learning: Examples
• A learning paradigm that formalizes sequential decision process under uncertainty – Navigating in an unknown environment – Playing and winning a game (e.g. Backgammon) – Retrieving information over the web (finding the
right info on the right websites) – Assigning user tasks to a set of computational
resources • Reference:
– Reinforcement Learning: A Survey by Leslie P. Kaelbling, Michael L. Littman, Andrew W. Moore
– Autonomous Helicopter: Andrew Ng
Reinforcement Learning: Optimization Objective
• Optimizing performance through trial and error. – The agent interacts with the environment, perform the right actions
such that they induce a state trajectory towards maximizing rewards. – Task dependent; can have multiple subgoals/subtasks – Learning from incomplete background knowledge
• Ex1: Navigating in an unknown environment – Objective: shortest path + avoiding obstacles + minimize fuel consumption + …
• Ex2: Assigning user tasks to a set of servers with unknown resource capacity – Objective: minimize turnaround time, maximize success rate, load balancing, …
Reinforcement Learning: Markov Decision Process (typical but not always)
Reinforcement Learning: Markov Decision Process
Potential Function: Q: S × A è utility
Challenges in Standard RL Formulations (1) • Challenges from large state and action space
– the complexity of RL methods depends largely on the dimensionality of the state space representation
Later … • Solution: Generalized action representation
– Explicitly express errors/variations of actions
– It becomes possible to express correlation from within a state-action combination to reveal their interdependency.
– The value function no longer needs to express a concrete mapping from state-action pairs to their values.
– Enables simultaneous control over multiple parameters that collectively describe behavioral details as actions are performed.
( , )s a S A+∈ ⊗x x
( , )k ʹ′x x( , )s a=x x x
//Compare two decision contexts
Challenges in Standard RL Formulations (2)
• Challenges from Environmental Shifts – Unexpected behaviors inherent in actions (same action but
different outcomes over time) – E.g. Recall the rover navigation example earlier
• Navigational policy learned under different surface conditions
• Challenges from Irreducible and Varying Action Sets – Large decision points – Feasible actions do not stay the same – E.g. Assigning tasks to time-varying computational resources in a dynamic virtual cluster (DVC)
• Compute resources are dynamically acquired with limited walltimes
Reinforcement Learning Algorithms … • In most complex domains, T, R need to be estimated à RL
framework – Temporal Difference (TD) Learning
• Q-learning, SARSA, TD(λ) – Example of TD learning: SARSA
– SARSA Update Rule:
– Q-learning Update Rule:
– Function approximation, SMC, Policy Gradient, etc.
“But the problems are …”
[ ]( , ) ( , ) ( , ) ( , )Q s a Q s a r Q s a Q s aα γ ʹ′ ʹ′← + + −
( , ) ( , ) max ( , ) ( , )a
Q s a Q s a r Q s a Q s aα γʹ′
⎡ ⎤ʹ′ ʹ′← + + −⎣ ⎦
MDPst St+1
at
st
rt
Alternative View of Actions • Actions are not just decision choices
– Variational procedure (e.g. principle of least {action, time}) – Errors
• Poor calibrations of actuators • Extra bumpy surfaces
– Real-world domains may involve actions with high complexity • Robotic control (e.g. a simultaneous control over a set of joint
parameters)
“What is an action really?” Actions induce a shift in the state configuration
– A continuous process – A process involving errors – State and action has hidden correlations
• Current knowledge base (state) • New info retrieved from external world (action)
– Similarity between decisions: (x1,a1) vs (x2,a2)
Action as an Operator • Action Operator (aop)
– Acts on current state and produces its successor state.
• aop takes on an input state … (1) • aop resolves stochastic effects in the action … (2)
– Recall: action is now parameterized by constrained random variables (e.g. d_r, d_theta)
• Given a fixed operator by (1), (2), now aop maps the input state to the output state (i.e. the successor state)
• The current state vector + action vector à augemented state • E.g.
Δx = Δrcos(Δθ )Δy = Δrsin(Δθ )
1+ Δxx
0
0 1+ Δyy
"
#
$$$$
%
&
''''
xy
"#$
%&'= x+Δx
y +Δy"#$
%&'
⇒ xs ,xa( ) = (x, y),(Δx,Δy)( )
Value Prediction: Part I “How to connect the notion of action
operator with value predictions?”
Parametric Actions (1) A
B
ΔrΔθ24
2
23
4 32
456
78
9
10
11 12 12
3
Y (xi ) = P((xi +wi ) ∈ Γ),(xa)i = xi |Y (xi ) ≥η.
xa = (x1,x2) = (Δr ,Δθ)• Actions as a random process • Example:
• These 12 parametric actions all take on parameter bounded within pie-shaped scopes
1 0
0 1
xx x xx
y y y yy
Δ⎡ ⎤+⎢ ⎥ + Δ⎡ ⎤ ⎡ ⎤=⎢ ⎥Δ ⎢ ⎥ ⎢ ⎥+ Δ⎣ ⎦ ⎣ ⎦+⎢ ⎥
⎢ ⎥⎣ ⎦
⇒ xs ,xa( ) = (x, y),(Δx,Δy)( )
ΔrΔθ24
2
23
4 32
456
78
9
10
11 12 12
3
xa = (x1,x2) = (Δr ,Δθ)• Actions as a random process
Parametric Actions (2)
• Augmented state space
• Learn a potential function Q xs ,xa( ) =Q x, y,Δx,Δy( )
action as an operator …
Using GPR: A Thought Process (1) • Need to gauge the similarity/correlation between any two decisions à kernel functions • Need to estimate the (potential) value of an arbitrary combination of
state and actions without needing to explore all the state space à the value predictor is a “function” of kernels
• The class of functions that exhibit the above properties à functions drawn from Reproduced Kernel Hilbert Space (RKHS)
• GP regression (GPR) method induces such functions [GPML, Rasmussen] – Representer Theorem [B. Schölkopf et al. 2000]
( , )k ʹ′x x
Q+(x*) = αii=1
n
∑ k(xi ,x*)
cos t (x1, y1, f (x1)),...,(xm , ym , f (xm ))( )+ regularizer f( )
Q+(⋅) = αii=1
n
∑ k(xi ,⋅)
Using GPR: A Thought Process (2) • With GPR, we work with samples of experience • The Approach
– Find the “best” function(s) that explain the data (decisions made so far)
• Model Selection – Keep only the most important samples from which to derive the
value functions • Sample Selection
• It is relatively easy to implement the notion of model selection with GPR – Tune hyperparameters of the kernel (as a covariance function of
GP) such that the marginal likelihood (of the data) is maximized – Periodic model selection to cope with environmental shifts
• Sample selection – Two-tier learning architecture
• Using a baseline RL learning algorithm for obtaining utility estimates • Using GPR to achieve generalization
{ }1 1: ( , ),...,( , )m mq qΩ x x
*GPR I: Gaussian Process
• GP: a probability distribution over a set of random variables (or function values), any finite set of which has a (joint) Gaussian distribution
– Given a prior assumption over functions (See previous page) – Make observations (e.g. policy learning) and gather evidences – Posterior p.d. over functions that eliminate those not consistent
with the evidence
f |Mi GP m(x),k(x, !x )( )y | x,Mi N f ,σ noise I( ) ⇒
Prior and Posterior
−5 0 5
−2
−1
0
1
2
input, x
outp
ut, f
(x)
−5 0 5
−2
−1
0
1
2
input, xou
tput
, f(x
)
Predictive distribution:
p(y⇤|x⇤ x y) ⇠ N�k(x⇤ x)>[K + �2
noiseI]-1
y
k(x⇤ x⇤) + �2noise - k(x⇤ x)>[K + �2
noiseI]-1
k(x⇤ x)�
Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 23 / 62
*GPR II: State Correlation Hypothesis • Kernel: covariance function (PD Kernel)
– Correlation hypothesis for states – Prior
– Observe samples
– Compute GP posterior (over latent functions)
– Predicted distribution à averaging over all possible posterior weights with respect to Gaussian likelihood function
k(x, !x ) =θ0exp(−12xΤD−1 "x )+θ1σ noise
2
=θ0exp(−12
xi2
θii=2
d∑ )+θ1σ noise
2
{ }1 1: ( , ),...,( , )m mq qΩ x x
Q+(xi ) | X ,q,xi ~GP(mpost (xi ) = k(xi ,x)ΤK (X ,X )−1q,
cov post (xi ,x) = k(xi ,xi )−k(xi ,x)ΤK (X ,X )−1q)
*GPR III: Value Prediction
• Predictive Distribution with a test point
– Prediction of a new test is achieved by comparing with all the samples retained in the memory
– Predictive value largely depends on correlated samples
– The (sample) correlation hypothesis (i.e. kernel) applies to in all state space
– Reproducing property from RKHS [B. Schölkopf and A. Smola,2002]
q* =Q+(x*) = k(x*)
ΤK (X ,X )−1q = αii=1
n
∑ k(xi ,x*)
cov(q*) = k(x*,x*)−k*T K +σ n
2I( )−1k*
( , ( )) : ( , ) ( ), ( , ) ( )Q k Q k Q+ + +∗ ∗ ∗ ∗ ∗⋅ → ⋅ ⋅ =x x x x x
*GPR IV: Model Selection • Maximizing Marginal Likelihood (ARD)
– A trade off between data fit and model complexity – Optimization: take partial derivative wrt each hyperparameter
to get
• conjugate gradient optimization • We get the optimal hyperparameters that best explain the
data • Resulting model follows Occam’s Razor principle • Computing K-1 is expensive à Reinforcement Sampling
11 1log ( | ) log | | log22 2 2
T np X K K π−= − − −q q q
1 1 11 1log ( | , ) ( )2 2
T
i i i
K Kp X K K tr Kθ θ θ
− − −∂ ∂ ∂= −
∂ ∂ ∂u θ u u
Value Predictions (Part II)
• Now what remains to solve are: – How to obtain the training signals {qi}?
• Use baseline RL agent to estimate utility based on MDP with concrete action choices
• Use GPR to generalize the utility estimate with parameterized action representation
– How to train the desired value function using only essential samples?
• memory constraint! • Sample replacement using reinforcement signals
– Old samples referencing old utility estimates can be replaced by new samples with new estimates - Experience Association
1b. Estimate utility for each “regular” state-action pair
- Expand the action into its parametric representation - Kernel assumption accounts for random effect in the action
MDPst St+1
at
st
rt
GPR
st a(xa)
st at r0 rt…... ut
( ,( , )) ( , )s ak k× = ×x x x
( ) ( ) α1( , ) ( , )ms a i ii
Q Q k+ +=
= =åx x x x x
(), ( , )Q k+ × ×x
{ }Ω 1 1: ( , ),...,( , )m mq qx x
108
data X for simplicity and thus, V6 2( | ) ( ) ( , ) ( , ).p X p N N I� � �w w 0 0 Conversely, the prior over
function values is explicitly modeled by a covariance matrix K evaluated pair-wise from the
data set X through the kernel; i.e. ( , ).ij i jK k� x x In the case where components of q
correspond to noisy observations of ,f the log likelihood in (5.48) is shifted to the following:
V V S12 21 1log ( | ) log log2
2 2 2T np X K I K I
��� � � � �q q q (5.50)
The result of (5.50) can be obtained directly by assuming V� 2| ( , ).X N K I�q 0 Alternatively,
one can use a noisy kernel to explicitly incorporate the noise assumption such that
V G2( , ) ,ij i j ijK k� �x x where G ij is a Kronecker delta function. In this case, (5.50) is reduced to
(5.48) by “absorbing” the noise effect into the covariance matrix .K In the later discussion, we
will consider the noise as part of the definition of the kernel function for notational convenience.
Now, consider again the noisy training set :. We denote the set of a new noisy test
points as ,X� which are induced by the stochastic effect in the action parameters. The
corresponding targets are denoted ,�q containing a sequence of predictive values with respect
to the input in X. With the assumption of a noisy kernel V G2( , )i j ijk �x x as the covariance
function, the predictive distribution over new targets is thus given by V| , ~ , [ ] ,N X�q f X q
where
1, ,K X X K X X ���q q (5.51)
V 1* * * *, , , ,X K X X K X X K X X K X X� ¯ � �¢ ± (5.52)
Here we note that the derivation to arrive at (5.51) and (5.52) is similar to the weight-space
model; however, the interested readers may refer to [14, 107] for more details. With only one
test point, (5.51) is reduced to
108
data X for simplicity and thus, V6 2( | ) ( ) ( , ) ( , ).p X p N N I� � �w w 0 0 Conversely, the prior over
function values is explicitly modeled by a covariance matrix K evaluated pair-wise from the
data set X through the kernel; i.e. ( , ).ij i jK k� x x In the case where components of q
correspond to noisy observations of ,f the log likelihood in (5.48) is shifted to the following:
V V S12 21 1log ( | ) log log2
2 2 2T np X K I K I
��� � � � �q q q (5.50)
The result of (5.50) can be obtained directly by assuming V� 2| ( , ).X N K I�q 0 Alternatively,
one can use a noisy kernel to explicitly incorporate the noise assumption such that
V G2( , ) ,ij i j ijK k� �x x where G ij is a Kronecker delta function. In this case, (5.50) is reduced to
(5.48) by “absorbing” the noise effect into the covariance matrix .K In the later discussion, we
will consider the noise as part of the definition of the kernel function for notational convenience.
Now, consider again the noisy training set :. We denote the set of a new noisy test
points as ,X� which are induced by the stochastic effect in the action parameters. The
corresponding targets are denoted ,�q containing a sequence of predictive values with respect
to the input in X. With the assumption of a noisy kernel V G2( , )i j ijk �x x as the covariance
function, the predictive distribution over new targets is thus given by V| , ~ , [ ] ,N X�q f X q
where
1, ,K X X K X X ���q q (5.51)
V 1* * * *, , , ,X K X X K X X K X X K X X� ¯ � �¢ ± (5.52)
Here we note that the derivation to arrive at (5.51) and (5.52) is similar to the weight-space
model; however, the interested readers may refer to [14, 107] for more details. With only one
test point, (5.51) is reduced to
108
data X for simplicity and thus, V6 2( | ) ( ) ( , ) ( , ).p X p N N I� � �w w 0 0 Conversely, the prior over
function values is explicitly modeled by a covariance matrix K evaluated pair-wise from the
data set X through the kernel; i.e. ( , ).ij i jK k� x x In the case where components of q
correspond to noisy observations of ,f the log likelihood in (5.48) is shifted to the following:
V V S12 21 1log ( | ) log log2
2 2 2T np X K I K I
��� � � � �q q q (5.50)
The result of (5.50) can be obtained directly by assuming V� 2| ( , ).X N K I�q 0 Alternatively,
one can use a noisy kernel to explicitly incorporate the noise assumption such that
V G2( , ) ,ij i j ijK k� �x x where G ij is a Kronecker delta function. In this case, (5.50) is reduced to
(5.48) by “absorbing” the noise effect into the covariance matrix .K In the later discussion, we
will consider the noise as part of the definition of the kernel function for notational convenience.
Now, consider again the noisy training set :. We denote the set of a new noisy test
points as ,X� which are induced by the stochastic effect in the action parameters. The
corresponding targets are denoted ,�q containing a sequence of predictive values with respect
to the input in X. With the assumption of a noisy kernel V G2( , )i j ijk �x x as the covariance
function, the predictive distribution over new targets is thus given by V| , ~ , [ ] ,N X�q f X q
where
1, ,K X X K X X ���q q (5.51)
V 1* * * *, , , ,X K X X K X X K X X K X X� ¯ � �¢ ± (5.52)
Here we note that the derivation to arrive at (5.51) and (5.52) is similar to the weight-space
model; however, the interested readers may refer to [14, 107] for more details. With only one
test point, (5.51) is reduced to
108
data X for simplicity and thus, V6 2( | ) ( ) ( , ) ( , ).p X p N N I� � �w w 0 0 Conversely, the prior over
function values is explicitly modeled by a covariance matrix K evaluated pair-wise from the
data set X through the kernel; i.e. ( , ).ij i jK k� x x In the case where components of q
correspond to noisy observations of ,f the log likelihood in (5.48) is shifted to the following:
V V S12 21 1log ( | ) log log2
2 2 2T np X K I K I
��� � � � �q q q (5.50)
The result of (5.50) can be obtained directly by assuming V� 2| ( , ).X N K I�q 0 Alternatively,
one can use a noisy kernel to explicitly incorporate the noise assumption such that
V G2( , ) ,ij i j ijK k� �x x where G ij is a Kronecker delta function. In this case, (5.50) is reduced to
(5.48) by “absorbing” the noise effect into the covariance matrix .K In the later discussion, we
will consider the noise as part of the definition of the kernel function for notational convenience.
Now, consider again the noisy training set :. We denote the set of a new noisy test
points as ,X� which are induced by the stochastic effect in the action parameters. The
corresponding targets are denoted ,�q containing a sequence of predictive values with respect
to the input in X. With the assumption of a noisy kernel V G2( , )i j ijk �x x as the covariance
function, the predictive distribution over new targets is thus given by V| , ~ , [ ] ,N X�q f X q
where
1, ,K X X K X X ���q q (5.51)
V 1* * * *, , , ,X K X X K X X K X X K X X� ¯ � �¢ ± (5.52)
Here we note that the derivation to arrive at (5.51) and (5.52) is similar to the weight-space
model; however, the interested readers may refer to [14, 107] for more details. With only one
test point, (5.51) is reduced to 111
where `,ma � D \,ja � analogous to the parameters for ,Q and .j X�x By virtue of the
symmetric property inherent in both ,� � and � �, ,k � � one can then postulate that the inner
product of the two functions Q and Qa can be well defined in the following form:
D D1 1
, ,m m
i j i ji j
Q Q ka
� �
a a��� x x (5.59)
The next step is to verify that (5.59) does compute a valid inner product. Notice that (5.59) can
be expressed explicitly as a linear combination of both Q and ;Qa that is,
D D D1 1 1
, ( , )m m m
j i j i j jj i j
Q Q k Qa a
� � �
� ¬� a a a�� �� �� ®� � �x x x (5.60)
Similarly,
D D D1 1 1
, ( , )m m m
i j i j i ii j i
Q Q k Qa
� � �
� ¬� �a a a� �� � �� ®� � �x x x (5.61)
Equations (5.60) and (5.61) imply that ,� � is both symmetric and bilinear since ,Q Qa is linear
in both Q and ,Qa and , , .Q Q Q Qa a� The symmetric property also follows from the fact that
( , )k ¸ ¸ is symmetric. Further, the positive definiteness of � �,k � � suggests that for any function Q
(and ),Q a ,� � always evaluates to a nonnegative value; that is,
D D2
, 1
, ( , ) 0m
i j i jHi j
Q Q Q k�
� � p� x x (5.62)
With the inner product definition in (5.59), one can thus evaluate the inner product between Q
and a kernelized pattern ( , )k ¸ x in terms of
D �1
, ( , ) ( , ) ( )m
i ii
Q k k Q�
¸ � ��x x x x (5.63)
1a. Baseline RL: e.g. SARSA - MDP based RL treats actions as decision choices - Estimate utility without looking at action parameters
2. Use the generalization capacity from GP to predict values
3. From Baseline RL layer to GPR (next)
124
1 1( ) ,..., , ,...,s s a an mQ s Q Q x x x x� � � �� �x (6.2)
Thus, if the agent traverses the state space with m steps, a sequence of training samples will be
generated and retained in memory: \ ^: 1 1: ( , ),...,( , ) ,m mq q X Q� qx x where X and Q,
respectively, denote the set of augmented states and their corresponding observed fitness
values. It is helpful for the moment to consider these training samples in : as a functional data
set [39] referenced by the experience particles distributed over the state space. The mechanism
for estimating their values are deferred to Section 6.5. Each experience particle effectively
defines a control policy that generalizes into the neighboring (augmented) state space through
properties of the kernel to be discussed shortly. The similarity of any two particles (or
equivalently, two referenced augmented states) takes into account both the state vector sx and
action vector .ax This formulation is made possible by allowing the action to take on continuous
variables along with the use of a kernel function as a correlation hypothesis associating one
augmented state to another through their inner products in the kernel-induced feature space
[22]. With the above constructs defined, the next step toward establishing the reinforcement
field is to represent the fitness function in a manner that integrates with parametric actions and,
in the meantime, serves as a “critic” for the policy embedded in experience particles. In this
research, we represent the fitness value function through a progressively-updated Gaussian
process.
6.4 Value Predictions using GPR
Consider the training set \ ^: 1 1: ( , ),...,( , )m mq qx x given earlier. The goal is to identify a
function ( )Q� x (Definition 6.2) fitting these samples with a tradeoff between quality of the
prediction (i.e. data fit) and smoothness assumptions of the function (i.e. model complexity) [14].
Value predictions using GPR first assume a normal prior distribution over functions, and
subsequently reject those functions not consistent with the observations (i.e. experience
124
1 1( ) ,..., , ,...,s s a an mQ s Q Q x x x x� � � �� �x (6.2)
Thus, if the agent traverses the state space with m steps, a sequence of training samples will be
generated and retained in memory: \ ^: 1 1: ( , ),...,( , ) ,m mq q X Q� qx x where X and Q,
respectively, denote the set of augmented states and their corresponding observed fitness
values. It is helpful for the moment to consider these training samples in : as a functional data
set [39] referenced by the experience particles distributed over the state space. The mechanism
for estimating their values are deferred to Section 6.5. Each experience particle effectively
defines a control policy that generalizes into the neighboring (augmented) state space through
properties of the kernel to be discussed shortly. The similarity of any two particles (or
equivalently, two referenced augmented states) takes into account both the state vector sx and
action vector .ax This formulation is made possible by allowing the action to take on continuous
variables along with the use of a kernel function as a correlation hypothesis associating one
augmented state to another through their inner products in the kernel-induced feature space
[22]. With the above constructs defined, the next step toward establishing the reinforcement
field is to represent the fitness function in a manner that integrates with parametric actions and,
in the meantime, serves as a “critic” for the policy embedded in experience particles. In this
research, we represent the fitness value function through a progressively-updated Gaussian
process.
6.4 Value Predictions using GPR
Consider the training set \ ^: 1 1: ( , ),...,( , )m mq qx x given earlier. The goal is to identify a
function ( )Q� x (Definition 6.2) fitting these samples with a tradeoff between quality of the
prediction (i.e. data fit) and smoothness assumptions of the function (i.e. model complexity) [14].
Value predictions using GPR first assume a normal prior distribution over functions, and
subsequently reject those functions not consistent with the observations (i.e. experience
456
78
9
10
11 12 12
3
MDPst St+1
at
st
rt
GPR
st a(xa)
st at r0 rt…... ut
( ,( , )) ( , )s ak k× = ×x x x
( ) ( ) α1( , ) ( , )ms a i ii
Q Q k+ +=
= =åx x x x x
(), ( , )Q k+ × ×x
{ }Ω 1 1: ( , ),...,( , )m mq qx x
3. From Baseline RL layer to GPR 3a. Take a sample of the random action vector 3b. Form the augmented state (s, a) à (xs,xa) 3c. Propagate the utility signal from the baseline learner and use it as the training signal 3d. Insert new (functional) data into current working memory
4. Use GPR to predict new test points 4a. Kernelized the new test point (s, a) à (xs,xa) à k(., x) 4b. Take inner product of k(., x) and Q+ to obtain the utility estimate (fitness value)
456
78
9
10
11 12 12
3
124
1 1( ) ,..., , ,...,s s a an mQ s Q Q x x x x� � � �� �x (6.2)
Thus, if the agent traverses the state space with m steps, a sequence of training samples will be
generated and retained in memory: \ ^: 1 1: ( , ),...,( , ) ,m mq q X Q� qx x where X and Q,
respectively, denote the set of augmented states and their corresponding observed fitness
values. It is helpful for the moment to consider these training samples in : as a functional data
set [39] referenced by the experience particles distributed over the state space. The mechanism
for estimating their values are deferred to Section 6.5. Each experience particle effectively
defines a control policy that generalizes into the neighboring (augmented) state space through
properties of the kernel to be discussed shortly. The similarity of any two particles (or
equivalently, two referenced augmented states) takes into account both the state vector sx and
action vector .ax This formulation is made possible by allowing the action to take on continuous
variables along with the use of a kernel function as a correlation hypothesis associating one
augmented state to another through their inner products in the kernel-induced feature space
[22]. With the above constructs defined, the next step toward establishing the reinforcement
field is to represent the fitness function in a manner that integrates with parametric actions and,
in the meantime, serves as a “critic” for the policy embedded in experience particles. In this
research, we represent the fitness value function through a progressively-updated Gaussian
process.
6.4 Value Predictions using GPR
Consider the training set \ ^: 1 1: ( , ),...,( , )m mq qx x given earlier. The goal is to identify a
function ( )Q� x (Definition 6.2) fitting these samples with a tradeoff between quality of the
prediction (i.e. data fit) and smoothness assumptions of the function (i.e. model complexity) [14].
Value predictions using GPR first assume a normal prior distribution over functions, and
subsequently reject those functions not consistent with the observations (i.e. experience
111
where `,ma � D \,ja � analogous to the parameters for ,Q and .j X�x By virtue of the
symmetric property inherent in both ,� � and � �, ,k � � one can then postulate that the inner
product of the two functions Q and Qa can be well defined in the following form:
D D1 1
, ,m m
i j i ji j
Q Q ka
� �
a a��� x x (5.59)
The next step is to verify that (5.59) does compute a valid inner product. Notice that (5.59) can
be expressed explicitly as a linear combination of both Q and ;Qa that is,
D D D1 1 1
, ( , )m m m
j i j i j jj i j
Q Q k Qa a
� � �
� ¬� a a a�� �� �� ®� � �x x x (5.60)
Similarly,
D D D1 1 1
, ( , )m m m
i j i j i ii j i
Q Q k Qa
� � �
� ¬� �a a a� �� � �� ®� � �x x x (5.61)
Equations (5.60) and (5.61) imply that ,� � is both symmetric and bilinear since ,Q Qa is linear
in both Q and ,Qa and , , .Q Q Q Qa a� The symmetric property also follows from the fact that
( , )k ¸ ¸ is symmetric. Further, the positive definiteness of � �,k � � suggests that for any function Q
(and ),Q a ,� � always evaluates to a nonnegative value; that is,
D D2
, 1
, ( , ) 0m
i j i jHi j
Q Q Q k�
� � p� x x (5.62)
With the inner product definition in (5.59), one can thus evaluate the inner product between Q
and a kernelized pattern ( , )k ¸ x in terms of
D �1
, ( , ) ( , ) ( )m
i ii
Q k k Q�
¸ � ��x x x x (5.63)
124
1 1( ) ,..., , ,...,s s a an mQ s Q Q x x x x� � � �� �x (6.2)
Thus, if the agent traverses the state space with m steps, a sequence of training samples will be
generated and retained in memory: \ ^: 1 1: ( , ),...,( , ) ,m mq q X Q� qx x where X and Q,
respectively, denote the set of augmented states and their corresponding observed fitness
values. It is helpful for the moment to consider these training samples in : as a functional data
set [39] referenced by the experience particles distributed over the state space. The mechanism
for estimating their values are deferred to Section 6.5. Each experience particle effectively
defines a control policy that generalizes into the neighboring (augmented) state space through
properties of the kernel to be discussed shortly. The similarity of any two particles (or
equivalently, two referenced augmented states) takes into account both the state vector sx and
action vector .ax This formulation is made possible by allowing the action to take on continuous
variables along with the use of a kernel function as a correlation hypothesis associating one
augmented state to another through their inner products in the kernel-induced feature space
[22]. With the above constructs defined, the next step toward establishing the reinforcement
field is to represent the fitness function in a manner that integrates with parametric actions and,
in the meantime, serves as a “critic” for the policy embedded in experience particles. In this
research, we represent the fitness value function through a progressively-updated Gaussian
process.
6.4 Value Predictions using GPR
Consider the training set \ ^: 1 1: ( , ),...,( , )m mq qx x given earlier. The goal is to identify a
function ( )Q� x (Definition 6.2) fitting these samples with a tradeoff between quality of the
prediction (i.e. data fit) and smoothness assumptions of the function (i.e. model complexity) [14].
Value predictions using GPR first assume a normal prior distribution over functions, and
subsequently reject those functions not consistent with the observations (i.e. experience
Policy Estimation Using Q+
• Policy Evaluation using Q+(xs,xa) ~ GP – Q+ can be estimated through GPR – Define policy through Q+ : e.g. softmax (Gibbs
distr.)
• is an increasing functional over Q+
π s,a(i)( ) =exp Q+ s,a(i)( ) / τ!
"#$%&
exp Q+(s,a( j) ) / τ!"
$%j∑
=exp Q+ (xs ,xa
(i) )( ) / τ!"#
$%&
exp Q+ (xs ,xa( j) )( ) / τ!
"#$%&j∑
[ ]Qπ +
Particle Reinforcement (1)
• Recall from ARD, we want to minimize the dimension of K? – Preserving only “essential info” – Samples that lead to an increase in TD à positively-
polarized particles – Samples that lead to a decrease in TD à negatively-
polarized particles – Positive particles lead to a policy that is aligned with
the global objective ________ – Negative particles serve as counterexamples for the
agent to avoid repeating the same mistakes. – Example next
Particle Reinforcement (2)
+-
+-
-
+
“Problem: How to replace older samples?”
• Maintain a set of state partitions
• Keep track of both positive particles and negative particles
• positive particles refer to the desired control policy while negative particles point out what to avoid
π[Q+]= π s,a(i)( )
=exp Q+ (xs ,xa
(i) )( ) / τ!"#
$%&
exp Q+ (xs ,xa( j) )( ) / τ!
"#$%&j∑
• “Interpolate” control policy. Recall:
Experience Association (1) • Basis learning principle:
– “If a decision led to a desired result in the past, then a similar decision should be replicated to cope with similar situations in the future”
• Again, use the kernel as a similarity measure
• Agent: “Is my current situation similar to a particular experience in the past?” • Agent: “I see there are two highly related instances memory where I did action #2 , which led to a pretty decent result. OK, I’ll try that again (or something similar).”
hs+
Experience Association (2): Policy Generalization
(xs ,xa )∈ S ⊗ A+
2
2
10
10A B
C
5 6
456
78
9
10
11 12 12
3
(xs ,xa ) ≈ (xs +xs ,xa +xa )k(x, "x )
Similarity measure comes in handy - relate similar samples of experience - sample update
Reinforcement Field
• A reinforcement field is a vector field in Hilbert space established by one or more kernels through their linear combination as a representation for the fitness function, where each of the kernel centers around a particular augmented state vector
Q+(⋅) = αi
i=1
n
∑ k(⋅,x*)
Reinforcement Field: Example
456
78
9
10
11 12 12
3
• Objective: Travel to the destination while circumventing obstacles
• State space is partitioned into a set of local regions
• Gray areas are filled with obstacles
• Strong penalty is imposed when the agent runs into obstacle-filled areas
Reinforcement Field: Using Different Kernels
k(x, !x ) =θ0exp(−12xΤD−1 !x )+θ1σ noise
2
k(x, !x ) =θ0ks (xs , !xs )ka (xa , !xa )+θ1σ noise2
Reinforcement Field Example
S
• State space is partitioned into a set of local regions
• Gray areas are filled with obstacles
• Strong penalty is imposed when the agent runs into obstacle-filled areas
• Objective: Travel to the destination while circumventing obstacles
456
78
9
10
11 12 12
3
Reinforcement Field Example
Action Operator: Step-by-Step (1)
• At a given state the agent chooses a (parametric) action according to the current policy
• The action operator resolves the random effect in action parameters through a sampling process such that the (stochastic) action is reduced to a fixed action vector
• The action vector resolved from above is subsequently paired with the current state vector to form an augmented state
456
78
9
10
11 12 12
3
s ∈S,
π [ ]Q+
.ax
sx( , ).s a=x x x
Action Operator: Step-by-Step (2)
• The new augmented state x is kernelized in terms of such that any state x implicitly maps to a function that expects another state as an argument;
evaluates to a high value provided that x and are strongly correlated.
• The value prediction for the new augmented state is given by
reproducing property (in RKHS):
456
78
9
10
11 12 12
3
k(⋅,x)
!xk( !x ,x)
!x
Q+,k(⋅,x) = αii=1m∑ k(⋅,xi ),k(⋅,x) = αii=1
m∑ k(x,xi ) =Q+(x).
Next … • The entire reinforcement field is generated by a set of training samples that are aligned with
the global objective – maximizing payoff “Can we learn decision concept out of these training samples by treating them as a structured functional data?”
{ }1 1: ( , ),...,( , )m mq qΩ x x
A Few Observations • Consider
• Properties of GPR – Correlated inputs have correlated signals – Can we assemble correlated samples together to form
clusters of “similar decisions?” – Functional clustering
• Cluster criteria – Similarity takes into account both input patterns and their
corresponding signals
{ }1 1: ( , ),...,( , )m mq qΩ x x
Example: Task Assignment Domain (1)
• Given a large set of computational resources and a continuous stream of user tasks, find the optimal task assignment policy – Simplified control policy with actions: dispatch job or
not dispatch job (à actions as decision choices) • Not practical • Users’ concern: under what conditions can we optimize a
given performance metric (e.g. minimized turn around time)
– Characterize each candidate servers in terms of their resource capacity (e.g. CPU percentage time, disk space, memory, bandwidth, owner-imposed usage criteria, etc)
• Actions: dispatching the current task to machine X • Problems?
Example: Task Assignment Domain (2)
• A general resource sharing environment could have a large amount of distributed resources (e.g. Grid network, Volunteer Computing, Cloud Computing, etc). – 1000 machines à 1000 decision points per state.
• Treating user tasks and machines collectively as multi-agent system? – Combinatorial state/action space – Large amount of agents
Task Assignment, Match Making: Other Similar Examples
• Recommender Systems - Selecting {movies, music, …} catering to the interests/needs of the (potential) customers: matching users with their favorite items (content-based) – Online advertising campaign trying to match
products and services with the right target audience
• NLP Apps: Question and Answering Systems – Matching questions with relevant documents
Generalizing RL with Functional Clustering
• Abstract Action Representation – (Functional) pattern discovery from within the
generalized state-action pairs (as localized policy) – Functional clustering by relating correlated inputs w.r.t.
their correlated functional responses • Inputs: generalized state-action pairs • Functional responses: utitlies
– Covariance Matrix from GPR à Fully-connected similarity graph à Graph Laplacian
– Spectral Clustering à a set of abstraction over (functionally) similar localized policies
– Control policy over these abstractions used as control actions
• Reduce decision points per state • Reveal interesting correlations between state features
and action parameters. E.g. match-making criterion
Policy Generalization
( , )s a S A+∈ ⊗x x
2
2
10
10A B
C
5 6
456
78
9
10
11 12 12
3
(xs ,xa ) ≈ (xs +xs ,xa +xa )k(x, "x )
“Policy as a Functional over Q+”
“à Experience Association”
Experience Association (1) • Basis learning principle:
– “If a decision led to a desired result in the past, then a similar decision can be re-applied to cope with similar situations in the future”
• Again, use the kernel as a similarity measure
• Agent: “Is my current situation similar to a particular scenario in the past?” • Agent: “I see there are two instances of similar memory where I did action #2, and that action led to a decent result, ok, I’ll try that again (or something similar)” “And if not, let me avoid repeating the mistake again”
– Hypothetical State • Definition • Agent: “I’ll looking to a past experience, and I replicate the action of
that experience and apply to mine current situation, is the result gong to be similar?”
hs+
• Step 1: form a hypothetical state (If I were to be …) – I am at a state – I pick a (relevant) particle from a set (later used in an
abstraction)
– Replicate the action and apply it to mine own state
• Step 2: Compare (… is the result going to be similar?) – Compare the result using the kernel (that’s my state
correlation hypothesis)
– If , then this sample is correlated to mine state; and the state of the target sample is in context; otherwise, it’s out of context
*Experience Association (2)
ss ʹ′ʹ′ = x
( ): ,s as+ x x
( ),h s as+ ʹ′= x x
( ) ( ) ( )i i iAω ∈Ω ←
( ) ( ) ( )( )( , ) , , , ,h s a s ak s s k k+ +ʹ′ʹ′= =x x x x x x
( , )hk s s τ+ + ≥
*Experience Association (3)
• Normalize the kernel such that the kernel assumes a probability semantics – This is a generalization over the concept of probability
amplitude in QM.
• Nadaraya Waston’s Model (Section 5.7)
k(x, !x )k(x,x) k( !x , !x )
= !k (x, !x ) = Φ(x),Φ( !x ) = Φ !Φ
Concept-Driven Learning Architecture (CDLA)
• The agent derives control policy only on the level of abstract actions – Further reduce decision points per state – Find interesting patterns across state features and
action parameters. E.g. match-making criterion • We have the necessary representation to form
(functional) clusters – Kernel as a similarity measure over augmented states – Covariance matrix K from GPR
• Each abstract action is represented through a set of experience particles
CDLA Conceptual Hierarchy
Graph representation from K where Kij = k(xi,xj)
A set of unstructured experience particles
Partitioned graph with two abstract actions
*Spectral Clustering: Big Picture
• Construct similarity graph (ßGPR) • Graph Laplacian (GL) • Graph cut as objective function (e.g. normalized
cut) • Optimize the graph cut criterion
– Minimizing Normalized Cut à partitions as tight as possible
ß maximal in-cluster links (weights) and mimimal between-cluster links – NP-Hard à spectral relaxation
• Use eigenvectors of GL as a continuous version cluster indicator vectors
• Evaluate final clusters using a selected instance-based clustering algorithm (Kmeans++)
*Spectral Clustering: Definitions
wnm = k(xn ,xm ) Dn = wnmm=1N∑
Vol(C) = Dnn∈C∑ Cut(C1,C2 ) = wnm
m∈C2
∑n∈C1
∑
Pairwise affinity Degree
Volume of set Cut between 2 sets
*Spectral Clustering: Graph Cut
• Graph Cut – Naïve Cut
– (K-way) Ratio Cut
– (K-way) Normalized Cut
Cut(A(1) ,...,A(k ) ) = 12W A(i) ,V \ A(i)( )
i=1
k
∑
RatioCut(A(1) ,...,A(k ) ) = 12
W A(i) ,V \ A(i)( )A(i)i=1
k
∑
NCut(A(1) ,...,A(k ) ) = 12
W A(i) ,V \ A(i)( )vol A(i)( )i=1
k
∑
NP-hard!
*Spectral Clustering: Approximation • Approximation
– Given a set of data to cluster – Form affinity matrix W – Find leading k eigenvectors
– Cluster data set in the eigenspace
– Projecting back to the original data
• Major differences in algorithms:
Lv vλ=
L = f (W )
*Random Walk Graph Laplacian (1)
• Definition: • First order Markov transition Matrix
– Each entry: probability of transitioning from node n to to node m in a single step
1rwL I D K I P−= − = −
1
( , )ij i jij m
iijj
K kP
dK=
= =∑
x x
1 1 ( , )m mi ij i jj jd K k
= == =∑ ∑ x x
1 1( ) ,rwL D D K I D K− −= − = −
*Random Walk Graph Laplacian (2)
• K-way normalized cut: find a partitioning s.t. the probability of transitioning across clusters is minimized
– When used with Random-Walk GL, this corresponds to minimizing the probability of the state-transitions between clusters
• In CDLA – Each abstract action corresponds to a coherent decision
concept – Why? By taking any actions in any of the states associated
with the same concept, this is a minimal chance to transitioning to the states associated with other concepts
( ) ( )( ) ,
1 1 1
1 | 1 i
i
k k nmn m Ci i i N
i i nmn C m
wNCut C P C C C
w∈
= = ∈ =
⎛ ⎞⎜ ⎟= − → = −⎜ ⎟⎝ ⎠
∑∑ ∑
∑ ∑
Functional Clustering using GPSC
• GPR + SC à SGP Clustering – Kernel as correlation hypothesis – Same hypothesis used as similarity measure for SC – Correlated inputs share approximately identical
outputs: – Similar augmented states ~ close fitness values or
utilities • Warning: The reverse is NOT true and this is why
we need multiple concepts that may share similar output signals – E.g. match-making job and machine requirements
CDLA: Context Matching • Each abstract action implicitly defines an action-
selection strategy – In context: at a given state, find the most correlated
state pair with its action, followed by applying that action
– Out of context: random action selection • Applicable in 1) infant agent 2) empty cluster 3)
referenced particles don’t match (by experience association)
• Caveat:
• The utility (fitness value) for random action selection does not correspond to the true estimate of resolving
• Need other way to adjust the utility estimate for its fitness value
( ) ( )( ) ( ), , !i iQ s A Q s a+≠
Empirical Study: Task Assignment Domain
Parameter Spec. Values Task feature set
(state) Task type, size, expected runtime
Server feature set (action)
Service type, percentage CPU time, memory, disk space, CPU speed, job slots
Kernel k SE kernel + noise (see (5.3)) Num. of Abstract
Actions 10 (assumed to be known)
Model update cycle T
10 state transitions towards 100
Goal: Find for each incoming user task the best candidate server(s) that are mutually agreeable in terms of matching criteria.
Empirical Study: Learned Concepts
Task type
Size Expected runtime
Service type
%CPU time
Fitness Value
1 1 1.1 0.93 1 9.784 120.41 2 2 2.5 1.98 2 10.235 128.13 3 3 3.2 2.92 3 15.29 135.23 4 1 1.0 1.02 2 20.36 -50.05 5 2 2.0 2.09 3 0.58 -47.28
• Illustration of 5 different learned decision concepts. • The top 3 rows in blue indicate success matches while the bottom 2 rows in yellow indicate failed matches.
Empirical Study: Comparison
0 20 40 60 80 100-2000
-1500
-1000
-500
0
500
1000
1500
Episodes
Rewa
rd P
er E
piso
de
G-SARSACondorRandom
• Performance comparison among (1) Stripped-down Condor match-making (black) (2) G-SARSA (blue) (3) Random (red)
Sample Promotion (1)
k(sh+ ,s) ≥ τ
Two possibilities: (1) the target is indeed correlated to a given an AA. (2) the target is NOT correlated to ANY Abstract Action à trial and error + sample promotion
Sample Promotion (2) • Recall: Each abstract actions implicitly define an action selection strategy • Key: functional data set must be as accurate as possible in terms of their predictive strength • Match new experience particle against the memory, find the
most relevant piece and use its value estimation. • Out-of-context case leads to randomized action selection. • How does the agent still manage to gain experience in this case? • Randomized action selection + Sample promotion
– Case 1: random action does get positive result • Match the result per abstract actions by sampling using
experience association operation
• If indeed correlated to some experience, then update the fitness value; otherwise à case 2
– Case 2: Discard the sample because there is no points of reference, sample not useful
( ) ( )( ), ,is as a → x x
191
reinforcement learning framework, CDLA, by integrating the reinforcement field and decision
concept formation into a coherent learning process represented by the center box in Figure 8.1.
In particular, the novel learning architecture addresses the core issue of standard reinforcement
learning algorithms from the perspective of parametric action formulation, extended state space
representation and sample-driven policy representation. All the components in CDLA are
represented directly or indirectly through the kernel function that serves as the fundamental
building block for value prediction, concept formation as well as policy search. In Figure 8.1,
kernelized components are marked by a circled k in yellow whereas indirectly-kernelized
Figure 8.1 CDLA schematic. Periodic iterations between value function approximation and concept-driven cluster formation constitutes the heartbeat of CDLA. The conceptual sequence from 1a to 1e corresponds to the logical flow that progressively updates the reinforcement field as a policy generalization mechanism. The sequence from 2a to 2d on the other hand references the logical flow towards establishing an abstract control policy based on a set of coherent decision concepts as high-level control decisions.
GRL Schematic
(4)1t kA − −
1t k− −
Demo of 4 Decision Concepts A(1) ~ A(4)
“Discovery of plant life J”
“Fallen into deep water L”
(2)tA
(4)t kA −
“Trash removed and organized J”
t k−
Collect
(3)tA
Avoid
“Trapped in high hills L” (1)tAAvoid
(4)tA Clean
t
Concept Polarity +/J -/L
{ }(2) (4),A A{ }(1) (3),A A
(4)1t kA − +
1t k− +
Future Work
More on … • Using spectral clustering to cluster functional data • Experience association • Context matching • Evolving the samples
– sample promotion – Probabilistic model for experience associations
• Evolving the clusters – Need to adjust value estimate for abstract actions as new
samples join in – Morphing clusters