Generalized Reinforcement Learning

Generalized Reinforcement Learning

Framework Barnett P. Chiu

3.22.2013

Overview •  Standard Formulation of Reinforcement Learning •  Challenges in standard RL framework due to its

representation •  Generalized/Alternative action formulation

•  Action as an Operator •  Parametric-Action Model

•  Reinforcement Field –  Using kernel as a similarity measure over “decision

contexts” (i.e. generalized state-action pair) –  Value predictions using functions (vectors) from RKHS (a vector

space) –  Representing policy using kernelized samples

Reinforcement Learning: Examples

•  A learning paradigm that formalizes sequential decision process under uncertainty –  Navigating in an unknown environment –  Playing and winning a game (e.g. Backgammon) –  Retrieving information over the web (finding the

right info on the right websites) –  Assigning user tasks to a set of computational

resources •  Reference:

–  Reinforcement Learning: A Survey by Leslie P. Kaelbling, Michael L. Littman, Andrew W. Moore

–  Autonomous Helicopter: Andrew Ng

Reinforcement Learning: Optimization Objective

•  Optimizing performance through trial and error. –  The agent interacts with the environment, perform the right actions

such that they induce a state trajectory towards maximizing rewards. –  Task dependent; can have multiple subgoals/subtasks –  Learning from incomplete background knowledge

•  Ex1: Navigating in an unknown environment –  Objective: shortest path + avoiding obstacles + minimize fuel consumption + …

•  Ex2: Assigning user tasks to a set of servers with unknown resource capacity –  Objective: minimize turnaround time, maximize success rate, load balancing, …

Reinforcement Learning: Markov Decision Process (typical but not always)

Reinforcement Learning: Markov Decision Process

Potential Function: Q: S × A è utility

Challenges in Standard RL Formulations (1) •  Challenges from large state and action space

–  the complexity of RL methods depends largely on the dimensionality of the state space representation

Later … •  Solution: Generalized action representation

–  Explicitly express errors/variations of actions

–  It becomes possible to express correlation from within a state-action combination to reveal their interdependency.

–  The value function no longer needs to express a concrete mapping from state-action pairs to their values.

–  Enables simultaneous control over multiple parameters that collectively describe behavioral details as actions are performed.

( , )s a S A+∈ ⊗x x

( , )k ʹ′x x( , )s a=x x x

//Compare two decision contexts

Challenges in Standard RL Formulations (2)

•  Challenges from Environmental Shifts –  Unexpected behaviors inherent in actions (same action but

different outcomes over time) –  E.g. Recall the rover navigation example earlier

•  Navigational policy learned under different surface conditions

•  Challenges from Irreducible and Varying Action Sets –  Large decision points –  Feasible actions do not stay the same –  E.g. Assigning tasks to time-varying computational resources in a dynamic virtual cluster (DVC)

•  Compute resources are dynamically acquired with limited walltimes

Reinforcement Learning Algorithms … •  In most complex domains, T, R need to be estimated à RL

framework –  Temporal Difference (TD) Learning

•  Q-learning, SARSA, TD(λ) –  Example of TD learning: SARSA

–  SARSA Update Rule:

–  Q-learning Update Rule:

–  Function approximation, SMC, Policy Gradient, etc.

“But the problems are …”

[ ]( , ) ( , ) ( , ) ( , )Q s a Q s a r Q s a Q s aα γ ʹ′ ʹ′← + + −

( , ) ( , ) max ( , ) ( , )a

Q s a Q s a r Q s a Q s aα γʹ′

⎡ ⎤ʹ′ ʹ′← + + −⎣ ⎦

MDPst St+1

at

st

rt

Alternative View of Actions •  Actions are not just decision choices

–  Variational procedure (e.g. principle of least {action, time}) –  Errors

•  Poor calibrations of actuators •  Extra bumpy surfaces

–  Real-world domains may involve actions with high complexity •  Robotic control (e.g. a simultaneous control over a set of joint

parameters)

“What is an action really?” Actions induce a shift in the state configuration

–  A continuous process –  A process involving errors –  State and action has hidden correlations

•  Current knowledge base (state) •  New info retrieved from external world (action)

–  Similarity between decisions: (x1,a1) vs (x2,a2)

Action as an Operator •  Action Operator (aop)

–  Acts on current state and produces its successor state.

•  aop takes on an input state … (1) •  aop resolves stochastic effects in the action … (2)

–  Recall: action is now parameterized by constrained random variables (e.g. d_r, d_theta)

•  Given a fixed operator by (1), (2), now aop maps the input state to the output state (i.e. the successor state)

•  The current state vector + action vector à augemented state •  E.g.

Δx = Δrcos(Δθ )Δy = Δrsin(Δθ )

1+ Δxx

0

0 1+ Δyy

"

#

$$$$

%

&

''''

xy

"#$

%&'= x+Δx

y +Δy"#$

%&'

⇒ xs ,xa( ) = (x, y),(Δx,Δy)( )

Value Prediction: Part I “How to connect the notion of action

operator with value predictions?”

Parametric Actions (1) A

B

ΔrΔθ24

2

23

4 32

456

78

9

10

11 12 12

3

Y (xi ) = P((xi +wi ) ∈ Γ),(xa)i = xi |Y (xi ) ≥η.

xa = (x1,x2) = (Δr ,Δθ)•  Actions as a random process •  Example:

• These 12 parametric actions all take on parameter bounded within pie-shaped scopes

1 0

0 1

xx x xx

y y y yy

Δ⎡ ⎤+⎢ ⎥ + Δ⎡ ⎤ ⎡ ⎤=⎢ ⎥Δ ⎢ ⎥ ⎢ ⎥+ Δ⎣ ⎦ ⎣ ⎦+⎢ ⎥

⎢ ⎥⎣ ⎦

⇒ xs ,xa( ) = (x, y),(Δx,Δy)( )

ΔrΔθ24

2

23

4 32

456

78

9

10

11 12 12

3

xa = (x1,x2) = (Δr ,Δθ)•  Actions as a random process

Parametric Actions (2)

•  Augmented state space

•  Learn a potential function Q xs ,xa( ) =Q x, y,Δx,Δy( )

action as an operator …

Using GPR: A Thought Process (1) •  Need to gauge the similarity/correlation between any two decisions à kernel functions •  Need to estimate the (potential) value of an arbitrary combination of

state and actions without needing to explore all the state space à the value predictor is a “function” of kernels

•  The class of functions that exhibit the above properties à functions drawn from Reproduced Kernel Hilbert Space (RKHS)

•  GP regression (GPR) method induces such functions [GPML, Rasmussen] –  Representer Theorem [B. Schölkopf et al. 2000]

( , )k ʹ′x x

Q+(x*) = αii=1

n

∑ k(xi ,x*)

cos t (x1, y1, f (x1)),...,(xm , ym , f (xm ))( )+ regularizer f( )

Q+(⋅) = αii=1

n

∑ k(xi ,⋅)

Using GPR: A Thought Process (2) •  With GPR, we work with samples of experience •  The Approach

–  Find the “best” function(s) that explain the data (decisions made so far)

•  Model Selection –  Keep only the most important samples from which to derive the

value functions •  Sample Selection

•  It is relatively easy to implement the notion of model selection with GPR –  Tune hyperparameters of the kernel (as a covariance function of

GP) such that the marginal likelihood (of the data) is maximized –  Periodic model selection to cope with environmental shifts

•  Sample selection –  Two-tier learning architecture

•  Using a baseline RL learning algorithm for obtaining utility estimates •  Using GPR to achieve generalization

{ }1 1: ( , ),...,( , )m mq qΩ x x

*GPR I: Gaussian Process

•  GP: a probability distribution over a set of random variables (or function values), any finite set of which has a (joint) Gaussian distribution

–  Given a prior assumption over functions (See previous page) –  Make observations (e.g. policy learning) and gather evidences –  Posterior p.d. over functions that eliminate those not consistent

with the evidence

f |Mi GP m(x),k(x, !x )( )y | x,Mi N f ,σ noise I( ) ⇒

Prior and Posterior

−5 0 5

−2

−1

0

1

2

input, x

outp

ut, f

(x)

−5 0 5

−2

−1

0

1

2

input, xou

tput

, f(x

)

Predictive distribution:

p(y⇤|x⇤ x y) ⇠ N�k(x⇤ x)>[K + �2

noiseI]-1

y

k(x⇤ x⇤) + �2noise - k(x⇤ x)>[K + �2

noiseI]-1

k(x⇤ x)�

Rasmussen (Engineering, Cambridge) Gaussian Process Regression August 30th - September 10th, 2009 23 / 62

*GPR II: State Correlation Hypothesis •  Kernel: covariance function (PD Kernel)

–  Correlation hypothesis for states –  Prior

–  Observe samples

–  Compute GP posterior (over latent functions)

–  Predicted distribution à averaging over all possible posterior weights with respect to Gaussian likelihood function

k(x, !x ) =θ0exp(−12xΤD−1 "x )+θ1σ noise

2

=θ0exp(−12

xi2

θii=2

d∑ )+θ1σ noise

2

{ }1 1: ( , ),...,( , )m mq qΩ x x

Q+(xi ) | X ,q,xi ~GP(mpost (xi ) = k(xi ,x)ΤK (X ,X )−1q,

cov post (xi ,x) = k(xi ,xi )−k(xi ,x)ΤK (X ,X )−1q)

*GPR III: Value Prediction

•  Predictive Distribution with a test point

–  Prediction of a new test is achieved by comparing with all the samples retained in the memory

–  Predictive value largely depends on correlated samples

–  The (sample) correlation hypothesis (i.e. kernel) applies to in all state space

–  Reproducing property from RKHS [B. Schölkopf and A. Smola,2002]

q* =Q+(x*) = k(x*)

ΤK (X ,X )−1q = αii=1

n

∑ k(xi ,x*)

cov(q*) = k(x*,x*)−k*T K +σ n

2I( )−1k*

( , ( )) : ( , ) ( ), ( , ) ( )Q k Q k Q+ + +∗ ∗ ∗ ∗ ∗⋅ → ⋅ ⋅ =x x x x x

*GPR IV: Model Selection •  Maximizing Marginal Likelihood (ARD)

–  A trade off between data fit and model complexity –  Optimization: take partial derivative wrt each hyperparameter

to get

•  conjugate gradient optimization •  We get the optimal hyperparameters that best explain the

data •  Resulting model follows Occam’s Razor principle •  Computing K-1 is expensive à Reinforcement Sampling

11 1log ( | ) log | | log22 2 2

T np X K K π−= − − −q q q

1 1 11 1log ( | , ) ( )2 2

T

i i i

K Kp X K K tr Kθ θ θ

− − −∂ ∂ ∂= −

∂ ∂ ∂u θ u u

Value Predictions (Part II)

•  Now what remains to solve are: –  How to obtain the training signals {qi}?

•  Use baseline RL agent to estimate utility based on MDP with concrete action choices

•  Use GPR to generalize the utility estimate with parameterized action representation

–  How to train the desired value function using only essential samples?

•  memory constraint! •  Sample replacement using reinforcement signals

–  Old samples referencing old utility estimates can be replaced by new samples with new estimates - Experience Association

1b. Estimate utility for each “regular” state-action pair

-  Expand the action into its parametric representation -  Kernel assumption accounts for random effect in the action

MDPst St+1

at

st

rt

GPR

st a(xa)

st at r0 rt…... ut

( ,( , )) ( , )s ak k× = ×x x x

( ) ( ) α1( , ) ( , )ms a i ii

Q Q k+ +=

= =åx x x x x

(), ( , )Q k+ × ×x

{ }Ω 1 1: ( , ),...,( , )m mq qx x

108

data X for simplicity and thus, V6 2( | ) ( ) ( , ) ( , ).p X p N N I� � �w w 0 0 Conversely, the prior over

function values is explicitly modeled by a covariance matrix K evaluated pair-wise from the

data set X through the kernel; i.e. ( , ).ij i jK k� x x In the case where components of q

correspond to noisy observations of ,f the log likelihood in (5.48) is shifted to the following:

V V S12 21 1log ( | ) log log2

2 2 2T np X K I K I

�� q q q (5.50)

The result of (5.50) can be obtained directly by assuming V� 2| ( , ).X N K I�q 0 Alternatively,

one can use a noisy kernel to explicitly incorporate the noise assumption such that

V G2( , ) ,ij i j ijK k� �x x where G ij is a Kronecker delta function. In this case, (5.50) is reduced to

(5.48) by “absorbing” the noise effect into the covariance matrix .K In the later discussion, we

will consider the noise as part of the definition of the kernel function for notational convenience.

Now, consider again the noisy training set :. We denote the set of a new noisy test

points as ,X� which are induced by the stochastic effect in the action parameters. The

corresponding targets are denoted ,�q containing a sequence of predictive values with respect

to the input in X. With the assumption of a noisy kernel V G2( , )i j ijk �x x as the covariance

function, the predictive distribution over new targets is thus given by V| , ~ , [ ] ,N X�q f X q

where

1, ,K X X K X X ��q q (5.51)

V 1* * * *, , , ,X K X X K X X K X X K X X� ¯ � �¢ ± (5.52)

Here we note that the derivation to arrive at (5.51) and (5.52) is similar to the weight-space

model; however, the interested readers may refer to [14, 107] for more details. With only one

test point, (5.51) is reduced to

108






2 2 2T np X K I K I

�� q q q (5.50)











where

1, ,K X X K X X ��q q (5.51)

V 1* * * *, , , ,X K X X K X X K X X K X X� ¯ � �¢ ± (5.52)




108






2 2 2T np X K I K I

�� q q q (5.50)











where

1, ,K X X K X X ��q q (5.51)

V 1* * * *, , , ,X K X X K X X K X X K X X� ¯ � �¢ ± (5.52)




108






2 2 2T np X K I K I

�� q q q (5.50)











where

1, ,K X X K X X ��q q (5.51)

V 1* * * *, , , ,X K X X K X X K X X K X X� ¯ � �¢ ± (5.52)



test point, (5.51) is reduced to 111

where `,ma � D \,ja � analogous to the parameters for ,Q and .j X�x By virtue of the

symmetric property inherent in both ,� � and � �, ,k � � one can then postulate that the inner

product of the two functions Q and Qa can be well defined in the following form:

D D1 1

, ,m m

i j i ji j

Q Q ka

� �

a a�� x x (5.59)

The next step is to verify that (5.59) does compute a valid inner product. Notice that (5.59) can

be expressed explicitly as a linear combination of both Q and ;Qa that is,

D D D1 1 1

, ( , )m m m

j i j i j jj i j

Q Q k Qa a

� � �

� ¬� a a a�� ®� � �x x x (5.60)

Similarly,

D D D1 1 1

, ( , )m m m

i j i j i ii j i

Q Q k Qa

� � �

� ¬� �a a a� �� ®� � �x x x (5.61)

Equations (5.60) and (5.61) imply that ,� � is both symmetric and bilinear since ,Q Qa is linear

in both Q and ,Qa and , , .Q Q Q Qa a� The symmetric property also follows from the fact that

( , )k ¸ ¸ is symmetric. Further, the positive definiteness of � �,k � � suggests that for any function Q

(and ),Q a ,� � always evaluates to a nonnegative value; that is,

D D2

, 1

, ( , ) 0m

i j i jHi j

Q Q Q k�

� � p� x x (5.62)

With the inner product definition in (5.59), one can thus evaluate the inner product between Q

and a kernelized pattern ( , )k ¸ x in terms of

D �1

, ( , ) ( , ) ( )m

i ii

Q k k Q�

¸ � ��x x x x (5.63)

1a. Baseline RL: e.g. SARSA - MDP based RL treats actions as decision choices -  Estimate utility without looking at action parameters

2. Use the generalization capacity from GP to predict values

3. From Baseline RL layer to GPR (next)

124

1 1( ) ,..., , ,...,s s a an mQ s Q Q x x x x� � � �� x (6.2)

Thus, if the agent traverses the state space with m steps, a sequence of training samples will be

generated and retained in memory: \ ^: 1 1: ( , ),...,( , ) ,m mq q X Q� qx x where X and Q,

respectively, denote the set of augmented states and their corresponding observed fitness

values. It is helpful for the moment to consider these training samples in : as a functional data

set [39] referenced by the experience particles distributed over the state space. The mechanism

for estimating their values are deferred to Section 6.5. Each experience particle effectively

defines a control policy that generalizes into the neighboring (augmented) state space through

properties of the kernel to be discussed shortly. The similarity of any two particles (or

equivalently, two referenced augmented states) takes into account both the state vector sx and

action vector .ax This formulation is made possible by allowing the action to take on continuous

variables along with the use of a kernel function as a correlation hypothesis associating one

augmented state to another through their inner products in the kernel-induced feature space

[22]. With the above constructs defined, the next step toward establishing the reinforcement

field is to represent the fitness function in a manner that integrates with parametric actions and,

in the meantime, serves as a “critic” for the policy embedded in experience particles. In this

research, we represent the fitness value function through a progressively-updated Gaussian

process.

6.4 Value Predictions using GPR

Consider the training set \ ^: 1 1: ( , ),...,( , )m mq qx x given earlier. The goal is to identify a

function ( )Q� x (Definition 6.2) fitting these samples with a tradeoff between quality of the

prediction (i.e. data fit) and smoothness assumptions of the function (i.e. model complexity) [14].

Value predictions using GPR first assume a normal prior distribution over functions, and

subsequently reject those functions not consistent with the observations (i.e. experience

124


















process.







456

78

9

10

11 12 12

3

MDPst St+1

at

st

rt

GPR

st a(xa)

st at r0 rt…... ut

( ,( , )) ( , )s ak k× = ×x x x

( ) ( ) α1( , ) ( , )ms a i ii

Q Q k+ +=

= =åx x x x x

(), ( , )Q k+ × ×x

{ }Ω 1 1: ( , ),...,( , )m mq qx x

3. From Baseline RL layer to GPR 3a. Take a sample of the random action vector 3b. Form the augmented state (s, a) à (xs,xa) 3c. Propagate the utility signal from the baseline learner and use it as the training signal 3d. Insert new (functional) data into current working memory

4. Use GPR to predict new test points 4a. Kernelized the new test point (s, a) à (xs,xa) à k(., x) 4b. Take inner product of k(., x) and Q+ to obtain the utility estimate (fitness value)

456

78

9

10

11 12 12

3

124


















process.







111

where `,ma � D \,ja � analogous to the parameters for ,Q and .j X�x By virtue of the

symmetric property inherent in both ,� � and � �, ,k � � one can then postulate that the inner

product of the two functions Q and Qa can be well defined in the following form:

D D1 1

, ,m m

i j i ji j

Q Q ka

� �

a a�� x x (5.59)

The next step is to verify that (5.59) does compute a valid inner product. Notice that (5.59) can

be expressed explicitly as a linear combination of both Q and ;Qa that is,

D D D1 1 1

, ( , )m m m

j i j i j jj i j

Q Q k Qa a

� � �

� ¬� a a a�� ®� � �x x x (5.60)

Similarly,

D D D1 1 1

, ( , )m m m

i j i j i ii j i

Q Q k Qa

� � �

� ¬� �a a a� �� ®� � �x x x (5.61)

Equations (5.60) and (5.61) imply that ,� � is both symmetric and bilinear since ,Q Qa is linear

in both Q and ,Qa and , , .Q Q Q Qa a� The symmetric property also follows from the fact that

( , )k ¸ ¸ is symmetric. Further, the positive definiteness of � �,k � � suggests that for any function Q

(and ),Q a ,� � always evaluates to a nonnegative value; that is,

D D2

, 1

, ( , ) 0m

i j i jHi j

Q Q Q k�

� � p� x x (5.62)

With the inner product definition in (5.59), one can thus evaluate the inner product between Q

and a kernelized pattern ( , )k ¸ x in terms of

D �1

, ( , ) ( , ) ( )m

i ii

Q k k Q�

¸ � ��x x x x (5.63)

124


















process.







Policy Estimation Using Q+

•  Policy Evaluation using Q+(xs,xa) ~ GP –  Q+ can be estimated through GPR –  Define policy through Q+ : e.g. softmax (Gibbs

distr.)

•  is an increasing functional over Q+

π s,a(i)( ) =exp Q+ s,a(i)( ) / τ!

"#$%&

exp Q+(s,a( j) ) / τ!"

$%j∑

=exp Q+ (xs ,xa

(i) )( ) / τ!"#

$%&

exp Q+ (xs ,xa( j) )( ) / τ!

"#$%&j∑

[ ]Qπ +

Particle Reinforcement (1)

•  Recall from ARD, we want to minimize the dimension of K? –  Preserving only “essential info” –  Samples that lead to an increase in TD à positively-

polarized particles –  Samples that lead to a decrease in TD à negatively-

polarized particles –  Positive particles lead to a policy that is aligned with

the global objective ________ –  Negative particles serve as counterexamples for the

agent to avoid repeating the same mistakes. –  Example next

Particle Reinforcement (2)

+-

+-

-

+

“Problem: How to replace older samples?”

•  Maintain a set of state partitions

•  Keep track of both positive particles and negative particles

•  positive particles refer to the desired control policy while negative particles point out what to avoid

π[Q+]= π s,a(i)( )

=exp Q+ (xs ,xa

(i) )( ) / τ!"#

$%&

exp Q+ (xs ,xa( j) )( ) / τ!

"#$%&j∑

•  “Interpolate” control policy. Recall:

Experience Association (1) •  Basis learning principle:

–  “If a decision led to a desired result in the past, then a similar decision should be replicated to cope with similar situations in the future”

•  Again, use the kernel as a similarity measure

•  Agent: “Is my current situation similar to a particular experience in the past?” •  Agent: “I see there are two highly related instances memory where I did action #2 , which led to a pretty decent result. OK, I’ll try that again (or something similar).”

hs+

Experience Association (2): Policy Generalization

(xs ,xa )∈ S ⊗ A+

2

2

10

10A B

C

5 6

456

78

9

10

11 12 12

3

(xs ,xa ) ≈ (xs +xs ,xa +xa )k(x, "x )

Similarity measure comes in handy - relate similar samples of experience - sample update

Reinforcement Field

•  A reinforcement field is a vector field in Hilbert space established by one or more kernels through their linear combination as a representation for the fitness function, where each of the kernel centers around a particular augmented state vector

Q+(⋅) = αi

i=1

n

∑ k(⋅,x*)

Reinforcement Field: Example

456

78

9

10

11 12 12

3

•  Objective: Travel to the destination while circumventing obstacles

•  State space is partitioned into a set of local regions

•  Gray areas are filled with obstacles

•  Strong penalty is imposed when the agent runs into obstacle-filled areas

Reinforcement Field: Using Different Kernels

k(x, !x ) =θ0exp(−12xΤD−1 !x )+θ1σ noise

2

k(x, !x ) =θ0ks (xs , !xs )ka (xa , !xa )+θ1σ noise2

Reinforcement Field Example

S

•  State space is partitioned into a set of local regions

•  Gray areas are filled with obstacles

•  Strong penalty is imposed when the agent runs into obstacle-filled areas

•  Objective: Travel to the destination while circumventing obstacles

456

78

9

10

11 12 12

3

Reinforcement Field Example

Action Operator: Step-by-Step (1)

•  At a given state the agent chooses a (parametric) action according to the current policy

•  The action operator resolves the random effect in action parameters through a sampling process such that the (stochastic) action is reduced to a fixed action vector

•  The action vector resolved from above is subsequently paired with the current state vector to form an augmented state

456

78

9

10

11 12 12

3

s ∈S,

π [ ]Q+

.ax

sx( , ).s a=x x x

Action Operator: Step-by-Step (2)

•  The new augmented state x is kernelized in terms of such that any state x implicitly maps to a function that expects another state as an argument;

evaluates to a high value provided that x and are strongly correlated.

•  The value prediction for the new augmented state is given by

reproducing property (in RKHS):

456

78

9

10

11 12 12

3

k(⋅,x)

!xk( !x ,x)

!x

Q+,k(⋅,x) = αii=1m∑ k(⋅,xi ),k(⋅,x) = αii=1

m∑ k(x,xi ) =Q+(x).

Next … •  The entire reinforcement field is generated by a set of training samples that are aligned with

the global objective – maximizing payoff “Can we learn decision concept out of these training samples by treating them as a structured functional data?”

{ }1 1: ( , ),...,( , )m mq qΩ x x

A Few Observations •  Consider

•  Properties of GPR –  Correlated inputs have correlated signals –  Can we assemble correlated samples together to form

clusters of “similar decisions?” –  Functional clustering

•  Cluster criteria –  Similarity takes into account both input patterns and their

corresponding signals

{ }1 1: ( , ),...,( , )m mq qΩ x x

Example: Task Assignment Domain (1)

•  Given a large set of computational resources and a continuous stream of user tasks, find the optimal task assignment policy –  Simplified control policy with actions: dispatch job or

not dispatch job (à actions as decision choices) •  Not practical •  Users’ concern: under what conditions can we optimize a

given performance metric (e.g. minimized turn around time)

–  Characterize each candidate servers in terms of their resource capacity (e.g. CPU percentage time, disk space, memory, bandwidth, owner-imposed usage criteria, etc)

•  Actions: dispatching the current task to machine X •  Problems?

Example: Task Assignment Domain (2)

•  A general resource sharing environment could have a large amount of distributed resources (e.g. Grid network, Volunteer Computing, Cloud Computing, etc). –  1000 machines à 1000 decision points per state.

•  Treating user tasks and machines collectively as multi-agent system? –  Combinatorial state/action space –  Large amount of agents

Task Assignment, Match Making: Other Similar Examples

•  Recommender Systems -  Selecting {movies, music, …} catering to the interests/needs of the (potential) customers: matching users with their favorite items (content-based) –  Online advertising campaign trying to match

products and services with the right target audience

•  NLP Apps: Question and Answering Systems –  Matching questions with relevant documents

Generalizing RL with Functional Clustering

•  Abstract Action Representation –  (Functional) pattern discovery from within the

generalized state-action pairs (as localized policy) –  Functional clustering by relating correlated inputs w.r.t.

their correlated functional responses •  Inputs: generalized state-action pairs •  Functional responses: utitlies

–  Covariance Matrix from GPR à Fully-connected similarity graph à Graph Laplacian

–  Spectral Clustering à a set of abstraction over (functionally) similar localized policies

–  Control policy over these abstractions used as control actions

•  Reduce decision points per state •  Reveal interesting correlations between state features

and action parameters. E.g. match-making criterion

Policy Generalization

( , )s a S A+∈ ⊗x x

2

2

10

10A B

C

5 6

456

78

9

10

11 12 12

3

(xs ,xa ) ≈ (xs +xs ,xa +xa )k(x, "x )

“Policy as a Functional over Q+”

“à Experience Association”

Experience Association (1) •  Basis learning principle:

–  “If a decision led to a desired result in the past, then a similar decision can be re-applied to cope with similar situations in the future”

•  Again, use the kernel as a similarity measure

•  Agent: “Is my current situation similar to a particular scenario in the past?” •  Agent: “I see there are two instances of similar memory where I did action #2, and that action led to a decent result, ok, I’ll try that again (or something similar)” “And if not, let me avoid repeating the mistake again”

–  Hypothetical State •  Definition •  Agent: “I’ll looking to a past experience, and I replicate the action of

that experience and apply to mine current situation, is the result gong to be similar?”

hs+

•  Step 1: form a hypothetical state (If I were to be …) –  I am at a state –  I pick a (relevant) particle from a set (later used in an

abstraction)

–  Replicate the action and apply it to mine own state

•  Step 2: Compare (… is the result going to be similar?) –  Compare the result using the kernel (that’s my state

correlation hypothesis)

–  If , then this sample is correlated to mine state; and the state of the target sample is in context; otherwise, it’s out of context

*Experience Association (2)

ss ʹ′ʹ′ = x

( ): ,s as+ x x

( ),h s as+ ʹ′= x x

( ) ( ) ( )i i iAω ∈Ω ←

( ) ( ) ( )( )( , ) , , , ,h s a s ak s s k k+ +ʹ′ʹ′= =x x x x x x

( , )hk s s τ+ + ≥

*Experience Association (3)

•  Normalize the kernel such that the kernel assumes a probability semantics –  This is a generalization over the concept of probability

amplitude in QM.

•  Nadaraya Waston’s Model (Section 5.7)

k(x, !x )k(x,x) k( !x , !x )

= !k (x, !x ) = Φ(x),Φ( !x ) = Φ !Φ

Concept-Driven Learning Architecture (CDLA)

•  The agent derives control policy only on the level of abstract actions –  Further reduce decision points per state –  Find interesting patterns across state features and

action parameters. E.g. match-making criterion •  We have the necessary representation to form

(functional) clusters –  Kernel as a similarity measure over augmented states –  Covariance matrix K from GPR

•  Each abstract action is represented through a set of experience particles

CDLA Conceptual Hierarchy

Graph representation from K where Kij = k(xi,xj)

A set of unstructured experience particles

Partitioned graph with two abstract actions

*Spectral Clustering: Big Picture

•  Construct similarity graph (ßGPR) •  Graph Laplacian (GL) •  Graph cut as objective function (e.g. normalized

cut) •  Optimize the graph cut criterion

–  Minimizing Normalized Cut à partitions as tight as possible

ß maximal in-cluster links (weights) and mimimal between-cluster links –  NP-Hard à spectral relaxation

•  Use eigenvectors of GL as a continuous version cluster indicator vectors

•  Evaluate final clusters using a selected instance-based clustering algorithm (Kmeans++)

*Spectral Clustering: Definitions

wnm = k(xn ,xm ) Dn = wnmm=1N∑

Vol(C) = Dnn∈C∑ Cut(C1,C2 ) = wnm

m∈C2

∑n∈C1

∑

Pairwise affinity Degree

Volume of set Cut between 2 sets

*Spectral Clustering: Graph Cut

•  Graph Cut –  Naïve Cut

–  (K-way) Ratio Cut

–  (K-way) Normalized Cut

Cut(A(1) ,...,A(k ) ) = 12W A(i) ,V \ A(i)( )

i=1

k

∑

RatioCut(A(1) ,...,A(k ) ) = 12

W A(i) ,V \ A(i)( )A(i)i=1

k

∑

NCut(A(1) ,...,A(k ) ) = 12

W A(i) ,V \ A(i)( )vol A(i)( )i=1

k

∑

NP-hard!

*Spectral Clustering: Approximation •  Approximation

–  Given a set of data to cluster –  Form affinity matrix W –  Find leading k eigenvectors

–  Cluster data set in the eigenspace

–  Projecting back to the original data

•  Major differences in algorithms:

Lv vλ=

L = f (W )

*Random Walk Graph Laplacian (1)

•  Definition: •  First order Markov transition Matrix

–  Each entry: probability of transitioning from node n to to node m in a single step

1rwL I D K I P−= − = −

1

( , )ij i jij m

iijj

K kP

dK=

= =∑

x x

1 1 ( , )m mi ij i jj jd K k

= == =∑ ∑ x x

1 1( ) ,rwL D D K I D K− −= − = −

*Random Walk Graph Laplacian (2)

•  K-way normalized cut: find a partitioning s.t. the probability of transitioning across clusters is minimized

–  When used with Random-Walk GL, this corresponds to minimizing the probability of the state-transitions between clusters

•  In CDLA –  Each abstract action corresponds to a coherent decision

concept –  Why? By taking any actions in any of the states associated

with the same concept, this is a minimal chance to transitioning to the states associated with other concepts

( ) ( )( ) ,

1 1 1

1 | 1 i

i

k k nmn m Ci i i N

i i nmn C m

wNCut C P C C C

w∈

= = ∈ =

⎛ ⎞⎜ ⎟= − → = −⎜ ⎟⎝ ⎠

∑∑ ∑

∑ ∑

Functional Clustering using GPSC

•  GPR + SC à SGP Clustering –  Kernel as correlation hypothesis –  Same hypothesis used as similarity measure for SC –  Correlated inputs share approximately identical

outputs: –  Similar augmented states ~ close fitness values or

utilities •  Warning: The reverse is NOT true and this is why

we need multiple concepts that may share similar output signals –  E.g. match-making job and machine requirements

CDLA: Context Matching •  Each abstract action implicitly defines an action-

selection strategy –  In context: at a given state, find the most correlated

state pair with its action, followed by applying that action

–  Out of context: random action selection •  Applicable in 1) infant agent 2) empty cluster 3)

referenced particles don’t match (by experience association)

•  Caveat:

•  The utility (fitness value) for random action selection does not correspond to the true estimate of resolving

•  Need other way to adjust the utility estimate for its fitness value

( ) ( )( ) ( ), , !i iQ s A Q s a+≠

Empirical Study: Task Assignment Domain

Parameter Spec. Values Task feature set

(state) Task type, size, expected runtime

Server feature set (action)

Service type, percentage CPU time, memory, disk space, CPU speed, job slots

Kernel k SE kernel + noise (see (5.3)) Num. of Abstract

Actions 10 (assumed to be known)

Model update cycle T

10 state transitions towards 100

Goal: Find for each incoming user task the best candidate server(s) that are mutually agreeable in terms of matching criteria.

Empirical Study: Learned Concepts

Task type

Size Expected runtime

Service type

%CPU time

Fitness Value

1 1 1.1 0.93 1 9.784 120.41 2 2 2.5 1.98 2 10.235 128.13 3 3 3.2 2.92 3 15.29 135.23 4 1 1.0 1.02 2 20.36 -50.05 5 2 2.0 2.09 3 0.58 -47.28

•  Illustration of 5 different learned decision concepts. •  The top 3 rows in blue indicate success matches while the bottom 2 rows in yellow indicate failed matches.

Empirical Study: Comparison

0 20 40 60 80 100-2000

-1500

-1000

-500

0

500

1000

1500

Episodes

Rewa

rd P

er E

piso

de

G-SARSACondorRandom

•  Performance comparison among (1) Stripped-down Condor match-making (black) (2) G-SARSA (blue) (3) Random (red)

Sample Promotion (1)

k(sh+ ,s) ≥ τ

Two possibilities: (1) the target is indeed correlated to a given an AA. (2) the target is NOT correlated to ANY Abstract Action à trial and error + sample promotion

Sample Promotion (2) •  Recall: Each abstract actions implicitly define an action selection strategy •  Key: functional data set must be as accurate as possible in terms of their predictive strength •  Match new experience particle against the memory, find the

most relevant piece and use its value estimation. •  Out-of-context case leads to randomized action selection. •  How does the agent still manage to gain experience in this case? •  Randomized action selection + Sample promotion

–  Case 1: random action does get positive result •  Match the result per abstract actions by sampling using

experience association operation

•  If indeed correlated to some experience, then update the fitness value; otherwise à case 2

–  Case 2: Discard the sample because there is no points of reference, sample not useful

( ) ( )( ), ,is as a → x x

191

reinforcement learning framework, CDLA, by integrating the reinforcement field and decision

concept formation into a coherent learning process represented by the center box in Figure 8.1.

In particular, the novel learning architecture addresses the core issue of standard reinforcement

learning algorithms from the perspective of parametric action formulation, extended state space

representation and sample-driven policy representation. All the components in CDLA are

represented directly or indirectly through the kernel function that serves as the fundamental

building block for value prediction, concept formation as well as policy search. In Figure 8.1,

kernelized components are marked by a circled k in yellow whereas indirectly-kernelized

Figure 8.1 CDLA schematic. Periodic iterations between value function approximation and concept-driven cluster formation constitutes the heartbeat of CDLA. The conceptual sequence from 1a to 1e corresponds to the logical flow that progressively updates the reinforcement field as a policy generalization mechanism. The sequence from 2a to 2d on the other hand references the logical flow towards establishing an abstract control policy based on a set of coherent decision concepts as high-level control decisions.

GRL Schematic

(4)1t kA − −

1t k− −

Demo of 4 Decision Concepts A(1) ~ A(4)

“Discovery of plant life J”

“Fallen into deep water L”

(2)tA

(4)t kA −

“Trash removed and organized J”

t k−

Collect

(3)tA

Avoid

“Trapped in high hills L” (1)tAAvoid

(4)tA Clean

t

Concept Polarity +/J -/L

{ }(2) (4),A A{ }(1) (3),A A

(4)1t kA − +

1t k− +

Future Work

More on … •  Using spectral clustering to cluster functional data •  Experience association •  Context matching •  Evolving the samples

–  sample promotion –  Probabilistic model for experience associations

•  Evolving the clusters –  Need to adjust value estimate for abstract actions as new

samples join in –  Morphing clusters

Technology

Generalized Reinforcement Learning