11
ECE 517: Reinforcement Learning in ECE 517: Reinforcement Learning in Artificial Intelligence Artificial Intelligence
Lecture 17: TRTRL, ImplementationLecture 17: TRTRL, ImplementationConsiderations, Apprenticeship LearningConsiderations, Apprenticeship Learning
Dr. Itamar ArelDr. Itamar Arel
College of EngineeringCollege of EngineeringDepartment of Electrical Engineering and Computer ScienceDepartment of Electrical Engineering and Computer Science
The University of TennesseeThe University of TennesseeFall 2010Fall 2010
November 3, 2010November 3, 2010
ECE 517: Reinforcement Learning in AI 22
OutlineOutline
Recap on RNNsRecap on RNNs
Implementation and usage issues with RTRLImplementation and usage issues with RTRL Computational complexity and resources requiredComputational complexity and resources required
Vanishing gradient problemVanishing gradient problem
Apprenticeship learningApprenticeship learning
ECE 517: Reinforcement Learning in AI 33
Recap on RNNsRecap on RNNs
RNNs are potentially much stronger than FFNNRNNs are potentially much stronger than FFNN Can capture temporal dependenciesCan capture temporal dependencies Embed complex state representation (i.e. memory)Embed complex state representation (i.e. memory) Models of discrete-time dynamic systemsModels of discrete-time dynamic systems
They are (very) complex to trainThey are (very) complex to train TDNNTDNN – limited performance based on window – limited performance based on window RTRLRTRL – calculates a dynamic gradient on-line – calculates a dynamic gradient on-line
ECE 517: Reinforcement Learning in AI 44
RTRL reviewedRTRL reviewed
RTRL is a gradient descent based methodRTRL is a gradient descent based method
It relies on sensitivitiesIt relies on sensitivities
expressing the impact of any weight expressing the impact of any weight wwijij on the on the
activation of neuron activation of neuron kk..
The algorithm then consists of computing weight The algorithm then consists of computing weight changeschanges
Let’s look at the resources involved …Let’s look at the resources involved …
)()()1( ' tztpwtnetftp jikIUl
lijklkk
kij
)()()( tptetw kij
Ukkij
ECE 517: Reinforcement Learning in AI 55
Implementing RTRL – computations involvedImplementing RTRL – computations involved
The key component in RTRL is the sensitivities matrixThe key component in RTRL is the sensitivities matrix
Must be calculated for each neuronMust be calculated for each neuron
RTRL, however, is NOT local …RTRL, however, is NOT local …
Can the calculations be efficiently distributed?Can the calculations be efficiently distributed?
)()()1( ' tztpwtnetftp jikIUl
lijklkk
kij
NN 33 NN
NN 44
ECE 517: Reinforcement Learning in AI 66
Implementing RTRL – storage requirementsImplementing RTRL – storage requirements
Let’s assume a fully-connected network of Let’s assume a fully-connected network of NN neurons neurons
Memory resourcesMemory resources Weights matrixWeights matrix, , wwijij NN
22
ActivationsActivations, , yykk NN Sensitivity matrixSensitivity matrix NN
33
Total memory requirements Total memory requirements O(O(NN 33))
Let’s go over an example:Let’s go over an example: Let’s assume we have 1000 neurons in the systemLet’s assume we have 1000 neurons in the system Each value requires 20 bits to representEach value requires 20 bits to represent ~20 Gb of storage!!~20 Gb of storage!!
ECE 517: Reinforcement Learning in AI 77
Possible solutions – static subgroupingPossible solutions – static subgrouping
Zipser et. al (1989) suggested static grouping of neuronsZipser et. al (1989) suggested static grouping of neurons
Relaxing the “fully-connected” requirementRelaxing the “fully-connected” requirement Has backing in neuroscienceHas backing in neuroscience Average “branching factor” in the brain ~ 1000Average “branching factor” in the brain ~ 1000
Reduced the complexity by simply leaving out elements of Reduced the complexity by simply leaving out elements of the sensitivity matrix based upon subgrouping of neuronsthe sensitivity matrix based upon subgrouping of neurons
Neurons are subgrouped arbitrarilyNeurons are subgrouped arbitrarily Sensitivities between groups are ignoredSensitivities between groups are ignored All connections still exist in the forward pathAll connections still exist in the forward path
If If gg is the number of subgroups then … is the number of subgroups then … Storage is O(Storage is O(NN33/g/g2 2 )) Computational speedup is Computational speedup is gg33
Communications Communications each node communicates with each node communicates with N/gN/g nodesnodes
ECE 517: Reinforcement Learning in AI 88
Possible solutions – static subgrouping (cont.)Possible solutions – static subgrouping (cont.)
Zipser’s empirical tests indicate that these networks can Zipser’s empirical tests indicate that these networks can solve many of the problems full RTRL solvessolve many of the problems full RTRL solves
One caveat of the subgrouped RTRL training is that each One caveat of the subgrouped RTRL training is that each subnet must have at least one unit for which a target subnet must have at least one unit for which a target exists (since gradient information is not exchanged exists (since gradient information is not exchanged between groups)between groups)
Others have proposed dynamic subgroupingOthers have proposed dynamic subgrouping Subgrouping based on maximal gradient informationSubgrouping based on maximal gradient information Not realistic for hardware realizationNot realistic for hardware realization
Open research question: how to calculate gradient without Open research question: how to calculate gradient without the the O(O(NN33)) storage requirement? storage requirement?
ECE 517: Reinforcement Learning in AI
Truncated Real Time Recurrent Learning (TRTRL)Truncated Real Time Recurrent Learning (TRTRL)
MotivationMotivation:: To obtain a scalable version of the RTRL algorithm while To obtain a scalable version of the RTRL algorithm while minimizing performance degradationminimizing performance degradation
How?How? Limit the sensitivities of each neuron to its Limit the sensitivities of each neuron to its ingressingress (incoming) and (incoming) and egressegress (outgoing) links (outgoing) links
ECE 517: Reinforcement Learning in AI
Performing Sensitivity Calculations in TRTRLPerforming Sensitivity Calculations in TRTRL
For all nodes that are For all nodes that are not in the output setnot in the output set, the , the egressegress sensitivity values for node sensitivity values for node ii are calculated by imposing are calculated by imposing k=jk=j in in
the original RTRL sensitivity equation, such thatthe original RTRL sensitivity equation, such that
Similarly, the Similarly, the ingressingress sensitivity values for node sensitivity values for node jj are given by are given by
For For outputoutput neurons, a nonzero sensitivity element must exist neurons, a nonzero sensitivity element must exist in order to update the weightsin order to update the weights
)()()1( tztpwtsftp jj
ijijiiiij
)()()1( tytpwtsftp ijijjiijii
iji
)()()()1( tztpwtpwtsftp jikj
ijkjiijkikk
kij
ECE 517: Reinforcement Learning in AI
The network structure remains the same with The network structure remains the same with TRTRL, only the calculation of sensitivities is TRTRL, only the calculation of sensitivities is reducedreduced
Significant reduction in resource requirements …Significant reduction in resource requirements …Computational load for each neuron drops to from Computational load for each neuron drops to from OO((NN33)) to to OO((2KN2KN),), where where KK denotes the number of denotes the number of output neuronsoutput neurons
Total computational complexity is now Total computational complexity is now OO((2KN2KN22))Storage requirements drop from Storage requirements drop from OO((NN33)) to to OO((NN22))
Example revisited: For Example revisited: For NN=100, 10 outputs =100, 10 outputs 100k 100k multiplications and only multiplications and only 20kB 20kB of storage!of storage!
Resource Requirements of TRTRLResource Requirements of TRTRL
ECE 517: Reinforcement Learning in AI 1212
Further TRTRL Improvements – Clustering of NeuronsFurther TRTRL Improvements – Clustering of Neurons
TRTRL introduced localization and memory TRTRL introduced localization and memory improvementimprovement
Clustered TRTRL adds scalability by reducing the Clustered TRTRL adds scalability by reducing the number of long connection lines between processing number of long connection lines between processing elementselements
Input
Output
ECE 517: Reinforcement Learning in AI
Test case #1: Frequency DoublerTest case #1: Frequency Doubler
Input: Input: sinsin((xx)), target output , target output sinsin(2(2xx))Both networks had 12 neurons Both networks had 12 neurons
ECE 517: Reinforcement Learning in AI 1414
Vanishing Gradient ProblemVanishing Gradient Problem
Recap on goals:Recap on goals: Find temporal dependencies in data with a RNNFind temporal dependencies in data with a RNN The idea behind RTRL: The idea behind RTRL: when an error value is found, when an error value is found,
apply it to inputs seen an indefinite number of epochs apply it to inputs seen an indefinite number of epochs agoago
In 1994 (Bengio et. al) it has been shown that both BPTT In 1994 (Bengio et. al) it has been shown that both BPTT and RTRL suffer from the problem of and RTRL suffer from the problem of vanishing gradientvanishing gradient informationinformation
When using gradient based training rules, the “error When using gradient based training rules, the “error signal” that is applied to previous inputs tends to vanishsignal” that is applied to previous inputs tends to vanish
Because of this, long-term dependencies in the data are Because of this, long-term dependencies in the data are often overlookedoften overlooked
Short-term memory is ok, long-term (>10 epochs) – lostShort-term memory is ok, long-term (>10 epochs) – lost
ECE 517: Reinforcement Learning in AI 1515
Vanishing Gradient Problem (cont.)Vanishing Gradient Problem (cont.)
A learning error yields gradients on outputs, and A learning error yields gradients on outputs, and therefore on the state variables therefore on the state variables st
Since the weights (parameters) are shared across timeSince the weights (parameters) are shared across time
RNNxxtt yyttsstt
W
s
s
s
s
E
W
s
s
E
W
E t
t
tt
t
tt
ttt xsfy ,1
largefor 0...... '
1'
1'1
2
1
1
fff
s
s
s
s
s
s
s
stt
t
t
t
tt
ECE 517: Reinforcement Learning in AI 1616
What is Apprenticeship LearningWhat is Apprenticeship Learning
Many times we want to train an agent based on a Many times we want to train an agent based on a reference controllerreference controller
Riding a bicycleRiding a bicycle Flying a planeFlying a plane
Starting from scratch may take a very long timeStarting from scratch may take a very long time Particularly for large state/action spacesParticularly for large state/action spaces
May cost a lot (e.g. helicopter crashing)May cost a lot (e.g. helicopter crashing)
Process:Process: Train agent on reference controllerTrain agent on reference controller Evaluate trained agentEvaluate trained agent Improve trained agentImprove trained agent
Note: reference controller can be anything (e.g. Note: reference controller can be anything (e.g. heuristic controller for Car Race problem)heuristic controller for Car Race problem)
ECE 517: Reinforcement Learning in AI 1717
Formalizing Apprenticeship LearningFormalizing Apprenticeship Learning
Let’s assume we have a reference policy Let’s assume we have a reference policy from which from which we want our agent to learnwe want our agent to learn
We would first like to learn the (approx.) value function, We would first like to learn the (approx.) value function, VV
Once we have Once we have VV, we can try an improve it based on , we can try an improve it based on the policy improvement theorem, i.e.the policy improvement theorem, i.e.
By following the original policy greedily we obtain a By following the original policy greedily we obtain a better policy!better policy!
In practice, many issues should be considered such as In practice, many issues should be considered such as state space coverage and exploration/exploitationstate space coverage and exploration/exploitation
Train on zero exploration, then explore gradually …Train on zero exploration, then explore gradually …
),(max)('
)()('
''
asQs
sVsV
a