Upload
mari-compton
View
43
Download
5
Tags:
Embed Size (px)
DESCRIPTION
Transfer Learning Via Advice Taking. Jude Shavlik University of Wisconsin-Madison. Acknowledgements. Lisa Torrey, Trevor Walker, & Rich Maclin DARPA IPTO Grant HR0011-04-1-0007 NRL Grant N00173-06-1-G002 DARPA IPTO Grant FA8650-06-C-7606. What Would You Like to Say to This Penguin?. - PowerPoint PPT Presentation
Citation preview
Transfer Learning Via Advice Taking
Jude ShavlikUniversity of Wisconsin-Madison
AcknowledgementsAcknowledgements
Lisa Torrey, Trevor Walker, & Rich MaclinLisa Torrey, Trevor Walker, & Rich Maclin
DARPA IPTO Grant HR0011-04-1-0007DARPA IPTO Grant HR0011-04-1-0007 NRL Grant N00173-06-1-G002NRL Grant N00173-06-1-G002 DARPA IPTO Grant FA8650-06-C-7606DARPA IPTO Grant FA8650-06-C-7606
What Would You What Would You Like to Say to This Penguin?Like to Say to This Penguin?
IF a IF a BeeBee is ( is (NearNear and and WestWest) &) & an an IceIce is ( is (NearNear and and NorthNorth))ThenThen BeginBegin Move EastMove East Move NorthMove North ENDEND
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
0
1000
2000
3000
4000
Number of Training Episodes
Rei
nfo
rcem
ent
on
Tes
tset
Without advice
With advice
Empirical ResultsEmpirical Results
Our Approach to Transfer Our Approach to Transfer LearningLearning
Source Task
Extraction
Extracted Knowledge
Transferred Knowledge
Mapping
Refinement
Target Task
Potential Benefits of TransferPotential Benefits of Transfer
perf
orm
ance
training
with transferwithout transfer
higher start
steeper slope higher asymptote
OutlineOutline
Reinforcement Learning w/ AdviceReinforcement Learning w/ Advice
Transfer via Rule Extraction & Advice Transfer via Rule Extraction & Advice TakingTaking
Transfer via MacrosTransfer via Macros Transfer via Markov Logic NetworksTransfer via Markov Logic Networks
(time permitting)(time permitting)
Wrap UpWrap Up
Reinforcement Learning (RL) Reinforcement Learning (RL) OverviewOverview
Choose actionSense state
Receive reward
Policy: choose the action with the highest Q-value in the current state
Use the rewards to
estimate the Q-values of actions in
states
Described by a set of features
The RoboCup DomainThe RoboCup Domain
RoboCup SubtasksRoboCup Subtasks
Mobile KeepAway
BreakAwayMoveDownfield
Variant of Stone & Sutton, ICML 2001
Q Learning Q Learning (Watkins PhD, 1989)(Watkins PhD, 1989)
Q function
state
actionvalue
policy(state) =
argmaxaction
For large state spaces, need
function approximation
Learning the Q FunctionLearning the Q Function
Feature vectorWeight vector ●
A Std Approach: Linear support-vector regressionA Std Approach: Linear support-vector regression
Q-value =
Set weights to minimizeSet weights to minimize
Model size + C × Data misfit
distance(me,teammate1)distance(me,opponent1)angle(opponent1, me,
teammate1)…
0.2-0.1 0.9…
T
●
Advice in RLAdvice in RL
AdviceAdvice provides constraints on Q values provides constraints on Q values under specified conditionsunder specified conditions
IF an opponent is near meAND a teammate is openTHEN Q(pass(teammate)) > Q(move(ahead))
Apply as Apply as softsoft constraints in optimization constraints in optimization
Model size + C × Data misfit + μ × Advice misfit
Aside: Generalizing the Idea Aside: Generalizing the Idea of a Training Example for of a Training Example for Support Vector Machines Support Vector Machines
(SVMs)(SVMs)Can extend the SVMlinear program tohandle “regions astraining examples”
Fung, Mangasarian, & Shavlik: NIPS 2003, COLT 2004
Specifying Advice for Specifying Advice for Support Vector RegressionSupport Vector Regression
B x ≤ d B x ≤ d y ≥ h’ x + y ≥ h’ x + ββ
If input (x) is in region specified by B and d
then output (y) should be above
some line (h’x + β)
x
y
Sample AdviceSample Advice
Advice formatAdvice format
BBxx ≤≤ dd f(x) ≥ hx + f(x) ≥ hx +
TeammatedistanceTo
shotAngle
GoaldistanceTo
is x
0 1 0
0 0 1
-
bwx
30
10 xIf If distanceToGoaldistanceToGoal ≤ 10 ≤ 10 andand
shotAngleshotAngle ≥ 30≥ 30
Then Q(shoot) ≥ 0.9Then Q(shoot) ≥ 0.90.9
Sample Advice-Taking ResultsSample Advice-Taking Results
if distanceToGoal 10
and shotAngle 30
then prefer shoot over all other actions
0.0
0.2
0.4
0.6
0.8
1.0
0 5000 10000 15000 20000 25000
Games Played
Pro
b(S
core
Go
al)
advice
std RL2 vs 1 BreakAway, rewards +1, -1
Q(shoot) > Q(pass)Q(shoot) > Q(move)
OutlineOutline
Reinforcement Learning w/ AdviceReinforcement Learning w/ Advice
Transfer via Rule Extraction & Advice Transfer via Rule Extraction & Advice TakingTaking
Transfer via MacrosTransfer via Macros
Transfer via Markov Logic NetworksTransfer via Markov Logic Networks
Wrap UpWrap Up
Close-Transfer ScenariosClose-Transfer Scenarios
2-on-1 BreakAway
3-on-2 BreakAway
4-on-3 BreakAway
Distant-Transfer ScenariosDistant-Transfer Scenarios
3-on-2 BreakAway
3-on-2 KeepAway
3-on-2 MoveDownfield
Our First Transfer-Learning Our First Transfer-Learning Approach:Approach:
Exploit fact that models and advice in Exploit fact that models and advice in same languagesame language
Qx = wx1f1 + wx2f2 + bx
Qy = wy1f1 + by
Qz = wz2f2 + bz
Source Q functions
Q´x = wx1f´1 + wx2f´2 + bx
Q´y = wy1f´1 + by
Q´z = wz2f´2 + bz
Mapped Q functions
if Q´x > Q´y
and Q´x > Q´z
then prefer x´
Advice
if wx1f´1 + wx2f´2 + bx > wy1f´1 + by
and wx1f´1 + wx2f´2 + bx > wz2 f´2 + bz
then prefer x´ to y´ and z´
Advice (expanded)
User Advice in Skill TransferUser Advice in Skill Transfer
There may be new skills in the target that There may be new skills in the target that cannot be learned from the sourcecannot be learned from the source
We allow (human) users to We allow (human) users to addadd their their own advice about these skillsown advice about these skills
IF: distance(me, GoalPart) < 10 IF: distance(me, GoalPart) < 10 ANDAND
angle(GoalPart, me, goalie) > 40angle(GoalPart, me, goalie) > 40
THEN: prefer shoot(GoalPart)THEN: prefer shoot(GoalPart)
User Advice for KeepAway to BreakAway
Sample Human InteractionSample Human Interaction
““Use what you learned in KeepAway, Use what you learned in KeepAway, and add in this new action and add in this new action
SHOOT.”SHOOT.”
““Here is some advice about shooting …”Here is some advice about shooting …”
““Now go practice for awhile.”Now go practice for awhile.”
PolicyPolicy Transfer to 3-on-2 Transfer to 3-on-2 BreakAwayBreakAway
0
0.1
0.2
0.3
0.4
0.5
0.6
0 500 1000 1500 2000 2500 3000
Training Games
Pro
bab
ility
of
Go
al
Standard RL
Policy Transfer from 2-on-1 BreakAway
Policy Transfer from 3-on-2 MoveDownfield
Policy Transfer from 3-on-2 KeepAway
Torrey, Walker, Shavlik & Maclin: ECML 2005
Our Second Approach: Our Second Approach: Use Inductive Logic Programming (ILP) on SOURCE to Use Inductive Logic Programming (ILP) on SOURCE to extractextract
adviceadvice
GivenGiven Positive and negative examples for each Positive and negative examples for each
actionaction
DoDo Learn first-order rules that describe Learn first-order rules that describe mostmost
positive examples but positive examples but fewfew negative examples negative examples
good_action(pass(t1), state1)
good_action(pass(t2), state3)
good_action(pass(t1), state2)
good_action(pass(t2), state2)
good_action(pass(t1), state3)
good_action(pass(Teammate), State) :-
distance(me, Teammate, State) > 10,
distance(Teammate, goal, State) < 15.
Searching for an ILP ClauseSearching for an ILP Clause((top-down search using A*)top-down search using A*)
P :- true
P :- Q P :- R P :- S
P :- R, Q P :- R, S
P :- R, S, V, W, X
SkillSkill Transfer to 3-on-2 Transfer to 3-on-2 BreakAwayBreakAway
0
0.1
0.2
0.3
0.4
0.5
0.6
0 500 1000 1500 2000 2500 3000
Training Games
Pro
bab
ilit
y o
f G
oal
Standard RL
Skill Transfer from 2-on-1 BreakAway
Skill Transfer from 3-on-2 MoveDownfield
Skill Transfer from 3-on-2 KeepAway
Torrey, Shavlik, Walker & Maclin: ECML 2006, ICML Workshop 2006
Approach #3: Relational Approach #3: Relational MacrosMacros
A A relational macrorelational macro is a finite-state machine is a finite-state machine NodesNodes represent internal states of agent in represent internal states of agent in
which which independent policiesindependent policies apply apply ConditionsConditions for transitions and actions are for transitions and actions are
learned via ILPlearned via ILP
hold ← true pass(Teammate) ← isOpen(Teammate)
isClose(Opponent)
allOpponentsFar
Step 1: Learning Macro Step 1: Learning Macro StructureStructure
Objective: find (via ILP) an Objective: find (via ILP) an action patternaction pattern that separates good and bad gamesthat separates good and bad games
macroSequence(Game, StateA) macroSequence(Game, StateA) ←←
actionTaken(Game, StateA, actionTaken(Game, StateA, movemove, , aheadahead, , StateB),StateB),
actionTaken(Game, StateB, actionTaken(Game, StateB, passpass, _, , _, StateC),StateC),
actionTaken(Game, StateC, actionTaken(Game, StateC, shootshoot, _, , _, gameEnd).gameEnd). pass(Teammate)move(ahead) shoot(GoalPart)
Step 2: Learning Macro Step 2: Learning Macro ConditionsConditions
Objective: describe when Objective: describe when transitionstransitions and and actionsactions should be taken should be taken For the
transition from move to passtransition(State) transition(State) ←←
distance(Teammate, goal, State) < 15.distance(Teammate, goal, State) < 15.
For the policy in the pass
nodeaction(State, pass(Teammate)) action(State, pass(Teammate)) ←←angle(Teammate, me, Opponent, State) > 30.angle(Teammate, me, Opponent, State) > 30.
pass(Teammate)move(ahead) shoot(GoalPart)
Learned 2-on-1 BreakAway Learned 2-on-1 BreakAway MacroMacro
pass(Teammate)
move(Direction)
shoot(goalRight)
shoot(goalLeft)
This shot is apparently a leading
pass
Player with BALL
executes the macro
Transfer via DemonstrationTransfer via Demonstration
DemonstrationDemonstration1.1. Execute the macro strategy to get Q-value Execute the macro strategy to get Q-value
estimatesestimates2.2. Infer low Q values for actions not taken by macroInfer low Q values for actions not taken by macro3.3. Compute an initial Q function with these examplesCompute an initial Q function with these examples4.4. Continue learning with standard RLContinue learning with standard RL
Advantage: potential for large immediate jump Advantage: potential for large immediate jump in performance in performance
Disadvantage: risk that agent will blindly follow Disadvantage: risk that agent will blindly follow an inappropriate strategy an inappropriate strategy
MacroMacro Transfer to 3-on-2 Transfer to 3-on-2 BreakAwayBreakAway
0
0.1
0.2
0.3
0.4
0.5
0.6
0 500 1000 1500 2000 2500 3000Training Games
Pro
bab
ilit
y o
f G
oal
Standard RLModel ReuseSkill TransferRelational Macro
Torrey, Shavlik, Walker & Maclin: ILP 2007
Variant of Taylor & Stone
0
0.1
0.2
0.3
0.4
0.5
0.6
0 500 1000 1500 2000 2500 3000Training Games
Pro
bab
ilit
y o
f G
oal
Standard RLModel ReuseSkill TransferRelational Macro
Macro Transfer to 4-on-3 Macro Transfer to 4-on-3 BreakAwayBreakAway
Torrey, Shavlik, Walker & Maclin: ILP 2007
OutlineOutline
Reinforcement Learning w/ AdviceReinforcement Learning w/ Advice
Transfer via Rule Extraction & Advice Transfer via Rule Extraction & Advice TakingTaking
Transfer via MacrosTransfer via Macros
Transfer via Markov Logic NetworksTransfer via Markov Logic Networks
Wrap UpWrap Up
Approach #4: Markov Logic Approach #4: Markov Logic NetworksNetworks
(Richardson and Domingos, MLj 2003)(Richardson and Domingos, MLj 2003)
0 ≤ Q < 0.5 0.5 ≤ Q < 1.0
dist1 > 5 ang1 > 45 dist2 < 10
IF dist1 > 5
AND ang1 > 45
THEN 0 ≤ Q < 0.5
Wgt = 2.1
IF dist2 < 10
AND ang1 > 45
THEN 0.5 ≤ Q < 1.0
Wgt = 1.7
Using MLNs to Learn a Q Using MLNs to Learn a Q FunctionFunction
Perform hierarchical clusteringPerform hierarchical clustering
to find to find setset of good Q-value bins of good Q-value bins
Use ILP to learn rules that Use ILP to learn rules that
classify examples into binsclassify examples into bins
Use MNL weight-learning methodsUse MNL weight-learning methodsto choose weights for these formulasto choose weights for these formulas
IF dist1 > 5
AND ang1 > 45
THEN 0 ≤ Q < 0.1
Q
MLN Transfer to 3-on-2 MLN Transfer to 3-on-2 BreakAwayBreakAway
Torrey, Shavlik, Natarajan, Kuppili & Walker: AAAI TL Workshop 2008
0
0.1
0.2
0.3
0.4
0.5
0.6
0 500 1000 1500 2000 2500 3000Training Games
Pro
bab
ilit
y o
f G
oal
Standard RL
MLN Transfer
Macro Transfer
Value-function Transfer
OutlineOutline
Reinforcement Learning w/ AdviceReinforcement Learning w/ Advice
Transfer via Rule Extraction & Advice Transfer via Rule Extraction & Advice Taking Taking
Transfer via MacrosTransfer via Macros
Transfer via Markov Logic NetworksTransfer via Markov Logic Networks
Wrap UpWrap Up
Summary of Our Transfer Summary of Our Transfer MethodsMethods
1.1. DirectlyDirectly reuse weighted sums as reuse weighted sums as adviceadvice
2.2. Use ILP to learn Use ILP to learn generalizedgeneralized advice advice for each actionfor each action
3.3. Use ILP to learn Use ILP to learn macro-operatorsmacro-operators
4.4. Use Markov Logic Networks to learn Use Markov Logic Networks to learn probability distributionsprobability distributions for Q functions for Q functions
Our Desiderata for Transfer in Our Desiderata for Transfer in RLRL
Transfer knowledge in first-order logicTransfer knowledge in first-order logic
Accept advice from humans expressed Accept advice from humans expressed naturallynaturally
Refine transferred knowledgeRefine transferred knowledge
Improve performance in related target tasksImprove performance in related target tasks
Major challenge: Avoid negative transferMajor challenge: Avoid negative transfer
Related Work in RL TransferRelated Work in RL Transfer
Value-function transfer (Taylor & Stone Value-function transfer (Taylor & Stone 2005)2005)
Policy reuse (Fernandez & Veloso 2006)Policy reuse (Fernandez & Veloso 2006)
State abstractions (Walsh et al. 2006)State abstractions (Walsh et al. 2006)
Options (Croonenborghs et al. 2007)Options (Croonenborghs et al. 2007)
Torrey and Shavlik survey paper on lineTorrey and Shavlik survey paper on line
ConclusionConclusion
Transfer learningTransfer learning important perspective important perspective for machine learningfor machine learning
- move beyond isolated learning tasks- move beyond isolated learning tasks
Appealing ways to do transfer learning Appealing ways to do transfer learning are via the are via the advice-takingadvice-taking and and demonstrationdemonstration perspectives perspectives
Long-term goal: Long-term goal: instructable computinginstructable computing- teach computers the same way- teach computers the same way we teach humans we teach humans