Transfer Learning Via Advice Taking

Transfer Learning Via Advice Taking

Jude ShavlikUniversity of Wisconsin-Madison

AcknowledgementsAcknowledgements

Lisa Torrey, Trevor Walker, & Rich MaclinLisa Torrey, Trevor Walker, & Rich Maclin

DARPA IPTO Grant HR0011-04-1-0007DARPA IPTO Grant HR0011-04-1-0007 NRL Grant N00173-06-1-G002NRL Grant N00173-06-1-G002 DARPA IPTO Grant FA8650-06-C-7606DARPA IPTO Grant FA8650-06-C-7606

What Would You What Would You Like to Say to This Penguin?Like to Say to This Penguin?

IF a IF a BeeBee is ( is (NearNear and and WestWest) &) & an an IceIce is ( is (NearNear and and NorthNorth))ThenThen BeginBegin Move EastMove East Move NorthMove North ENDEND

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

0

1000

2000

3000

4000

Number of Training Episodes

Rei

nfo

rcem

ent

on

Tes

tset

Without advice

With advice

Empirical ResultsEmpirical Results

Our Approach to Transfer Our Approach to Transfer LearningLearning

Source Task

Extraction

Extracted Knowledge

Transferred Knowledge

Mapping

Refinement

Target Task

Potential Benefits of TransferPotential Benefits of Transfer

perf

orm

ance

training

with transferwithout transfer

higher start

steeper slope higher asymptote

OutlineOutline

Reinforcement Learning w/ AdviceReinforcement Learning w/ Advice

Transfer via Rule Extraction & Advice Transfer via Rule Extraction & Advice TakingTaking

Transfer via MacrosTransfer via Macros Transfer via Markov Logic NetworksTransfer via Markov Logic Networks

(time permitting)(time permitting)

Wrap UpWrap Up

Reinforcement Learning (RL) Reinforcement Learning (RL) OverviewOverview

Choose actionSense state

Receive reward

Policy: choose the action with the highest Q-value in the current state

Use the rewards to

estimate the Q-values of actions in

states

Described by a set of features

The RoboCup DomainThe RoboCup Domain

RoboCup SubtasksRoboCup Subtasks

Mobile KeepAway

BreakAwayMoveDownfield

Variant of Stone & Sutton, ICML 2001

Q Learning Q Learning (Watkins PhD, 1989)(Watkins PhD, 1989)

Q function

state

actionvalue

policy(state) =

argmaxaction

For large state spaces, need

function approximation

Learning the Q FunctionLearning the Q Function

Feature vectorWeight vector ●

A Std Approach: Linear support-vector regressionA Std Approach: Linear support-vector regression

Q-value =

Set weights to minimizeSet weights to minimize

Model size + C × Data misfit

distance(me,teammate1)distance(me,opponent1)angle(opponent1, me,

teammate1)…

0.2-0.1 0.9…

T

●

Advice in RLAdvice in RL

AdviceAdvice provides constraints on Q values provides constraints on Q values under specified conditionsunder specified conditions

IF an opponent is near meAND a teammate is openTHEN Q(pass(teammate)) > Q(move(ahead))

Apply as Apply as softsoft constraints in optimization constraints in optimization

Model size + C × Data misfit + μ × Advice misfit

Aside: Generalizing the Idea Aside: Generalizing the Idea of a Training Example for of a Training Example for Support Vector Machines Support Vector Machines

(SVMs)(SVMs)Can extend the SVMlinear program tohandle “regions astraining examples”

Fung, Mangasarian, & Shavlik: NIPS 2003, COLT 2004

Specifying Advice for Specifying Advice for Support Vector RegressionSupport Vector Regression

B x ≤ d B x ≤ d y ≥ h’ x + y ≥ h’ x + ββ

If input (x) is in region specified by B and d

then output (y) should be above

some line (h’x + β)

x

y

Sample AdviceSample Advice

Advice formatAdvice format

BBxx ≤≤ dd f(x) ≥ hx + f(x) ≥ hx +

TeammatedistanceTo

shotAngle

GoaldistanceTo

is x

0 1 0

0 0 1

-

bwx

30

10 xIf If distanceToGoaldistanceToGoal ≤ 10 ≤ 10 andand

shotAngleshotAngle ≥ 30≥ 30

Then Q(shoot) ≥ 0.9Then Q(shoot) ≥ 0.90.9

Sample Advice-Taking ResultsSample Advice-Taking Results

if distanceToGoal 10

and shotAngle 30

then prefer shoot over all other actions

0.0

0.2

0.4

0.6

0.8

1.0

0 5000 10000 15000 20000 25000

Games Played

Pro

b(S

core

Go

al)

advice

std RL2 vs 1 BreakAway, rewards +1, -1

Q(shoot) > Q(pass)Q(shoot) > Q(move)

OutlineOutline



Transfer via MacrosTransfer via Macros

Transfer via Markov Logic NetworksTransfer via Markov Logic Networks

Wrap UpWrap Up

Close-Transfer ScenariosClose-Transfer Scenarios

2-on-1 BreakAway

3-on-2 BreakAway

4-on-3 BreakAway

Distant-Transfer ScenariosDistant-Transfer Scenarios

3-on-2 BreakAway

3-on-2 KeepAway

3-on-2 MoveDownfield

Our First Transfer-Learning Our First Transfer-Learning Approach:Approach:

Exploit fact that models and advice in Exploit fact that models and advice in same languagesame language

Qx = wx1f1 + wx2f2 + bx

Qy = wy1f1 + by

Qz = wz2f2 + bz

Source Q functions

Q´x = wx1f´1 + wx2f´2 + bx

Q´y = wy1f´1 + by

Q´z = wz2f´2 + bz

Mapped Q functions

if Q´x > Q´y

and Q´x > Q´z

then prefer x´

Advice

if wx1f´1 + wx2f´2 + bx > wy1f´1 + by

and wx1f´1 + wx2f´2 + bx > wz2 f´2 + bz

then prefer x´ to y´ and z´

Advice (expanded)

User Advice in Skill TransferUser Advice in Skill Transfer

There may be new skills in the target that There may be new skills in the target that cannot be learned from the sourcecannot be learned from the source

We allow (human) users to We allow (human) users to addadd their their own advice about these skillsown advice about these skills

IF: distance(me, GoalPart) < 10 IF: distance(me, GoalPart) < 10 ANDAND

angle(GoalPart, me, goalie) > 40angle(GoalPart, me, goalie) > 40

THEN: prefer shoot(GoalPart)THEN: prefer shoot(GoalPart)

User Advice for KeepAway to BreakAway

Sample Human InteractionSample Human Interaction

““Use what you learned in KeepAway, Use what you learned in KeepAway, and add in this new action and add in this new action

SHOOT.”SHOOT.”

““Here is some advice about shooting …”Here is some advice about shooting …”

““Now go practice for awhile.”Now go practice for awhile.”

PolicyPolicy Transfer to 3-on-2 Transfer to 3-on-2 BreakAwayBreakAway

0

0.1

0.2

0.3

0.4

0.5

0.6

0 500 1000 1500 2000 2500 3000

Training Games

Pro

bab

ility

of

Go

al

Standard RL

Policy Transfer from 2-on-1 BreakAway

Policy Transfer from 3-on-2 MoveDownfield

Policy Transfer from 3-on-2 KeepAway

Torrey, Walker, Shavlik & Maclin: ECML 2005

Our Second Approach: Our Second Approach: Use Inductive Logic Programming (ILP) on SOURCE to Use Inductive Logic Programming (ILP) on SOURCE to extractextract

adviceadvice

GivenGiven Positive and negative examples for each Positive and negative examples for each

actionaction

DoDo Learn first-order rules that describe Learn first-order rules that describe mostmost

positive examples but positive examples but fewfew negative examples negative examples

good_action(pass(t1), state1)





good_action(pass(Teammate), State) :-

distance(me, Teammate, State) > 10,

distance(Teammate, goal, State) < 15.

Searching for an ILP ClauseSearching for an ILP Clause((top-down search using A*)top-down search using A*)

P :- true

P :- Q P :- R P :- S

P :- R, Q P :- R, S

P :- R, S, V, W, X

SkillSkill Transfer to 3-on-2 Transfer to 3-on-2 BreakAwayBreakAway

0

0.1

0.2

0.3

0.4

0.5

0.6

0 500 1000 1500 2000 2500 3000

Training Games

Pro

bab

ilit

y o

f G

oal

Standard RL

Skill Transfer from 2-on-1 BreakAway

Skill Transfer from 3-on-2 MoveDownfield

Skill Transfer from 3-on-2 KeepAway

Torrey, Shavlik, Walker & Maclin: ECML 2006, ICML Workshop 2006

Approach #3: Relational Approach #3: Relational MacrosMacros

A A relational macrorelational macro is a finite-state machine is a finite-state machine NodesNodes represent internal states of agent in represent internal states of agent in

which which independent policiesindependent policies apply apply ConditionsConditions for transitions and actions are for transitions and actions are

learned via ILPlearned via ILP

hold ← true pass(Teammate) ← isOpen(Teammate)

isClose(Opponent)

allOpponentsFar

Step 1: Learning Macro Step 1: Learning Macro StructureStructure

Objective: find (via ILP) an Objective: find (via ILP) an action patternaction pattern that separates good and bad gamesthat separates good and bad games

macroSequence(Game, StateA) macroSequence(Game, StateA) ←←

actionTaken(Game, StateA, actionTaken(Game, StateA, movemove, , aheadahead, , StateB),StateB),

actionTaken(Game, StateB, actionTaken(Game, StateB, passpass, _, , _, StateC),StateC),

actionTaken(Game, StateC, actionTaken(Game, StateC, shootshoot, _, , _, gameEnd).gameEnd). pass(Teammate)move(ahead) shoot(GoalPart)

Step 2: Learning Macro Step 2: Learning Macro ConditionsConditions

Objective: describe when Objective: describe when transitionstransitions and and actionsactions should be taken should be taken For the

transition from move to passtransition(State) transition(State) ←←

distance(Teammate, goal, State) < 15.distance(Teammate, goal, State) < 15.

For the policy in the pass

nodeaction(State, pass(Teammate)) action(State, pass(Teammate)) ←←angle(Teammate, me, Opponent, State) > 30.angle(Teammate, me, Opponent, State) > 30.

pass(Teammate)move(ahead) shoot(GoalPart)

Learned 2-on-1 BreakAway Learned 2-on-1 BreakAway MacroMacro

pass(Teammate)

move(Direction)

shoot(goalRight)

shoot(goalLeft)

This shot is apparently a leading

pass

Player with BALL

executes the macro

Transfer via DemonstrationTransfer via Demonstration

DemonstrationDemonstration1.1. Execute the macro strategy to get Q-value Execute the macro strategy to get Q-value

estimatesestimates2.2. Infer low Q values for actions not taken by macroInfer low Q values for actions not taken by macro3.3. Compute an initial Q function with these examplesCompute an initial Q function with these examples4.4. Continue learning with standard RLContinue learning with standard RL

Advantage: potential for large immediate jump Advantage: potential for large immediate jump in performance in performance

Disadvantage: risk that agent will blindly follow Disadvantage: risk that agent will blindly follow an inappropriate strategy an inappropriate strategy

MacroMacro Transfer to 3-on-2 Transfer to 3-on-2 BreakAwayBreakAway

0

0.1

0.2

0.3

0.4

0.5

0.6

0 500 1000 1500 2000 2500 3000Training Games

Pro

bab

ilit

y o

f G

oal

Standard RLModel ReuseSkill TransferRelational Macro

Torrey, Shavlik, Walker & Maclin: ILP 2007

Variant of Taylor & Stone

0

0.1

0.2

0.3

0.4

0.5

0.6

0 500 1000 1500 2000 2500 3000Training Games

Pro

bab

ilit

y o

f G

oal

Standard RLModel ReuseSkill TransferRelational Macro

Macro Transfer to 4-on-3 Macro Transfer to 4-on-3 BreakAwayBreakAway

Torrey, Shavlik, Walker & Maclin: ILP 2007

OutlineOutline





Wrap UpWrap Up

Approach #4: Markov Logic Approach #4: Markov Logic NetworksNetworks

(Richardson and Domingos, MLj 2003)(Richardson and Domingos, MLj 2003)

0 ≤ Q < 0.5 0.5 ≤ Q < 1.0

dist1 > 5 ang1 > 45 dist2 < 10

IF dist1 > 5

AND ang1 > 45

THEN 0 ≤ Q < 0.5

Wgt = 2.1

IF dist2 < 10

AND ang1 > 45

THEN 0.5 ≤ Q < 1.0

Wgt = 1.7

Using MLNs to Learn a Q Using MLNs to Learn a Q FunctionFunction

Perform hierarchical clusteringPerform hierarchical clustering

to find to find setset of good Q-value bins of good Q-value bins

Use ILP to learn rules that Use ILP to learn rules that

classify examples into binsclassify examples into bins

Use MNL weight-learning methodsUse MNL weight-learning methodsto choose weights for these formulasto choose weights for these formulas

IF dist1 > 5

AND ang1 > 45

THEN 0 ≤ Q < 0.1

Q

MLN Transfer to 3-on-2 MLN Transfer to 3-on-2 BreakAwayBreakAway

Torrey, Shavlik, Natarajan, Kuppili & Walker: AAAI TL Workshop 2008

0

0.1

0.2

0.3

0.4

0.5

0.6

0 500 1000 1500 2000 2500 3000Training Games

Pro

bab

ilit

y o

f G

oal

Standard RL

MLN Transfer

Macro Transfer

Value-function Transfer

OutlineOutline


Transfer via Rule Extraction & Advice Transfer via Rule Extraction & Advice Taking Taking



Wrap UpWrap Up

Summary of Our Transfer Summary of Our Transfer MethodsMethods

1.1. DirectlyDirectly reuse weighted sums as reuse weighted sums as adviceadvice

2.2. Use ILP to learn Use ILP to learn generalizedgeneralized advice advice for each actionfor each action

3.3. Use ILP to learn Use ILP to learn macro-operatorsmacro-operators

4.4. Use Markov Logic Networks to learn Use Markov Logic Networks to learn probability distributionsprobability distributions for Q functions for Q functions

Our Desiderata for Transfer in Our Desiderata for Transfer in RLRL

Transfer knowledge in first-order logicTransfer knowledge in first-order logic

Accept advice from humans expressed Accept advice from humans expressed naturallynaturally

Refine transferred knowledgeRefine transferred knowledge

Improve performance in related target tasksImprove performance in related target tasks

Major challenge: Avoid negative transferMajor challenge: Avoid negative transfer

Related Work in RL TransferRelated Work in RL Transfer

Value-function transfer (Taylor & Stone Value-function transfer (Taylor & Stone 2005)2005)

Policy reuse (Fernandez & Veloso 2006)Policy reuse (Fernandez & Veloso 2006)

State abstractions (Walsh et al. 2006)State abstractions (Walsh et al. 2006)

Options (Croonenborghs et al. 2007)Options (Croonenborghs et al. 2007)

Torrey and Shavlik survey paper on lineTorrey and Shavlik survey paper on line

ConclusionConclusion

Transfer learningTransfer learning important perspective important perspective for machine learningfor machine learning

- move beyond isolated learning tasks- move beyond isolated learning tasks

Appealing ways to do transfer learning Appealing ways to do transfer learning are via the are via the advice-takingadvice-taking and and demonstrationdemonstration perspectives perspectives

Long-term goal: Long-term goal: instructable computinginstructable computing- teach computers the same way- teach computers the same way we teach humans we teach humans

Documents

Transfer Learning Via Advice Taking