Machine Learning via Advice Taking Jude Shavlik. Thanks To... Rich Maclin Lisa Torrey Trevor Walker...

Preview:

Citation preview

Machine Learning via

Advice Taking

Jude Shavlik

Thanks To ...

Rich MaclinLisa TorreyTrevor Walker

Prof. Olvi MangasarianGlenn FungTed Wild

DARPA

Quote (2002) from DARPA

Sometimes an assistant will merely watch you and draw conclusions.

Sometimes you have to tell a new person, 'Please don't do it this way' or 'From now on when I say X, you do Y.'

It's a combination of learning by example and by being guided.

Widening the “Communication Pipeline” between Humans and Machine Learners

Teacher

Pupil

Machine Learner

Our Approach to Building Better Machine Learners

• Human partner expresses advice “naturally” and w/o knowledge of ML agent’s internals

• Agent incorporates advice directly into the function it is learning

• Additional feedback (rewards, I/O pairs, inferred labels, more advice) used to refine learner continually

“Standard” Machine Learning vs. Theory Refinement

• Positive Examples (“should see doctor”) temp = 102.1, age = 21, sex = F, …

temp = 101.7, age = 37, sex = M, …

• Negative Examples (“take two aspirins”) temp = 99.1, age = 43, sex = M, …

temp = 99.6, age = 24, sex = F, …

• Approximate Domain Knowledge if temp = high and age = young … then neg example

Related work by labs of Mooney, Pazzani, Cohen, Giles, etc

Rich Maclin’s PhD (1995)

IF a Bee is (Near and West) & an Ice is (Near and North)Then Begin Move East Move North END

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

0

10

00

20

00

30

00

40

00

Number of Training Episodes

Re

info

rce

me

nt

on

Te

sts

et

Sample Results

Without advice

With advice

Our Motto

Give advice

rather than commands

to your computer

Outline

Prior Knowledge and Support Vector Machines Intro to SVM’s Linear Separation Non-Linear Separation Function Fitting (“Regression”) Advice-Taking Reinforcement Learning Transfer Learning via Advice Taking

Support Vector MachinesMaximizing the Margin between Bounding Planes

x0w = í + 1

x0w = í à 1

A+

A-

jjwjj22

Support Vectors?

Margin

Linear Algebra for SVM’s

• Given p points in n dimensional space• Represent by p-by-n matrix A of reals

• More succinctly

D(Awà eí )=e;where e is vector of ones

• Separate by two bounding planes

A iw=í + 1; for D i i = + 1;

A iw5 í à 1; for D i i = à 1:

• Each Ai in class +1 or -1

“Slack” VariablesDealing with Data that is not Linearly Separable

A+

A-

y

Support Vectors

Support Vector Machines Quadratic Programming Formulation

• Solve this quadratic program

D(Awà eí ) >e;y > 0;w; í e0ymin

s.t. + y÷ + 2

1jjwjj22

• Maximize margin by minimizing21kwk2

2jjwjj22

• Minimize sum of slack vars with wgt

÷e0y

Support Vector MachinesLinear Programming Formulation

Use 1-norm instead of 2-norm(typically runs faster; better feature selection;might generalize better, NIPS ‘03)

÷e0y+ kwk1y > 0;w; í

D(Awà eí ) + y > e

min

s.t.

Knowledge-Based SVM’sGeneralizing “Example” from POINT to REGION

A+

A-

Incorporating “Knowledge Sets”

Into the SVM Linear Program

This implication equivalent to set of constraints (proof in NIPS ’02 paper)

• Suppose that knowledge set belongs to class A+

Hence must lie in half space

èx??Bx 6 b

é

èxjx0w>í + 1

é

Bx6b ) x0w>í + 1

• We therefore have the implication

Resulting LP for KBSVM’s

We get this linear program (LP)

Ranges over # regions

KBSVM with Slack Variables

Was 0

SVMs and Non-Linear Separating Surfaces

f1

f2 +

+

_

_

h(f1, f2)

g(f1, f2) +

+

_

_

Non-linearly map to new space

Linearly separate in new space (using kernels)

Result is non-linear separator in original space

Fung et al. (2003) presents knowledge-

based non-linear SVMs

Support Vector Regression(aka Kernel Regression)

Linearly approximating a function, given array A of inputs and vector y of (numeric) outputs

f(x) ≈ x’w + b

Find weights such that

Aw + be ≈ y

In dual space, w = A’, so get

(A A’) + be ≈ y

Kernel’izing (to get non-linear approx)

K(A,A’) + be ≈ y

y

x

What to Optimize?

Linear program to optimize

• 1st term () is “regularizer” that minimizes model complexity

• 2nd term is approximation error, weighted by parameter C

• Classical “least squares” fit if quadratic version and first term ignored

Predicting Y for New X

y = K(x’, A’) + b

• Use Kernel to compute “distance” to each training point (ie, row in A)

• Weight by i (hopefully many i are zero), Sum

• Add b (a scalar)

Knowledge-Based SVRMangasarian, Shavlik, & Wild, JMLR ‘04

Add soft constraints to linear program (so need only follow advice approximately)

minimize ||w||1 + C ||s||1

+ penalty for violating advice

such that y - s Aw + b y + s “slacked” match to advice

Advice: In this region, y should exceed 4

S

y

4

Testbeds: Subtasks of RoboCup

Keep ball from opponents

[Stone & Sutton, ICML 2001]

Mobile KeepAway

Score goal

[Maclin et al., AAAI 2005]

BreakAway

Reinforcement Learning Overview

Take an actionReceive a state

Receive a reward

Policy: choose the action with the highest Q-value in the current state

Use the rewards to

estimate the Q-values of actions in

states

Described by a set of features

Incorporating Advice in KBKR

Advice format Bx ≤ d f(x) ≥ hx +

TeammatedistanceTo

shotAngle

GoaldistanceTo

is x

0 1 0

0 0 1

-

bwx

30

10x

9.0

If distanceToGoal ≤ 10 and

shotAngle ≥ 30

Then Q(shoot) ≥ 0.9

Giving Advice About Relative Values of Multiple Functions

Maclin et al, AAAI ’05

When the input satisfies

preconditions(input)

Then

f1(input) > f2(input)

Sample Advice-Taking Results

if distanceToGoal 10

and shotAngle 30

then prefer shoot over all other actions

0.0

0.2

0.4

0.6

0.8

1.0

0 5000 10000 15000 20000 25000

Games Played

Pro

b(S

core

Go

al)

advice

std RL2 vs 1 BreakAway, rewards +1, -1

Q(shoot) > Q(pass)Q(shoot) > Q(move)

Transfer Learning

Agent discovers how tasks are related

We use a user

mappingto tell the agent this

Agent learns Task A

Agent encounters related Task B

Agent uses knowledge from Task A to learn Task B faster

Task A is the

source Task B is

the target

Transfer Learning:The Goal for the Target Task

perf

orm

ance

training

with transfer

without transfer

better start

faster rise better asymptote

Our Transfer Algorithm

Observe source task games to learn skills

Use ILP to create advice for the target task

Learn target taskwith KBKR

Translate learned skills

into transfer advice

If there is user advice, add it

in

Learning Skills By Observation

• Source-task games are sequences: (state, action)• Learning skills is like learning to classify states

by their correct actions• ILP = Inductive Logic Programming

State 1distBetween(me,teammate2) = 15distBetween(me,teammate1) = 10distBetween(me,opponent1) = 5...action = pass(teammate2)outcome = caught(teammate2)

ILP: Searching for First-Order Rules

P :- true

P :- Q P :- R P :- S

P :- R, Q P :- R, S

P :- R, S, V, W, XWe also use a

random-sampling approach

Advantages of ILP

• Can produce first-order rules for skills• Capture only the essential aspects of the skill• We expect these aspects to transfer better

• Can incorporate background knowledge

pass(Teammate)

pass(teammate1)

pass(teammateN)

vs....

Example of a Skill Learned by ILP from KeepAway

pass(Teammate) :- distBetween(me, Teammate) > 14, passAngle(Teammate) > 30, passAngle(Teammate) < 150, distBetween(me, Opponent) < 7.

Also gave “human” advice about shooting, since that is new skill in BreakAway

TL Level 7: KA to BA Raw Curves

TL Level 7: KA to BA Averaged Curves

TL Level 7: StatisticsTL Metrics Average Reward

Type Name KA to BA MD to BA

Score P Value Score P Value

I Jump start 0.05 0.0312 0.08 0.0086

Jump start smoothed 0.08 0.0002 0.06 0.0014

II Transfer ratio 1.82 0.0034 1.86 0.0004

Transfer ratio (truncated) 1.82 0.0032 1.86 0.0004

Average relative reduction (narrow)

0.58 0.0042 0.54 0.0004

Average relative reduction (wide) 0.70 0.0018 0.71 0.0008

Ratio (of area under the curves) 1.37 0.0056 1.41 0.0012

Transfer difference 503.57 0.0046 561.27

0.0008

Transfer difference (scaled) 1017.00

0.0040 1091.2

0.0016

III Asymptotic advantage 0.09 0.0086 0.11 0.0040

Asymptotic advantage smoothed 0.08 0.0116 0.10 0.0030

Boldface indicates a significant difference was found

Conclusion

• Can use much more than I/O pairs in ML

• Give advice to computers; theyautomatically refine it based on feedback from user or environment

• Advice an appealing mechanism for transferring learned knowledgecomputer-to-computer

Some Papers (on-line, use Google :-)

Creating Advice-Taking Reinforcement Learners, Maclin & Shavlik, Machine Learning 1996

Knowledge-Based Support Vector Machine Classifiers, Fung, Mangasarian, & Shavlik, NIPS 2002

Knowledge-Based Nonlinear Kernel Classifiers, Fung, Mangasarian, & Shavlik, COLT 2003

Knowledge-Based Kernel Approximation, Mangasarian, Shavlik, & Wild, JAIR 2004

Giving Advice about Preferred Actions to Reinforcement Learners Via Knowledge-Based Kernel Regression, Maclin, Shavlik, Torrey, Walker, & Wild, AAAI 2005

Skill Acquisition via Transfer Learning and Advice Taking, Torrey, Shavlik, Walker, & Maclin, ECML 2006

Backups

Breakdown of Results

0

0.1

0.2

0.3

0.4

0.5

0.6

0 1000 2000 3000 4000 5000

Games Played

Pro

bab

ilit

y o

f G

oal

all advice

transfer advice onlyuser advice only

no advice

What if User Advice is Bad?

0

0.1

0.2

0.3

0.4

0.5

0.6

0 1000 2000 3000 4000 5000

Games Played

Pro

bab

ility

of

Go

al

Transfer with good advice

Transfer with bad adviceBad advice only

No advice

Related Work on Transfer

• Q-function transfer in RoboCup• Taylor & Stone (AAMAS 2005, AAAI 2005)

• Transfer via policy reuse• Fernandez & Veloso (AAMAS 2006, ICML workshop

2006)• Madden & Howley (AI Review 2004)• Torrey et al. (ECML 2005)

• Transfer via relational RL• Driessens et al. (ICML workshop 2006)

Recommended