16
Modelling Motivation for Experience-Based Attention Focus in Reinforcement Learning Candidate Kathryn Merrick School of Information Technologies University of Sydney PhD Thesis Defence July, 2007 Supervisor Prof. Mary Lou Maher Key Centre for Design Computing and Cognition, University of Sydney Objectives | Contributions | Results | Conclusions

Modelling Motivation for Experience-Based Attention Focus in Reinforcement Learning Candidate Kathryn Merrick School of Information Technologies University

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Modelling Motivation for Experience-Based Attention Focus in Reinforcement Learning Candidate Kathryn Merrick School of Information Technologies University

Modelling Motivation for Experience-Based Attention

Focus in Reinforcement Learning

CandidateKathryn Merrick

School of Information Technologies

University of Sydney

PhD Thesis Defence

July, 2007

SupervisorProf. Mary Lou MaherKey Centre for Design Computing

and Cognition, University of Sydney

Objectives | Contributions | Results | Conclusions

Page 2: Modelling Motivation for Experience-Based Attention Focus in Reinforcement Learning Candidate Kathryn Merrick School of Information Technologies University

Introduction

Objectives | Contributions | Results | Conclusions

Learning environments may be complex, with many states and possible actions

The tasks to be learned may change over time

It may be difficult to predict tasks in advance

Doing ‘everything’ may be infeasible

How can artificial agents focus attention to develop behaviours in complex, dynamic environments?

This thesis considers this question in conjunction with reinforcement learning

Page 3: Modelling Motivation for Experience-Based Attention Focus in Reinforcement Learning Candidate Kathryn Merrick School of Information Technologies University

1. Develop models of motivation that focus attention based on experiences

2. Model complex, dynamic environments using a representation that enables adaptive behaviour

3. Develop learning agents with three aspects of attention focus:

Behavioural cycles Adaptive behaviour Multi-task learning

4. Develop metrics for comparing adaptability and multi-task learning behaviour of MRL agents.

5. Evaluate performance and scalability of MRL agents using different models of motivation and different RL approaches.

Objectives | Contributions | Results | Conclusions

0

2

4

6

8

10

12

14

16

18

Environment 1 Environment 2 Environment 3 Environment 4 Environment 5

Maxim

um

Beh

avio

ura

l C

om

ple

xit

y

.

Interest +Competence Baseline

Positive feedback

Negative feedback

Interest

-1

-0.5

0

0.5

1

0 0.5 1 1.5 2Novelty 2N(t)

Inte

rest

(R

ewar

d)

I(2N

(t))

= R

(t)

.

S1 S2

S4

A1

A2A4

A3

S3

Page 4: Modelling Motivation for Experience-Based Attention Focus in Reinforcement Learning Candidate Kathryn Merrick School of Information Technologies University

Modelling Motivation as Experience-Based Reward

Positive feedback

Negative feedback

Interest

-1

-0.5

0

0.5

1

0 0.5 1 1.5 2Novelty 2N(t)

Inte

rest

(R

ewar

d)

I(2N

(t))

= R

(t)

.

0

0.2

0.4

0.6

0.8

1

0 20 40 60Time (t)

No

velt

y N

(t)

.

N(t) Stimulus

Objectives | Contributions | Results | Conclusions

0

1

0 5 10 15 20Time (t)

Err

or.

Positive feedback

Negative feedback

-1

-0.5

0

0.5

1

0 0.5 1 1.5 2Error

Co

mp

eten

ce R

ewar

d .

Rm(t) = max(I(t), C(t))

Compute observations and events OS(t), ES(t)

Task selection using a self-organising map

Compute experience-based reward using:

Policy error Deci and Ryan’s model

of optimal challenges

Arbitrate by taking maximum of interest and competence motivation

Rm(t) = I(t)

Compute observations and events OS(t), ES(t)

Task selection using a self-organising map

Compute experience-based reward using:

Stanley’s model of habituation

Wundt Curve

No arbitration required

Page 5: Modelling Motivation for Experience-Based Attention Focus in Reinforcement Learning Candidate Kathryn Merrick School of Information Technologies University

Representing Complex, Dynamic Environments

S <sensations><sensations> <PiSensations><sensations> | ε<PiSensations> <sj><PiSensations> | ε<sj> <number> | <string><number> 1 | 2 | 3 | ...<string> ...

P = {P1, P2, P3, …, Pi , …}

Objectives | Contributions | Results | Conclusions

A <actions><actions> <PiActions><actions> | ε<PiActions> <Aj><PiActions> | ε<Aj> ...

S(1) = (<visiblePick:1> <visibleForge:1><visibleSmithy:1>)

A(1) = {A(pick-up, pick), A(pick-up, forge), A(pick-up, smithy)}

S(2) = (<visibleAxe:1><visibleLathe:1>)

A(2) = {A(pick-up, axe), A(pick-up, lathe)}

Page 6: Modelling Motivation for Experience-Based Attention Focus in Reinforcement Learning Candidate Kathryn Merrick School of Information Technologies University

Metrics and Evaluation A classification of different types of MRL and the

role played by motivation in these approaches.

Metrics for comparing learned behavioural cycles in terms of adaptability and multi-task learning.

Evaluation of the performance and scalability of MRL agents using different:

Models of motivation RL approaches Types of environment

New approaches to the design of non-player characters for games, which can adapt in open-ended virtual worlds.

Objectives | Contributions | Results | Conclusions

Page 7: Modelling Motivation for Experience-Based Attention Focus in Reinforcement Learning Candidate Kathryn Merrick School of Information Technologies University

Experiment 1

0

5

10

15

20

25

MFRL MMORL MHRL

Beh

avio

ura

l V

arie

ty

.

Interest +Competence Baseline

0

2

4

6

8

10

12

MFRL MMORL MHRLMax

imu

m b

ehav

iou

ral

com

ple

xity

.

Interest +Competence Baseline

Behavioural Variety Behavioural Complexity

Objectives | Contributions | Results | Conclusions

Task oriented learning emerges using a task-independent motivation signal to direct learning.

Greatest behavioural variety in simple environments is achieved by MFRL agents

Greatest behavioural complexity is achieved by MFRL and MHRL agents, which can interleave solutions to multiple tasks

Page 8: Modelling Motivation for Experience-Based Attention Focus in Reinforcement Learning Candidate Kathryn Merrick School of Information Technologies University

0

10

20

30

40

50

60

Environment 1 Environment 2 Environment 3 Environment 4 Environment 5

Beh

avio

ura

l V

arie

ty .

.

Interest +Competence Baseline

0

10

20

30

40

50

60

Environment 1 Environment 2 Environment 3 Environment 4 Environment 5

Beh

avio

ura

l V

arie

ty

.

Interest +Competence Baseline

0

10

20

30

40

50

60

Environment 1 Environment 2 Environment 3 Environment 4 Environment 5

Beh

avio

ura

l V

arie

ty

.

Interest +Competence Baseline

0

2

4

6

8

10

12

14

16

18

Environment 1 Environment 2 Environment 3 Environment 4 Environment 5

Maxim

um

Behavio

ura

l Com

ple

xity

.

Interest +Competence Baseline

0

2

4

6

8

10

12

14

16

18

Environment 1 Environment 2 Environment 3 Environment 4 Environment 5

Maxim

um

Beh

avio

ura

l C

om

ple

xit

y

.

Interest +Competence Baseline

0

2

4

6

8

10

12

14

16

18

Environment 1 Environment 2 Environment 3 Environment 4 Environment 5

Maxim

um

Behavio

ura

l Com

ple

xity

.

Interest +Competence Baseline

0

5

10

15

20

25

30

35

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

Time

Beh

avio

ura

l V

arie

ty

.

Interest +Competence Baseline

0

5

10

15

20

25

30

35

0 20000 40000 60000 80000 100000Time

Beh

avio

ura

l V

arie

ty

.

Interest +Competence Baseline

0

5

10

15

20

25

30

35

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000Time

Beh

avio

ura

l V

arie

ty

.

Interest +Competence Baseline

Experiment 2 Experiment 3 Experiment 4

MFRL

MMORL

MHRL

Objectives | Contributions | Results | Conclusions

MFRL agents are most adaptable and most scalable as the number of tasks in the environment increases

MMORL are most scalable as the complexity of tasks increases

Agents motivated by interest and competence achieve greater adaptability, and show increased behavioural variety and complexity

Page 9: Modelling Motivation for Experience-Based Attention Focus in Reinforcement Learning Candidate Kathryn Merrick School of Information Technologies University

MRL agents can learn task-oriented behavioural cycles using a task-independent motivation signal

The greatest behavioural variety and complexity in simple environments is achieved by MFRL agents

The greatest adaptability is displayed by MRL agents motivated by interest and competence

The most scalable approach when recall is required uses MMORL

Objectives | Contributions | Results | Conclusions

Conclusions

Page 10: Modelling Motivation for Experience-Based Attention Focus in Reinforcement Learning Candidate Kathryn Merrick School of Information Technologies University

Limitations and Future Work

Scalability of MRL in other types of environments

Additional approaches to motivation:

Biological models Cognitive models Social models Combined models

Motivation in other machine learning settings:

Motivated supervised learning Motivated unsupervised learning

Additional metrics for MRL:

Usefulness Intelligence Rationality

Objectives | Contributions | Results | Conclusions

(Linden, 2007)

Page 11: Modelling Motivation for Experience-Based Attention Focus in Reinforcement Learning Candidate Kathryn Merrick School of Information Technologies University
Page 12: Modelling Motivation for Experience-Based Attention Focus in Reinforcement Learning Candidate Kathryn Merrick School of Information Technologies University

Tasks

• Maintenance tasks: observations

• Achievement tasks: Events

Agent

Sensed state

Observation

World state

sensors

E(t) = S(t)–S(t’) = (Δ(s1(t), s1(t’)), Δ(s2(t), s2(t’)), … Δ(sL(t), sL(t’)), …)

Page 13: Modelling Motivation for Experience-Based Attention Focus in Reinforcement Learning Candidate Kathryn Merrick School of Information Technologies University

Behavioural Cycles

S1 = (<location:Food Machine><Food Machine:1>)

S2 = (<location:Food Machine><Food Machine:1><Food:1>)

A1 = use(Food Machine)

A3 = use(Food)

S3 = (<location:Food> <Food Machine:1> <Food:1>)

A2 = move to(Food)A4 = move to(Food Machine)

S3 = (<location:NO_OBJECT> <Food Machine:1> <Food:1>)

S1 S1 S2S1 S2

S3

S1 S2

Sn

A1

A1

A2

A1

A2A3

A1

A2An

An-1

Page 14: Modelling Motivation for Experience-Based Attention Focus in Reinforcement Learning Candidate Kathryn Merrick School of Information Technologies University

Agent Models

A(t)

S(t), Rmt)

S(t)

W(t)

sensors

O(t), E(t)

RL

effectors

MO(t)-1,E(t-1)

π (t), S(t),A(t)

π (t)-1), S(t)-1,A(t)-1)

E U

OU

π (t) U

A(t) U

A

T(t)

A(t)

S(t), Rmt)

S(t)

W(t)

B(t)

effectors

M

π (t), S(t),B(t)

π (t)-1), S(t)-1,B(t)-1)

E U

OU

π (t) U

A(t) U

A U

B

T(t)

B(t-1).π, S(t-1),S(t), B(t-1).AB(t-1).Ω(S(t-1))MORL

ReflexB(t)-1

sensors

S(t), Rmt)

O(t)-1),E(t-1)

O(t)),E(t)

MFRL MMORL

Page 15: Modelling Motivation for Experience-Based Attention Focus in Reinforcement Learning Candidate Kathryn Merrick School of Information Technologies University

SensitivityPositive

feedback

Negative feedback

Interest

-1

-0.5

0

0.5

1

0 0.5 1 1.5 2Novelty 2N(t)

Inte

rest

(R

ewar

d)

I(2N

(t))

= R

(t)

.Positive

feedback

Negative feedback

Interest

-1

-0.5

0

0.5

1

0 0.5 1 1.5 2Novelty 2N(t)

Inte

rest

(R

ewar

d)

I(2N

(t))

= R

(t)

.

Positive feedback

Negative feedback

Interest

-1

-0.5

0

0.5

1

0 0.5 1 1.5 2Novelty 2N(t)

Inte

rest

(R

ewar

d)

I(2N

(t))

= R

(t)

.

Positive feedback

Negative feedback

Interest

-1

-0.5

0

0.5

1

0 0.5 1 1.5 2Novelty 2N(t)

Inte

rest

(R

ewar

d)

I(2N

(t))

= R

(t)

.

Change in interest with (a) ρ+ = ρ- = 5,F+min = 0.5 and F-min = 1.5 and (b) ρ+ = ρ- = 30,F+min= 0.5 and F-min = 1.5

Change in interest with (a) ρ+ = ρ- =10, F+min = 0.1 and F-min = 1.9 and (b) ρ+ = ρ- = 10, F+min = 0.9 and F-min = 1.1

Page 16: Modelling Motivation for Experience-Based Attention Focus in Reinforcement Learning Candidate Kathryn Merrick School of Information Technologies University

Metrics

• A task is complete when its defining observation or event is achieved

• A task is learned when the standard deviation of the number of actions in h behavioural cycle completing the task is less than some error threshold

• Behavioural variety measures the number of tasks learned

• Behavioural complexity measures the number of actions in a behavioural cycle