Modelling Motivation for Experience-Based Attention Focus in Reinforcement Learning Candidate Kathryn Merrick School of Information Technologies University

Modelling Motivation for Experience-Based Attention

Focus in Reinforcement Learning

CandidateKathryn Merrick

School of Information Technologies

University of Sydney

PhD Thesis Defence

July, 2007

SupervisorProf. Mary Lou MaherKey Centre for Design Computing

and Cognition, University of Sydney

Objectives | Contributions | Results | Conclusions

Introduction


Learning environments may be complex, with many states and possible actions

The tasks to be learned may change over time

It may be difficult to predict tasks in advance

Doing ‘everything’ may be infeasible

How can artificial agents focus attention to develop behaviours in complex, dynamic environments?

This thesis considers this question in conjunction with reinforcement learning

1. Develop models of motivation that focus attention based on experiences

2. Model complex, dynamic environments using a representation that enables adaptive behaviour

3. Develop learning agents with three aspects of attention focus:

Behavioural cycles Adaptive behaviour Multi-task learning

4. Develop metrics for comparing adaptability and multi-task learning behaviour of MRL agents.

5. Evaluate performance and scalability of MRL agents using different models of motivation and different RL approaches.


0

2

4

6

8

10

12

14

16

18

Environment 1 Environment 2 Environment 3 Environment 4 Environment 5

Maxim

um

Beh

avio

ura

l C

om

ple

xit

y

.

Interest +Competence Baseline

Positive feedback

Negative feedback

Interest

-1

-0.5

0

0.5

1

0 0.5 1 1.5 2Novelty 2N(t)

Inte

rest

(R

ewar

d)

I(2N

(t))

= R

(t)

.

S1 S2

S4

A1

A2A4

A3

S3

Modelling Motivation as Experience-Based Reward

Positive feedback

Negative feedback

Interest

-1

-0.5

0

0.5

1

0 0.5 1 1.5 2Novelty 2N(t)

Inte

rest

(R

ewar

d)

I(2N

(t))

= R

(t)

.

0

0.2

0.4

0.6

0.8

1

0 20 40 60Time (t)

No

velt

y N

(t)

.

N(t) Stimulus


0

1

0 5 10 15 20Time (t)

Err

or.

Positive feedback

Negative feedback

-1

-0.5

0

0.5

1

0 0.5 1 1.5 2Error

Co

mp

eten

ce R

ewar

d .

Rm(t) = max(I(t), C(t))

Compute observations and events OS(t), ES(t)

Task selection using a self-organising map

Compute experience-based reward using:

Policy error Deci and Ryan’s model

of optimal challenges

Arbitrate by taking maximum of interest and competence motivation

Rm(t) = I(t)

Compute observations and events OS(t), ES(t)

Task selection using a self-organising map

Compute experience-based reward using:

Stanley’s model of habituation

Wundt Curve

No arbitration required

Representing Complex, Dynamic Environments

S <sensations><sensations> <PiSensations><sensations> | ε<PiSensations> <sj><PiSensations> | ε<sj> <number> | <string><number> 1 | 2 | 3 | ...<string> ...

P = {P1, P2, P3, …, Pi , …}


A <actions><actions> <PiActions><actions> | ε<PiActions> <Aj><PiActions> | ε<Aj> ...

S(1) = (<visiblePick:1> <visibleForge:1><visibleSmithy:1>)

A(1) = {A(pick-up, pick), A(pick-up, forge), A(pick-up, smithy)}

S(2) = (<visibleAxe:1><visibleLathe:1>)

A(2) = {A(pick-up, axe), A(pick-up, lathe)}

Metrics and Evaluation A classification of different types of MRL and the

role played by motivation in these approaches.

Metrics for comparing learned behavioural cycles in terms of adaptability and multi-task learning.

Evaluation of the performance and scalability of MRL agents using different:

Models of motivation RL approaches Types of environment

New approaches to the design of non-player characters for games, which can adapt in open-ended virtual worlds.


Experiment 1

0

5

10

15

20

25

MFRL MMORL MHRL

Beh

avio

ura

l V

arie

ty

.


0

2

4

6

8

10

12

MFRL MMORL MHRLMax

imu

m b

ehav

iou

ral

com

ple

xity

.


Behavioural Variety Behavioural Complexity


Task oriented learning emerges using a task-independent motivation signal to direct learning.

Greatest behavioural variety in simple environments is achieved by MFRL agents

Greatest behavioural complexity is achieved by MFRL and MHRL agents, which can interleave solutions to multiple tasks

0

10

20

30

40

50

60


Beh

avio

ura

l V

arie

ty .

.


0

10

20

30

40

50

60


Beh

avio

ura

l V

arie

ty

.


0

10

20

30

40

50

60


Beh

avio

ura

l V

arie

ty

.


0

2

4

6

8

10

12

14

16

18


Maxim

um

Behavio

ura

l Com

ple

xity

.


0

2

4

6

8

10

12

14

16

18


Maxim

um

Beh

avio

ura

l C

om

ple

xit

y

.


0

2

4

6

8

10

12

14

16

18


Maxim

um

Behavio

ura

l Com

ple

xity

.


0

5

10

15

20

25

30

35

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

Time

Beh

avio

ura

l V

arie

ty

.


0

5

10

15

20

25

30

35

0 20000 40000 60000 80000 100000Time

Beh

avio

ura

l V

arie

ty

.


0

5

10

15

20

25

30

35

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000Time

Beh

avio

ura

l V

arie

ty

.


Experiment 2 Experiment 3 Experiment 4

MFRL

MMORL

MHRL


MFRL agents are most adaptable and most scalable as the number of tasks in the environment increases

MMORL are most scalable as the complexity of tasks increases

Agents motivated by interest and competence achieve greater adaptability, and show increased behavioural variety and complexity

MRL agents can learn task-oriented behavioural cycles using a task-independent motivation signal

The greatest behavioural variety and complexity in simple environments is achieved by MFRL agents

The greatest adaptability is displayed by MRL agents motivated by interest and competence

The most scalable approach when recall is required uses MMORL


Conclusions

Limitations and Future Work

Scalability of MRL in other types of environments

Additional approaches to motivation:

Biological models Cognitive models Social models Combined models

Motivation in other machine learning settings:

Motivated supervised learning Motivated unsupervised learning

Additional metrics for MRL:

Usefulness Intelligence Rationality


(Linden, 2007)

Tasks

• Maintenance tasks: observations

• Achievement tasks: Events

Agent

Sensed state

Observation

World state

sensors

E(t) = S(t)–S(t’) = (Δ(s1(t), s1(t’)), Δ(s2(t), s2(t’)), … Δ(sL(t), sL(t’)), …)

Behavioural Cycles

S1 = (<location:Food Machine><Food Machine:1>)

S2 = (<location:Food Machine><Food Machine:1><Food:1>)

A1 = use(Food Machine)

A3 = use(Food)

S3 = (<location:Food> <Food Machine:1> <Food:1>)

A2 = move to(Food)A4 = move to(Food Machine)

S3 = (<location:NO_OBJECT> <Food Machine:1> <Food:1>)

…

S1 S1 S2S1 S2

S3

S1 S2

Sn

A1

A1

A2

A1

A2A3

A1

A2An

An-1

Agent Models

A(t)

S(t), Rmt)

S(t)

W(t)

sensors

O(t), E(t)

RL

effectors

MO(t)-1,E(t-1)

π (t), S(t),A(t)

π (t)-1), S(t)-1,A(t)-1)

E U

OU

π (t) U

A(t) U

A

T(t)

A(t)

S(t), Rmt)

S(t)

W(t)

B(t)

effectors

M

π (t), S(t),B(t)

π (t)-1), S(t)-1,B(t)-1)

E U

OU

π (t) U

A(t) U

A U

B

T(t)

B(t-1).π, S(t-1),S(t), B(t-1).AB(t-1).Ω(S(t-1))MORL

ReflexB(t)-1

sensors

S(t), Rmt)

O(t)-1),E(t-1)

O(t)),E(t)

MFRL MMORL

SensitivityPositive

feedback

Negative feedback

Interest

-1

-0.5

0

0.5

1

0 0.5 1 1.5 2Novelty 2N(t)

Inte

rest

(R

ewar

d)

I(2N

(t))

= R

(t)

.Positive

feedback

Negative feedback

Interest

-1

-0.5

0

0.5

1

0 0.5 1 1.5 2Novelty 2N(t)

Inte

rest

(R

ewar

d)

I(2N

(t))

= R

(t)

.

Positive feedback

Negative feedback

Interest

-1

-0.5

0

0.5

1

0 0.5 1 1.5 2Novelty 2N(t)

Inte

rest

(R

ewar

d)

I(2N

(t))

= R

(t)

.

Positive feedback

Negative feedback

Interest

-1

-0.5

0

0.5

1

0 0.5 1 1.5 2Novelty 2N(t)

Inte

rest

(R

ewar

d)

I(2N

(t))

= R

(t)

.

Change in interest with (a) ρ+ = ρ- = 5,F+min = 0.5 and F-min = 1.5 and (b) ρ+ = ρ- = 30,F+min= 0.5 and F-min = 1.5

Change in interest with (a) ρ+ = ρ- =10, F+min = 0.1 and F-min = 1.9 and (b) ρ+ = ρ- = 10, F+min = 0.9 and F-min = 1.1

Metrics

• A task is complete when its defining observation or event is achieved

• A task is learned when the standard deviation of the number of actions in h behavioural cycle completing the task is less than some error threshold

• Behavioural variety measures the number of tasks learned

• Behavioural complexity measures the number of actions in a behavioural cycle

Documents

Modelling Motivation for Experience-Based Attention Focus in Reinforcement Learning Candidate Kathryn Merrick School of Information Technologies University