View
214
Download
1
Tags:
Embed Size (px)
Citation preview
Modelling Motivation for Experience-Based Attention
Focus in Reinforcement Learning
CandidateKathryn Merrick
School of Information Technologies
University of Sydney
PhD Thesis Defence
July, 2007
SupervisorProf. Mary Lou MaherKey Centre for Design Computing
and Cognition, University of Sydney
Objectives | Contributions | Results | Conclusions
Introduction
Objectives | Contributions | Results | Conclusions
Learning environments may be complex, with many states and possible actions
The tasks to be learned may change over time
It may be difficult to predict tasks in advance
Doing ‘everything’ may be infeasible
How can artificial agents focus attention to develop behaviours in complex, dynamic environments?
This thesis considers this question in conjunction with reinforcement learning
1. Develop models of motivation that focus attention based on experiences
2. Model complex, dynamic environments using a representation that enables adaptive behaviour
3. Develop learning agents with three aspects of attention focus:
Behavioural cycles Adaptive behaviour Multi-task learning
4. Develop metrics for comparing adaptability and multi-task learning behaviour of MRL agents.
5. Evaluate performance and scalability of MRL agents using different models of motivation and different RL approaches.
Objectives | Contributions | Results | Conclusions
0
2
4
6
8
10
12
14
16
18
Environment 1 Environment 2 Environment 3 Environment 4 Environment 5
Maxim
um
Beh
avio
ura
l C
om
ple
xit
y
.
Interest +Competence Baseline
Positive feedback
Negative feedback
Interest
-1
-0.5
0
0.5
1
0 0.5 1 1.5 2Novelty 2N(t)
Inte
rest
(R
ewar
d)
I(2N
(t))
= R
(t)
.
S1 S2
S4
A1
A2A4
A3
S3
Modelling Motivation as Experience-Based Reward
Positive feedback
Negative feedback
Interest
-1
-0.5
0
0.5
1
0 0.5 1 1.5 2Novelty 2N(t)
Inte
rest
(R
ewar
d)
I(2N
(t))
= R
(t)
.
0
0.2
0.4
0.6
0.8
1
0 20 40 60Time (t)
No
velt
y N
(t)
.
N(t) Stimulus
Objectives | Contributions | Results | Conclusions
0
1
0 5 10 15 20Time (t)
Err
or.
Positive feedback
Negative feedback
-1
-0.5
0
0.5
1
0 0.5 1 1.5 2Error
Co
mp
eten
ce R
ewar
d .
Rm(t) = max(I(t), C(t))
Compute observations and events OS(t), ES(t)
Task selection using a self-organising map
Compute experience-based reward using:
Policy error Deci and Ryan’s model
of optimal challenges
Arbitrate by taking maximum of interest and competence motivation
Rm(t) = I(t)
Compute observations and events OS(t), ES(t)
Task selection using a self-organising map
Compute experience-based reward using:
Stanley’s model of habituation
Wundt Curve
No arbitration required
Representing Complex, Dynamic Environments
S <sensations><sensations> <PiSensations><sensations> | ε<PiSensations> <sj><PiSensations> | ε<sj> <number> | <string><number> 1 | 2 | 3 | ...<string> ...
P = {P1, P2, P3, …, Pi , …}
Objectives | Contributions | Results | Conclusions
A <actions><actions> <PiActions><actions> | ε<PiActions> <Aj><PiActions> | ε<Aj> ...
S(1) = (<visiblePick:1> <visibleForge:1><visibleSmithy:1>)
A(1) = {A(pick-up, pick), A(pick-up, forge), A(pick-up, smithy)}
S(2) = (<visibleAxe:1><visibleLathe:1>)
A(2) = {A(pick-up, axe), A(pick-up, lathe)}
Metrics and Evaluation A classification of different types of MRL and the
role played by motivation in these approaches.
Metrics for comparing learned behavioural cycles in terms of adaptability and multi-task learning.
Evaluation of the performance and scalability of MRL agents using different:
Models of motivation RL approaches Types of environment
New approaches to the design of non-player characters for games, which can adapt in open-ended virtual worlds.
Objectives | Contributions | Results | Conclusions
Experiment 1
0
5
10
15
20
25
MFRL MMORL MHRL
Beh
avio
ura
l V
arie
ty
.
Interest +Competence Baseline
0
2
4
6
8
10
12
MFRL MMORL MHRLMax
imu
m b
ehav
iou
ral
com
ple
xity
.
Interest +Competence Baseline
Behavioural Variety Behavioural Complexity
Objectives | Contributions | Results | Conclusions
Task oriented learning emerges using a task-independent motivation signal to direct learning.
Greatest behavioural variety in simple environments is achieved by MFRL agents
Greatest behavioural complexity is achieved by MFRL and MHRL agents, which can interleave solutions to multiple tasks
0
10
20
30
40
50
60
Environment 1 Environment 2 Environment 3 Environment 4 Environment 5
Beh
avio
ura
l V
arie
ty .
.
Interest +Competence Baseline
0
10
20
30
40
50
60
Environment 1 Environment 2 Environment 3 Environment 4 Environment 5
Beh
avio
ura
l V
arie
ty
.
Interest +Competence Baseline
0
10
20
30
40
50
60
Environment 1 Environment 2 Environment 3 Environment 4 Environment 5
Beh
avio
ura
l V
arie
ty
.
Interest +Competence Baseline
0
2
4
6
8
10
12
14
16
18
Environment 1 Environment 2 Environment 3 Environment 4 Environment 5
Maxim
um
Behavio
ura
l Com
ple
xity
.
Interest +Competence Baseline
0
2
4
6
8
10
12
14
16
18
Environment 1 Environment 2 Environment 3 Environment 4 Environment 5
Maxim
um
Beh
avio
ura
l C
om
ple
xit
y
.
Interest +Competence Baseline
0
2
4
6
8
10
12
14
16
18
Environment 1 Environment 2 Environment 3 Environment 4 Environment 5
Maxim
um
Behavio
ura
l Com
ple
xity
.
Interest +Competence Baseline
0
5
10
15
20
25
30
35
0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000
Time
Beh
avio
ura
l V
arie
ty
.
Interest +Competence Baseline
0
5
10
15
20
25
30
35
0 20000 40000 60000 80000 100000Time
Beh
avio
ura
l V
arie
ty
.
Interest +Competence Baseline
0
5
10
15
20
25
30
35
0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000Time
Beh
avio
ura
l V
arie
ty
.
Interest +Competence Baseline
Experiment 2 Experiment 3 Experiment 4
MFRL
MMORL
MHRL
Objectives | Contributions | Results | Conclusions
MFRL agents are most adaptable and most scalable as the number of tasks in the environment increases
MMORL are most scalable as the complexity of tasks increases
Agents motivated by interest and competence achieve greater adaptability, and show increased behavioural variety and complexity
MRL agents can learn task-oriented behavioural cycles using a task-independent motivation signal
The greatest behavioural variety and complexity in simple environments is achieved by MFRL agents
The greatest adaptability is displayed by MRL agents motivated by interest and competence
The most scalable approach when recall is required uses MMORL
Objectives | Contributions | Results | Conclusions
Conclusions
Limitations and Future Work
Scalability of MRL in other types of environments
Additional approaches to motivation:
Biological models Cognitive models Social models Combined models
Motivation in other machine learning settings:
Motivated supervised learning Motivated unsupervised learning
Additional metrics for MRL:
Usefulness Intelligence Rationality
Objectives | Contributions | Results | Conclusions
(Linden, 2007)
Tasks
• Maintenance tasks: observations
• Achievement tasks: Events
Agent
Sensed state
Observation
World state
sensors
E(t) = S(t)–S(t’) = (Δ(s1(t), s1(t’)), Δ(s2(t), s2(t’)), … Δ(sL(t), sL(t’)), …)
Behavioural Cycles
S1 = (<location:Food Machine><Food Machine:1>)
S2 = (<location:Food Machine><Food Machine:1><Food:1>)
A1 = use(Food Machine)
A3 = use(Food)
S3 = (<location:Food> <Food Machine:1> <Food:1>)
A2 = move to(Food)A4 = move to(Food Machine)
S3 = (<location:NO_OBJECT> <Food Machine:1> <Food:1>)
…
S1 S1 S2S1 S2
S3
S1 S2
Sn
A1
A1
A2
A1
A2A3
A1
A2An
An-1
Agent Models
A(t)
S(t), Rmt)
S(t)
W(t)
sensors
O(t), E(t)
RL
effectors
MO(t)-1,E(t-1)
π (t), S(t),A(t)
π (t)-1), S(t)-1,A(t)-1)
E U
OU
π (t) U
A(t) U
A
T(t)
A(t)
S(t), Rmt)
S(t)
W(t)
B(t)
effectors
M
π (t), S(t),B(t)
π (t)-1), S(t)-1,B(t)-1)
E U
OU
π (t) U
A(t) U
A U
B
T(t)
B(t-1).π, S(t-1),S(t), B(t-1).AB(t-1).Ω(S(t-1))MORL
ReflexB(t)-1
sensors
S(t), Rmt)
O(t)-1),E(t-1)
O(t)),E(t)
MFRL MMORL
SensitivityPositive
feedback
Negative feedback
Interest
-1
-0.5
0
0.5
1
0 0.5 1 1.5 2Novelty 2N(t)
Inte
rest
(R
ewar
d)
I(2N
(t))
= R
(t)
.Positive
feedback
Negative feedback
Interest
-1
-0.5
0
0.5
1
0 0.5 1 1.5 2Novelty 2N(t)
Inte
rest
(R
ewar
d)
I(2N
(t))
= R
(t)
.
Positive feedback
Negative feedback
Interest
-1
-0.5
0
0.5
1
0 0.5 1 1.5 2Novelty 2N(t)
Inte
rest
(R
ewar
d)
I(2N
(t))
= R
(t)
.
Positive feedback
Negative feedback
Interest
-1
-0.5
0
0.5
1
0 0.5 1 1.5 2Novelty 2N(t)
Inte
rest
(R
ewar
d)
I(2N
(t))
= R
(t)
.
Change in interest with (a) ρ+ = ρ- = 5,F+min = 0.5 and F-min = 1.5 and (b) ρ+ = ρ- = 30,F+min= 0.5 and F-min = 1.5
Change in interest with (a) ρ+ = ρ- =10, F+min = 0.1 and F-min = 1.9 and (b) ρ+ = ρ- = 10, F+min = 0.9 and F-min = 1.1
Metrics
• A task is complete when its defining observation or event is achieved
• A task is learned when the standard deviation of the number of actions in h behavioural cycle completing the task is less than some error threshold
• Behavioural variety measures the number of tasks learned
• Behavioural complexity measures the number of actions in a behavioural cycle