Upload
mabel-tyler
View
219
Download
1
Tags:
Embed Size (px)
Citation preview
Reinforcement LearningUnsupervised learning
1
2
So far ….
h Supervised machine learning: given a set of annotated istances and a set of categories, find a model to automatically categorize unseen istances
h Unsupervided learning: no examples are available
h Previous lesson: clustering5 Given a set of instances described by feature
vectors, group them in clusters such as intra-cluster similarity is maximized and inter-cluster dissimilarity is maximized
Reinforcement learning
h Reinforcement learning:
5Agent receives no examples and starts with no model of the environment.
5Agent gets feedback through rewards, or reinforcement.
3
Learning by reinforcementh Examples:
5 Learning to play Backgammon5 Robot learning to move in an environment
h Characteristics:5 No direct training examples – (delayed) rewards
instead5 Need for exploration of environment & exploitation5 The environment might be stochastic and/or
unknown5 The actions of the learner affect future rewards
Example: Robot moving in an environment
5
Example: playing a game (e.g. backgammon, chess)
6
Reinforcement Learning
Supervised Learning
Input is an istance, output is aclassification of the istance
Input is some “goal” outputIs a sequence of actions tomeet the goal
Elements of RL
h Transition model, how actions A influence states S
h Reward R, immediate value of state-action transition
h Policy , maps states to actions
Agent
Environment
State Reward Action
Policy
In RL, the “model” to be learned is a policy to meet “at best” some given goal (e.g. win a game)
Markov Decision Process (MDP)h MDP is a formal model of the RL problem
h At each discrete time point 5 Agent observes state st and chooses action at
(according to some probability distribution)5 Receives reward rt from the environment and the state
changes to st+1
h Markov assumption: rt=r(st,at) st+1=(st,at)
i.e. rt and st+1 depend only on the current state and action5 In general, the functions r and may not be
deterministic (i.e. they are stochastic) and are not necessarily known to the agent
s
Agent’s Learning Task
Execute actions in environment, observe results and
h Learn action policy that maximises expected cumulative rewardfrom any starting state in S.
Here is the discount factor for future rewards
h Note: • Target function is • There are no training examples of the form (s,a) but only
of the form ((s,a),r)
AS :
...][ 22
1 ttt rrrE 10
AS :
Example: TD-Gammon
h Immediate reward:+100 if win
-100 if lose
0 for all other states
h Trained by playing 1.5 million games against itself
h Now approximately equal to the best human player
Example: Mountain-Car
h States: position and velocity
h Actions: accelerate forward, accelerate backward, coast
h Rewards5 Reward=-1for every step, until the car reaches the top5 Reward=1 at the top, 0 otherwise, <1
h The eventual reward will be maximised by minimising the number of steps to the top of the hill
Value functionWe will consider deterministic worlds first
In a deterministic environment, rewards are known and the environment is known
h Given a policy π (adopted by the agent), define an evaluation function over states:
02
21 ...)(
iit
itttt rrrrsV
)(
...)()(
1
21
tt
tttt
sVr
rrrsV
Example: robot in a grid environment
r(state, action)immediate reward values
100
0
0
100
G
0
0
0
0
0
0
0
0
0
r( (2,3), UP)=100
Grid world environmentSix possible statesArrows represent possible actionsG: goal state
Example: robot in a grid environment
h Value function: maps states to state values, acording to some policy π
V*(state) valuesr(state, action)immediate reward values
100
0
0
100
G
0
0
0
0
0
0
0
0
0 G
90 100 0
81 90 100
G 90 100 0
81 90 100
G 90 100 0
81 90 100
The “value” of beingin (1,2) according to some known π is100
02
21 ...)(
iit
itttt rrrrsV
Computing V(s) (if policy is given)
Suppose the following is an
Optimal policy – denoted * :
* tells us what is the best thing to do when in each state.
According to * , we can compute the values of the states for this policy – denoted V*, or V*
r(s,a) (immediate reward) values: V*(s) values, with =0.9
one optimal policy: V*(s6) = 100 + 0.9*0 = 100
V*(s5) = 0 + 0.9*100 = 90
V*(s4) = 0 + 0.9*90 = 81
However, * is usually unknown and the task is to learn the optimal policy
)(),(maxarg* ssV
)( )( 1**
ttt sVrsV
Learning *
h Target function is : state action
h However…
5 We have no training examples of form
<state, action>
5 We only know:
<<state, action>, reward>
h Reward might not be known for all states!
Utility-based agents
h Try to learn V * (abbreviated V*)
h Perform look ahead search to choose best action from any state s
h Works well if agent knows
5 : state action state
5 r : state action R
h When agent doesn’t know and r, cannot choose actions
this way
a s,δ*Va s,rmaxargsπ*a
Q-learning: an utility-based learnerin deterministic environmentsh Q-values
5 Define new function very similar to V*
5 If agent learns Q, it can choose optimal action
even without knowing or R
h Using Q
a s,Q maxargsπ*a
a s,δ*Va s,ra s,Q
a s,δ*Va s,rmaxargsπ*a
Learning the Q-value
5 Note: Q and V* closely related
5 Allows us to write Q recursively as:
5 This policy is also called Temporal
Difference learning
a' s,Q maxargs*Va'
a' ,tsQmax γta ,tsr
ta ,tsδγVta ,tsr ta ,tsQ
a'1
V(s(t+1))
Learning the Q-value
h FOR each <s, a> DO
5 Initialize table entry:
h Observe current state s
h WHILE (true) DO
5 Select action a and execute it
5 Receive immediate reward r
5 Observe new state s’
5 Update table entry for as follows
5 Move: record transition from s to s’
0 a s,Q̂
a s,Q̂
a' ,s'Q max γa s,r a s,Q a'
ˆˆ
Q-Learning: Example (=0.9)
s1 s2 s3
s4 s5 s6
N S W E
s1 0 0 0 0
s2 0 0 0 0
s3 0 0 0 0
s4 0 0 0 0+0.9*0
=0
s5 0 0 0 0
s6 0 0 0 0
1
10
0
0 0
0
• Select a possible action in s and execute • Obtain the reward r• Observe new state s’ reached when
executing a from s• Update estimate as follows:
• s s’ )','(ˆmax),(ˆ
'asQrasQ
a
s=s4, s’=s5
actions
The “max Q” function search for the “most promising action” from s’But in s5: therefore:
Q-values table
Q-Learning: Example (=0.9)
s1 s2 s3
s4 s5 s6
N S O E
s1 0 0 0 0
s2 0 0 0 0
s3 0 0 0 0
s4 0 0 0 0
s5 0 0 0 0+0.9*0
=0
s6 0 0 0 0
1
10
0
0 0
0
)','(ˆmax),(ˆ'
asQrasQa
s=s5, s’=s6
• Select a possible action in s and execute • Obtain the reward r• Observe new state s’ reached when
executing a from s• Update estimate as follows:
• s s’
Q-Learning: Example (=0.9)
s1 s2 s3
s4 s5 s6
N S O E
s1 0 0 0 0
s2 0 0 0 0
s3 0 0 0 0
s4 0 0 0 0
s5 0 0 0 0
s61+0.9*0
=1 0 0 0
1
10
0
0 0
0
s=s6, s’=s3
)','(ˆmax),(ˆ'
asQrasQa
• Select a possible action in s and execute • Obtain the reward r• Observe new state s’ reached when
executing a from s• Update estimate as follows:
• s s’
• GOAL REACHED: END OF FIRST EPISODE
In s3 r=1 so Q is updated to 1
Q-Learning: Example (=0.9)
s1 s2 s3
s4 s5 s6
N S O E
s1 0 0 0 0
s2 0 0 0 0
s3 0 0 0 0
s40+0.9*0
=0 0 0 0
s5 0 0 0 0
s6 1 0 0 0
1
10
0
0 0
0
s=s4, s’=s1
)','(ˆmax),(ˆ'
asQrasQa
• Select a possible action in s and execute • Obtain the reward r• Observe new state s’ reached when
executing a from s• Update estimate as follows:
• s s’
Q-Learning: Example (=0.9)
s1 s2 s3
s4 s5 s6
N S O E
s1 0 0 0 0+0.9*0
=0
s2 0 0 0 0
s3 0 0 0 0
s4 0 0 0 0
s5 0 0 0 0
s6 1 0 0 0
1
10
0
0 0
0
)','(ˆmax),(ˆ'
asQrasQa
s=s1, s’=s2
• Select a possible action in s and execute • Obtain the reward r• Observe new state s’ reached when
executing a from s• Update estimate as follows:
• s s’
Q-Learning: Example (=0.9)
s1 s2 s3
s4 s5 s6
N S O E
s1 0 0 0 0
s2 0 0 0 1+0.9*0
=1
s3 0 0 0 0
s4 0 0 0 0
s5 0 0 0 0
s6 1 0 0 0
1
10
0
0 0
0
s=s2, s’=s3
)','(ˆmax),(ˆ'
asQrasQa
• Select a possible action in s and execute • Obtain the reward r• Observe new state s’ reached when
executing a from s• Update estimate as follows:
• s s’
• GOAL REACHED: END OF 2nd EPISODE
Q-Learning: Example (=0.9)
s1 s2 s3
s4 s5 s6
N S O E
s1 0 0 0 0
s2 0 0 0 1
s3 0 0 0 0
s4 0 0 0 0+0
s5 0 0 0 0
s6 1 0 0 0
1
10
0
0 0
0
s=s4, s’=s5
Q-Learning: Example (=0.9)
s1 s2 s3
s4 s5 s6
N S O E
s1 0 0 0 0
s2 0 0 0 1
s3 0 0 0 0
s4 0 0 0 0
s50+0.9*1
=0.9 0 0 0
s6 1 0 0 0
1
10
0
0 0
0
s=s5, s’=s2
Q-Learning: Example (=0.9)
s1 s2 s3
s4 s5 s6
N S O E
s1 0 0 0 0
s2 0 0 0 1+0
s3 0 0 0 0
s4 0 0 0 0
s5 0.9 0 0 0
s6 1 0 0 0
1
10
0
0 0
0
s=s2, s’=s3
• GOAL REACHED: END OF 3rd EPISODE
Q-Learning: Example (=0.9)
s1 s2 s3
s4 s5 s6
N S O E
s1 0 0 0 0
s2 0 0 0 1
s3 0 0 0 0
s4 0 0 0 0+0.9*0.9
=0.81
s5 0.9 0 0 0
s6 1 0 0 0
1
10
0
0 0
0
s=s4, s’=s5
Q-Learning: Example (=0.9)
s1 s2 s3
s4 s5 s6
N S O E
s1 0 0 0 0
s2 0 0 0 1
s3 0 0 0 0
s4 0 0 0 0.81
s50+0.9*1
=0.9 0 0 0
s6 1 0 0 0
1
10
0
0 0
0
s=s5, s’=s2
Q-Learning: Esempio (=0.9)
s1 s2 s3
s4 s5 s6
N S O E
s1 0 0 0 0
s2 0 0 0 1+0
s3 0 0 0 0
s4 0 0 0 0.81
s5 0.9 0 0 0
s6 1 0 0 0
1
10
0
0 0
0
s=s2, s’=s3
• GOAL REACHED: END OF 4th EPISODE
Q-Learning: Example (=0.9)
s1 s2 s3
s4 s5 s6
N S O E
s1 0 0 0 0
s2 0 0 0 1
s3 0 0 0 0
s4 0+0 0 0 0.81
s5 0.9 0 0 0
s6 1 0 0 0
1
10
0
0 0
0
s=s4, s’=s1
Q-Learning: Example (=0.9)
s1 s2 s3
s4 s5 s6
N S O E
s1 0 0 0 0+0.9*1=0.9
s2 0 0 0 1
s3 0 0 0 0
s4 0 0 0 0.81
s5 0.9 0 0 0
s6 1 0 0 0
1
10
0
0 0
0
s=s1, s’=s2
Q-Learning: Esempio (=0.9)
s1 s2 s3
s4 s5 s6
N S O E
s1 0 0 0 0.9
s2 0 0 0 1+0
s3 0 0 0 0
s4 0 0 0 0.81
s5 0.9 0 0 0
s6 1 0 0 0
1
10
0
0 0
0
s=s2, s’=s3
• GOAL REACHED: END OF 5th EPISODE
Q-Learning: Example(=0.9)
s1 s2 s3
s4 s5 s6
N S O E
s1 0 0.72 0 0.9
s2 0 0.81 0.81 1
s3 0 0 0 0
s4 0.81 0 0 0.81
s5 0.9 0 0.72 0.9
s6 1 0 0.81 0
1
0.90.81
0.72
After several iterations, the algorithm converges to the following table:
10.90.81
0.81
0.90.81
0.81
0.72
A crawling robot using Q-Learning
39
An example of Q-learning using 16 states (4 positions for each servo's) and 4 possible actions (moving 1 position up or down for on of the servos). The robot is build with lego components, two servo's, a wheel encoder and a dwengo-board with a microcontroller.The robot learns using the Q-learning algorithm and its reward is related with the moved distance.
The non-deterministic case
40
• What if the reward and the state transition are not deterministic? – e.g. in Backgammon learning and playing depends on rolls of dice!
• Then V and Q needs redefined by taking expected values
• Similar reasoning and convergent update iteration will apply
))],((),([),(
][...][)(
*
02
21
asVasrEasQ
rErrrEsVi
iti
ttt
41
Summary
h Reinforcement learning is suitable for learning in uncertain environments where rewards may be delayed and subject to chance
h The goal of a reinforcement learning program is to maximise the eventual reward
h Q-learning is a form of reinforcement learning that doesn’t require that the learner has prior knowledge of how its actions affect the environment