View
224
Download
0
Tags:
Embed Size (px)
Citation preview
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 1
An Introduction to Reinforcement Learning
Presenter:
Verena Rieser, [email protected]
Course:
Classification and Clustering, WS 2005
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 2
Contents
Part 1: The main ideas of RL
Part 2: The general framework of RL
Part 3: Automatic Optimization of Dialogue Management (Application)
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 3
Reinforcement Learning
Psychology
Artificial Intelligence
Control Theory andOperations Research
Artificial Neural Networks
ReinforcementLearning (RL)
Neuroscience
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 4
Part 1:The Idea of Reinforcement Learning
Learning from interaction with environment to achieve some goal
Example 1 Baby playing: No teacher; sensorimotor connection to environment.
Cause-effect/Action-consequences How to achieve some goal
Example 2 Learning to hold a conversation, etc. We find out the effects of our actions later.
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 5
Supervised Learning
Supervised Learning SystemInputs Outputs
Training Info = desired (target) outputs
Error = (target output – actual output)
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 6
Reinforcement Learning
RLSystemInputs Outputs (“actions”)
Training Info = evaluations (“rewards” / “penalties”)
Objective: get as much reward as possible
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 7
R L - How does it work?
Learning a mapping from situations to actions in order to maximize a scalar reward/reinforcement signal.
How? Try out actions to learn which produces highest
reward - trial and error search Actions affect immediate reward + all
subsequent rewards - delayed effects, delayed rewards
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 8
Exploration/Exploitation Trade-off
High rewards from trying previously-well-rewarded actions - EXPLOITATION (= greedy)
BUT: Which actions are best? Must try ones not tried before - EXPLORATION (= )
Must do both!
Exploitation/Exploration trade-off also depends on the life-time of an agent.
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 9
-Greedy Methods on the 10-Armed Testbed[Sutton and Barto, 2002]
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 10
Part 2: Framework of RL
Temporally situated Continual learning and planning Object is to affect the environment Environment is stochastic and uncertain
Environment
actionstate
rewardAgent
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 11
Elements of RL
Policy: what to do Reward: what is good Value: what is good because it predicts reward Model: what follows what
Policy
Reward
ValueModel of
environment
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 12
General RL Algorithm
i. Initialise learner’s internal state
ii. Do forever (!?):
a. Observe current state s
b. Choose action a using some evaluation function
c. Execute action a
d. Let r be immediate reward, s’ new state
e. Update internal state based on s,a,r,s’
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 13
To solve the problem mathematically:
• Formulate it as Markov Decision Process (MDP) or Partially Observable Markov Decision Process (POMDP)
• Maximize the state-value and action-value functions using the Bellmann optimality equation
• Use approximations to solve the Bellmann equation such as dynamic programming, Monte Carlo Methods, Temporal-difference Learning.
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 14
The Bellmann Equation
• Bellmann optimality equation estimates “how good” it is to be in a state s.
V*(s) = max Qπ*(s,a) [figure (a)]
Q*(s,a) = ∑Pass´ [Ra
ss´+ max Q*(s´,a´)] [figure (b)]
Vπ(s)=∑aπ(s,a) ∑s’Pss’a[Rss’
a+µVπ(s’)]
“What actions are available?” “How good are those actions?”
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 15
Summary: Key Features of RL
Learner is not told which actions to takeTrial-and-Error searchPossibility of delayed reward
(Sacrifice short-term gains for greater long-term gains)
The need to explore and exploitConsiders the whole problem of a goal-
directed agent interacting with an uncertain environment
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 16
Interactive Exercise:
Help me to annotate the example “a dog catching a stick” with concepts from RL.
Explain: How would an artificial dog learn to catch the stick using RL?
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 17
Part 3: Application for CoLi
Diane J. Litman and Michael S. Kearns and Satinder Singh and Marilyn A. Walker:
Automatic Optimization of Dialogue Management.
In: Proceedings of the 18th International Conference on Computational Linguistics (COLING-2000), Saarbrücken, 2000.
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 18
Dialogue Management
Motivation:
• Agent wants to achieve some goal
• Non-trivial choices based on the internal state
• Usability should be guaranteed by iterative prototyping
DM is costly!
Why not “simply” learn the optimal choices? Formulate dialogue as MDP Represent the environment (= states) Define a set of possible dialogue strategies (= actions) Evaluate actions (= reward)
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 19
The NJFun System
1) Represent a dialogue strategy as mapping from state S to a set of dialogue acts
2) Deploy an initial training system which generates exploratory training data w.r.t. S
3) Construct an MPD model from the training data
4) Using value iteration to learn the optimal strategy
5) Evaluate the system w.r.t. a hand-coded strategy
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 20
NJFun: Action Space
Initiative User: the system asks open questions with an
unrestricted grammar for recognition System: the system uses directed prompts with
restricted grammars Mixed: the system uses directed prompts with non-
restricted grammars Confirmation
Explicit: the system asks the user to verify an attribute
No confirmation: the system does not generate a confirmation prompt
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 21
NJFun: State Space
{Greet}: whether the system has greeted the user or not (0,1)
{Attr}: which attr the system is trying to obtain or verify (1=activity, 2=location, 3=time, 4=done)
{Conf}: ASR confidence after obtaining value for an attribute (0,1,2,3,4)
{Val}: whether system has obtained a value for an attribute (0,1)
{Times}: number of times the system has asked for the attribute
{Gram}: type of grammar most recently used to obtain the attribute
{Hist}: “trouble-in-past”
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 22
Example
S1: Welcome to NJFun. How may I help you?
s[greet=1] - a[user initiative]
U1: I’d like to find *um* wine tasting in Lambertville.
s[conf=2, val=1]• S2a: Did you say you are interested in wine
tasting in Lambertville?
s’[attr=(1,2), times=1] - a[explicit confirmation ]• S2b: At what time?
s’[attr=3] - a[no confirmation ]
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 23
NJFun: Optimizing the strategy
NJFun’s initial strategy: “Exploratory for Initiative and Confirmation” (EIC); chooses randomly between possible actions in each state
Data: 54 subjects for training, 21 for testing Binary reward function: 1 if system queries DB
with all specified attr., 0 otherwise Results: large and significant improvement for
expert user and non-significant degeneration for novice
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 24
Discussion
How general are the features? What about dialogues on other domains (e.g. information seeking vs. Tutorial dialogue)
What about the algorithm? Why can’t we use supervised learning?
Do we really save costs? Stochastic user models for training “boot-strap” an initial system from training
data
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 26
Simple Learning Taxonomy
Supervised Learning “Teacher” provides required response to inputs.
Desired behaviour is know. Unsupervised Learning
Learner looks for patterns in input. No “right” answer.
Reinforcement Learning Learner not told which actions to take, but gets
reward/punishment from environment and learns the action to pick the next time.
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 27
RL vs. SL
The main problem facing a SL system is to construct a mapping from situations to actions
that mimics the correct actions specified by the environment
and that generalizes correctly to new situations. A SL system cannot be said to learn to control its
environment because it follows, rather than influences, the instructive
information it receives. Instead of trying to make its environment behave
in a certain way, it tries to make itself behave as instructed by its environment.
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 28
RL vs. US
US: Make some decision *now* which satisfies
the immediate constrains (e.g. clustering: clusters should be not smaller than n)
RL: Plan your decision to achieve some goal in
the future; delayed rewards
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 29
A More Formal Definition of the RL Framework...
POLICY p(s,a) =P{at= a|st=t}
Given the situation at time t is s the policy which gives the probability that the agent’s action will be a.
Reward function
Defines goal, and immediate good or bad experience
Value function
Estimate of total future long-term reward.
(We want actions that lead to states of high value, not necessarily high immediate reward!)
Model of environment
Maps states and actions onto states S´AxS. If in state s1 we take action a2 the model predicts s2 (and sometimes reward r2)
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 30
Markov Property
A state signal that succeeds in retaining all relevant information is said to be Markov, or to have the Markov property.
For example: the current position and velocity of a cannonball is all that matters for its future flight. It doesn't matter how that position and velocity came about.
This is sometimes also referred to as an "independence of path" property because all that matters is in the current state signal; its meaning is independent of the "path," or history, of signals that have led up to it.
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 31
MPDs vs. POMPDs
Major difference: how they represent uncertainty.
In MPDs the state space is in general represented as vectors describing information slots where each is associated with a discrete value.
POMPDS explicitly model uncertainty by maintaining a belief state - a distribution over MPD states - in the absence of knowing its state exactly.
V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 32
Some Notable RL Applications
TD-Gammon: Tesauro– world’s best backgammon program
Elevator Control: Crites & Barto– high performance down-peak elevator controller
Dynamic Channel Assignment: Singh & Bertsekas, Nie & Haykin– high performance assignment of radio channels to mobile
telephone calls In general applicable for all (?) optimization tasks which are
goal-oriented