V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 1 An Introduction to Reinforcement Learning Presenter: Verena Rieser, [email protected]@coli.uni-sb.de

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 1

An Introduction to Reinforcement Learning

Presenter:

Verena Rieser, [email protected]

Course:

Classification and Clustering, WS 2005

mailto:[email protected]





Contents

Part 1: The main ideas of RL

Part 2: The general framework of RL

Part 3: Automatic Optimization of Dialogue Management (Application)


Reinforcement Learning

Psychology

Artificial Intelligence

Control Theory andOperations Research

Artificial Neural Networks

ReinforcementLearning (RL)

Neuroscience


Part 1:The Idea of Reinforcement Learning

Learning from interaction with environment to achieve some goal

Example 1 Baby playing: No teacher; sensorimotor connection to environment.

Cause-effect/Action-consequences How to achieve some goal

Example 2 Learning to hold a conversation, etc. We find out the effects of our actions later.


Supervised Learning

Supervised Learning SystemInputs Outputs

Training Info = desired (target) outputs

Error = (target output – actual output)


Reinforcement Learning

RLSystemInputs Outputs (“actions”)

Training Info = evaluations (“rewards” / “penalties”)

Objective: get as much reward as possible


R L - How does it work?

Learning a mapping from situations to actions in order to maximize a scalar reward/reinforcement signal.

How? Try out actions to learn which produces highest

reward - trial and error search Actions affect immediate reward + all

subsequent rewards - delayed effects, delayed rewards


Exploration/Exploitation Trade-off

High rewards from trying previously-well-rewarded actions - EXPLOITATION (= greedy)

BUT: Which actions are best? Must try ones not tried before - EXPLORATION (= )

Must do both!

Exploitation/Exploration trade-off also depends on the life-time of an agent.


-Greedy Methods on the 10-Armed Testbed[Sutton and Barto, 2002]


Part 2: Framework of RL

Temporally situated Continual learning and planning Object is to affect the environment Environment is stochastic and uncertain

Environment

actionstate

rewardAgent


Elements of RL

Policy: what to do Reward: what is good Value: what is good because it predicts reward Model: what follows what

Policy

Reward

ValueModel of

environment


General RL Algorithm

i. Initialise learner’s internal state

ii. Do forever (!?):

a. Observe current state s

b. Choose action a using some evaluation function

c. Execute action a

d. Let r be immediate reward, s’ new state

e. Update internal state based on s,a,r,s’


To solve the problem mathematically:

• Formulate it as Markov Decision Process (MDP) or Partially Observable Markov Decision Process (POMDP)

• Maximize the state-value and action-value functions using the Bellmann optimality equation

• Use approximations to solve the Bellmann equation such as dynamic programming, Monte Carlo Methods, Temporal-difference Learning.


The Bellmann Equation

• Bellmann optimality equation estimates “how good” it is to be in a state s.

V*(s) = max Qπ*(s,a) [figure (a)]

Q*(s,a) = ∑Pass´ [Ra

ss´+ max Q*(s´,a´)] [figure (b)]

Vπ(s)=∑aπ(s,a) ∑s’Pss’a[Rss’

a+µVπ(s’)]

“What actions are available?” “How good are those actions?”


Summary: Key Features of RL

Learner is not told which actions to takeTrial-and-Error searchPossibility of delayed reward

(Sacrifice short-term gains for greater long-term gains)

The need to explore and exploitConsiders the whole problem of a goal-

directed agent interacting with an uncertain environment


Interactive Exercise:

Help me to annotate the example “a dog catching a stick” with concepts from RL.

Explain: How would an artificial dog learn to catch the stick using RL?


Part 3: Application for CoLi

Diane J. Litman and Michael S. Kearns and Satinder Singh and Marilyn A. Walker:

Automatic Optimization of Dialogue Management.

In: Proceedings of the 18th International Conference on Computational Linguistics (COLING-2000), Saarbrücken, 2000.


Dialogue Management

Motivation:

• Agent wants to achieve some goal

• Non-trivial choices based on the internal state

• Usability should be guaranteed by iterative prototyping

DM is costly!

Why not “simply” learn the optimal choices? Formulate dialogue as MDP Represent the environment (= states) Define a set of possible dialogue strategies (= actions) Evaluate actions (= reward)


The NJFun System

1) Represent a dialogue strategy as mapping from state S to a set of dialogue acts

2) Deploy an initial training system which generates exploratory training data w.r.t. S

3) Construct an MPD model from the training data

4) Using value iteration to learn the optimal strategy

5) Evaluate the system w.r.t. a hand-coded strategy


NJFun: Action Space

Initiative User: the system asks open questions with an

unrestricted grammar for recognition System: the system uses directed prompts with

restricted grammars Mixed: the system uses directed prompts with non-

restricted grammars Confirmation

Explicit: the system asks the user to verify an attribute

No confirmation: the system does not generate a confirmation prompt


NJFun: State Space

{Greet}: whether the system has greeted the user or not (0,1)

{Attr}: which attr the system is trying to obtain or verify (1=activity, 2=location, 3=time, 4=done)

{Conf}: ASR confidence after obtaining value for an attribute (0,1,2,3,4)

{Val}: whether system has obtained a value for an attribute (0,1)

{Times}: number of times the system has asked for the attribute

{Gram}: type of grammar most recently used to obtain the attribute

{Hist}: “trouble-in-past”


Example

S1: Welcome to NJFun. How may I help you?

s[greet=1] - a[user initiative]

U1: I’d like to find *um* wine tasting in Lambertville.

s[conf=2, val=1]• S2a: Did you say you are interested in wine

tasting in Lambertville?

s’[attr=(1,2), times=1] - a[explicit confirmation ]• S2b: At what time?

s’[attr=3] - a[no confirmation ]


NJFun: Optimizing the strategy

NJFun’s initial strategy: “Exploratory for Initiative and Confirmation” (EIC); chooses randomly between possible actions in each state

Data: 54 subjects for training, 21 for testing Binary reward function: 1 if system queries DB

with all specified attr., 0 otherwise Results: large and significant improvement for

expert user and non-significant degeneration for novice


Discussion

How general are the features? What about dialogues on other domains (e.g. information seeking vs. Tutorial dialogue)

What about the algorithm? Why can’t we use supervised learning?

Do we really save costs? Stochastic user models for training “boot-strap” an initial system from training

data


Additional Slides


Simple Learning Taxonomy

Supervised Learning “Teacher” provides required response to inputs.

Desired behaviour is know. Unsupervised Learning

Learner looks for patterns in input. No “right” answer.

Reinforcement Learning Learner not told which actions to take, but gets

reward/punishment from environment and learns the action to pick the next time.


RL vs. SL

The main problem facing a SL system is to construct a mapping from situations to actions

that mimics the correct actions specified by the environment

and that generalizes correctly to new situations. A SL system cannot be said to learn to control its

environment because it follows, rather than influences, the instructive

information it receives. Instead of trying to make its environment behave

in a certain way, it tries to make itself behave as instructed by its environment.


RL vs. US

US: Make some decision *now* which satisfies

the immediate constrains (e.g. clustering: clusters should be not smaller than n)

RL: Plan your decision to achieve some goal in

the future; delayed rewards


A More Formal Definition of the RL Framework...

POLICY p(s,a) =P{at= a|st=t}

Given the situation at time t is s the policy which gives the probability that the agent’s action will be a.

Reward function

Defines goal, and immediate good or bad experience

Value function

Estimate of total future long-term reward.

(We want actions that lead to states of high value, not necessarily high immediate reward!)

Model of environment

Maps states and actions onto states S´AxS. If in state s1 we take action a2 the model predicts s2 (and sometimes reward r2)


Markov Property

A state signal that succeeds in retaining all relevant information is said to be Markov, or to have the Markov property.

For example: the current position and velocity of a cannonball is all that matters for its future flight. It doesn't matter how that position and velocity came about.

This is sometimes also referred to as an "independence of path" property because all that matters is in the current state signal; its meaning is independent of the "path," or history, of signals that have led up to it.


MPDs vs. POMPDs

Major difference: how they represent uncertainty.

In MPDs the state space is in general represented as vectors describing information slots where each is associated with a discrete value.

POMPDS explicitly model uncertainty by maintaining a belief state - a distribution over MPD states - in the absence of knowing its state exactly.


Some Notable RL Applications

TD-Gammon: Tesauro– world’s best backgammon program

Elevator Control: Crites & Barto– high performance down-peak elevator controller

Dynamic Channel Assignment: Singh & Bertsekas, Nie & Haykin– high performance assignment of radio channels to mobile

telephone calls In general applicable for all (?) optimization tasks which are

goal-oriented

Documents

V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005. 1 An Introduction to Reinforcement Learning Presenter: Verena Rieser, [email protected]@coli.uni-sb.de