76
Reinforcement Learning Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 25 Slides adapted from Tom Mitchell and Peter Abeel Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 1 of 32

Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Reinforcement Learning

Machine Learning: Jordan Boyd-GraberUniversity of Colorado BoulderLECTURE 25

Slides adapted from Tom Mitchell and Peter Abeel

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 1 of 32

Page 2: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Reinforcement Learning

• Control learning

• Control policies that choose optimal actions

• Q learning

• Feature-based representations

• Policy Search

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 2 of 32

Page 3: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Control Learning

Plan

Control Learning

Q-Learning

Policy Search

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 3 of 32

Page 4: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Control Learning

Control Learning

Consider learning to choose actions, e.g.,

• Roomba learning to dock on battery charger

• Learning to choose actions to optimize factory output

• Learning to play Backgammon

Note several problem characteristics:

• Delayed reward

• Opportunity for active exploration

• Possibility that state only partially observable

• Possible need to learn multiple tasks with same sensors/effectors

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 4 of 32

Page 5: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Control Learning

One Example: TD-Gammon

[Tesauro, 1995]Learn to play BackgammonImmediate reward

• +100 if win

• -100 if lose

• 0 for all other states

Trained by playing 1.5 million games against itselfNow approximately equal to best human player

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 5 of 32

Page 6: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Control Learning

Reinforcement Learning Problem

Agent

Environment

State Reward Action

r + aa r + r + ... , where a <10 22

1

Goal: Learn to choose actions that maximize

0s 1s 2s0a 1a 2a

0r 1r 2r...

<0

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 6 of 32

Page 7: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Control Learning

Markov Decision Processes

Assume

• finite set of states S

• set of actions A

• at each discrete time agent observes state st ∈ S and choosesaction at ∈ A

• then receives immediate reward rt

• and state changes to st+1

• Markov assumption: st+1 = δ(st , at) and rt = r(st , at)◦ i.e., rt and st+1 depend only on current state and action◦ functions δ and r may be nondeterministic◦ functions δ and r not necessarily known to agent

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 7 of 32

Page 8: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Control Learning

Agent’s Learning Task

Execute actions in environment, observe results, and

• learn action policy π : S → A that maximizes

E[rt + γrt+1 + γ2rt+2 + . . .

]from any starting state in S

• here 0 ≤ γ < 1 is the discount factor for future rewards

Note something new:

• Target function is π : S → A

• but we have no training examples of form 〈s, a〉• training examples are of form 〈〈s, a〉, r〉

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 8 of 32

Page 9: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Q-Learning

Plan

Control Learning

Q-Learning

Policy Search

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 9 of 32

Page 10: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Q-Learning

Value Function

To begin, consider deterministic worlds . . .For each possible policy π the agent might adopt, we can define anevaluation function over states

V π(s) ≡ rt + γrt+1 + γ2rt+2 + ...

≡∞∑i=0

γ i rt+i

where rt , rt+1, . . . are from following policy π starting at state s

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 10 of 32

Page 11: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Q-Learning

Q-learning

Restated, the task is to learn the optimal policy π∗

π∗ ≡ arg maxπ

V π(s), (∀s)

• r(s, a) (immediate reward) values

G100

100

0

0

0

00

0

0

00

0

0

• Q(s, a) values

• One optimal policy

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 11 of 32

Page 12: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Q-Learning

Q-learning

Restated, the task is to learn the optimal policy π∗

π∗ ≡ arg maxπ

V π(s), (∀s)

• r(s, a) (immediate reward) values• Q(s, a) values

G10090

100

81

90

8181

9081

72

7281

0

• One optimal policy

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 11 of 32

Page 13: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Q-Learning

Q-learning

Restated, the task is to learn the optimal policy π∗

π∗ ≡ arg maxπ

V π(s), (∀s)

• r(s, a) (immediate reward) values

• Q(s, a) values

• One optimal policy

G

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 11 of 32

Page 14: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Q-Learning

What to Learn

We might try to have agent learn the evaluation function V π∗ (whichwe write as V ∗)It could then do a lookahead search to choose best action from anystate s because

π∗(s) = arg maxa

[r(s, a) + γV ∗(δ(s, a))]

A problem:

• This works well if agent knows δ : S × A→ S , and r : S × A→ <• But when it doesn’t, it can’t choose actions this way

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 12 of 32

Page 15: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Q-Learning

Q Function

Define new function very similar to V ∗

Q(s, a) ≡ r(s, a) + γV ∗(δ(s, a))

If agent learns Q, it can choose optimal action even without knowingδ!

π∗(s) = arg maxa

[r(s, a) + γV ∗(δ(s, a))]

π∗(s) = arg maxa

Q(s, a)

Q is the evaluation function the agent will learn

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 13 of 32

Page 16: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Q-Learning

Training Rule to Learn Q

Note Q and V ∗ closely related:

V ∗(s) = maxa′

Q(s, a′)

Which allows us to write Q recursively as

Q(st , at) = r(st , at) + γV ∗(δ(st , at)))

= r(st , at) + γmaxa′

Q(st+1, a′)

Nice! Let Q denote learner’s current approximation to Q. Considertraining rule

Q(s, a)← r + γmaxa′

Q(s ′, a′)

where s ′ is the state resulting from applying action a in state s

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 14 of 32

Page 17: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Q-Learning

Q Learning for Deterministic Worlds

For each s, a initialize table entry Q(s, a)← 0Observe current state sDo forever:

• Select an action a and execute it

• Receive immediate reward r

• Observe the new state s ′

• Update the table entry for Q(s, a) as follows:

Q(s, a)← r + γmaxa′

Q(s ′, a′)

• s ← s ′

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 15 of 32

Page 18: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Q-Learning

Updating Q

100

81

R63

72

Initial state: s1

10090

81

R63

Next state: s2

aright

Q(s1, aright) ← r + γmaxa′

Q(s2, a′)

← 0 + 0.9 max{63, 81, 100} = 90

if rewards non-negative, then

(∀s, a, n) Qn+1(s, a) ≥ Qn(s, a)

and(∀s, a, n) 0 ≤ Qn(s, a) ≤ Q(s, a)

Q converges to Q.

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 16 of 32

Page 19: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Q-Learning

Nondeterministic Case

What if reward and next state are non-deterministic?We redefine V ,Q by taking expected values

V π(s) ≡ E [rt + γrt+1 + γ2rt+2 + . . .]

≡ E [∞∑i=0

γ i rt+i ]

Q(s, a) ≡ E [r(s, a) + γV ∗(δ(s, a))]

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 17 of 32

Page 20: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Q-Learning

Nondeterministic Case

Q learning generalizes to nondeterministic worldsAlter training rule to

Qn(s, a)← (1− αn)Qn−1(s, a) + αn[r + maxa′

Qn−1(s ′, a′)]

where

αn =1

1 + visitsn(s, a)

Can still prove convergence of Q to Q [Watkins and Dayan, 1992]

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 18 of 32

Page 21: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Q-Learning

Temporal Difference Learning

Q learning: reduce discrepancy between successive Q estimatesOne step time difference:

Q(1)(st , at) ≡ rt + γmaxa

Q(st+1, a)

Why not two steps?

Q(2)(st , at) ≡ rt + γrt+1 + γ2 maxa

Q(st+2, a)

Or n?

Q(n)(st , at) ≡ rt + γrt+1 + · · ·+ γ(n−1)rt+n−1 + γn maxa

Q(st+n, a)

Blend all of these:

Qλ(st , at) ≡ (1−λ)[Q(1)(st , at) + λQ(2)(st , at) + λ2Q(3)(st , at) + · · ·

]Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 19 of 32

Page 22: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Q-Learning

Temporal Difference Learning

Qλ(st , at) ≡ (1−λ)[Q(1)(st , at) + λQ(2)(st , at) + λ2Q(3)(st , at) + · · ·

]Equivalent expression:

Qλ(st , at) = rt + γ[ (1− λ) maxa

Q(st , at)

+λ Qλ(st+1, at+1)]

TD(λ) algorithm uses above training rule

• Sometimes converges faster than Q learning

• converges for learning V ∗ for any 0 ≤ λ ≤ 1 (Dayan, 1992)

• Tesauro’s TD-Gammon uses this algorithm

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 20 of 32

Page 23: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Q-Learning

What if the number of states is huge and/or structured?

• Let’s say we discover that stateis bad

• In Q learning, we know nothingabout similar states

• Solution: Feature-basedRepresentation◦ Distance to closest ghost◦ Distance to closest dot◦ Number of ghosts◦ Is Pacman in a tunnel?

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 21 of 32

Page 24: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Q-Learning

What if the number of states is huge and/or structured?

• Let’s say we discover that stateis bad

• In Q learning, we know nothingabout similar states

• Solution: Feature-basedRepresentation◦ Distance to closest ghost◦ Distance to closest dot◦ Number of ghosts◦ Is Pacman in a tunnel?

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 21 of 32

Page 25: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Q-Learning

Function Approximation

• Q(s, a) ≈ w1f1(s, a) + . . .

• Q-learning now had perceptron style updates (least squaresregression)

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 22 of 32

Page 26: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Plan

Control Learning

Q-Learning

Policy Search

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 23 of 32

Page 27: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Policy Search

• Problem: often feature-based policies that work well aren’t thosethat approximate V /Q best

• Solution: Find policies that maximize rewards rather than thevalue that predicts rewards

• Successful

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 24 of 32

Page 28: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Example: Imitation Learning

• Take examples of experts {(s1, a1) . . . }• Learn a classifier mapping s → a

• Create loss as the negative reward

• Problems?

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 25 of 32

Page 29: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Example: Imitation Learning

• Take examples of experts {(s1, a1) . . . }• Learn a classifier mapping s → a

• Create loss as the negative reward

• Problems?

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 25 of 32

Page 30: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

How do we find a good policy?

• Find optimal policies through dynamic programming π0 ≡ π∗• Represent states s through a feature vector ~f (s)

• Until convergence:◦ Generate examples of state action pairs: (πt(s), s)◦ Create a classifier that maps states to actions (an apprentice policy)

ht : f (s) 7→ A◦ Interpolate learned classifier πt+1 = λπt + (1− λ)ht

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 26 of 32

Page 31: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

How do we find a good policy?

• Find optimal policies through dynamic programming π0 ≡ π∗• Represent states s through a feature vector ~f (s)

• Until convergence:◦ Generate examples of state action pairs: (πt(s), s)◦ Create a classifier that maps states to actions (an apprentice policy)

ht : f (s) 7→ A◦ Interpolate learned classifier πt+1 = λπt + (1− λ)ht

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 26 of 32

Page 32: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

How do we find a good policy?

• Find optimal policies through dynamic programming π0 ≡ π∗• Represent states s through a feature vector ~f (s)

• Until convergence:◦ Generate examples of state action pairs: (πt(s), s)◦ Create a classifier that maps states to actions (an apprentice policy)

ht : f (s) 7→ A (Loss of classifier is the negative reward)◦ Interpolate learned classifier πt+1 = λπt + (1− λ)ht

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 26 of 32

Page 33: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

How do we find a good policy?

• Find optimal policies through dynamic programming π0 ≡ π∗• Represent states s through a feature vector ~f (s)

• Until convergence:◦ Generate examples of state action pairs: (πt(s), s)◦ Create a classifier that maps states to actions (an apprentice policy)

ht : f (s) 7→ A (Loss of classifier is the negative reward)◦ Interpolate learned classifier πt+1 = λπt + (1− λ)ht

searn: Searching to Learn (Daume & Marcu, 2006)

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 26 of 32

Page 34: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Applications of Imitation Learning

• Car driving

• Flying helicopters

• Question answering

• Machine translation

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 27 of 32

Page 35: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Applications of Imitation Learning

• Car driving

• Flying helicopters

• Question answering

• Machine translation

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 27 of 32

Page 36: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Question Answering

• State: The words seen, opponent• Action: Buzz or wait• Reward: Points

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 28 of 32

Page 37: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Question Answering

• State: The words seen, opponent

• Action: Buzz or wait

• Reward: Points

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 28 of 32

Page 38: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Why machine translation really hard is

• State: The words you’ve seen,output of machine translationsystem

• Action: Take translation,predict the verb

• Reward: Translation quality

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 29 of 32

Page 39: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Comparing Policies

Er ist zum Laden gegangen

He went to the store

He to theHe to the store

Psychic

Monotone

He went to the store

Batch

PolicyPrediction

He went He went to the store

He to the store went

He went to the

βSource Sentence

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 30 of 32

Page 40: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Comparing Policies

Er ist zum Laden gegangen

He went to the store

He to theHe to the store

Psychic

Monotone

He went to the store

Batch

PolicyPrediction

He went He went to the store

He to the store went

He went to the

βSource Sentence

He

He

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

He went to the store

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 30 of 32

Page 41: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Comparing Policies

Er ist zum Laden gegangen

He to the store

Psychic

Monotone

He went to the store

Batch

PolicyPrediction

He went He went to the store

He to the store went

He went to the

βSource Sentence

He

He

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

He went to the store

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 30 of 32

Page 42: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Comparing Policies

Er ist zum Laden gegangen

He to theHe to the store

Psychic

Monotone

He went to the store

Batch

PolicyPrediction

He went He went to the store

He to the store went

He went to the

βSource Sentence

He

He

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

He went to the store

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 30 of 32

Page 43: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Comparing Policies

Er ist zum Laden gegangen

He to the

Psychic

Monotone

Batch

PolicyPrediction

He went

βSource Sentence

He

He

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

He went to the store

He went to the store

He went to the store

He to the store went

He to the store

He went to the

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 30 of 32

Page 44: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Comparing Policies

Er ist zum Laden gegangen

He to the

Psychic

Monotone

Batch

PolicyPrediction

He went

βSource Sentence

He

He

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

He went to the store

He went to the store

He went to the store

He to the store went

He to the store

He went to the

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 30 of 32

Page 45: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Comparing Policies

Er ist zum Laden gegangen

He to the

Psychic

Monotone

Batch

PolicyPrediction

He went

TSource Sentence

He

He

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

He went to the store

He went to the store

He went to the store

He to the store went

He to the store

He went to the

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 30 of 32

Page 46: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Comparing Policies

Er ist zum Laden gegangen

He went to the store

He to theHe to the store

Psychic

Monotone

He went to the store

Batch

PolicyPrediction

He went He went to the store

He to the store went

He went to the

βSource Sentence

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 30 of 32

Page 47: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Comparing Policies

Er ist zum Laden gegangen

He went to the store

He to theHe to the store

Psychic

Monotone

He went to the store

Batch

PolicyPrediction

He went He went to the store

He to the store went

He went to the

βSource Sentence

He

He

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

He went to the store

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 30 of 32

Page 48: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Comparing Policies

Er ist zum Laden gegangen

He to the store

Psychic

Monotone

He went to the store

Batch

PolicyPrediction

He went He went to the store

He to the store went

He went to the

βSource Sentence

He

He

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

He went to the store

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 30 of 32

Page 49: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Comparing Policies

Er ist zum Laden gegangen

He to theHe to the store

Psychic

Monotone

He went to the store

Batch

PolicyPrediction

He went He went to the store

He to the store went

He went to the

βSource Sentence

He

He

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

He went to the store

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 30 of 32

Page 50: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Comparing Policies

Er ist zum Laden gegangen

He to the

Psychic

Monotone

Batch

PolicyPrediction

He went

βSource Sentence

He

He

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

He went to the store

He went to the store

He went to the store

He to the store went

He to the store

He went to the

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 30 of 32

Page 51: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Comparing Policies

Er ist zum Laden gegangen

He to the

Psychic

Monotone

Batch

PolicyPrediction

He went

βSource Sentence

He

He

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

He went to the store

He went to the store

He went to the store

He to the store went

He to the store

He went to the

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 30 of 32

Page 52: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Comparing Policies

Er ist zum Laden gegangen

He to the

Psychic

Monotone

Batch

PolicyPrediction

He went

TSource Sentence

He

He

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

He went to the store

He went to the store

He went to the store

He to the store went

He to the store

He went to the

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 30 of 32

Page 53: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Comparing Policies

Er ist zum Laden gegangen

He went to the store

He to theHe to the store

Psychic

Monotone

He went to the store

Batch

PolicyPrediction

He went He went to the store

He to the store went

He went to the

βSource Sentence

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 30 of 32

Page 54: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Comparing Policies

Er ist zum Laden gegangen

He went to the store

He to theHe to the store

Psychic

Monotone

He went to the store

Batch

PolicyPrediction

He went He went to the store

He to the store went

He went to the

βSource Sentence

He

He

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

He went to the store

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 30 of 32

Page 55: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Comparing Policies

Er ist zum Laden gegangen

He to the store

Psychic

Monotone

He went to the store

Batch

PolicyPrediction

He went He went to the store

He to the store went

He went to the

βSource Sentence

He

He

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

He went to the store

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 30 of 32

Page 56: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Comparing Policies

Er ist zum Laden gegangen

He to theHe to the store

Psychic

Monotone

He went to the store

Batch

PolicyPrediction

He went He went to the store

He to the store went

He went to the

βSource Sentence

He

He

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

He went to the store

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 30 of 32

Page 57: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Comparing Policies

Er ist zum Laden gegangen

He to the

Psychic

Monotone

Batch

PolicyPrediction

He went

βSource Sentence

He

He

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

He went to the store

He went to the store

He went to the store

He to the store went

He to the store

He went to the

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 30 of 32

Page 58: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Comparing Policies

Er ist zum Laden gegangen

He to the

Psychic

Monotone

Batch

PolicyPrediction

He went

βSource Sentence

He

He

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

He went to the store

He went to the store

He went to the store

He to the store went

He to the store

He went to the

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 30 of 32

Page 59: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Comparing Policies

Er ist zum Laden gegangen

He to the

Psychic

Monotone

Batch

PolicyPrediction

He went

TSource Sentence

He

He

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

He went to the store

He went to the store

He went to the store

He to the store went

He to the store

He went to the

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 30 of 32

Page 60: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Comparing Policies

Er ist zum Laden gegangen

He went to the store

He to theHe to the store

Psychic

Monotone

He went to the store

Batch

PolicyPrediction

He went He went to the store

He to the store went

He went to the

βSource Sentence

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 30 of 32

Page 61: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Comparing Policies

Er ist zum Laden gegangen

He went to the store

He to theHe to the store

Psychic

Monotone

He went to the store

Batch

PolicyPrediction

He went He went to the store

He to the store went

He went to the

βSource Sentence

He

He

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

He went to the store

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 30 of 32

Page 62: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Comparing Policies

Er ist zum Laden gegangen

He to the store

Psychic

Monotone

He went to the store

Batch

PolicyPrediction

He went He went to the store

He to the store went

He went to the

βSource Sentence

He

He

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

He went to the store

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 30 of 32

Page 63: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Comparing Policies

Er ist zum Laden gegangen

He to theHe to the store

Psychic

Monotone

He went to the store

Batch

PolicyPrediction

He went He went to the store

He to the store went

He went to the

βSource Sentence

He

He

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

He went to the store

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 30 of 32

Page 64: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Comparing Policies

Er ist zum Laden gegangen

He to the

Psychic

Monotone

Batch

PolicyPrediction

He went

βSource Sentence

He

He

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

He went to the store

He went to the store

He went to the store

He to the store went

He to the store

He went to the

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 30 of 32

Page 65: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Comparing Policies

Er ist zum Laden gegangen

He to the

Psychic

Monotone

Batch

PolicyPrediction

He went

βSource Sentence

He

He

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

He went to the store

He went to the store

He went to the store

He to the store went

He to the store

He went to the

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 30 of 32

Page 66: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Comparing Policies

Er ist zum Laden gegangen

He to the

Psychic

Monotone

Batch

PolicyPrediction

He went

TSource Sentence

He

He

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

Good Translation

Bad Translation

He went to the store

He went to the store

He went to the store

He to the store went

He to the store

He went to the

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 30 of 32

Page 67: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Applying SEARN

Wait

Predict

Wait

Com

mit

Wait

Oracle Example

Rew

ard

Time

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 31 of 32

Page 68: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Applying SEARN

Wait

Predict

Wait

Com

mit

Wait

Oracle Example

Rew

ard

Time

Training Examplesx y

Wait

Predict

Wait

f(s1)

f(s2)

f(s3)

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 31 of 32

Page 69: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Applying SEARN

Wait

Pred

ict

Wait

Com

mit

Wait

Oracle Example

Rew

ard

Time

⇡ : s 7! a

Classifier

Training Examplesx y

Wait

Predict

Wait

f(s1)

f(s2)

f(s3)

Cost-Sensitive Classification

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 31 of 32

Page 70: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Applying SEARN

Wait

Predict

Wait

Com

mit

Wait

Oracle Example

Rew

ard

Time

⇡ : s 7! aClassifier

Wait

Com

mit

Rew

ard

Time

Wait Wait Wait

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 31 of 32

Page 71: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Applying SEARN

Wait

Predict

Wait

Com

mit

Wait

Oracle Example

Rew

ard

Time

⇡ : s 7! aClassifier

Wait

Com

mit

Rew

ard

Time

Wait Wait Wait

NO GOOD!

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 31 of 32

Page 72: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Applying SEARN

Wait

Predict

Wait

Com

mit

Wait

Oracle Example

Rew

ard

Time

⇡ : s 7! aClassifier

Wait

Com

mit

Rew

ard

Time

Wait Wait Wait

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 31 of 32

Page 73: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Applying SEARN

Wait

Predict

Wait

Com

mit

Wait

Oracle Example

Rew

ard

Time

⇡ : s 7! aClassifier

Wait

Com

mit

Rew

ard

Time

Wait Wait Wait

Wait

Predict

Com

mit

Wait

Rew

ard

Time

Wait

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 31 of 32

Page 74: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Applying SEARN

Wait

Predict

Com

mit

Wait

Rew

ard

Time

Wait

Training Examplesx y

Wait

Predict

Wait

f(s1)

f(s2)

f(s3)

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 31 of 32

Page 75: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Applying SEARN

⇡ : s 7! a

Classifier N

Wait

Predict

Com

mit

Wait

Rew

ard

Time

Wait Wait

Predict

Com

mit

Wait

Rew

ard

Time

Wait

Training Examplesx y

Wait

Wait

Predict

f(s1)

f(s2)

f(s3)

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 31 of 32

Page 76: Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Policy Search

Recap

• Reinforcement learning: interacting with environments

• Important to scale to large state spaces

• Connection with classification

• Lets computers observe and repeat

Machine Learning: Jordan Boyd-Graber | Boulder Reinforcement Learning | 32 of 32