42
A Unifying Framework for Computational Reinforcement Learning Theory Lihong Li Rutgers Laboratory for Real-Life Reinforcement Learning (RL 3 ) Department of Computer Science, Rutgers University PhD Defense Committee Michael Littman, Michael Pazzani, Robert Schapire, Mario Szegedy Joint work with Michael Littman, Alex Strehl, Tom Walsh, …

A Unifying Framework for Computational Reinforcement Learning Theory

  • Upload
    gitel

  • View
    32

  • Download
    1

Embed Size (px)

DESCRIPTION

A Unifying Framework for Computational Reinforcement Learning Theory. Lihong Li Rutgers Laboratory for Real-Life Reinforcement Learning (RL 3 ) Department of Computer Science, Rutgers University PhD Defense Committee Michael Littman, Michael Pazzani, Robert Schapire, Mario Szegedy - PowerPoint PPT Presentation

Citation preview

Page 1: A Unifying Framework for Computational Reinforcement Learning Theory

A Unifying Framework for Computational Reinforcement

Learning TheoryLihong Li

Rutgers Laboratory for Real-Life Reinforcement Learning (RL3)Department of Computer Science, Rutgers University

PhD Defense CommitteeMichael Littman, Michael Pazzani, Robert Schapire, Mario Szegedy

Joint work withMichael Littman, Alex Strehl, Tom Walsh, …

Page 2: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 204/17/2009

$ponsored $earch

Are these better alternatives? Need to EXPLORE!

Page 3: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 304/17/2009

Thesis

The KWIK (Knows What It Knows) learning model provides a flexible, modularized, and unifying way

for creating and analyzing RL algorithmswith provably efficient exploration.

Page 4: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 404/17/2009

Outline

• Reinforcement Learning (RL)• The KWIK Framework• Provably Efficient RL

– Model-based Approaches– Model-free Approaches

• Conclusions

Page 5: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 504/17/2009

100K dimensionaldialog state,features, etc.

Reinforcement Learning Example

Speech Recognition,NLP, Belief Tracking, etc.

Language Generation,Text-to-speech, etc.

responsesto user

“May I speak toJohn Smith?”

Confirm(“John Smith”)

“So you want to callJohn Smith, is that right?”

-1 per response+20 if succeeds-20 if fails

reward

Optimizedby RL

Dialog Design Objectivesucceed in conversationwith fewest responses

AT&T Dialer [Li & Williams & Balakrishnan 09] states

actions

Want to call someone at AT&T

Page 6: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 604/17/2009

RL Summary

European Workshop on Reinforcement Learning 2008

Define reward and let the agent chase it!

Page 7: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 704/17/2009

Markov Decision Process

• Environment is often modeled as an MDP

s1 s2 st

time

st+1

Set of states

Set of actions

Transition probabilities

Reward function

Discount factor in (0,1)

, , , ,M S A T R

1a

1 1 1( , )r R s a

2 1 1( | , )s T s a ta

( , )t t tr R s a

1 ( | , )t t ts T s a

Regularity Assumption:

0 ( , ) 1R s a

Page 8: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 804/17/2009

Policies and Value Functions

• Policy:

• Value function:

• Optimal value function:

• Optimal policy:

• Solving an MDP:

: S A

21 2 3( , )Q s a r r r

E

*( , ) max ( , )Q s a Q s a

* *( ) arg max ( , )a

s Q s a

* * arg max ( , ) ( ) a

Q Q Q s a s

Page 9: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 904/17/2009

Solving an MDP• Planning (when and are known)

– Dynamic programming, linear programming, …– Relatively easy to analyze

• Learning (when or are unknown)– Q-learning [Watkins 89], …– Fundamentally harder– Exploration/exploitation dilemma

T R

T R

Page 10: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 1004/17/2009

Exploration/Exploitation Dilemma

• Similar to– active learning (like selective sampling)– bandit problems (Ad ranking)

• But different/harder– Many heuristics may fail

Take optimal actions“exploitation”:

rewardmaximization

Needestimate

and “exploration”:knowledgeacquisition

Try suboptimal actions

“dual control”

T R

Page 11: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 1104/17/2009

Combination Lock

time

total rewards

poor (insufficient)exploration

active (efficient)exploration

optimal policy

1 2 3 98 99 100 0 0 0 0 0 0

1000

0.001

0

Page 12: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 1204/17/2009

PAC-MDP RL• RL algorithm viewed as a non-stationary policy:

• Sample complexity [Kakade 03] (given ):

• A is PAC-MDP (Probably Approximate Correct in MDP) [Strehl, Li, Wiewiora, Langford & Littman 06] if:– With prob. at least – The sample complexity is

1 1 1 1:t t t tA s a r r s a

0

*1,2,3, | ( , ) ( , )tAt t t tt Q s a Q s a

A

0,1 0 1

1 1 1poly , , ,

1M

In words…

We want the algorithmto act near optimally

except in a small number of steps

Page 13: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 1304/17/2009

Why PAC-MDP?

• Sample complexity– number of steps where learning/exploration happens– related to “learning speed” or “exploration efficiency”

• Roles of parameters– : allow small sub-optimality– : allow failure due to unlucky data– |M|: measures problem complexity– 1/(1-): larger makes problem harder

• Generality– No assumption on ergodicity– No assumption on mixing– No need for reset or generative model

Page 14: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 1404/17/2009

Rmax [Brafman & Tenenholtz 02]

• Rmax is for finite-state, finite-action MDPs• Learns T and R by counting/averaging• In st, takes optimal action in

( ' | , )

( , )

T s s a

R s a

Knownstate-actions

Unknownstate-actions

“Optimism in the face of uncertainty”:

Either: explore “unknown” region

Or: exploit “known” region

Thm: Rmax is PAC-MDP [Kakade 03]SxA

max

1

1Q

Known , , , ,M S A T R

* 2max

1( , ) 1

1Q s a Q

KnownM

Page 15: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 1504/17/2009

Outline

• Reinforcement Learning (RL)• The KWIK Framework• Provably Efficient RL

– Model-based Approaches– Model-free Approaches

• Conclusions

Page 16: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 1604/17/2009

• KWIK: Knows What It Knows [Li & Littman & Walsh 08]

• A self-aware, supervised-learning model– Input set: X– Output set: Y– Observation set: Z– Hypothesis class: H µ (X Y)– Target function: h* 2 H

• “Realizable assumption”

– Special symbol: ? (“I don’t know”)

KWIK Notation

Page 17: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 1704/17/2009

KWIK Definition

,1 prob. w.if succeeds Learning Given: , , H

Env: Pick h* 2 Hsecretly & adversarially

Env: Pick x adversarially

Learner “ŷ”

“?”Observe y=h*(x) [deterministic]or measurement z [stochastic

where E[z]=h*(x)]

“I know”

“I don’t know”

W/prob. 1- , all predictions are correct |ŷ - h*(x)| ≤

Total #? is small at most poly(1/,1/,dim(H))

Learning succeeds if

Page 18: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 1804/17/2009

PACiid inputslabels up frontno mistakes

MBadversarial inputlabels when wrong

KWIKadversarial inputlabels on requestno mistakes

correctincorrect

request

Related Frameworks

PAC: Probably Approximately Correct [Valiant 84]

MB: Mistake Bound [Littlestone 87]KWIK: Knows What It Knows [Li & Littman & Walsh 08]

(if one-way functions exist)[Blum 94]

(may be exponentially harder)[Li & Littman & Walsh 08]

Page 19: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 1904/17/2009

Deterministic / Finite Case(X or H is finite)

Thought Experiment:You own a bar frequented by n patrons… – One is an instigator. When he shows up, there is a fight, unless

– Another patron, the peacemaker, is also there.– We want to predict, for a subset of patrons, {fight or no-fight}

19

Alg. 1: Memorization

• Memorize outcome for each subgroup of patrons• Predict ? if unseen before• #? ≤ |X|• Bar-fight: #? · 2n

Alg. 2: Enumeration• Enumerate all consistent (instigator, peacemaker) pairs• Say ? when they disagree• #? ≤ |H| -1• Bar-fight: #? · n(n-1)

Can make accurate predictions before complete identification of h*

Page 20: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 2004/17/2009

• Problem– Learn a multinomial distribution over N outcomes

• Same input at all times– Observe outcomes, not actual probabilities

• Algorithm– Predict ? for the first times– Use empirical estimate afterwards– Correctness follows from Chernoff’s bound

• Building block for many other stochastic cases

Stochastic / Finite Case:Dice-Learning

2 logO N N

Page 21: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 2104/17/2009

More Examples

• Distance to an unknown point in <n

[Li & Littman & Walsh 08]

• Linear functions with white noise[Strehl & Littman 08]

[Walsh & Szita & Diuk & Littman 09]

• Gaussian distributions[Brunskill & Leffler & Li & Littman & Roy 08]

Page 22: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 2204/17/2009

Outline

• Reinforcement Learning (RL)• The KWIK Framework• Provably Efficient RL

– Model-based Approaches– Model-free Approaches

• Conclusions

Page 23: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 2304/17/2009

Model-based RL• Model-based RL (in )

– First learn T and R– Then uses to compute

• Simulation lemma [Kearns & Singh 02]

• Building a model often makes more efficient use of training data in practice

, , , ,M S A T R

*1 *

2

( | , ) ( | , ) (1 )

(1 )( , ) ( , )

TR T

R

T s a T s aQ Q

R s a R s a

* *Q̂ Qˆ ˆ ˆ, , , ,M S A T R

Page 24: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 2404/17/2009

KWIK-Rmax [Li et al. 09]• Generalizes Rmax to general MDPs• KWIK-learns T and R simultaneously• In st, takes optimal action in

( ' | , )

( , )

T s s a

R s a “Optimism in the face of uncertainty”:

Either: explore “unknown” region

Or: exploit “known” region

max

1

1Q

Known , , , ,M S A T R

* 2max

1( , ) 1

1Q s a Q

Rmax [Brafman & Tenenholtz 02]

Rmax is for finite-state, finite-action MDPs

Learns T and R by counting/averaging

KnownM

Knownstate-actions

Unknownstate-actionsSxA

Page 25: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 2504/17/2009

KWIK-Rmax Analysis

• Explore-or-Exploit Lemma [Li et al. 09]– KWIK-Rmax either follows -optimal policy, or– explores an unknown state

• allowing KWIK-learners to learn T and R!

• Theorem [Li et al. 09]: KWIK-Rmax is PAC-MDP w/ sample complexity

2

2

( (1 ) , ) ( (1 ), )

(1 )T RB B

O

Page 26: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 2604/17/2009

KWIK-Learning Finite MDPsby Input-Partition

• T(.|s,a) is multinomial distribution– There are |S||A| many of them– Each indexed by (s,a)

Input-Partition

T(.|s1,a1) T(.|s1,a2) T(.|sn,am)……

2# ln

S S AO S A

[Brafman & Tenenholtz 02][Kakade 03][Strehl & Li & Littman 06]

Environment

x=(s1,a2)

dice-learning dice-learning dice-learning

Page 27: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 2704/17/2009

• DBN representation [Dean & Kanazawa 89]

Network topologies from [Guestrin & Koller & Parr & Venkataraman 03]

Factored-State MDPs

Bidirectional Ring Star Ring and Star

3 LegsRing of Rings

Page 28: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 2804/17/2009

Factored-State MDPs

• DBN representation [Dean & Kanazawa 89]

– Assuming #parents is bounded by a constant D

S1

S2

S3

Sn

S’1

S’2

S’3

S’nt t+1

a

S = (S1,S2,…,Sn)

Challenges: How to estimate Ti(si’ | parents(si’),a)? How to discover parents of each si’? How to combine learners L(si’) and L(sj’)?

1 2 1 2( '| , ,..., , ) ( '| , ,..., , )

( '|parents( '), )

n i i ni

i i ii

T s s s s a T s s s s a

T s s a

Page 29: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 2904/17/2009

KWIK-Learning DBNswith Unknown Structure

Noisy-Union

Input-Partition

Dice-Learning

Entries in CPT

Learning a DBN

Discovery of parents of si’

Cross-ProductCPTs for T(si’ | parent(si’), a)

From [Kearns & Koller 99]:“This paper leaves many interesting problems unaddressed. Of these, the most intriguing one is to allow the algorithm to learn the model structure as well as the parameters. The recent body of work on learning Bayesian networks from data [Heckerman, 1995] lays much of the foundation, but the integration of these ideas with the problems of exploration/exploitation is far from trivial.”[Li & Littman & Walsh 08]

[Diuk & Li & Leffler 09] First solved by [Strehl & Diuk & Littman 07]

Page 30: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 3004/17/2009

Experiment: “System Administrator”

Met-Rmax [Diuk & Li & Leffler 09]SLF-Rmax [Strehl & Diuk & Littman 07]Factored Rmax [Guestrin & Patrascu & Schuurmans 02]

Ring network8 machines9 actions

Page 31: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 3104/17/2009

MDPs with Gaussian Dynamics• Examples: robot navigation, transportation planning• State offset is multi-variate normal distribution

type( ), type( ),( | , ) ( , )s a s aT s a s N

CORL [Brunskill & Leffler & Li &Littman & Roy 08]

RAM-Rmax [Leffler & Littman &Edmunds 07]

(video by Leffler)

Page 32: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 3204/17/2009

Outline

• Reinforcement Learning (RL)• The KWIK Framework• Provably Efficient RL

– Model-based Approaches– Model-free Approaches

• Conclusions

Page 33: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 3304/17/2009

Model-free RL

• Estimate directly– Implying– No need to estimate T or R

• Benefits– Tractable computation complexity– Tractable space complexity

• Drawbacks– Seems to makes inefficient use of data– Are there PAC-MDP model-free algorithms?

*Q Q*

Q

Page 34: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 3404/17/2009

PAC-MDP Model-free RL* *

' '

' '

Bellman equation: ( , ) ( , ) max ( ', ')

Bellman error: ( , ) ( , ) max ( ', ') ( , )

s a

s a

Q s a R s a Q s a

E s a R s a Q s a Q s a

E

E

Can be KWIK-learned

*Q

1

1

1

1

1Q

optimistic Q-functions

),( as

tKAS

small E(s,a) near-optimal (exploit)

explore

S A

Page 35: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 3504/17/2009

Delayed Q-learning

• Delayed Q-learning (for finite MDPs)

first known PAC-MDP model-free algorithm[Strehl-Li-Wiewiora-Langford-Littman 06]

• Similar to Q-learning [Watkins 89]

– Minimal computation complexity– Minimal space complexity

Page 36: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 3604/17/2009

Comparison

Page 37: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 3704/17/2009

Improved Lower Bound for Finite MDPs

• Lower bound for N=1 [Mannor & Tsitsiklis 04]:

• Theorem: a new lower bound

• Delayed Q-learning’s upper bound:

2

1log

A

2log

S A S

logS A

O S A

Page 38: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 3804/17/2009

KWIK with Linear Function Approximation

• Linear FA:

• LSPI-Rmax [Li & Littman & Mansley 09]– LSPI [Lagoudakis & Parr 03] with online exploration– (s,a) is unknown if under-represented in training set– Includes Rmax as a special case

• REKWIRE [Li & Littman 08]– For finite-horizon MDPs– Learns Q in a bottom-up manner

1

( , ) ( , ) ( , )k

i ii

Q s a w s a w s a

Page 39: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 3904/17/2009

Outline

• Reinforcement Learning (RL)• The KWIK Framework• Provably Efficient RL

– Model-based Approaches– Model-free Approaches

• Conclusions

Page 40: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 4004/17/2009

Open Problems

• Agnostic learning [Kearns & Schapire & Sellie 94] in KWIK– Hypothesis class H may not include h*– “Unrealizable” KWIK [Li & Littman 08]

• Prior information in RL– Bayesian prior [Asmuth & Li & Littman & Nouri & Wingate 09]– Heuristic/shaping [Asmuth & Littman & Zinkov 08] [Strehl & Li & Littman 09]

• Approximate RL with KWIK– Least-squares policy iteration [Li & Littman & Mansley 09]– Fitted value iteration [Brunskill & Leffler & Li & Littman & Roy 08]– Linear function approximation [Li & Littman 08]

Page 41: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 4104/17/2009

Conclusions: A Unification

KWIK[Li & Littman & Walsh 08]

Finite MDP[Kearns & Singh 02]

[Brafman & Tenenholtz 02][Kakade 03]

[Strehl & Li & Littman 06]

Linear MDP[Strehl & Littman 08]

RAM-MDP[Leffler & Littman & Edmunds 07]

Gaussian-Offset MDP[Brunskill & Leffer & Li & Littman & Roy 08]

Factored MDP[Kearns & Koller 99]

[Strehl & Diuk & Littman 07][Li & Littman & Walsh 08]

[Diuk & Li & Leffler 09]

Delayed-Observation MDP[Walsh & Nouri & Li & Littman 07]

Finite MDP[Strehl & Li & Wiewiora &Langford & Littman 06]

KWIK-based VFA[Li & Littman 08]

[Li & Mansley & Littman 09]

MatchingLowerBound

model-based

model-free

The KWIK (Knows What It Knows) learning model provides a flexible, modularized, and unifying way

for creating and analyzing RL algorithmswith provably efficient exploration.

Page 42: A Unifying Framework for Computational Reinforcement Learning Theory

Lihong Li 4204/17/2009

1. Li, Littman, & Walsh: “Knows what it knows: A framework for self-aware learning”. In ICML 2008.

2. Diuk, Li, & Leffler: “The adaptive k-meteorologist problem and its applications to structure discovery and feature selection in reinforcement learning”. In ICML 2009.

3. Brunskill, Leffler, Li, Littman, & Roy: “CORL: A continuous-state offset-dynamics reinforcement learner”. In UAI 2008.

4. Walsh, Nouri, Li, & Littman: “Planning and learning in environments with delayed feedback”. In ECML 2007.

5. Strehl, Li, & Littman: “Incremental model-based learners with formal learning-time guarantees”. In UAI 2006.

6. Li, Littman, & Mansley: “Online exploration in least-squares policy iteration”. In AAMAS 2009.

7. Li & Littman: “Efficient value-function approximation via online linear regression”. In AI&Math 2008.

8. Strehl, Li, Wiewiora, Langford, & Littman: “PAC model-free reinforcement learning”. In ICML 2006.

ReferencesKWIK

MBRL

MFRL