83
Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

Embed Size (px)

Citation preview

Page 1: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

1

Structural Return Maximization for Reinforcement Learning

Josh JosephAlborz Geramifard

Javier Velez Jonathan HowNicholas Roy

Page 2: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

2

How should we act in the presence of complex, unknown dynamics?

Page 3: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

3

How should we act in the presence of complex, unknown dynamics?

Page 4: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

4

How should we act in the presence of complex, unknown dynamics?

Page 5: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

5

How should we act in the presence of complex, unknown dynamics?

Page 6: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

6

What do I mean by complex dynamics?

• Can’t derive from first principles / intuition• Any dynamics model will be approximate• Limited data– Otherwise just do nearest neighbors

• Batch data– Trying to keep it as simple as possible for now– Fairly straightforward to extend to active learning

Page 7: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

7

What do I mean by complex dynamics?

• Can’t derive from first principles / intuition• Any dynamics model will be approximate• Limited data• Batch data– Fairly straightforward to extend to active learning

Page 8: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

8

How does RL solve these problems?

• Assume some representation class for:– Dynamics model– Value function– Policy

• Collect some data• Find the “best” representation based on the

data

Page 9: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

9

How does RL solve these problems?

• Assume some representation class for:– Dynamics model– Value function– Policy

• Collect some data• Find the “best” representation based on the

data

Page 10: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

10

• The “best” representation based on the data

• This defines the best policy…not the best representation

Value (return)

How does RL solve these problems?

Policy

Starting state

reward unknown dynamics model

Page 11: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

11

• The “best” representation based on the data

• This defines the best policy…not the best representation

Value (return)

How does RL solve these problems?

Policy

Starting state

reward unknown dynamics model

Page 12: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

12

• The “best” representation based on the data

• This defines the best policy…not the best representation

Value (return)

How does RL solve these problems?

Policy

Starting state

reward unknown dynamics model

Page 13: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

13

…but does RL actually solve this problem?

• Policy Search– Policy directly parameterized by

Page 14: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

14

…but does RL actually solve this problem?

• Policy Search– Policy directly parameterized by

Page 15: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

15

…but does RL actually solve this problem?

• Policy Search– Policy directly parameterized by

Number of episodes

Empirical estimate

Page 16: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

16

…but does RL actually solve this problem?

• Policy Search– Policy directly parameterized by

Number of episodes

Empirical estimate

Page 17: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

17

…but does RL actually solve this problem?

• Model-based RL– Dynamics model =

Page 18: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

18

…but does RL actually solve this problem?

• Model-based RL– Dynamics model =

Page 19: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

19

…but does RL actually solve this problem?

• Model-based RL– Dynamics model =

Page 20: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

20

…but does RL actually solve this problem?

• Model-based RL– Dynamics model =

Maximizing likelihood != maximizing return

Page 21: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

21

…but does RL actually solve this problem?

• Model-based RL– Dynamics model =

Maximizing likelihood != maximizing return

…similar story for value-based methods

Page 22: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

22

ML model selection in RL

• So why do we do it?– It’s easy– It sometimes works really well– Intuitively it feels like finding the most likely model

should result in a high performing policy• Why does it fail?– Chooses an “average” model based on the data– Ignores reward function

• What do we do then?

Page 23: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

23

ML model selection in RL

• So why do we do it?– It’s easy– It sometimes works really well– Intuitively it feels like finding the most likely model

should result in a high performing policy• Why does it fail?– Chooses an “average” model based on the data– Ignores reward function

• What do we do then?

Page 24: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

24

ML model selection in RL

• So why do we do it?– It’s easy– It sometimes works really well– Intuitively it feels like finding the most likely model

should result in a high performing policy• Why does it fail?– Chooses an “average” model based on the data– Ignores reward function

• What do we do then?

Page 25: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

25

Our Approach

• Model-based RL– Dynamics model =

Page 26: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

26

Our Approach

• Model-based RL– Dynamics model =

Empirical estimate

Page 27: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

27

Our Approach

• Model-based RL– Dynamics model =

Empirical estimate

Page 28: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

28

Planning with Misspecified Model Classes

Us

Page 29: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

29

Our Approach

• Model-based RL– Dynamics model =

Empirical estimate

Page 30: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

30

Our Approach

• Model-based RL– Dynamics model =

Empirical estimate

We can do the same thing in a value-based setting.

Page 31: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

31

…but

• We are indirectly choosing a policy representation

• The win of this indirect representation is that it can be “small”

• Small = less data?– Intuitively you’d think so– Empirical evidence from toy problems

• But all of our guarantees rely on infinite data• …maybe there’s a way to be more concrete

Page 32: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

32

…but

• We are indirectly choosing a policy representation

• The win of this indirect representation is that it can be “small”

• Small = less data?– Intuitively you’d think so– Empirical evidence from toy problems

• But all of our guarantees rely on infinite data• …maybe there’s a way to be more concrete

Page 33: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

33

What we want

• How does the representation space relate to true return?

• …they’ve been doing this in classification since the 60s– Relationship between the bound and “size” of the

representation space / amount of data

≈?

Page 34: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

34

What we want

• How does the representation space relate to true return?

• …they’ve been doing this in classification since the 60s– Relationship between the bound and “size” of the

representation space / amount of data

≈?

Page 35: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

35

What we want

• How does the representation space relate to true return?

• …they’ve been doing this in classification since the 60s– Relationship between the “size” of the

representation space and the amount of data

≈?

Page 36: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

36

How to get there

Model-based, value-based, policy search

Page 37: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

37

How to get there

Model-based, value-based, policy search

Map RL to classification Empirical Risk Minimization

Page 38: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

38

How to get there

Model-based, value-based, policy search

Map RL to classification Empirical Risk Minimization

Measuring function class size Bound on true risk

Page 39: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

39

How to get there

Model-based, value-based, policy search

Map RL to classification Empirical Risk Minimization

Measuring function class size Bound on true risk

Page 40: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

40

How to get there

Model-based, value-based, policy search

Map RL to classification Empirical Risk Minimization

Measuring function class size Bound on true risk

Structure of function classes Structural risk minimization

Page 41: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

41

How to get there

Model-based, value-based, policy search

Map RL to classification Empirical Risk Minimization

Measuring function class size Bound on true risk

Structure of function classes Structural risk minimization

Page 42: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

42

Classification

Page 43: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

43

Classification

Page 44: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

44

Classification

f

𝑓 ([𝑥1𝑥2])=𝑠𝑖𝑔𝑛([𝜃1𝜃2]𝑇

[𝑥1𝑥2])

𝑥1

𝑥2

Page 45: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

45

Classification

Risk

Page 46: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

46

Classification

Loss (cost)

Risk Unknown datadistribution

Page 47: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

47

Empirical Risk Minimization

Unknown datadistribution

Page 48: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

48

Empirical Risk Minimization

Unknown datadistribution

Number of samples

Empirical estimate

Page 49: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

49

Mapping RL to Classification

Page 50: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

50

Mapping RL to Classification

Page 51: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

51

Mapping RL to Classification

Page 52: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

52

How to get there

Model-based, value-based, policy search

Map RL to classification Empirical Risk Minimization

Measuring function class size Bound on true risk

Structure of function classes Structural risk minimization

Page 53: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

53

Measuring the size of a function class:VC Dimension

• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly

decide

Page 54: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

54

Measuring the size of a function class:VC Dimension

• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly

decide

Page 55: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

55

Measuring the size of a function class:VC Dimension

• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly

decide

Page 56: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

56

Measuring the size of a function class:VC Dimension

• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly

decide

Page 57: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

57

Measuring the size of a function class:VC Dimension

• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly

decide

Page 58: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

58

Measuring the size of a function class:VC Dimension

• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly

decide

Page 59: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

59

Measuring the size of a function class:VC Dimension

• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly

decide

𝑉𝐶𝐷𝑖𝑚()=3

Page 60: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

60

Measuring the size of a function class:VC Dimension

• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly

decide• Magically, shattering (VC Dim) can be used to

bound true risk

Page 61: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

61

Measuring the size of a function class:VC Dimension

• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly

decide• Magically, shattering (VC Dim) can be used to

bound true risk

Page 62: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

62

Measuring the size of a function class:VC Dimension

• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly

decide• Magically, shattering (VC Dim) can be used to

bound true risk

Page 63: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

63

For those of you familiar with statistical learning theory…

• VC Dim – Only known for a few function classes– Difficult to estimate, bound

• Rademacher complexity– Use the data to estimate the “volume” of the

function class– This volume can then be used in a similar bound

Page 64: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

64

Measuring the size of a function class

• Now we can say concrete things about why we may prefer one representation over another with limited data

Page 65: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

65

Measuring the size of a function class

• Now we can say concrete things about why we may prefer one representation over another with limited data

Page 66: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

66

How to get there

Model-based, value-based, policy search

Map RL to classification Empirical Risk Minimization

Measuring function class size Bound on true risk

Structure of function classes Structural risk minimization

Page 67: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

67

Empirical Risk Minimization

Unknown datadistribution

Number of samples

Empirical estimate

Page 68: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

68

Empirical Risk Minimization and Limited Data

Unknown datadistribution

But if we have limited data we cannot expect small empirical risk to result in small true risk

Empirical estimate

Page 69: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

69

Empirical Risk Minimization and Limited Data

• If the bound is large, we cannot expect small empirical risk to result in small true risk

• …so what do we do?• Choose the function class which minimizes the

bound!

Page 70: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

70

Empirical Risk Minimization and Limited Data

• If the bound is large, we cannot expect small empirical risk to result in small true risk

• …so what do we do?• Choose the function class which minimizes the

bound!

Page 71: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

71

Structural Risk Minimization

• Using a “structure” of function classes

• For N data, we choose the function class:

Page 72: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

72

Structural Risk Minimization

• Using a “structure” of function classes

Many natural structures of policy classes!

Page 73: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

73

Structural Risk Minimization

• Using a “structure” of function classes

• We choose the function class:

Page 74: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

74

Is this Bayesian?

• Prior knowledge– Structure encodes prior knowledge

• Robust to over-fitting– Choose the function class based on risk bound

• No Bayes update• No assumptions about the true function lying

in the structure– Breaks most (all?) Bayesian nonparametrics

Page 75: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

75

Is this Bayesian?

• Prior knowledge– Structure encodes prior knowledge

• Robust to over-fitting– Choose the function class based on risk bound

• No Bayes update• No assumptions about the true function lying

in the structure– Breaks most (all?) Bayesian nonparametrics

Page 76: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

76

Is this Bayesian?

• Prior knowledge– Structure encodes prior knowledge

• Robust to over-fitting– Choose the function class based on risk bound

• No Bayes update• No assumption that the true function is

somewhere in the structure– Breaks most (all?) Bayesian nonparametrics

Page 77: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

77

Contribution

• Classification to RL mapping• Transferred probabilistic bounds from

statistical learning theory to RL• Applied structural risk minimization to RL

Page 78: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

78

Contribution

• Classification to RL mapping• Transferred probabilistic bounds from

statistical learning theory to RL• Applied structural risk minimization to RL

Page 79: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

79

Backup Slides

Page 80: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

80

From last time…

Page 81: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

81

From last time…

{𝒎𝒄 ,𝒎𝒑 ,𝒍 }

Page 82: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

82

From last time…

≈?

{𝒎𝒄 ,𝒎𝒑 ,𝒍 }

Page 83: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1

83

Measuring the size of a function class

• Rademacher complexity