69
Robot Reinforcement Learning 1

Robot Reinforcement Learning

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Robot Reinforcement Learning

Robot Reinforcement Learning

1

Page 2: Robot Reinforcement Learning

2

Robot Decision Making and Planning

Page 3: Robot Reinforcement Learning

Robots need to make various decisions and construct different plans, for example:

• Decision making• Task planning• Motion planning (e.g., for robotic arms)• Path planning (e.g., for mobile robotics)

3

Robot Decision Making and Planning

Page 4: Robot Reinforcement Learning

Perspectives to consider robot planning methods• Reactive (one-time) decision making versus sequential

planning• Certain and uncertain scenarios• Observable versus partially observable space

4

Robot Decision Making and Planning

We will focus on sequential decision making

Page 5: Robot Reinforcement Learning

• Deterministic, fully observable• Agent knows exactly which state it will be in• Agent action is executed as expected

• Dynamic, partially observable• Observations provide new information about current

state with uncertainty• Robot actions may not be successfully executed

• Non-observable• Agent may have no

idea where it is

5

Decision Making/Planning Types

Page 6: Robot Reinforcement Learning

6

Example: Vacuum World

Page 7: Robot Reinforcement Learning

7

Example: Vacuum World (observable)

(Also called reward, penalty, or utility)

Page 8: Robot Reinforcement Learning

8

Example: Vacuum World (non-observable)

Page 9: Robot Reinforcement Learning

• How about in an uncertain scenario?• Uncertainty in action outcomes• Uncertainty in state of knowledge• Any combination of the two

9

Planning with Uncertainty

Page 10: Robot Reinforcement Learning

• Solutions based on decision tree

10

Planning with Uncertainty

Page 11: Robot Reinforcement Learning

•Utility (i.e., reward or cost) function associates a real-valued utility (reward or cost) with each outcome (state or state-action pair)•With utilities, we can compute and optimize

expected utilities for planning under uncertainty• For example, the expected utility of decision d in the

state s is defined as

• The principle of maximum expected utility states that the optimal decision under uncertainty is the one that has greatest expected utility

11

Planning with Uncertainty

Page 12: Robot Reinforcement Learning

Two fundamental problems in sequential decision making•Reinforcement Learning:

• The environment is initially unknown• The agent interacts with the environment• The agent improves its policy

•Planning:• A model of the environment is known• The agent performs computations with its model

(without any external interaction)

12

Reinforcement Learning for Planning

Page 13: Robot Reinforcement Learning

13

Reinforcement Learning

• Branches of Machine Learning

Page 14: Robot Reinforcement Learning

14

Characteristics of Reinforcement Learning

•What makes reinforcement learning different from other machine learning paradigms?• There is no supervisor, only a reward signal• Feedback is delayed, not instantaneous• Time really matters (sequential, non i.i.d data)• Agent's actions affect the subsequent data it receives

Page 15: Robot Reinforcement Learning

15

Applications of Reinforcement Learning

Page 16: Robot Reinforcement Learning

16

Applications of Reinforcement Learning

Page 17: Robot Reinforcement Learning

17

Reinforcement Learning

•Reinforcement learning is based on the reward hypothesis

•Definition (Reward Hypothesis): All goals can be described by the maximization of expected cumulative reward• A reward 𝑅𝑡 is a scalar feedback signal• Indicates how well agent is doing at step 𝑡• The agent's job is to maximize cumulative reward

•Actions may have long term consequences, thus reward may be delayed• It may be better to sacrifice immediate reward to

gain more long-term reward

Page 18: Robot Reinforcement Learning

18

Agent and Environment

The slides of RL are from Dr. David Silver: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html

Page 19: Robot Reinforcement Learning

19

History and State

Page 20: Robot Reinforcement Learning

20

Environment State

Page 21: Robot Reinforcement Learning

21

Agent State

Page 22: Robot Reinforcement Learning

22

Information State

Page 23: Robot Reinforcement Learning

23

Fully Observable Environments

Page 24: Robot Reinforcement Learning

24

Markov Property

Page 25: Robot Reinforcement Learning

25

State Transition Matrix

Page 26: Robot Reinforcement Learning

26

Markov Process

Page 27: Robot Reinforcement Learning

27

Markov Process: Example

Page 28: Robot Reinforcement Learning

28

Markov Process: Example

Page 29: Robot Reinforcement Learning

29

Markov Reward Process

Page 30: Robot Reinforcement Learning

30

Markov Reward Process: Example

Page 31: Robot Reinforcement Learning

31

Markov Decision Process

Page 32: Robot Reinforcement Learning

32

Markov Decision Process: Example

Page 33: Robot Reinforcement Learning

33

MDP and Reinforcement Learning

Page 34: Robot Reinforcement Learning

34

Components of an RL Agent

Policy

(and greedy)

Page 35: Robot Reinforcement Learning

35

Components of an RL Agent

Model

Value Function

Page 36: Robot Reinforcement Learning

36

Maze Example

Page 37: Robot Reinforcement Learning

37

Maze Example: Policy

Page 38: Robot Reinforcement Learning

38

Maze Example: Value Function

Page 39: Robot Reinforcement Learning

39

Maze Example: Model

Page 40: Robot Reinforcement Learning

40

Model-Based and Model-Free RL

Model-based RL Model-free RL

Page 41: Robot Reinforcement Learning

41

Model-Based and Model-Free RL

Model-based RL Model-free RL

Page 42: Robot Reinforcement Learning

42

Q-Learning: Integrating Learning and Planning

•We’re going to learn model-free RL (although knowing a model also works) •We will focus on finding a way to estimate the

value function directly• The value function is not necessary to directly

associate with the world and represent the world

•The value function is the Q-function• A recursive way to approximate a value function

•The process of estimation the Q-function is called Q-learning• Q-Learning integrates learning and planning

Page 43: Robot Reinforcement Learning

43

Q-Learning Basics

•Given a sequence of states, actions, and rewards defined by an MDP:

we define a unit experience as

• At each step s, choose the action a which maximizes the Q-function Q(s, a)•Q is the estimated value function • It tells us how good an action is given a state •Q(s, a) = immediate reward for making an action +

best value (Q) from the resulting (future) states

Page 44: Robot Reinforcement Learning

44

Q-Learning Formal Definition

•Q-function has a recursive definition:

•Q-learning is about maintaining and updating the table of Q-values, called Q-table.• Only updates Q-values related to the state-action pairs

that are visited.

Page 45: Robot Reinforcement Learning

45

Q-Learning Algorithm

•The Q-learning algorithm is also recursive:• Consider the unit experience

Page 46: Robot Reinforcement Learning

46

Q-Learning Example

Page 47: Robot Reinforcement Learning

47

Q-Learning Example

Page 48: Robot Reinforcement Learning

48

Q-Learning Example

Page 49: Robot Reinforcement Learning

49

Q-Learning Example

Page 50: Robot Reinforcement Learning

50

Q-Learning Example

Page 51: Robot Reinforcement Learning

51

Q-Learning Example

Page 52: Robot Reinforcement Learning

52

Q-Learning Example

Page 53: Robot Reinforcement Learning

53

Q-Learning Example

Page 54: Robot Reinforcement Learning

54

Q-Learning Example

Page 55: Robot Reinforcement Learning

55

Q-Learning ExampleNew episode

Page 56: Robot Reinforcement Learning

56

Q-Learning Example

Page 57: Robot Reinforcement Learning

57

Q-Learning Example

Page 58: Robot Reinforcement Learning

58

Q-Learning

•The Q-learning algorithm is also recursive:• Consider the unit experience

(Greedy selection of the action)

(Complete overwriting old Q values)

Page 59: Robot Reinforcement Learning

59

Q-Learning: Temporal Difference Learning

•Temporal Difference (TD) algorithms enable the agent to incrementally update its Q-table through every single action it takes.

▪ The value Target-OldEstimate is called the target error.

▪ StepSize is usually denoted by 𝛼 is also called the learning rate, between 0 and 1, and 1 means completely overwrites the old Q value.

Page 60: Robot Reinforcement Learning

60

Q-Learning: Temporal Difference Learning

• The algorithm of Q-learning becomes:

• Q-Learning is an off-policy learning algorithm, because it directly finds the optimal Q-value without any dependency on the policy being followed (due to the max operation).

Ref: Introduction to Reinforcement learning by Sutton and Barto - Chapter 6.8

Page 61: Robot Reinforcement Learning

61

Q-Learning Example

Page 62: Robot Reinforcement Learning

62

𝜺-greedy policy for Action Selection

• The 𝜀-greedy policy is most widely used (for both on-and off-policy) to choose an action given a state:

• The value of 𝜀 determines the exploration-exploitation of the agent▪ A larger 𝜀 results in more exploration and less exploitation▪ As a rule of thumb, ε is usually chosen to be 0.9 and

decreased over time

𝜺-greedy policy:

1. Generate a random number 𝑟 ∈ [0,1]2. If 𝑟 > 𝜀, choose an action derived from the Q values

(which yields the maximum reward)

3. Else choose a random action

Page 63: Robot Reinforcement Learning

63

SARSA

• SARSA is acronym for State-Action-Reward-State-Action.

• SARSA is an on-policy TD learning algorithm, because it evaluates and improves the same policy that is being used to select actions.

Page 64: Robot Reinforcement Learning

64

SARSA

•The final algorithm of SARSA:

Ref: Introduction to Reinforcement learning by Sutton and Barto - Chapter 6.7

Page 65: Robot Reinforcement Learning

65

SARSA VS. Q-Learning

Page 66: Robot Reinforcement Learning

66

RL Challenges: A Pole-Balancing Example

• Problem: Pole-balancing • Move car left/right to keep the

pole balanced

• State representations• Position and velocity of car• Angle and angular velocity of pole• They are all continuous variables

•Action representations• Move the cart left or right with certain speed• Speed is also continuous variables

• Straightforward solutions• Coarse discretization of 3 state variables:

left, center, right• Course discretization of speed as well

Page 67: Robot Reinforcement Learning

67

RL Challenges

•Rewards • Indicate what we want to accomplish•NOT how we want to accomplish it

•Robot in a maze• Episodic task• Big reward when out, -1 for each step

•Chess• GOOD: +1 for winning, -1 losing• BAD: +0.25 for taking opponent’s pieces

• It is hard to design rewards for general tasks

• An active research topic is reward learning

Page 68: Robot Reinforcement Learning

68

Deep Q-Learning

•Q-learning is a simple yet powerful algorithm

•However, Q-learning easily suffers from the curse of dimensionality

•When the number of states and actions becomes larger, the Q-table becomes intractable:

1. The amount of memory required to save and update the Q-table would increase as the number of states and actions increases

2. The amount of time required to explore each state to create the required Q-table would be unrealistic

Page 69: Robot Reinforcement Learning

69

Deep Q-Learning

• Idea of Deep Q-Learning: approximating the Q-table with a neural network

▪ The state is given as the input and the Q-value of all possible actions is generated as the output.