1
Reinforcement Learning for Autonomous Quadrotor Helicopter Control Michael Koval, Christopher Mansley, and Michael Littman Rutgers, the State University of New Jersey Abstract I Quadrotor helicopters are rapidly finding use for a number of applications I Programming a new behavior means implementing a control algorithm I Tuning a new algorithms is costly and requires expertise in control theory I Supervised learning is not a viable alternative: training data is hard to get I Reinforcement learning can learn directly from the quadrotor’s raw sensors I Goal: Use reinforcement learning to teach a quadrotor helicopter how to perform complex tasks with minimal human interaction Experimental Setup 1. Parrot AR.Drone [4] . Inexpensive quadrotor helicopter with impressive technical specifications . Outfitted with two cameras, an altitude sensor, and an IMU . On-board computer running a variant of embedded Linux . Communicates with a controlling computer over a dedicated wifi network 2. Vicon Motion Tracking System . Quadrotor is tagged with an asymmetric pattern of reflective spheres . Configure the Vicon system to track the pattern as a rigid body . Track the quadrotor’s flight using the Vicon system’s infrared cameras 3. Control Software . Exerts direct, low-level control over the AR.Drone’s flight . Closes the loop between the motion tracking system and the AR.Drone . Uses reinforcement learning to find an optimal control algorithm 4. Robot Operating System (ROS) . Inter-process communication framework developed by Willow Garage . Allows the system to be easily extended for more complex experiments Joystick Teleoperation AR.Drone Vicon System Learning Algorithm Kalman Filter Reinforcement Learning I Agent learns from rewards earned while interacting with a stochastic environment I Does not require annotated examples of the agent’s correct actions I Ideal for situations rewards can automatically be assigned I Environment is modeled as a Markov decision process where, at time t: 1. the agent is in state s t S , 2. chooses action a t π (s t ,a) using policy π , and 3. receives reward r t I Goal is to find the policy, π * , that maximizes the agent’s reward: J | π = E ( X t=0 γ t r t ) I Popular algorithms include: Q Learning, TD Learning, and Actor-Critic I Policy gradient and natural actor critic are most popular in robotics Simulation: Linear Quadratic Regulator -2 -1 0 1 2 -8 -4 0 4 8 -15 -10 -5 0 J (θ ) Long-Term Expected Reward θ 1 θ 2 J (θ ) I Simple one-dimensional control theory problem with s R, a R, and the environment: [2] s t+1 = s t + a t + noise R t = -s 2 t - a 2 t I Stochastic controller draws actions from a t N (μ θ θ ) where μ θ = θ 1 s t and σ θ = 1+ e -θ 2 -1 [2] I Vanilla policy gradient was simulated using central-difference estimator to compute gradients [3] Results: Linear Quadratic Regulator Simulation -0.6 -0.55 -0.5 -0.45 -0.4 -0.35 0 50 100 150 200 250 300 350 400 θ 1 t Policy Parameter θ 1 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0 50 100 150 200 250 300 350 400 θ 2 t Policy Parameter θ 2 I Control theory proves that the theoretical optimum is at θ * 1 = -0.58 and θ * 2 = -∞ [2] I Vanilla gradient ascent quickly converges to θ 1 = θ * 1 and continues moving towards θ 2 = θ * 2 Policy Gradient I Assumes the policy π (s, a | θ ) is uniquely parameterized by θ I Learning the optimal policy π * is now simplified to learning the optimal θ * I For any given θ , J | θ points in the direction of maximum increase I Step the parameters by a small amount in the direction of J | θ i : θ i+1 = θ i + αJ | θ i I Gradually decrease α until the algorithm converges to a local maximum I Problem: This requires an empirical estimate of the gradient J | θ Finite Difference Gradient Estimate I Simple form of gradient approximation that requires no knowledge of π I Slightly perturb θ with several small changes (i.e. Δθ 1 , Δθ 2 , ...) I Approximate J (θ θ i ) for each Δθ i by averaging multiple roll-outs [3] I Construct ΔΘ and ΔJ such that ΔΘ i = θ i , ΔJ i = J (θ θ i ), and J | θ = ( ΔΘΔΘ T ) -1 ΔΘ T ΔJ Likelihood Ratio Gradient Estimate I Requires specific knowledge of the policy to estimate π (s, a|θ ) [2] I Estimates J | θ without requiring the policy to be perturbed: J | θ = E ( X t=0 (r t - b)log π (s t ,a t |θ ) ) I Faster convergence than finite difference methods in noisy environments Natural Policy Gradient I Gradient is defined in terms of Euclidean distance: lim Δθ 0 ||J (θ θ ) - J (θ )+ J (θ ) · Δθ || 2 ||Δθ || 2 =0 I Parameters have arbitrary units, making the space highly non-Euclidean I Vanilla gradient no longer points in the direction of maximum increase [3] I Natural gradient is rotated to face the direction of maximum increase: [3] ˜ J | θ = G -1 θ J | θ I Dramatically better convergence rates than vanilla policy gradient [1] References S.I. Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998. H. Kimura and S. Kobayashi. Reinforcement learning for continuous action using stochastic gradient ascent. Intelligent Autonomous Systems (IAS-5), pages 288–295, 1998. J. Peters and S. Schaal. Policy gradient methods for robotics. In Intelligent Robots and Systems, 2006 IEEE/RSJ International Conference on, pages 2219–2225. IEEE, 2006. Stephane Piskorski. AR.Drone Developers Guide SDK 1.5. Parrot, 2010. [email protected], [email protected], [email protected]

Reinforcement Learning for Autonomous Quadrotor Helicopter ...mkoval.org/projects/quadrotor/files/poster.pdf · Reinforcement Learning for Autonomous Quadrotor Helicopter Control

Embed Size (px)

Citation preview

Page 1: Reinforcement Learning for Autonomous Quadrotor Helicopter ...mkoval.org/projects/quadrotor/files/poster.pdf · Reinforcement Learning for Autonomous Quadrotor Helicopter Control

Reinforcement Learning for Autonomous Quadrotor Helicopter ControlMichael Koval, Christopher Mansley, and Michael Littman

Rutgers, the State University of New Jersey

Abstract

I Quadrotor helicopters are rapidly finding use for a number of applicationsI Programming a new behavior means implementing a control algorithmI Tuning a new algorithms is costly and requires expertise in control theoryI Supervised learning is not a viable alternative: training data is hard to getI Reinforcement learning can learn directly from the quadrotor’s raw sensorsI Goal: Use reinforcement learning to teach a quadrotor helicopter how to

perform complex tasks with minimal human interaction

Experimental Setup

1. Parrot AR.Drone [4]. Inexpensive quadrotor helicopter with impressive technical specifications. Outfitted with two cameras, an altitude sensor, and an IMU. On-board computer running a variant of embedded Linux. Communicates with a controlling computer over a dedicated wifi network

2. Vicon Motion Tracking System. Quadrotor is tagged with an asymmetric pattern of reflective spheres. Configure the Vicon system to track the pattern as a rigid body. Track the quadrotor’s flight using the Vicon system’s infrared cameras

3. Control Software. Exerts direct, low-level control over the AR.Drone’s flight. Closes the loop between the motion tracking system and the AR.Drone. Uses reinforcement learning to find an optimal control algorithm

4. Robot Operating System (ROS). Inter-process communication framework developed by Willow Garage. Allows the system to be easily extended for more complex experiments

Joystick

Teleoperation AR.Drone

Vicon System

Learning Algorithm

Kalman Filter

Reinforcement Learning

I Agent learns from rewards earned while interacting with a stochasticenvironment

I Does not require annotated examples of the agent’s correct actionsI Ideal for situations rewards can automatically be assignedI Environment is modeled as a Markov decision process where, at time t:

1. the agent is in state st ∈ S,2. chooses action at ∼ π(st, a) using policy π, and3. receives reward rt

I Goal is to find the policy, π∗, that maximizes the agent’s reward:

J |π = E

{ ∞∑t=0

γtrt

}I Popular algorithms include: Q Learning, TD Learning, and Actor-CriticI Policy gradient and natural actor critic are most popular in robotics

Simulation: Linear Quadratic Regulator

-2-1

01

2 -8-4

04

8-15

-10

-5

0

J(θ)

Long-Term Expected Reward

θ1

θ2

J(θ)

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0w1

−10

−5

0

5

10

w2

Estimated Gradient of Expected Reward

I Simple one-dimensional control theory problem with s ∈ R, a ∈ R, and the environment: [2]

st+1 = st + at + noise

Rt = −s2t − a2

t

I Stochastic controller draws actions from at ∼ N(µθ, σθ) where µθ = θ1st and σθ =[1 + e−θ2

]−1[2]

I Vanilla policy gradient was simulated using central-difference estimator to compute gradients [3]

Results: Linear Quadratic Regulator Simulation

-0.6

-0.55

-0.5

-0.45

-0.4

-0.35

0 50 100 150 200 250 300 350 400

θ 1

t

Policy Parameter θ1

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0

0 50 100 150 200 250 300 350 400

θ 2

t

Policy Parameter θ2

I Control theory proves that the theoretical optimum is at θ∗1 = −0.58 and θ∗2 = −∞ [2]I Vanilla gradient ascent quickly converges to θ1 = θ∗1 and continues moving towards θ2 = θ∗2

Policy Gradient

I Assumes the policy π(s, a | θ) is uniquely parameterized by θI Learning the optimal policy π∗ is now simplified to learning the optimal θ∗

I For any given θ, ∇J |θ points in the direction of maximum increaseI Step the parameters by a small amount in the direction of ∇J |θi:

θi+1 = θi + α∇J |θiI Gradually decrease α until the algorithm converges to a local maximumI Problem: This requires an empirical estimate of the gradient ∇J |θ

Finite Difference Gradient Estimate

I Simple form of gradient approximation that requires no knowledge of πI Slightly perturb θ with several small changes (i.e. ∆θ1, ∆θ2, ...)I Approximate J(θ + ∆θi) for each ∆θi by averaging multiple roll-outs [3]I Construct ∆Θ and ∆J such that ∆Θi = θi, ∆Ji = J(θ + ∆θi), and

∇J |θ =(∆Θ∆ΘT

)−1∆ΘT∆J

Likelihood Ratio Gradient Estimate

I Requires specific knowledge of the policy to estimate ∇π(s, a|θ) [2]I Estimates ∇J |θ without requiring the policy to be perturbed:

∇J |θ = E

{ ∞∑t=0

(rt − b)∇ log π(st, at|θ)

}I Faster convergence than finite difference methods in noisy environments

Natural Policy Gradient

I Gradient is defined in terms of Euclidean distance:

lim∆θ→0

(||J(θ + ∆θ)− J(θ) +∇J(θ) ·∆θ||2

||∆θ||2

)= 0

I Parameters have arbitrary units, making the space highly non-EuclideanI Vanilla gradient no longer points in the direction of maximum increase [3]I Natural gradient is rotated to face the direction of maximum increase: [3]

∇̃J |θ = G−1θ ∇J |θ

I Dramatically better convergence rates than vanilla policy gradient [1]

References

S.I. Amari.Natural gradient works efficiently in learning.Neural computation, 10(2):251–276, 1998.

H. Kimura and S. Kobayashi.Reinforcement learning for continuous action using stochastic gradient ascent.Intelligent Autonomous Systems (IAS-5), pages 288–295, 1998.

J. Peters and S. Schaal.Policy gradient methods for robotics.In Intelligent Robots and Systems, 2006 IEEE/RSJ International Conference on, pages2219–2225. IEEE, 2006.

Stephane Piskorski.AR.Drone Developers Guide SDK 1.5.Parrot, 2010.

[email protected], [email protected], [email protected]