Reinforcement Learning for Autonomous Quadrotor Helicopter...

Preview:

Citation preview

Reinforcement Learning for Autonomous Quadrotor Helicopter ControlMichael Koval, Christopher Mansley, and Michael Littman

Rutgers, the State University of New Jersey

Abstract

I Quadrotor helicopters are rapidly finding use for a number of applicationsI Programming a new behavior means implementing a control algorithmI Tuning a new algorithms is costly and requires expertise in control theoryI Supervised learning is not a viable alternative: training data is hard to getI Reinforcement learning can learn directly from the quadrotor’s raw sensorsI Goal: Use reinforcement learning to teach a quadrotor helicopter how to

perform complex tasks with minimal human interaction

Experimental Setup

1. Parrot AR.Drone [4]. Inexpensive quadrotor helicopter with impressive technical specifications. Outfitted with two cameras, an altitude sensor, and an IMU. On-board computer running a variant of embedded Linux. Communicates with a controlling computer over a dedicated wifi network

2. Vicon Motion Tracking System. Quadrotor is tagged with an asymmetric pattern of reflective spheres. Configure the Vicon system to track the pattern as a rigid body. Track the quadrotor’s flight using the Vicon system’s infrared cameras

3. Control Software. Exerts direct, low-level control over the AR.Drone’s flight. Closes the loop between the motion tracking system and the AR.Drone. Uses reinforcement learning to find an optimal control algorithm

4. Robot Operating System (ROS). Inter-process communication framework developed by Willow Garage. Allows the system to be easily extended for more complex experiments

Joystick

Teleoperation AR.Drone

Vicon System

Learning Algorithm

Kalman Filter

Reinforcement Learning

I Agent learns from rewards earned while interacting with a stochasticenvironment

I Does not require annotated examples of the agent’s correct actionsI Ideal for situations rewards can automatically be assignedI Environment is modeled as a Markov decision process where, at time t:

1. the agent is in state st ∈ S,2. chooses action at ∼ π(st, a) using policy π, and3. receives reward rt

I Goal is to find the policy, π∗, that maximizes the agent’s reward:

J |π = E

{ ∞∑t=0

γtrt

}I Popular algorithms include: Q Learning, TD Learning, and Actor-CriticI Policy gradient and natural actor critic are most popular in robotics

Simulation: Linear Quadratic Regulator

-2-1

01

2 -8-4

04

8-15

-10

-5

0

J(θ)

Long-Term Expected Reward

θ1

θ2

J(θ)

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0w1

−10

−5

0

5

10

w2

Estimated Gradient of Expected Reward

I Simple one-dimensional control theory problem with s ∈ R, a ∈ R, and the environment: [2]

st+1 = st + at + noise

Rt = −s2t − a2

t

I Stochastic controller draws actions from at ∼ N(µθ, σθ) where µθ = θ1st and σθ =[1 + e−θ2

]−1[2]

I Vanilla policy gradient was simulated using central-difference estimator to compute gradients [3]

Results: Linear Quadratic Regulator Simulation

-0.6

-0.55

-0.5

-0.45

-0.4

-0.35

0 50 100 150 200 250 300 350 400

θ 1

t

Policy Parameter θ1

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0

0 50 100 150 200 250 300 350 400

θ 2

t

Policy Parameter θ2

I Control theory proves that the theoretical optimum is at θ∗1 = −0.58 and θ∗2 = −∞ [2]I Vanilla gradient ascent quickly converges to θ1 = θ∗1 and continues moving towards θ2 = θ∗2

Policy Gradient

I Assumes the policy π(s, a | θ) is uniquely parameterized by θI Learning the optimal policy π∗ is now simplified to learning the optimal θ∗

I For any given θ, ∇J |θ points in the direction of maximum increaseI Step the parameters by a small amount in the direction of ∇J |θi:

θi+1 = θi + α∇J |θiI Gradually decrease α until the algorithm converges to a local maximumI Problem: This requires an empirical estimate of the gradient ∇J |θ

Finite Difference Gradient Estimate

I Simple form of gradient approximation that requires no knowledge of πI Slightly perturb θ with several small changes (i.e. ∆θ1, ∆θ2, ...)I Approximate J(θ + ∆θi) for each ∆θi by averaging multiple roll-outs [3]I Construct ∆Θ and ∆J such that ∆Θi = θi, ∆Ji = J(θ + ∆θi), and

∇J |θ =(∆Θ∆ΘT

)−1∆ΘT∆J

Likelihood Ratio Gradient Estimate

I Requires specific knowledge of the policy to estimate ∇π(s, a|θ) [2]I Estimates ∇J |θ without requiring the policy to be perturbed:

∇J |θ = E

{ ∞∑t=0

(rt − b)∇ log π(st, at|θ)

}I Faster convergence than finite difference methods in noisy environments

Natural Policy Gradient

I Gradient is defined in terms of Euclidean distance:

lim∆θ→0

(||J(θ + ∆θ)− J(θ) +∇J(θ) ·∆θ||2

||∆θ||2

)= 0

I Parameters have arbitrary units, making the space highly non-EuclideanI Vanilla gradient no longer points in the direction of maximum increase [3]I Natural gradient is rotated to face the direction of maximum increase: [3]

∇̃J |θ = G−1θ ∇J |θ

I Dramatically better convergence rates than vanilla policy gradient [1]

References

S.I. Amari.Natural gradient works efficiently in learning.Neural computation, 10(2):251–276, 1998.

H. Kimura and S. Kobayashi.Reinforcement learning for continuous action using stochastic gradient ascent.Intelligent Autonomous Systems (IAS-5), pages 288–295, 1998.

J. Peters and S. Schaal.Policy gradient methods for robotics.In Intelligent Robots and Systems, 2006 IEEE/RSJ International Conference on, pages2219–2225. IEEE, 2006.

Stephane Piskorski.AR.Drone Developers Guide SDK 1.5.Parrot, 2010.

mkoval@eden.rutgers.edu, cmansley@cs.rutgers.edu, mlittman@cs.rutgers.edu

Recommended