6
LQR-RRT * : Optimal Sampling-Based Motion Planning with Automatically Derived Extension Heuristics Anonymous Abstract— Recently, the RRT * algorithm has been proposed as an optimal extension to standard RRT [4]. However, it is difficult to use RRT * to find optimal plans for problems with complex or underactuated dynamics because it is not clear what distance metric should be used or how to perform node extensions. In this paper, we propose a new algorithm, LQR- RRT * , that addresses this problem. Our key contribution is an application of linear quadratic regulation (LQR) to the problem of identifying a set of “near” nodes in the graph that are candidates for the RRT * rewiring step. We also propose using the LQR control policy to perform the node extensions. The resulting algorithm, LQR-RRT * , finds optimal plans in domains with complex or underactuated dynamics. We characterize the algorithm in the context of a series of examples that are successively torque-limited, underactuated, and in belief-space. I. I NTRODUCTION The problem of planning dynamically feasible trajectories from an initial state to a goal state has been the subject of extensive study over the last few decades. Because this problem is PSPACE-hard [1], any complete algorithmi.e, one that finds a solution if one exists—is computationally intractible in the worst case. In an effort to solve problems of practical interest, much recent research has been devoted to developing sampling-based planners such as the Rapidly- exploring Randomized Tree (RRT) [2] and the Probabilistic Road Map (PRM) [3]. In principle, these algorithms only achieve probabilistic completeness—that is, the probability that a solution is found (if one exists) converges to one as the number of samples approaches infinity. However, they have good average-case performance and have been found to work well on many problems of practical interest. One of the key drawbacks of the sampling-based algo- rithms is that they often produce suboptimal solutions that “wander” through state space toward a solution. In fact, Karaman and Frazzoli have recently shown that RRT almost surely does not find an optimal solution [4]. They introduced RRT * , an algorithm which extends RRT to achieve the asymptotic optimality property, i.e, almost-sure convergence to the optimal solution. Compared with the RRT, RRT * requires only a constant factor more computation. Like RRT, RRT * requires the user to specify a distance metric and a node extension procedure. These two parts play a significant role on the algorithm; naive implementations (e.g., using Euclidean distance) may not reflect the dynamics of particular domain, and therefore result in poor exploration and slow convergence to the optimal solution. Indeed, it has been shown that the performance of RRT-based algorithms is sensitive to the design or selection of such a metric [5]. Prior research has considered the question of how a good distance metric should behave. Cheng and LaValle [6] showed that RRTs are able to efficiently explore the state space only when this metric reflects the true cost-to-go. LaValle and Kuffner have previously noted that cost-to-go functions could be based on, among other options, optimal control for linearized systems [2]. Glassman and Tedrake [7] derived a cost-to-go pseudometric based on minimum-time linear quadratic regulators (LQR), which was used to grow a standard RRT in domains with constrained and underactuated dynamics. We propose a variation of the RRT * algorithm that is well suited for systems with complex or underactuated dynamics. Our key contribution is an approach to incorporating LQR into RRT * . Following Glassman and Tedrake [7], we use the LQR cost function as a distance pseudo-metric in order to select the “nearest” and “near” nodes considered for exten- sion. In addition, we follow the LQR control policy in order to extend to the sampled node. Finally, we use the LQR cost function in order to determine which nodes in the tree are candidates for the “rewiring” step of RRT * . We demonstrate the effectiveness of our approach for three problems that would be difficult to solve using an unmodified version of RRT * : the torque-limited inverted pendulum problem, the acrobot problem, and a belief space planning problem. II. BACKGROUND A. Problem Statement Given a system with known (possibly non-linear) process dynamics, the optimal motion planning problem is to find a minimum cost feasible trajectory from an initial con- figuration to a goal configuration. Let configuration space be a compact set, X X , that describes all possible configurations of the system. Let U ⊆U be a compact control input space. We assume that we are given the non- linear process dynamics, ˙ x = f (x, u). A dynamically feasible trajectory is a continuous function, σ : [0, 1] X, for which there exists a control function, u : [0, 1] U , such that t [0, 1], ˙ σ(t)= f (σ(t),u(t)). The set of all dynamically feasible trajectories is denoted by Σ traj . Given an initial configuration, x init , and a goal region X goal , the motion planning problem is to find a dynamically feasible trajectory, σ, and a control, u, that starts in the initial configuration and reaches the goal region: σ(0) = x init and σ(1) X goal . Let c traj R + be a cost functional that maps each feasible trajectory to a non-negative cost. The CONFIDENTIAL. Limited circulation. For review only. Preprint submitted to 2012 IEEE International Conference on Robotics and Automation. Received September 16, 2011.

LQR-RRT*: Optimal Sampling-Based Motion Planning with ...people.csail.mit.edu/rplatt/papers/icra2012_perez_submitted.pdf · the LQR control policy to perform the node extensions

  • Upload
    others

  • View
    4

  • Download
    1

Embed Size (px)

Citation preview

Page 1: LQR-RRT*: Optimal Sampling-Based Motion Planning with ...people.csail.mit.edu/rplatt/papers/icra2012_perez_submitted.pdf · the LQR control policy to perform the node extensions

LQR-RRT∗: Optimal Sampling-Based Motion Planning withAutomatically Derived Extension Heuristics

Anonymous

Abstract— Recently, the RRT∗ algorithm has been proposedas an optimal extension to standard RRT [4]. However, it isdifficult to use RRT∗ to find optimal plans for problems withcomplex or underactuated dynamics because it is not clearwhat distance metric should be used or how to perform nodeextensions. In this paper, we propose a new algorithm, LQR-RRT∗, that addresses this problem. Our key contribution is anapplication of linear quadratic regulation (LQR) to the problemof identifying a set of “near” nodes in the graph that arecandidates for the RRT∗ rewiring step. We also propose usingthe LQR control policy to perform the node extensions. Theresulting algorithm, LQR-RRT∗, finds optimal plans in domainswith complex or underactuated dynamics. We characterize thealgorithm in the context of a series of examples that aresuccessively torque-limited, underactuated, and in belief-space.

I. INTRODUCTION

The problem of planning dynamically feasible trajectoriesfrom an initial state to a goal state has been the subjectof extensive study over the last few decades. Because thisproblem is PSPACE-hard [1], any complete algorithm—i.e,one that finds a solution if one exists—is computationallyintractible in the worst case. In an effort to solve problemsof practical interest, much recent research has been devotedto developing sampling-based planners such as the Rapidly-exploring Randomized Tree (RRT) [2] and the ProbabilisticRoad Map (PRM) [3]. In principle, these algorithms onlyachieve probabilistic completeness—that is, the probabilitythat a solution is found (if one exists) converges to one asthe number of samples approaches infinity. However, theyhave good average-case performance and have been foundto work well on many problems of practical interest.

One of the key drawbacks of the sampling-based algo-rithms is that they often produce suboptimal solutions that“wander” through state space toward a solution. In fact,Karaman and Frazzoli have recently shown that RRT almostsurely does not find an optimal solution [4]. They introducedRRT∗, an algorithm which extends RRT to achieve theasymptotic optimality property, i.e, almost-sure convergenceto the optimal solution. Compared with the RRT, RRT∗

requires only a constant factor more computation. LikeRRT, RRT∗ requires the user to specify a distance metricand a node extension procedure. These two parts play asignificant role on the algorithm; naive implementations (e.g.,using Euclidean distance) may not reflect the dynamics ofparticular domain, and therefore result in poor explorationand slow convergence to the optimal solution. Indeed, it hasbeen shown that the performance of RRT-based algorithmsis sensitive to the design or selection of such a metric [5].

Prior research has considered the question of how a

good distance metric should behave. Cheng and LaValle [6]showed that RRTs are able to efficiently explore the statespace only when this metric reflects the true cost-to-go.LaValle and Kuffner have previously noted that cost-to-gofunctions could be based on, among other options, optimalcontrol for linearized systems [2]. Glassman and Tedrake [7]derived a cost-to-go pseudometric based on minimum-timelinear quadratic regulators (LQR), which was used to grow astandard RRT in domains with constrained and underactuateddynamics.

We propose a variation of the RRT∗ algorithm that is wellsuited for systems with complex or underactuated dynamics.Our key contribution is an approach to incorporating LQRinto RRT∗. Following Glassman and Tedrake [7], we use theLQR cost function as a distance pseudo-metric in order toselect the “nearest” and “near” nodes considered for exten-sion. In addition, we follow the LQR control policy in orderto extend to the sampled node. Finally, we use the LQR costfunction in order to determine which nodes in the tree arecandidates for the “rewiring” step of RRT∗. We demonstratethe effectiveness of our approach for three problems thatwould be difficult to solve using an unmodified version ofRRT∗: the torque-limited inverted pendulum problem, theacrobot problem, and a belief space planning problem.

II. BACKGROUND

A. Problem Statement

Given a system with known (possibly non-linear) processdynamics, the optimal motion planning problem is to finda minimum cost feasible trajectory from an initial con-figuration to a goal configuration. Let configuration spacebe a compact set, X ⊆ X , that describes all possibleconfigurations of the system. Let U ⊆ U be a compactcontrol input space. We assume that we are given the non-linear process dynamics,

x = f(x, u).

A dynamically feasible trajectory is a continuous function,σ : [0, 1] → X , for which there exists a control function,u : [0, 1] → U , such that ∀t ∈ [0, 1], σ(t) = f(σ(t), u(t)).The set of all dynamically feasible trajectories is denoted byΣtraj. Given an initial configuration, xinit, and a goal regionXgoal, the motion planning problem is to find a dynamicallyfeasible trajectory, σ, and a control, u, that starts in the initialconfiguration and reaches the goal region: σ(0) = xinit andσ(1) ∈ Xgoal. Let c : Σtraj → R+ be a cost functional thatmaps each feasible trajectory to a non-negative cost. The

CONFIDENTIAL. Limited circulation. For review only.

Preprint submitted to 2012 IEEE International Conference onRobotics and Automation. Received September 16, 2011.

Page 2: LQR-RRT*: Optimal Sampling-Based Motion Planning with ...people.csail.mit.edu/rplatt/papers/icra2012_perez_submitted.pdf · the LQR control policy to perform the node extensions

optimal motion planning problem is to find a solution to themotion planning problem that minimizes c(σ).

B. RRT∗

Rapidly exploring random trees (RRTs) are a standardrandomized approach to motion planning [2]. RRT∗ [4] isa variant of this algorithm that has the asymptotic optimalityproperty, i.e., almost-sure convergence to an optimal solution.The algorithm is shown in Algorithm 1.

Algorithm 1: RRT∗((V,E), N)

for i = 1, . . . , N do1

xrand ← Sample;2

Xnear ← Near(V, xrand);3

(xmin, σmin)← ChooseParent(Xnear, xrand);4

if CollisionFree(σ) then5

V ← V ∪ {xrand};6

E ← E ∪ {(xmin, xrand) };7

(V,E)← Rewire( (V,E), Xnear, xrand );8

return G = (V,E);9

Algorithm 2: ChooseParent(Xnear, xrand)

minCost←∞; xmin ← NULL; σmin ← NULL;1

for xnear ∈ Xnear do2

σ ← Steer(xnear, xrand);3

if Cost(xnear) + Cost(σ) < minCost then4

minCost← Cost(xnear) + Cost(σ);5

xmin ← xnear; σmin ← σ;6

return (xmin, σmin);7

Algorithm 3: Rewire( (V,E), Xnear, xrand )

for xnear ∈ Xnear do1

σ ← Steer(xrand, xnear) ;2

if Cost(xrand) + Cost(σ) < Cost(xnear) then3

if CollisionFree(σ) then4

xparent ← Parent(xnear);5

E ← E \ {xparent, xnear};6

E ← E ∪ {xrand, xnear};7

return (V,E);8

The major components of the algorithm are outlinedbelow:• Random sampling: The Sample procedure provides

independent uniformly distributed random samples ofstates from the configuration space.

• Near nodes:Given a set V of vertices in the tree and a state x, theNear(V, x) procedure provides a set of states in V thatare close to x. Here, closeness is measured with respectto some metric as follows:

Near(V, x) :=

{x′ ∈ V : ‖x− x′‖ ≤ γ

(log n

n

)1/d},

where ‖ · ‖ is the distance using the metric, n is thenumber of vertices in the tree, d is the dimension of thespace, and γ is a constant.

• Steering: Given two states x, x′, the Steer(x, x′) pro-cedure returns a path σ that connects x and x′. Mostimplementations connect two given states with a straightpath.

• Collision checking: Given a path σ, theCollisionFree(σ) procedure returns true if the pathlies in the obstacle-free portion of the configurationspace.

The algorithm proceeds as follows. Firstly, a state is sam-pled, denoted as xrand, from the configuration space (Line 2),and near neighbors are evaluated using the Near function(Line 3).1 Next, RRT∗ calls ChooseParent (Algorithm 2)to find a candidate for a parent node to xrand (Line 4).ChooseParent searches through the nodes in the set Xnear

of near nodes and returns the node, denoted as xmin, thatreaches xrand with minimum cost along with the path σmin

that connects xmin to xrand. After the node within Xnear

that reaches (possibly with a path that is not collision free)is found, the algorithm checks this path for collision (Line 5).If the path σmin collides with an obstacle, then the algorithmproceeds with the next iteration. Otherwise, the algorithmadds xrand to the tree and connects xmin to xrand, and itattempts to “rewire” the nodes in Xnear using the Rewire

procedure (Algorithm 3). The Rewire procedure attempts toconnect xrand with each node in the set Xnear of near nodes.If the path that connects xrand with a near node xnear reachesxnear with cost less than that of its current parent, then thexnear is “rewired” to xrand by connecting xrand and xnear.

C. LQR

Linear quadratic regulation (LQR) is an approach tocomputing optimal control policies for linear systems withGaussian process noise. In general, the problem of solvingfor optimal control policies for arbitrary stochastic systemsis P -complete and typically requires numerically performing“backups” of the Bellman equation. However, it turns out thatthe Bellman equation has a closed form solution for linear-Gaussian systems known as LQR. LQR is a standard methodin control theory that has been studied for nearly fifty years.Suppose that we have a discrete-time linear system,

x = Ax+Bu,

and a cost function,

J =

∫ ∞0

xTQx+ uTRu. (1)

Suppose that we are interested in solving for a controlpolicy that minimizes the expected value of J . One wayto accomplish this is to solve the following equation (thealgebraic Riccati equation) for the matrix, P :

ATP + PA− PBR−1BTP +Q = 0.

1Although it is not explicitly noted in the algorithm description in ordernot to complicate the discussion, we assume that the algorithm proceeds tothe next iteration if the Near function returns an empty set.

CONFIDENTIAL. Limited circulation. For review only.

Preprint submitted to 2012 IEEE International Conference onRobotics and Automation. Received September 16, 2011.

Page 3: LQR-RRT*: Optimal Sampling-Based Motion Planning with ...people.csail.mit.edu/rplatt/papers/icra2012_perez_submitted.pdf · the LQR control policy to perform the node extensions

It is possible to solve this equation for P efficiently usingone of several standard functions available including the lqrfunction included in Matlab. Given a solution, P , the optimalvalue function is:

V (x) = xTPx.

In particular, the value function can be used to compute acost to go from any point to the goal. We can use this costfunction as a distance metric: states with a high cost to goare further from the goal than those with a lower cost togo. Using cost to go as a distance metric results in efficientexploration of the state space [6].

The optimal policy is:

π∗(x) = −R−1BTPx. (2)

The above procedure is optimal for linear-Gaussian systems.However, LQR can also used to control non-linear systemsby “linearizing” the system about an operating point. Thelocally linear system is:

x = A(x)x+B(x)u,

whereA(x) =

∂f(x, u)

∂x

andB(x) =

∂f(x, u)

∂u.

It is these locally linearized dynamics that we use in LQR-RRT∗ to estimate distances and perform extensions. Initiallyin our algorithm, when the distances between the new nodeand the “near” nodes in the tree are large, this distancemeasure will be inaccurate. However, as the tree covers thestate space more densely, the distances shrink and the localLQR policy becomes more closely optimal. In the following,we will use the notation

[K,S] = LQR(A,B,Q,R),

to denote the function that calculates the LQR gain matrix,K, and the cost matrix, S.

III. LQR-RRT∗

LQR-RRT∗ is a variation of the RRT∗ algorithm thatincorporates LQR for “near” and “nearest” node calculationsas well as for extensions among nodes. The algorithm isgiven in Algorithm 4. In each iteration, our algorithm firstsamples a state, denoted as xrand, from the configurationspace (Line 2), and evaluates its near neighbors using theLQRNearest function (Line 5). Then, the algorithm callsthe LQRSteer function to create a xnew state. This staterepresents the end of a trajectory executed with the LQR-policy generated from the LQRNearest function with a pre-specified step size based on the cost. The LQRNear procedureis then called on xnew to search for a neighborhood of nodesaround this state. Then, ChooseParent finds a candidate fora parent node to xnew (Line 6). The ChooseParent proce-dure, presented in detail in Algorithm 5, simply searchesthrough all the nodes in the set Xnear of near nodes and

returns the node, denoted as xmin, that reaches xnew withminimum LQR S cost along with the trajectory σmin thatconnects xmin to xnew. The algorithm then adds xnew to thetree and connects xmin to xnew, and it attempts to “rewire”the nodes in Xnear using the Rewire procedure, which isprovided in Algorithm 6. The Rewire procedure attempts toconnect xnew with each node in the set Xnear of near nodes.If the trajectory that connects xnew with a near node xnearreaches xnear with cost less than that of its current parent,then the xnear is “rewired” to xnew by connecting xnew andxnear.

Algorithm 4: LQR− RRT∗((V,E), N)

for i = 1, . . . , N do1

xrand ← Sample;2

xnearest ← LQRNearest(V, xrand);3

xnew ← LQRSteer(xnearest, xrand);4

Xnear ← LQRNear(V, xnew);5

(xmin, σmin)← ChooseParent(Xnear, xnew);6

X ← X ∪ {xnew}; E ← E ∪ {(xmin, xnew)};7

(V,E)← Rewire( (V,E), Xnear, xnew );8

return G = (V,E);9

Algorithm 5: ChooseParent(Xnear, xnew)

minCost←∞; xmin ← NULL; σmin ← NULL;1

for xnear ∈ Xnear do2

σ ← LQRSteer(xnear, xnew);3

if Cost(xnear) + Cost(σ) < minCost then4

minCost← Cost(xnear) + Cost(σ);5

xmin ← xnear; σmin ← σ;6

return (xmin, σmin);7

Algorithm 6: Rewire( (V,E), Xnear, xnew )

for xnear ∈ Xnear do1

σ ← LQRSteer(xnew, xnear) ;2

if Cost(xnew) + Cost(σ) < Cost(xnear) then3

xparent ← Parent(xnear);4

E ← E \ {xparent, xnear};5

E ← E ∪ {xnew, xnear};6

return (V,E);7

The algorithm primitives are outlined in detail below:

• Nearest Vertex:Given a set of vertices in the tree, V , and a state x, theLQRNearest(V, x) function returns the nearest node inthe tree in terms of the LQR cost-to-go function. Welocally linearize the system about x and evaluate theLQR gain matrix, K(x), and cost matrix, S(x):

[K(x), S(x)] = LQR(A(x), B(x), Q,R).

CONFIDENTIAL. Limited circulation. For review only.

Preprint submitted to 2012 IEEE International Conference onRobotics and Automation. Received September 16, 2011.

Page 4: LQR-RRT*: Optimal Sampling-Based Motion Planning with ...people.csail.mit.edu/rplatt/papers/icra2012_perez_submitted.pdf · the LQR control policy to perform the node extensions

We use the cost matrix to evaluate the distance betweenvertices in the tree and the sample point:

xnearest = arg minv∈V

(v − x)TS(v − x).

• Near Vertices:This function returns the set of vertices that are withina certain distance of a state, x, calculated by evaluatingthe LQR cost function about that state. If

[K(x), S(x)] = LQR(A(x), B(x), Q,R),

then,

Near(V, x)

=

{v ∈ V : (v − x)TS(v − x) ≤ γ

(log n

n

)1/d},

where n is the number of vertices in the tree, d is thedimension of the space, and γ is a constant.

• Steering: Given two states, x, x′, the LQRSteer(x, x′)function “connects” state x with x′ using the local LQRpolicy calculated by linearizing about x.

Although not explicitly noted in the algorithm descrip-tions, we also employ a conservative version of the branchand bound technique presented in [8]. The branch and boundtechnique prunes the parts of the tree that can not possiblyreach the goal with a cost better than what is currentlyavailable in the tree. For this purpose, the cost of the currentbest trajectory in the tree is used as a lower bound, and allnodes with a total cost above this value are removed. Thebranch and bound method concentrates the search on findinga trajectory that reaches the goal optimally.

IV. RESULTS

In this section we evaluate LQR-RRT∗ in three domainswith complex dynamics: one which is torque-limited, onewhich is underactuated, and one which requires planning inbelief-space.

A. Torque-limited Pendulum

Our first domain is a torque-limited pendulum. The ob-jective is to swing up the pendulum from rest to a verti-cally balanced position. This domain has a two-dimensionalstate space, (θ, θ), where θ denotes the orientation of thependulum. The equation of motion for this system is: θ =u − bθ − g cos(θ). where gravity is g = 9.81, the constantof viscous friction is b = 0.1, torque is constrained tou ∈ [−3, 3], and we have assumed unit mass and unit length.Since this problem is low dimensional, we are able to com-pare the solution found using LQR-RRT∗ with the solutionfound using dynamic programming (which is optimal for agiven discretization) using a grid of 100 × 100 and actionsdiscretized into 10 bins.

We tested our algorithm in two scenarios with differentaction costs. In the first (the scenario with low action cost),we set Q equal to identity and R = 1 (in Equation 1). In thesecond (high action cost), we set Q equal to identify and R =50. Figure 1 illustrates the results. In Figure 1(a), we see that

−4 −3 −2 −1 0 1 2 3 4−8

−6

−4

−2

0

2

4

6

8

θ

θ dot

Iteration: 5000 Nodes: 2976 Radius: 0.51842 Lower Bound: 302.8823

(a) R = 1

−4 −3 −2 −1 0 1 2 3 4−8

−6

−4

−2

0

2

4

6

8

θ

θ dot

Iteration: 1200 Nodes: 2191 Radius: 5.9252 Lower Bound: 2305.2161

(b) R = 50

Fig. 1. The search tree and solution trajectories (thick red lines) foundby LQR-RRT∗ after 5000 iterations. (a) shows the solution found by thealgorithm when torque is cheap. (b) shows the solution found when theaction costs are high.

the optimal trajectory when action cost is small involves onlytwo or three swings to reach the upright position. Figure 1(b)shows that the optimal trajectory for a higher action costinvolves many more swings. Essentially, the algorithm findsa trajectory that takes longer to reach the upright position,but expends less action cost in the process. All solutions weresimulated and we observed that the goal was reached withoutviolating the torque constraint during all runs. Figure 2 showsthe solution cost over time for both problem instances. Wealso show the average solution cost found by RRT (which iscost-insensitive) using the same extension heuristics and thesolution cost found by dynamic programming. These graphsshow that the solution cost found by LQR-RRT∗ decreasesrapidly over time, improving significantly on the solutioncost found by the RRT and converging to the approximatelyoptimal solution cost as obtained by dynamic programming.

Although Pendulum is a low-dimensional domain, solvingit using RRT∗ requires good extension heuristics: our attemptto do so using extensions based on Euclidean distance failed,running out of memory before a solution was found. By con-trast, LQR-RRT∗ was able to effectively solve the problemwithout requiring problem-specific extension heuristics.

B. Acrobot

In the second domain, the planner must bring an acrobot(a double inverted pendulum actuated only at the middlejoint) from its initial rest configuration to a balanced verticalpose. The two joints are denoted q1 and q2. The state spaceis four-dimensional: (q1, q2, q1, q2). Since the first joint iscompletely un-actuated, the acrobot is a mechanically un-

CONFIDENTIAL. Limited circulation. For review only.

Preprint submitted to 2012 IEEE International Conference onRobotics and Automation. Received September 16, 2011.

Page 5: LQR-RRT*: Optimal Sampling-Based Motion Planning with ...people.csail.mit.edu/rplatt/papers/icra2012_perez_submitted.pdf · the LQR control policy to perform the node extensions

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

200

400

600

800

1000

1200

1400

Iterations

Cos

t

(a) R = 1

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

1000

1500

2000

2500

3000

3500

Iterations

Cos

t

(b) R = 50

Fig. 2. Solution cost over time for LQR-RRT (blue) and LQR-RRT∗

(green) for the torque-limited simple pendulum with R = 1 (a) and R = 50(b). Averages are over 100 runs, and error bars are standard deviation. Anapproximately optimal solution cost, obtained by dynamic programming,is shown in red. Average times to initial solutions for LQR-RRT andLQR-RRT∗ were 4.8 and 22.2 seconds respectively. All algorithms wereimplemented in MATLAB and tested on a Intel Core 2 Duo (2.26 GHz)computer.

deractuated system. The equations of motion for the acrobotcan be expressed in the standard manipulator form:

H(q)q + C(q, q)q +G(q) = B(q)u,

where the inertia matrix is

H(q) =

(2 + 2 cos(q2) 1 + cos(q2)1 + cos(q2) 1

),

the matrix describing centrifugal and coliolis forces is

C(q, q) =

(0 (2q1 + q2) sin(q2)

q1 sin(q1) 0

),

gravity is G(q) = (sin(q1) + sin(q1 + q2), sin(q1 + q2))T,

and B =

(0 00 1

).

Figure 3 shows the optimal solution found using LQR-RRT∗, and Figure 4 shows the solution cost over time.Again, LQR-RRT∗ rapidly improves, starting from a costsimilar to that of RRT (with the same extension heuristic)and converging to the optimal cost.

Acrobot is difficult to design extension heuristics for bothbecause it is underactuated and because it has interestingdynamics. Our attempts to use standard Euclidean distance-based heuristics again failed, almost always running outof memory before a solution could be found. By contrast,LQR-RRT∗ was able to effectively find the optimal solutionwithout requiring problem-specific extension heuristics.

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Fig. 3. Visualization of a LQR-RRT∗ solution after 5000 iterations forthe acrobot. Earlier state configurations are shown in lighter gray. Torqueis presented as a false-colored arc where red represents higher values.

C. Light-Dark Domain

The Light-Dark domain (adapted from Platt et al. [9]) isa partially observable problem where the agent must moveinto a goal region with high confidence. Initially, the agentis uncertain of its true position. On each time step, theagent makes noisy state measurements. The covariance ofthese measurements is a quadratic function of the minimumdistance to each of two ligh sources located at (5, 2) and(5,−2) (see Figure 5).

A key concept in solving partially observable problemsis that of belief state, a probability distribution over theunderlying state of the system. Since the agent is unable tosense state directly, it is instead necessary to plan in the spaceof beliefs regarding the underlying state of the system ratherthan the underlying state itself. In this example, it is assumedthat belief state is always an isotropic Gaussian. In this case,belief state is three-dimensional: two dimensions describingthe mean, x, and one dimension describing variance, σ2. Thedynamics of the belief system are derived from the Kalmanfilter equations and describe how belief state is expected tochange as the system gains more information. The beliefspace dynamics are: s = (u1, u2, σ)T , where σ = − σ2

σ+w(x) ,and w(x) = min((x−b1)2, (x−b2)2), with b1 and b2 beingthe location of the first and second light source, respectively.The goal of planning is to move from an initial high-variance

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0

200

400

600

800

1000

1200

1400

Iterations

Cos

t

Fig. 4. Acrobot solution cost over time for LQR-RRT (blue) and LQR-RRT∗ (green). Error bars indicated standard deviation over an average of50 runs. Average times to initial solutions for LQR-RRT and LQR-RRT∗

were 109 and 136 seconds respectively.

CONFIDENTIAL. Limited circulation. For review only.

Preprint submitted to 2012 IEEE International Conference onRobotics and Automation. Received September 16, 2011.

Page 6: LQR-RRT*: Optimal Sampling-Based Motion Planning with ...people.csail.mit.edu/rplatt/papers/icra2012_perez_submitted.pdf · the LQR control policy to perform the node extensions

belief state to a low-variance belief state where the mean ofthe distribution is at the goal. This objective corresponds toa situation where the agent is highly confident that the truestate is in the goal region. In order to achieve this, the agentmust move toward one of the lights in order to obtain a goodestimate of its position before proceedings to the goal. Thedomain, along with a sample solution trajectory, is depictedin Figure 5.

x

y

−1 0 1 2 3 4 5 6 7−4

−3

−2

−1

0

1

2

3

4

Fig. 5. The Light-Dark domain, where the noise in the agent’s locationsensing depends upon the amount of light present at its location. Here theagent moves from its start location (marked by a black circle) to its goal(a black X), first passing through a well-lit area to reduce its localizationuncertainty (variance shown using gray circles).

Figure 6 shows the solution cost for the Light-Darkdomain over time. As in the other examples, LQR-RRT∗

finds a solution quickly and rapidly converges to the optimalsolution. Figure 7 shows an example tree. Notice that itwould be difficult to create heuristics by hand that wouldbe effective in this domain: it is unclear how much weightshould be attached to differences in variance against dif-ferences in real-world coordinates. LQR-RRT∗ avoids thisdifficult question because it obtains its extension heuristicsdirectly from the dynamics of the problem, and is thereforeable to effectively find an optimal plan.

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100002900

3000

3100

3200

3300

3400

3500

3600

3700

3800

3900

Iterations

Cos

t

Fig. 6. Light-Dark solution cost over time for LQR-RRT (blue) and LQR-RRT∗ (green). Error bars indicated standard deviation over an average of100 runs. Average times to initial solutions for LQR-RRT and LQR-RRT∗

were 0.2 and 2.1 seconds respectively.

V. SUMMARY AND CONCLUSION

The previous section has shown that LQR-RRT∗ is suc-cessfully able to find plans in domains with interesting

−2 0 2 4 6 8−5

0

5

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Iteration: 5000 Nodes: 1578 Radius: 16.7508 Lower Bound: 3006.4426

(a) Light-Dark Domain

Fig. 7. The search tree for Light-Dark after 5000 iterations of beliefspace planning with LQR-RRT∗. The X and Y axis represent the respectivemeans in the plane and height represents the variance of the belief state.The solution is shown with a thick yellow line.

dynamics and improve the cost of those plans, eventuallyconverging to an optimal solution. While RRT∗ is in generalalso guaranteed to find the optimal solution asymptotically,in practice its performance is crucially dependent on thecareful design of appropriate extension heuristics. By con-trast, LQR-RRT∗ solved all three problems using the sameautomatically derived extension heuristics—without the needfor significant design effort. The use of such automaticallyderived heuristics eliminates one of the major obstacles tothe application of sampling-based planners to problems withcomplex or underactuated dynamics.

REFERENCES

[1] J. T. Schwartz and M. Sharir, “On the ‘piano movers’ problem:II. General techniques for computing topological properties of realalgebraic manifolds,” Advances in Applied Mathematics, vol. 4, pp.298–351, 1983.

[2] S. M. LaValle and J. J. Kuffner, “Randomized kinodynamic planning,”International Journal of Robotics Research, vol. 20, no. 5, pp. 378–400,May 2001.

[3] L. Kavraki, P. Svestka, J. Latombe, and M. Overmars, “Probabilisticroadmaps for path planning in high-dimensional configuration spaces,”IEEE Transactions on Robotics and Automation, vol. 12, no. 4, pp.566–580, 1996.

[4] S. Karaman and E. Frazzoli, “Sampling-based algorithms for optimalmotion planning,” International Journal of Robotics Research (to ap-pear), June 2011.

[5] S. M. Lavalle, “From dynamic programming to RRTs: Algorithmicdesign of feasible trajectories,” in Control Problems in Robotics.Springer-Verlag, 2002.

[6] P. Cheng and S. M. Lavalle, “Reducing metric sensitivity in randomizedtrajectory design,” in In IEEE International Conference on IntelligentRobots and Systems, 2001, pp. 43–48.

[7] E. Glassman and R. Tedrake, “A quadratic regulator-based heuristic forrapidly exploring state space,” in Proceedings of the IEEE InternationalConference on Robotics and Automation, May 2010.

[8] S. Karaman, M. Walter, A. Perez, E. Frazzoli, and S. Teller, “Anytimemotion planning using the RRT*,” in IEEE International Conferenceon Robotics and Automation, 2011.

[9] R. Platt, R. Tedrake, L. Kaelbling, and T. Lozano-Perez, “Belief spaceplanning assuming maximum likelihood observations,” in Proceedingsof Robotics: Science and Systems, June 2010.

CONFIDENTIAL. Limited circulation. For review only.

Preprint submitted to 2012 IEEE International Conference onRobotics and Automation. Received September 16, 2011.