Simultaneous Localization and Planning for Physical Mobile Robots via Enabling Dynamic Replanning in Belief Space

Embed Size (px)

DESCRIPTION

Simultaneous planning while localizing is a crucial ability for an autonomous robot operating under uncertainty. This paper addresses this problem by designing methods to dynamically replan while the localization uncertainty or environ- ment map is updated. In particular, relying on sampling-based methods, the proposed online planning scheme can cope with challenging situations, including when the localization update or changes in the environment alter the homotopy class of trajectories, in which the optimal plan resides. The proposed algorithm eliminates the need for stabilization in the state-of-the- art FIRM (Feedback-based Information RoadMap) method, and outperforms its performance and success probability. Applying belief space planning to physical systems brings with it a plethora of challenges. Thus, designing computationally tractable algorithms, a key focus and contribution of this paper is to implement the proposed planner on a physical robot and show the SLAP (simultaneous localization and planning) performance under uncertainty, in changing environments, and in the presence of large disturbances such as kidnapped robot situation.

Citation preview

  • 1Simultaneous Localization and Planning for Physical Mobile Robots viaEnabling Dynamic Replanning in Belief Space

    Ali-akbar Agha-mohammadi, Saurav Agarwal, Suman Chakravorty, and Nancy M. Amato

    AbstractSimultaneous planning while localizing is a crucialability for an autonomous robot operating under uncertainty.This paper addresses this problem by designing methods todynamically replan while the localization uncertainty or environ-ment map is updated. In particular, relying on sampling-basedmethods, the proposed online planning scheme can cope withchallenging situations, including when the localization updateor changes in the environment alter the homotopy class oftrajectories, in which the optimal plan resides. The proposedalgorithm eliminates the need for stabilization in the state-of-the-art FIRM (Feedback-based Information RoadMap) method, andoutperforms its performance and success probability. Applyingbelief space planning to physical systems brings with it aplethora of challenges. Thus, designing computationally tractablealgorithms, a key focus and contribution of this paper is toimplement the proposed planner on a physical robot and showthe SLAP (simultaneous localization and planning) performanceunder uncertainty, in changing environments, and in the presenceof large disturbances such as kidnapped robot situation.

    Index TermsMotion planning, belief space, robust, POMDP,uncertainty, mobile robots, rollout

    I. INTRODUCTIONSimultaneous Localization and Planning (SLAP) refers to the

    ability of autonomous robots to (re)plan dynamically every timethe localization module updates the probability distribution onthe robots state. For autonomous robots to reliably operate in areal-world environments, online (re)planning under uncertaintyis a key requisite to enable SLAP and cope with uncertaintiesand changes in the environment. For example, consider a low-cost mobile robot, working in an office environment, where itsgoals (tasks) are assigned online based on user requests, whilecoping with changes in the obstacle map (e.g., office doorsswitch state between open and closed). Such changes in the ob-stacle map or in the goal location often call for global replanningas they switch the optimal homotopy class of solutions fromone to another. Therefore, the robot needs to change its motionplan in real-time as it encounters changes in its goal or in theenvironment. What makes such online replanning challenging isthat low-cost robots are not able to follow the control commandsexactly due to motion noise and they do not have perfectmeasurements due to sensor noise. Therefore, this problem callsfor online planning algorithms in uncertain, partially observableenvironments. In a broader sense, this problem is an instance ofthe problem of decision making and control under uncertainty.

    In general, decision making and control under uncertaintyis a ubiquitous challenge in many robotic applications. For anautonomous robot to operate reliably, it is crucial to be able toperceive sensory measurements, infer its situation (state) in theenvironment, and plan and take actions accordingly. However, in

    A. Agha is with Qualcomm Research, San Diego, CA, 92121. S. Agarwaland S. Chakravorty are with the Dept. of Aerospace Eng. and N. Amatois with the Dept. of Computer Science & Eng. at Texas A&M University,College Station, TX 77843, USA. Emails: [email protected],{sauravag,schakrav,amato}@tamu.edu

    partially-observable environments the state of the system cannotbe determined exactly due to imperfect and noisy measurements.Based on the systems model, a filtering module (e.g., Kalmanfilter) can provide an estimation of the state, i.e., a probabilitydistribution function (pdf) over all possible system states. Thispdf describing the localization uncertainty is also referred to asbelief or information-state. Based on the belief of the system,at every time step actions are chosen. To formalize this problemof finding the optimal mapping between perceived observationsand the taken action, we rely on the most general formulation,i.e., Partially-Observable Markov Decision Processes (POMDP)[1], [2].

    There are a number of challenges in dealing with POMDPs,including the curse of dimensionality and curse of history. Curseof dimensionality refers to the high dimensions of the beliefspace. If the underlying robotic system evolves in a discretegrid world with n cells, the corresponding belief space is an n-dimensional continuous space. Moreover, if the underlying statespace is continuous (which is the case for most real roboticapplications), then the belief space is an infinite dimensionalspace. Methods such as [3][11] alleviate these issues and aimat taking POMDPs to more challenging and realistic problems.In this paper, we consider a class of POMDPs that commonlyarise in modeling the SLAP problem. The settings are similarto the ones used in KF-based localization literature, such as (i)the system model is given as closed-form nonlinear equations,(ii) the state/action/observation spaces are continuous, and (iii)belief is unimodal, thus it is well-approximated by a Gaussian.

    In addition to the above-mentioned challenges associatedwith POMDPs, when dealing with real-world physical systems,there is another important challenge: the discrepancy betweenthe real models with the models used for computation. Theseinclude discrepancies in the environment map, the system andsensor models, and noise models. Such discrepancies can leadto deviations of the system from the desired plan. A plausiblesolution for this problem is an ability to carry out planningin a simultaneous manner with localization, i.e., an ability toreplan dynamically to cope with changes in the environment anddeviations resulted from model discrepancies and intermittentsensing failures.

    To enable an online replanning scheme for SLAP, we relyon multi-query methods in belief space and specifically theFeedback-based Information Roadmap (FIRM) method, as dis-cussed below. The main body of POMDP literature, in particularsampling-based methods, propose single-query solvers, i.e., thecomputed solution depends on the initial belief [7], [12], [13].Therefore, in replanning (planning from a new initial belief)almost all the computations need to be reproduced, which limitstheir usage in solving SLAP where dynamic replanning is re-quired. However, multi-query methods such as FIRM provide aconstruction mechanism, independent of the initial belief of thesystem, which makes them suitable methods for SLAP.

  • The original FIRM framework provided a reliable method-ology for solving the problem of motion planning under uncer-tainty by reducing the intractable dynamic programming (DP) toa tractable DP over the nodes of the FIRM graph. In this paper,we extend our previous work on FIRM by proposing a dynamicreplanning scheme in belief space that enables online replanningfor real world applications in mobile robotics.This extensionleads to intelligent behaviors that provably takes the solutioncloser to the optimal and can guarantee the success probabilityonly increases via this extension. In addition to theoreticalresults on the online generalization of FIRM, the main emphasisof this paper is on the implementation of the proposed SLAPsolution on a physical robot. We investigate how dynamic onlinereplanning can generate a feedback plan that leads to higherperformance and success probability. Also, we demonstrate itsunique ability to cope with discrepancies between real modelsand computational models such as changes in the environmentand large deviations which can globally change the plan bychanging the optimal homotopy class of solutions. We believethese results lay the ground work for advancing the theoreticalPOMDP framework towards practical SLAP applications, andachieving long-term autonomy in robotic systems.

    A. Related Work

    Online replanning in the belief space is a vital capability tosolve the SLAP problem for two main reasons. First, the beliefdynamics are usually more random than the state dynamicsbecause the belief is directly affected by the measurement noise.Therefore, a noisy measurement or spurious data association cancause large changes in the belief. Second, in practical applica-tions, discrepancies between real and computation models are asignificant source of error and cause the belief to occasionallyhave behaviors different than expected nominal behavior. Thus,simultaneously replanning as localizing, gives a system the abil-ity to recover from such deviations. In addition, enabling SLAPcan help the robot to cope with changes in the environment aswell as recover from large unexpected deviations in the robotspose.

    Solving the planning problem alongside localization andmapping has been the topic of several recent works (e.g, [14],[15], [16], [17]). The method in [18] presents an approach touncertainty-constrained simultaneous planning, localization andmapping for unknown environments and in [19], the authorspropose an approach to actively explore unknown maps whileimposing a hard constraint on the localization uncertainty at theend of the planning horizon. Our work assumes the environmentmap is known, formulates the problem in its most general form(POMDP), and focuses on online replanning in the presence ofobstacles (possibly changing).

    The majority of planners designed to solve POMDPs are notequipped with online replanning capabilities. Sampling-basedmethods (e.g., [12] [13], [20], [21]) and point-based methods(e.g., [3], [6], [22], [23], [24], [25], [26]) are single querymethods, thus in case of replanning from a new belief mostof the computation needs to be redone, in particular whenthe planner has to switch the plan from one homotopy classto another. Method proposed in [27] reuses offline computa-tions and demonstrate online replanning in simulation, whereit chooses between two heuristics (with adaptive weights) toascertain whether the planner needs to explore new actions or

    exploit its knowledge. In our work, the rollout value/cost-to-go beyond the horizon is computed using the cost-to-go of abase policy (e.g., FIRM policy). Thus, the choice of explorationversus exploitation is a direct consequence of the underlyingpolicy and the cost-to-connect values that are computed online.

    Another class of planners rely on trajectory optimization-based methods for replanning. They mainly follow the RecedingHorizon Control (RHC) scheme, where at every step the locallyoptimal trajectory is computed within a limited horizon. Then,only the first step of the trajectory is executed and the rest of itis discarded. From the new point, then, the optimal trajectory isrecomputed and this process is repeated until the system reachesthe goal region. The RHC framework was originally designedfor deterministic systems and its extension to stochastic systemsand belief space planning is still an open problem. A directapproach is to replace the uncertain quantities (such as futureobservations) with their nominal values (e.g., most likely obser-vations), and then treat the stochastic system as a deterministicone and use it in an RHC framework (e.g., [28], [29], [30], [31],[32]). However, in such an approach the optimization is carriedout only within a limited horizon, and therefore the system maylocally choose good actions but after a while may find itself in ahigh-cost region.

    From an experimental point of view, a few recent works havefocused on applying belief space planning to real-world robots.[33] implements a belief planner on a mobile manipulator withtime of traversal as a cost metric. [34] is an integrated task andmotion planner in belief space, utilizing symbolic abstraction,whose performance is demonstrated on a PR2 robot tasked withpicking and placing household objects. In [35], the authorsdevelop a stochastic motion planner and show its performanceon a physical manipulation task where unknown obstaclesare placed in the robots operating space and the task-relevantobjects are disturbed by external agents.

    B. Highlights and Contributions

    This paper extends our previous work in [36]. Compared to[36], we discuss in detail the concept of rollout-based beliefspace planning, policy execution, and present extensive simu-lation and experimental results to demonstrate the performanceimprovements made by using the proposed method. We alsopresent analyses that guarantees a lower execution cost andfailure probability as compared to the nominal FIRM policy. Themain contributions, highlights, and impact of this work can besummarized as follows.Online belief planner to enable SLAP: We propose an

    online planning method in belief space. The method lends itselfto the class of rollout-based methods [37] and extends them tothe belief space. Compared to belief space RHC methods, thismethod is not limited to a horizon, does not get stuck in localminima, and does not assume deterministic future observations.Online swtiching between homotopy classes: In motion

    planning, homotopy classes of paths refer to sets of paths thatcan be deformed into each other by continuous transformation(bending and stretching) without passing through obstacles [38].A unique feature of the presented method is that it is capableof updating the plan globally online, even when the homotopyclass of optimal solution has changed due to the deviations orenvironment changes. This feature allows the proposed SLAP

    2

  • method to work in changing environments and cope with largedeviations.Stabilization-free policy: The proposed method supercedes

    a state-of-the-art method, FIRM [39], in performance, successprobability, and ability to cope with changing environments.It builds upon a FIRM and inherits the desired features of theFIRM framework such as robustness, scalability, and feedbacknature of the solution. But, it also eliminates the need forbelief node stabilization in the original FIRM method, thus canbe viewed as a stabilization-free FIRM. In the original FIRMframework, at every node the system needs to steer its sensoryinformation to reach the belief node (each graph node is abelief, i.e., a particular localization uncertainty). But, in thispaper, by embedding an online local planning module in theFIRM framework, we achieve a locally optimal tradeoff betweenstabilization to a node (i.e., exploring the information space toreach the exact belief node) and moving forward towards goal(exploiting the gradient of local cost function), while the globaloptimality on the graph is still guaranteed by solving dynamicprogramming. As a result of this optimal tradeoff, interestingbehaviors emerge out of the algorithm without encoding anyheuristic. These behaviors exchange information and energy. Forexample, consider a case when the desired cost is to reach agoal while minimizing the probability of colliding with obsta-cles. In that case, in the open areas where there are no narrowpassages, the system bypasses the belief node stabilizations. Itspeeds up and does not waste any time gathering informationand reducing its uncertainty as there is not much benefit indoing so in obstacle-free regions. However, once it faces withobstacles or narrow enough passages, it automatically decidesto perform stabilization (partially) until the uncertainty is shrunkenough to safely traverse the narrow passage. Fig. 1, shows thisphenomenon pictorially.Performance guarantees: We provide lower bounds on

    the performance of the method and show that in stationaryenvironments, the performance and success probability of theproposed method always exceeds those of the FIRM method.Applications to physical systems: Among the set of meth-

    ods that cope with continuous state/action/observation POMDP,only a few (e.g., [33], [34]) have been applied to physicalsystems due to the computational intractability of these methodsin real-world robotics problem. An important contribution ofthis work is to implement a continuous state/action/observationPOMDP on a physical robot in a real-world office-like en-vironment with changes in the environment. We explain thisprocedure in details and discuss the theoretical tools and meth-ods designed during the process of to help with the real-worldimplementing including the lazy feedback approach.

    II. BELIEF SPACE PLANNING FOR MOBILE ROBOTSIn this section, we briefly describe the abstract framework of

    Feedback-based Information RoadMap (FIRM) followed by adescription of its concrete implementation in our experiments.We refer the reader to [39], [40] for a more in-depth descriptionof the abstract FIRM method.

    A. Planning Under UncertaintyIn motion planning under uncertainty, we are looking for

    a policy pik at time-step k that generates a control action ukbased on available information about the robot. We start by

    defining some key terms. Consider a system whose state isdenoted by xk at the k-th time step, let uk and wk, be the controlaction and motion noise respectively at time k. Let us denotethe state evolution model by xk+1 = f (xk,uk,wk). In a partiallyobservable system, we do not have access to the true state ofthe system. Rather, we get some noisy observations of the robotstate. Let us denote the sensor measurement (or observation)vector by zk at the k-th time step and the measurement modelby zk = h(xk,vk), where vk denotes sensing noise. The onlydata that is available for decision making at the k-th time step(i.e., generating uk) is the history of observations and controls:Hk = {z0:k,u0:k1}= {z0,z1, ,zk,u0, ,uk1}.

    Conditional probability distribution over all possible sys-tem states bk = p(xk|Hk), which is referred to as belief orinformation-state compresses the dataHk. It is well-known thatin Bayesian filtering, belief can be computed recursively basedon the last action and current observation bk+1 = (bk,uk,zk+1)[37], [41]:

    bk+1 = p(zk+1|xk+1)X

    p(xk+1|xk,uk)bkdxk =: (bk,uk,zk+1).

    where, = p(zk+1|Hk,uk)1 is the normalization constant. Asa result of filtering, the action uk can be taken based on the beliefbk using a policy (planner) pik, i.e., uk = pik(bk). It is well-knownthat pik is the solution of a POMDP, which is intractable overcontinuous state/action/observation spaces.

    B. A Brief Overview of FIRMFIRM is a framework designed to reduce the mentioned in-

    tractable POMDP problem to a tractable problem, by generatinga representative graph (PRM: Probabilistic Roadmap Method)in the belief space. Similar to PRMs where the solution path isa concatenation of local paths, in FIRM the solution policy isa concatenation of local policies. Every node in an FIRM is asmall region B = {b : b b` } around a sampled belief b`.We denote the i-th node by Bi and the set of nodes by V= {Bi}.Each edge in an FIRM is a closed-loop local controller whosegoal is to steer the belief into the target node of the edge. Anedge controller connecting nodes i and j is denoted by i j andthe set of edges by M = { i j}. Thus, a policy pig on the graphis a mapping from graph nodes to edges; i.e., pig : VM. gdenotes the set of all possible policies.

    Having such a graph in belief space, we can form a tractablePOMDP on the FIRM graph (so-called FIRM MDP):

    pig = argmingE

    n=0

    Cg(Bn,pig(Bn)) (1)

    where, Bn is the n-th visited node, and n is the edge taken at Bn.Cg(B,) := Tk=0 c(bk,(bk)) is the generalized cost of takinglocal controller at node B centered at b0. We incorporate thefailure set in planning by adding a hypothetical FIRM nodeB0 = F to the list of FIRM nodes. Since the FIRM MDP inEq.(1) is defined over a finite set of nodes, it can be solvedby computing the graph node cost-to-gos through the followingdynamic programming problem:

    Jg(Bi)=min{Cg(Bi,)+

    N

    =0

    Pg(B |Bi,)Jg(B)} (2)

    and pig(Bi) = , where is the argument of above minimiza-tion. Pg(B |Bi,) is the probability of reaching B from Bi under

    3

  • (a) A simple scenario with a FIRMroadmap, robot and environment as de-picted.

    (b) The rollout policy is computed peri-odically. Four candidate edges are com-pared (three of them are shown indashed line and the last one is the restof the current edge.)

    (c) In clutter-free regions, rollout takes ashorter route (edge 3), increasing perfor-mance and speed while loosing certainty(i.e., skipping node stabilization).

    (d) While completing edge 3, the newrollout further cuts down task executiontime by taking shorter route through anewly computed rollout edge 2.

    (e) The robot is approaching the clut-tered region. As needed the planner willslow the robot down to trade perfor-mance with certainty.

    (f) Stabilization reduces localizationuncertainty (covariance shrinks), thusguaranteeing high success probability.

    (g) Stabilization occurs again as robot isstill passing through the narrow passage.

    (h) New rollout connections allow by-passing stabilization.

    (i) The robot approaching the goal.

    Fig. 1. A representational scenario depicting how rollout-based FIRM achieves higher performance compared to the standard FIRM algorithm whileguaranteeing robustness. The 9 scenes depict different stages of task execution as the robot moves from the start to goal location.

    . The failure and goal cost-to-gos (i.e., Jg(B0) and Jg(Bgoal))are set to a suitably high positive value and zero, respectively.

    C. Concrete FIRM instance in our implementation

    Here we discuss the concrete realization of the FIRM graphconstructed for conducting the experiments.One-step-cost: Although the objective function can be gen-

    eral, the cost function we use in our experiments includes thelocalization uncertainty, control effort, and elapsed time.

    c(bk,uk) = ptr(Pk)+uuk+T . (3)where tr(Pk) is the trace of estimation covariance as a measureof localization uncertainty. The norm of the control signal ukdenotes the control effort, and T is present in the cost topenalize each time lapse. Coefficients p, u, and T are user-defined task-dependent scalars to combine these costs to achievea desirable behavior. In the presence of constraints (such asobstacles in the environment), we assume the task fails if therobot violates these constraints (e.g., collides with obstacles).Therefore, in case of failure, the running-sum of costs (cost-to-go), i.e., Jg(B0) = Jg(F) is set to a suitably high cost-to-go.Steering localization covariance: To construct a FIRM

    graph, we first need to sample a set of stabilizers (belief steeringfunctions). Each stabilizer is a closed-loop controller, whose roleis to drive the localization uncertainty (or belief) to a FIRMnode. A stabilizer consists of a filter and a separated controller

    [42]. The filter governs the belief evolution and the separated-controller generates control signals based on the available beliefat each time step [42]. To design these steering laws, we firstsample a set of points V = {vi} in the robots state space andthen associated with each point we construct a stabilizer [39]. Inthe vicinity of each node v j, we rely on Stationary Kalman Filter(SKF) as the steering filter (which is constructed by linearizingthe system about the target point v j) as the stabilizers filter.It can be shown that for an observable system, the covarianceunder the j-th SKF approaches to covariance P+

    js , which can

    be efficiently computed by solving a corresponding DiscreteAlgebraic Riccati Equation [43].

    Steering localization mean: While steering belief towardnode B j, separated-controller i j is responsible for generatingthe control signals based on the available belief, i.e., uk = i j(bk). The iRobot Create is a nonholonomic robot and ismodeled as a unicycle (see Section V-A2), thus to steer theestimation mean toward the target node v j, one needs to use acontrollers designed for stabilizing nonholonomic systems (e.g.,[44], [45], [46]). However, the randomness of the estimationmean (resulted from randomness of observations) calls for acontroller that can perform such stabilization under uncertainty.To this end, we implemented different controllers includingpolar coordinate-based controller [47] and Dynamic FeedbackLinearization-based controller [48]. Observing the behavior ofdifferent controllers, we adopted a variant of the Open-LoopFeedback Control (OLFC) scheme [37] for stabilization pur-

    4

  • poses. In this variant of OLFC, for a given v j, we compute anopen-loop control sequence starting from the current estimationmean and ending at v j. Then, we apply a truncated sequenceof the first l controls (l = 5 in our experiments)1. This processrepeats every l steps until we reach the vicinity of v j.FIRM graph: Associated with each sample v j, we can define

    the belief node b` j (v j,P+ js ). Defining FIRM node as a ballaround this point B j = {b : b b` j }, we can steer theGaussian localization uncertainty to this ball with combinationof OLFC and SKF. Accordingly, we sample N FIRM nodes{B j}Nj=1. The SKF/OLFC combination between nodes i and jforms the FIRM edge (local controller) and is denoted by i j.We connect each node to k-nearest neighbors and the set ofconstructed edges is denoted by M= { i j}. Then, we computeand store costs and transition probabilities associated with eachedge by offline simulations. Finally, we solve the DP in Eq. (2)to get the optimal graph cost-to-gos Jg(Bi) and policy pig(Bi)for all i.

    III. SLAP VIA ROLLOUT-BASED DYNAMIC REPLANNINGIN BELIEF SPACE

    To enable SLAP and handle changes in the environmentand goal location, large deviations in the robots location, anddiscrepancies between real and computational models, we resortto dynamic replanning in belief space. In this section, we discussthe extension of the RHC and Rollout policy [37] to the beliefspace to design a principled scheme for online replanning andincrease the performance of FIRM by smart selective stabiliza-tion.

    To make the connection with the rollout policy, we re-state thePOMDP problem in a more general setting of the time-varyingpolicy.

    pi0:() = argmin0:

    k=0E [c(bk,pik(bk))]

    s.t. bk+1 = (bk,pik(bk),zk), zk p(zk|xk)xk+1 = f (xk,pik(bk),wk), wk p(wk|xk,pik(bk)) (4)

    In the above problem, we seek for a sequence of policiespi0: = {pi0(),pi1(),pi2(), }, where pik maps any given bk tothe optimal action uk, k is the space of all possible policiesat time step k, i.e., pik k. In the infinite horizon case, itcan be shown that the solution is a stationary policy pis, i.e.,pi1 = pi2 = = pis and the problem is reduced to the oneintroduced earlier in this paper. However, we keep the time-varying format for the reasons that will be clear further below.

    As discussed earlier, solving the above POMDP problem iscomputationally intractable over continuous state, action, andobservation spaces. However, the more difficult problem is tosolve the SLAP problem which requires re-solving the abovePOMDP online every time the localization pdf is updated. Wehandle this problem by reusing computations in an efficient wayas will be explained in the next subsection. However, we firststart by RHC which is a natural way of thinking about suchrepeated online solutions.RHC in belief space: Receding horizon control (often re-

    ferred to as rolling horizon or model predictive control) wasoriginally designed for deterministic systems [49] to cope with

    1Only one control (i.e., l = 1) is not enough due to the nonholonomicityof the system.

    model discrepancies. For stochastic systems, where the closed-loop (feedback) control law is needed, formulation of the RHCscheme is up for debate [30], [50][52]. In the most commonform of RHC [37] the stochastic system is approximated with adeterministic system by replacing the uncertain quantities withtheir typical values (e.g., maximum likelihood value.) In beliefspace planning the quantity that injects randomness in beliefdynamics is the observation. Thus, one can replace the randomobservations zk with their deterministic maximum likelihoodvalue zml , where zmlk := argmaxz p(zk|xdk ) in which xd is the nom-inal deterministic value for the state that results from replacingthe motion noise w by zero, i.e., xdk+1 = f (x

    dk ,pik(b

    dk ),0). The

    deterministic belief bd is then used for planning in the recedinghorizon window. At every time step, the RHC scheme performsa two-stage computation. At the first stage, the RHC scheme fordeterministic systems solves an open-loop control problem (i.e.,returns a sequence of actions u0:T ) over a fixed finite horizon Tas follows:

    u0:T = argminU0:T

    T

    k=0

    c(bdk ,uk)

    s.t. bdk+1 = (bdk ,uk,z

    mlk+1)

    zmlk+1 = argmaxzp(z|xdk+1)

    xdk+1 = f (xdk ,uk,0) (5)

    In the second stage, it executes only the first action u0 and dis-cards the remaining actions in the sequence u0:T . However, sincethe actual observation is noisy and is not equal to the zml , thethe belief bk+1 will be different that bdk+1. Subsequently, RHCperforms these two stages from the new belief bk+1. In otherwords, RHC computes an open loop sequence u0:T from thisnew belief, and this process continues until the belief reachesthe desired belief location. Algorithm 1 recaps this procedure.State-of-the-art methods such as [53] and [32] utilize the RHC-in belief space. [32] refers to the method as partially-closed loopRHC (PCLRHC) as it exploits partial information about futureobservations (i.e., zml) and does not ignore them.

    Algorithm 1: RHC with most likely observations forpartially-observable stochastic systems1 input : Initial belief bcurrent X, Bgoal B2 while bcurrent / Bgoal do3 u0:T = Solve the optimization in Eq.(5) starting from

    bd0 = bcurrent ;4 Apply the action u0 to the system;5 Observe the actual z;6 Compute the belief bcurrent (bcurrent ,u0,z);

    A known shortcoming of the stated RHC formulation is itslimited horizon which might lead the system to local minimaby choosing actions that guide the robot toward favorablestates (with low cost) in the near future followed by a set ofunfavorable states (with a high cost) in the long run. Toimprove the basic RHC, different variants have been proposedincluding the rollout policy [37]. Here, we discuss how theycan be extended and realized in belief space.Rollout policy in belief space: Another class of methods

    that aims to reduce the complexity of the stochastic planningproblem in Eq. (4) is the class of rollout policies [37], whichare more powerful than the described version of RHC in the

    5

  • following sense: First, they do not approximate the system witha deterministic one. Second, they avoid local minima using asuboptimal policy that approximates the true cost-to-go beyondthe horizon. This function is referred to as the base policy anddenoted by J. Formally, at each step of the rollout policy scheme,the following closed-loop optimization is solved:

    pi0:T () = argmin0:T

    E

    [T

    k=0

    c(bk,pik(bk))+ J(bT+1)

    ]s.t. bk+1 = (bk,pik(bk),zk), zk p(zk|xk)

    xk+1 = f (xk,pik(bk),wk), wk p(wk|xk,pik(bk)) (6)Then, only the first control law pi0 is used to generate the con-

    trol signal u0 and the remaining policies are discarded. Similarto the RHC, after applying the first control, a new sequence ofpolicies is computed from the new point. The rollout algorithmis described in Algorithm 2.

    Algorithm 2: Rollout algorithm in belief space1 input : Initial belief bcurrent B, Bgoal B2 while bcurrent / Bgoal do3 pi0:T = Solve optimization in Eq.(6) starting from

    b0 = bcurrent ;4 Apply the action u0 = pi(b0) to the system;5 Observe the actual z;6 Compute the belief bcurrent (bcurrent ,u0,z);

    Although the rollout policy in the belief space efficiently re-duces the computational cost compared to the original POMDPproblem, it is still formidable to solve since the optimizationis carried out over the policy space. Moreover there should bea base policy that provides a reasonable cost-to-go J. In thefollowing, we propose a rollout policy in the belief space basedon the FIRM-based cost-to-go.

    A. Enabling SLAP via FIRM-based rollout in belief spaceIn this section, we discuss how a rollout policy in belief space

    (and hence SLAP) can be realized using the FIRM framework.As explained briefly, in FIRM, the system transitions betweentwo nodes2 Bi and B j at sampled beliefs bi and b j using acontroller i j. The global level decision-making is limited towhen the system is in the region Bi and the rest of the time,the local controls are executed according to i j. In FIRM-basedrollout, we raise this limitation by forcing the system to globallyreplan at every step to enable SLAP. In particular, suppose thatat time t, the belief state of the system is in bt . Then we solvethe following problem online for bt :1) We connect bt to all its FIRM neighbors in some radius R

    using suitable controllers t j, designed in a similar way to theones used as FIRM edges.2) We evaluate the transition costs C(bt , t j) and the probabil-

    ity of landing in nodes B under the influence of the controller t j at bt , i.e., P(B |bt , t j).3) We evaluate the best edge outgoing from bt by solving:

    j=argminj{C(bt , t j)+

    N

    =0

    P(B |bt , t j)Jg(B)} (7)

    2In dense graphs the belief under i j might reach a different node beforereaching B j . The summation over in equations takes that into account.

    where Jg(B) is the nominal cost-to-go under the FIRM policyfrom node B and Jg(B0) is the failure cost-to-go as discussedin Section II-B.4) We choose t j as the local controller at bt if the expected

    success probability exceeds the current one. In other words, if i j is the current local controller at time t, we only switch to t j if below condition holds:

    E[success|bt , t j ]> E[success|bt , t j] (8)where expected success probability is

    E[success|bt , t ] =N

    =1

    P(B |bt , t)Psuccess(B) (9)

    and Psuccess(B) is the probability of success for reaching thegoal from FIRM node B under the nominal FIRM policy.

    Algorithm 3 describes planning with the proposed rolloutprocess. We split the computation to offline and online phases.In the offline phase, we carry out the expensive computation ofgraph edge costs and transition probabilities. Then, we handlethe deviations and changes in the start and goal location byrepeated online replanning, while reusing the offline computa-tions.

    Algorithm 3: Rollout algorithm with FIRM as base policy1 input : Initial belief bt and goal belief region Bgoal2 Construct a FIRM graph and store nodes V= {Bi},

    edges M= { i j}, Cost-to-go Jg(), and Successprobabilities Psuccess(); // offine phase

    3 while bt / Bgoal // online phase4 do5 Find r neighboring nodes Vr = {Bi}ri=1 to bt ;6 Set Bt = {bt}, J(Bt) = , and S = 0;7 forall the B j VR do8 t j = Generate controller from bt to B j ;9 C(bt , t j),P(B |bt , t j) = Simulate t j to

    compute transition probability and expected cost;10 Compute the expected success E[success|bt , t j];11 if E[success|bt , t j] S then12 Compute the candidate cost-to-go as

    Jcand =C(bt , t j)+N=0P(B |bt , t j)Jg(B);13 if Jcand < J(Bt) then14 J(Bt) = Jcand and S = E[success|bt , t j];15 t j = t j;

    16 Apply the action ut = t j(bt) to the system;

    17 Get the actual measurement zt+1;18 Compute the next belief bt (bt ,ut ,zt+1);19 if user submits a new goal state vgoal then20 Bgoal Sample the corresponding FIRM node;21 Add Bgoal to the FIRM graph; V V{Bgoal};22 Connect Bgoal to its r nearest neighbors using

    edges {(i,goal)}. Also, MM{(i,goal)};23 [Jg(),Psuccess()] = DynamicProgramming(V,M);

    Below, we discuss the how Algorithm 3 realizes a variant ofEq. 7 and extends the rollout policy methods [37] to belief space.Following, the concepts and terminology in [37], here, nominal

    6

  • FIRM policy plays the role of the base policy. Accordingly, thecost-to-go of the FIRM policy is used to approximate the cost-to-go beyond the horizon. Given a dense FIRM graph, wherenodes partition the belief space, i.e., iBi = B, then at the endof time horizon T , the belief bT+1 belongs to a FIRM node Bwith known cost-to-go. With a sparse FIRM graph, where nodesdo not cover the entire belief space, we design local policiesthat drive the belief into a FIRM node at the end of horizon.However, since the belief evolution is random, reaching a FIRMnode at deterministic time horizon T may not be guaranteed.Therefore, we propose a new variant of rollout, defining thehorizon based on belief (instead of time).

    pi0:() = argminE

    [T

    k=0

    c(bk,pik(bk))+ J(bT +1)

    ]s.t. bk+1 = (bk,pik(bk),zk), zk p(zk|xk)

    xk+1 = f (xk,pik(bk),wk), wk p(wk|xk,pik(bk))bT +1 jB j, (10)

    where for bT +1 B j we haveJ(bT +1) = Jg(B j) (11)

    and is a restricted set of policies under which the beliefwill reach a FIRM node B j in finite time. In other words,if pi and pi = {pi1,pi2, }, then for finite T , we haveP(bT +1 jB j|pi) = 1. Thus, the last constraint in Eq. (10) isredundant and automatically satisfied for suitably constructed. Also, it is worth noting that the FIRM-based cost-to-go Jg()plays the role of the cost-to-go beyond the horizon J() (Eq.(11)).

    Note that based on Algorithm 3, we can provide guaranteeson the performance of the proposed method. Before formallystating the results, recall that at each instance of rollout compu-tation, the current belief bt is added as a virtual node Bvirtualtto the FIRM graph to generate the augmented FIRM graphGat . A virtual node being defined as a temporary node with noincoming edges, which is removed from the graph as soon as thesystem departs its vicinity.

    Proposition 1: The performance and the success probabilityof the FIRM-Rollout policy is strictly lower bounded by thenominal FIRM policy at any belief state during execution of theplanner.

    Proof 1: As discussed, to compute the rollout at time t,belief bt is added to the FIRM graph as a virtual node, withno incoming edges that drive the system into it. Therefore, thedynamic programming solution remains unchanged. Thus, theoptimal cost-to-go from the virtual node Bvirtualt is given by theminimum of the sum of the rollout edge cost and the cost-to-gofrom the target of rollout edge, i.e.,

    J(Bvirtualt ) = minj{C(bt , t j)+

    N

    =0

    P(B |bt , t j)Jg(B)}

    Since the current FIRM edge is one of edges over which theabove minimization is carried out, the cost-to-go (performance)with rollout is strictly upper (lower) bounded by the nominalFIRM policy cost (performance). Furthermore, due to the checkin Eq. (8), it can be further assured that the probability of successof the rollout policy is strictly greater than that of the FIRMpolicy.

    Once the rollout is computed and the target node is chosen, therobot starts executing the controller t j and leaves the vicinityof node Bt . This node then gets removed from the graph. Thus,it is called a virtual node. Further, it should be noted that as therobot moves on the virtual edge (edge from node Btvirtual to B

    j ),the rollout process is repeated which leads the robot to skip thebelief stabilization as needed. Consequently, as the robot moves,due to rollout, it chooses actions which are never worse-off thanthe nominal FIRM policy. We refer the reader to Fig.1 for avisual explanation of the process.

    Remark: If the desired factor was merely the success prob-ability, one can ignore the cost-to-go comparison conditionAlgorithm 3 and only maximize the success probability.

    In addition to improving the performance while not compro-mising on the safety, the rollout procedure is particularly helpfulin handling the changes in the environment map. We discuss thisaspect in the following section.

    B. Enabling SLAP in changing environmentsIn this section, we investigate the ability of the proposed

    SLAP framework to handle changes in the obstacle map. Here,we focus on a challenging case, where changes in the obstaclemap are persistent and can possibly eliminate a homotopy classof solutions. Doors are an important example of this class. If therobot observes a door is closed (which was expected to be open),it might have to globally change the plan to get to the goal froma different passage. This is not feasible with the state-of-the-artmethods in the belief space planning literature. We will discussother type of changes in the environment in the experimentssection.

    To handle such changes in the obstacle map and replanaccordingly, inspired by the lazy evaluation methods for PRMframeworks [54], we propose a method for lazy evaluation of thegenerated feedback tree, referred to as lazy feedback evalua-tion algorithm. The basic idea is that at every node the robot re-evaluates only the next edge (or the next few edges up to a fixedhorizon) that the robot will most likely take. By re-evaluation,we mean it needs to re-compute the collision probabilities alongthese edges. If there is a significant change (measured by inAlg. 4) in the collision probabilities, the dynamic programmingproblem is re-solved and new cost-to-gos are computed. Oth-erwise, the cost-to-gos remain unchanged and the robot keepsfollowing its rollout-policy. Such lazy evaluation (computingthe collision probabilities for a single edge or a small numberof edges) can be performed online. The method is detailed inAlgorithm 4.

    Remark: Another challenge with these persistent changes isthat they stay in the memory. Imagine a case where the robot isin a room with two doors. Suppose after checking both doors,the robot realizes they are closed. In those cases where there isno homotopy class of solutions to the goal, the door state is resetto open after a specific amount of time to persuade the robotto recheck the state of doors.

    It is important to note that it is the particular structure of theproposed SLAP framework that makes such replanning feasibleonline. The graph structure of the underlying FIRM allows usto locally change the collision probabilities in the environmentwithout affecting the collision probability of the rest of the graph(i.e., properties of different edges on the graph are independentof each other). Such a property is not present in the state-of-the-art sampling-based belief space planners (e.g., [12], [13]), where

    7

  • Algorithm 4: Lazy Feedback Evaluation (Lazy Replan-ning)1 input : FIRM graph2 output : Updated cost-to-go, Jg() and success

    probabilities Psuccess()3 Perceive the obstacles map;4 if there is a change in map then5 F Retrieve the sequence of nominal edges

    returned by feedback up to horizon l; Set ctr = 1;6 forall the edges F do7 Re-compute collision probabilities Pnew(B,)

    from start node B of edge;8 if |Pnew(B,)P(B,)|> then9 P(B,) Pnew(B,);10 ctr++

    11 Update edge set M based on new transitionprobabilities;

    12 if ctr > 0 then13 [Jg(),Psuccess()] = DynamicProgramming(V,M);14 return Jg() and Psuccess();the collision probabilities and costs on all edges are dependenton each other and hence need to be re-computed.

    IV. SIMULATION RESULTSIn this section, we present simulation results for a comparison

    of the standard FIRM algorithm versus FIRM with Rollout ina 2D navigation problem. The simulation represents a motionplanning scenario wherein the robot is tasked to go from a startlocation to multiple goal locations sequentially in an environ-ment with clutter, narrow passages and assymetrically placedlandmarks/beacons. We conduct two simulations, the first withthe standard FIRM algorithm and the second with the Rolloutbased policy framework presented in this work. All simulationswere carried out on a Dell Precision Desktop with a Quad-CoreIntel(R) Xeon(R) E5-2609 CPU running at 2.40GHz and 16GBof RAM running Ubuntu 14.04.

    1) Motion Model: We represent the motion of the robot bya simple planar kinematic model in which each component ofthe state can be independently controlled. The state of the robotat time k is denoted by xk = (xk,yk,k)T (2D position and theheading angle). We denote the control as uk = (vx,k,vy,k,k)Tand the process noise by wk = (nvx ,nvy ,n)

    T N (0,Qk).Let f denote the kinematics of the robot such that xk+1 =f (xk,uk,wk) = xk +uk t+wk

    t.

    2) Observation Model: We use a range-bearing landmarkbased observation model in which the landmarks are assumedto be passive beacons placed at various locations in the environ-ment. Each landmark has a unique ID associated to it and thisID can be read by the sensor along with the range and bearingrelative to the robot. Let iL be the location of the i-th landmark,then the displacement vector id from the robot to iL is given byid= [idx, idy]T := iLp, where p= [x,y]T is the position of therobot. Therefore, the observation iz of the i-th landmark can bedescribed as iz = ih(x, iv) = [id,atan2(idy, idx) ]T + iv.

    The observation noise is assumed to be zero-mean Gaus-sian such that iv N (0, iR) where iR = diag((rid + rb)

    2,(id+b )2). The measurement quality decreases asthe robot gets farther from the landmarks and the parameters

    (a) (b)

    Fig. 2. (a) The simulation environment. The black diamonds depict thelandmarks, the grey polygons are the obstacles and the white space representsthe free space. The locations of interest that the robot is tasked to visit aremarked by red crosses. The two narrow passages P1 and P2 are marked,these represent regions of high collision probability (risky) due to the smallclearance. (b) The underlying FIRM roadmap, the grey lines depict edgesand the cyan disks are the nodes, the dashed ellipses represent the stationarycovariance of the FIRM nodes.

    r and determine this dependency. rb and b are the bias

    standard deviations.3) Environment and Scenario: The environment in Fig.

    2(a) represents a 2D office space measuring 21m 21m withobstacles and beacons placed in it. The robot is represented bya circular disk of radius 1m. There are two narrow passagesP1 and P2 which represent paths or edges of low transitionprobability/high collision probability. The narrow passages are1.25m wide thus offering a very small clearance for the robotto pass through. The robot is placed at starting location A andtasked to visit 4 different locations in a sequential manner, theseare marked as B, C, D and E in the environment. Thus, we splitthe task into 4 sequential phases: 1) A B, 2) BC, 3) CD,and 4) D E. The simulation is carried out twice, once with thestandard FIRM algorithm and once with the proposed rollout-based method in belief space. In the following sections, weexplain each simulation and then present a comparative analysis.

    4) Planning with Standard FIRM Algorithm: Here wegive a brief explanation of how the FIRM algorithm works andintroduce some key terminology and then explain the individualsteps of the simulation itself.

    Offline-phase: First, we construct the underlying FIRMroadmap as depicted in Fig. 2(b). This roadmap is constructedby uniformly sampling configurations in the free space and thenbuilding the stationary beliefs on these configurations whichare the nodes of our FIRM roadmap. At each node, thereexists a stabilization controller (stabilizer) which drives beliefsfrom a region around the node to the stationary belief. Theedges of the FIRM roadmap are generated by first finding valid(collision free) straight line connections between neighboringnodes (nodes within a neighborhood of fixed radius R) and thengenerating edge controllers which drive the belief from the startbelief node to the vicinity of the goal belief node. For each edgein the graph, we run Monte Carlo simulations to compute theexpected execution cost and transition probability. Once we haveconstructed the underlying FIRM roadmap, we store it for use inthe online phase.

    Online-phase: In the online phase, the planner has accessto the stored roadmap that is computed offline and receives astart and a goal configuration. These configurations are addedto the existing roadmap by computing the appropriate stationary

    8

  • (a) The FIRM feedback is shownby the green edges for goal lo-cation B. The robot is depictedby the blue disk and the red diskdepicts the current belief.

    (b) The most likely path underFIRM marked by yellow.

    Fig. 3. Phase 1 of policy execution with FIRM, starting at A and going toB. The locations A and B are marked in Fig. 2(a).belief, stabilizer and edge controllers. Now, using the fact thatFIRM preserves the optimal sub-structure property (edges areindependent of each other), we solve the Dynamic Programmingon the graph for the given goal location. Before we proceed, wedefine a few terms that will be used frequently in the subsequenttext: FIRM Feedback Tree: The solution of the dynamic pro-

    gramming problem, i.e., pig, is visualized with a feedback tree.Recall that pig is a mapping (look-up table) that returns the nextbest edge for any give graph node. Therefore, for each node, thefeedback tree contains only one outgoing edge ( = pig(Bi)) thatpulls the robot towards the goal. The feedback tree is rooted atthe goal node. Most-Likely Path (MLP): The most likely path is generated

    by concatenating the sequence of edges to take for the motionplan concerning the given start/goal combination i.e. beginningat the start node and then adding the subsequent edges as givenby the FIRM feedback. Thus, it depicts the solution as a pathwhich is instructive to us. Note that the robot does not followthe most-likely path exactly due to the noises.

    In phase 1, the planner is given A and B as the start and goallocations respectively and computes the FIRM Feeedback, Fig.3(a) shows the feedback on the graph. Once the feedback iscomputed, the robot starts executing the feedback policy. We seein Fig. 3(b) that the robot is close to the most-likely path. Notethat the robot does not exactly follow the most-likely path dueto noise, however in this case, the noise is not high enough tochange the homotopy class. Figure 3(b) shows this path passingthrough the narrow passage P2. On its way to the goal, the robotfollows the edge controllers returned by the feedback policy andstabilizes to the FIRM nodes at the end of the edges. Once therobot reaches the first goal location B, we begin phase 2 andthe planner is given the new goal location C. A new feedbackpolicy is computed online (Fig. 4(a)) and the robot moves toC along the MLP (Fig. 4(b)), again passing through the narrowpassage P2. Similarly, a new feedback policy is computed foreach subsequent goal (Fig. 5(a) and Fig. 6(a)) and the roboteventually reaches its final destination E successfully.

    5) Planning with Rollout-based FIRM: Here again, we firstbegin with the underlying FIRM roadmap that is constructedoffline as explained in the previous section and then find theFIRM feedback in phase 1 for start and goal locations A andB respectively. Once the feedback policy is computed (same asthat in Fig. 3(a)), the rollout-based FIRM starts by following

    (a) The FIRM feedback for goallocation C.

    (b) The most likely path underFIRM from B to C marked byyellow.

    Fig. 4. Phase 2 of policy execution with FIRM, starting at B and going toC. The locations B and C are marked in Fig. 2(a)

    (a) The FIRM feedback for goallocation D.

    (b) The most likely path underFIRM from C to D (yellow).

    Fig. 5. Phase 3 of policy execution with FIRM, starting at C and going toD.

    the feedback tree. However, from here on the rollout behavioremerges. Let the rollout update time interval be defined asTrollout . Thus, at every Trollout seconds the planner locally com-putes connections to existing FIRM nodes in a neighborhood ofradius R centered at the robots belief, i.e., the planner locallygenerates edge controllers with their associated cost-to-connectand the transition probability. Since FIRM gives us the cost-to-go to the goal from each and every FIRM node, by findingthe local connection costs, the planner checks which connectionprovides the lowest sum of the cost to connect to a neighborhoodnode and the cost-to-go from that node (Eq. 7). The connectionwith the lowest sum is chosen as the next edge to follow. Fig.7(a) shows the planner checking connections (red-edges) locallyto neighboring FIRM nodes. An important behavior emerges inphase 1, as the robot proceeds, the rollout is able to find a shorterpath through the relatively open area by skipping unnecessarystabilizations (as shown in Fig. 7(b) and 7(c)). As the robot

    (a) The FIRM feedback for goallocation E.

    (b) The most likely path underFIRM from D to E (yellow).

    Fig. 6. Phase 4 of policy execution with FIRM, starting at D and going toE.

    9

  • (a) The robot checks for a newrollout policy by locally checkingconnections with neighbors. Therollout connections are shown inred. The yellow path depicts themost-likely path under the nomi-nal FIRM feedback.

    (b) Rollout guides the robot awayfrom the MLP through a shorterpath through the relatively openarea.

    (c) Robot approaches narrowpassage 2 through a more directpath as compared to the MLP.

    (d) The robot stabilizes ata FIRM node while passingthrough the narrow passage.

    (e) The robot approaches goal lo-cation B.

    (f) The green path indicates theactual path taken by the robotunder rollout, it is a geometri-cally shorter path compared tothe nominal FIRM policy, further,by skipping unnecessary stabi-lizations, the robot moves faster.

    Fig. 7. Phase 1 with rollout: Starting at A and going to B.

    traverses the narrow passage, the rollout changes the behavior byforcing the robot to stabilize to the node as it concludes it is anoptimal decision to further reduce uncertainty while proceedingthrough the narrow passage (shown in Fig. 7(d)). Eventually therobot reaches location B through the path as marked in greenin Fig. 7(f). It is clear that the rollout gives the robot a distinctadvantage over the nominal FIRM plan as it guides the robotthrough a direct, much shorter route. Further, it should be notedthat although the end segments of the two paths (i.e. after exitingthe narrow passage) look similar, they differ significantly in thevelocity profiles. Along the yellow path, the robot stabilizesto each and every FIRM node along the way while along thegreen path (rollout), it skips stabilizations and only stabilizeswhen necessary. This helps the robot maintain a higher averagevelocity while executing the plan.

    (a) The robot stabilizes to aFIRM node before approachingthe narrow passage P2.

    (b) The robot has reached goallocation C. The robots path(green) and

    Fig. 8. Phase 2: Starting at B and going to C.

    (a) Rollout connections duringexecution while approaching nar-row passage P2.

    (b) Robot has reached the goallocation D.

    Fig. 9. Phase 3: Starting at C and going to D.

    At point B, a new FIRM feedback is computed online for thenext goal location C (phase 2). Prior to approaching the narrowpassage, rollout drives the robot to stabilize to a FIRM nodeas seen in Fig. 8(a). This stabilization helps the robot becomemore certain of its state before passing through a region with lowtransition probability. The robot then passes through the narrowpassage P2 and reaches C through the path (green) shown inFig. 8(b). In phase 3, (goal location D), the plan again takes therobot through the narrow passage P2, and during this phase, theplanner drives the robot to stabilize to three FIRM nodes on theway to D. (Fig. 9 shows phase 3).

    Finally, in phase 4, the robot is given goal location E. Here,we notice that while returning to the narrow passage from D,the rollout causes the robot to divert slightly from the MLP asmarked in Fig. 10(a). The same was not observed when the robotwas moving towards D. The reason that this happens is that thecost to reach D through the MLP was lower than the computedcost through a shorter but riskier path. However, on the way toE, while moving away from D, the MLP has a higher cost thanthe one given by a more direct path as given by rollout. Finally,the mission is completed and the robot arrives at E as shown inFig. 10(b).

    6) Analysis of Simulation Results: In this section, we presentthe analysis for the simulation results from the previous section.The key finding is that using rollout based FIRM policy exe-cution significantly increases the performance of the standardFIRM implementation while preserving its robustness and scal-ability. We analyze the results in terms of three key criterion.Time of Execution: One of the indicators of performance

    is how much time a planner takes to complete the task whileguaranteeing a high likelihood of success. The time taken tocomplete the task is of prime importance to physical appli-

    10

  • (a) The robot passing throughnarrow passage P2 while movingto goal E.

    (b) Robot has completed its mis-sion.

    Fig. 10. Phase 4: Starting at D and going to E.

    cations since the robot may have physical constraints such asbattery life limitations apart from user based requirements whichmay dictate a certain limit for the robot to succeed. From thesimulations we note that for the same task, the time taken tocomplete the task with rollout is 2520 timesteps (252s) whereasfor the standard FIRM implementation it is 3606 timesteps(360.6s). There is a 30% reduction in the time to complete thetask compared to the standard FIRM algorithm. Thus, we getsignificant performance improvement with rollout which makesit a good candidate for real-time physical applications. Hence, aswe increase the frequency of rollouts, we get closer to optimalitycompared to the standard FIRM implementation.Cost of Execution: The FIRM cost metric is a numerical

    value that measures the amount of uncertainty along the robotspath and how much time the robot took to traverse that path.Fig. 11(a) shows the evolution of cost with time for the sametask using rollout-based FIRM and with standard FIRM. Wecan see that the cost rises linearly for the rollout based policywhereas for FIRM, the cost rises significantly faster in a non-linear manner. The difference between the two costs growsnon-linearly with time, indicating that as the planning horizonincreases, rollout offers increasing returns in performance.Stabilization to FIRM Nodes: One of the key performance

    issues for the standard FIRM algorithm is also one of its majorstrengths, i.e, the node stabilization process of reaching nodes inbelief space. Node stabilization makes FIRM robust and scalablewhile maintaining the optimal sub-structure of the graph (all theedges are independent of each other). Thus, though stabilizationallows FIRM to provide certain guarantees, it might lead toslower robot motion in general as it needs reach each beliefnode along the way, thus increasing the time to complete thetask and also adding cost during the stabilization process ateach node. Rollout-based FIRM brings a level intelligence tothis process of node stabilization. Rollout performs stabilizationas and when required and bypasses it when possible. Thus, bybypassing the stabilization when not required, rollout allows therobot complete the task faster as well as with less execution cost.Fig.11(b) shows the number of nodes the robot has stabilized towith the passage of time. We note in this example that at theend of the task, under the FIRM plan, the robot stabilizes to 48nodes whereas under rollout-based FIRM it stabilizes to only11 nodes. Further, rollout offers increasing benefits as the taskbecomes longer.

    V. EXPERIMENTAL RESULTS FOR A PHYSICAL SYSTEMThis section includes the main contribution of this paper:

    demonstrating an online POMDP-based solution for the si-

    (a) (b)

    Fig. 11. Performance comparison of the original FIRM algorithm and Rollout-based FIRM. (a) Cost of execution for FIRM vs. rollout-based FIRM. Theexecution cost for FIRM rises rapidly, for longer planning tasks the benefitof using a rollout based policy increases. (b) Comparison of the number ofFIRM nodes that the robot stabilizes to during plan execution.

    multaneous localization and planning problem on a physicalrobot in a real-world setting. We discuss the details of im-plementation and report the important lessons learned whileinvestigating the application of FIRM-based SLAP on physicalsystems. A video demonstrating the system is available on:https://parasol.tamu.edu/.

    A. Target Application and System Set UpConsider a scenario, where the robot needs to operate and

    reach a goal in an office environment. Each time the robotreaches a goal, a new goal is submitted by a higher-level applica-tion (e.g., manually by a user or different users). During a set oflong runs, we investigate the performance of online-replanningin SLAP and the robustness of the method to (i) changing obsta-cles, such as doors, and moving people, (ii) changes in the goallocation, (iii) deviations due to intermittent sensory failures, and(iv) kidnap situations (significant sudden deviation in the robotslocation). During the run, there are many situations where therobot needs to replan: It needs to replan each time a new goalis submitted and move toward the new goal. Also, the robotencounters changes in the obstacle map. For example the robotmay expect a door to a room be open while planning a path, butsenses that it is closed on reaching there. Similarly, it encountersmoving people and must avoid bumping into them. Observingthese changes, the robot updates its map and replans online toupdate its policy. Moreover, the robot might be deviated from itsnominal path. As an extreme case of this situation, we look intothe kidnapped situation, where a person might move the robotto an unknown location during the run. Thus, the robot needs torecover from this catastrophic situation. Finally, the robot maydeviate from its nominal location due to temporary failures in thesensing system. In all these cases an online replanning schemecan help the robot to recover from the aforementioned situationsand accomplish its goal. It is worth noting that the main focus ofall the different experiments in this section is to demonstrate theperformance and feasibility of SLAP by enabling online beliefspace planning on physical robots.

    1) Environment: The specific environment for conductingour experiments is the fourth floor of the Harvey Bum Bright(HRBB) building at the Texas A&M University campus inCollege Station, TX. The floor-plan is shown in Fig. 12. Thefloor spans almost 40 meters of hallways whose average widthis approximately 2 meters, which is distinguished in yellow andblue in Fig. 12. The particular set of experiments reported in thispaper is conducted in the region which is highlighted in bluein Fig. 12, part of which contains a large cluttered office area(407 area). This area has interesting properties that makes the

    11

  • planning more challenging: (i) 407 area is highly cluttered insideand includes lots of chairs and more than 15 workstations. (ii) Asis seen in Fig. 12, there are several doors in this area which maybe opened or closed. Two of these doors (front-door and back-door) are labeled in Fig. 12. (iii) There are objects such as chairsand trash-cans in this environment which usually get displaced.(iv) There are moving people, who are avoided using a reactivebehavior, which may displace the robot from its planned path.This further introduces challenges for the high level planner.

    Back-door of 407 region

    Inner-door of 407 region

    Front-door of 407 region

    Fig. 12. Floor-plan of the environment, in which experiments are conducted.

    2) Robot Model: The physical platform utilized in our exper-iments is an iRobot Create mobile robot (See Fig. 13). The robotcan be modeled as a unicycle whose kinematics are as follows:

    xk+1= f (xk,uk,wk)=

    xk +(Vk t+nv t)cosk

    yk +(Vk t+nv t)sink

    k +k t+n t

    , (12)where xk = (xk,yk,k)T describes the robot state, in which(xk,yk)

    T is the 2D position of the robot and k is the headingangle of the robot, at time step k. Control commands are thelinear and angular velocities uk = (Vk,k)T . We use the Playerrobot interface [55] to send these control commands to the robot.Motion noise: The motion noise vector is denoted by wk =

    (nv,n)T N (0,Qk), which mostly arose from uneven tileson the floor, wheel slippage, and inaccuracy in the length oftime control signals need to be applied. Experimentally, wefound that in addition to the fixed uncertainty associated withthe control commands there exists a portion of the noise that isproportional to the signal strength. Thus, we model the varianceof the process noise at the k-th time step as Qk = diag((Vk +Vb )

    2,(k + b )2), where in our implementations we have

    = 0.03, Vb = 0.01m/s, b = 0.001rad.

    3) Sensing Model: For sensing purposes, we use the on-board camera existing on the laptop. We perform a vision-based landmark detection based on ArUco (a minimal libraryfor Augmented Reality applications) [56]. Each landmark isa black and white pattern printed on a letter-size paper. Thepattern on each landmark follows a slight modification of theHamming code, and has a unique id, so that it can be detectedrobustly and uniquely. Landmarks are placed on the walls in theenvironment (see Fig. 13). The absolute position and orientationof each landmark in the environment is known. The ArUcolibrary performs the detection process and presents the relative

    Fig. 13. A picture of the robot (an iRobot Create) in the operatingenvironment. Landmarks can be seen on the walls.

    range and bearing to each visible landmark along with its id.Therefore, if we denote the j-th landmark position in the 2Dglobal coordinates as jL, we can model the observation as arange-bearing sensing system:

    jzk = [ jdk,atan2( jd2k , jd1k) ]T + jv, jvN (0, jR),where jdk = [ jd1k ,

    jd2k ]T := [xk,yk]T L j.

    Measurement noise: A random vector jv models the mea-surement noise associated with the measurement of the j-thlandmark. Experimentally, we found that the intensity of themeasurement noise increases by the distance from the landmarkand by the incident angle. The incident angle refers to the anglebetween the line connecting the camera to landmark and thewall, on which landmark is mounted. Denoting the incidentangle by [pi/2,pi/2], we model the sensing noise asso-ciated with the j-th landmark as a zero mean Gaussian, whosecovariance is

    jRk = diag((rd jdk+r |k|+ rb)2,(d jdk+ |k|+b )2) (13)

    where, in our implementations we have rd = 0.1, r = 0.01, rb = 0.05m, d = 0.001, = 0.01, and

    b = 2.0deg.

    Full vector of measurements: At every step the robotobserves a subset of the landmarks, which fall into its field ofview. Suppose at a particular step the robot can see r landmarks{Li1 , ,Lir}. The concatenation of visible landmarks is thetotal measurement vector that is denoted by z = [i1 zT , , ir zT ]Tand due to the independence of measurements of different land-marks, the observation model for all landmarks can be writtenas z = h(x)+ v, where v = [i1 vT , , ir vT ]T N (0,R), whereR= diag(i1R, , irR).

    B. SLAP versus Regular Localization and Planning

    In this section, we contrast the results of a regular localiza-tion and planning with the proposed SLAP solution. Regularlocalization and planning here refers to a method, where firstthe planner (ignoring the localizer) generates a path for the robotto follow. Then, in the execution phase, we run the localizationalgorithm to compute the mean and follow the pre-computed tra-jectory using a closed-loop controller. However, in the proposedSLAP solution, the planner takes the localizer into account inthe planning phase and replans simultaneously as the localizerupdates its estimation.

    The environment is shown in Fig. 14. Blue regions are ob-stacles and black regions are free space. Landmarks are shownby small white diamonds. The start and goal locations for themotion planning problem are marked in Fig. 14. The goal

    12

  • location is inside room 407 (see Fig. 12) and the start is closeto the front door.Regular planning and localization: To select a suitable

    planner, we tried a variety of traditional planners such as PRM,RRT, and their variants. We observed that due to the motionnoise of this low-cost robot and sparsity of the information inthe certain parts of this environment, many of these variants leadto collisions with obstacles and cannot reach the goal point. Thebest results, we got was from the MAPRM (Medial-Axis PRM)method. This planner is computationally more expensive thanthe other variants but is very powerful in dealing with collisions.since it samples points and construct a PRM that has the mostclearance from obstacles, so it leads to plans with minimumcollisions in our experiments. An MAPRM graph (in white) forthis environment is shown in Fig. 14.

    As can be seen, there exist two homotopy classes of pathsbetween the start and goal nodes: Through the front door ofroom 407 and through the back door of the room. From Fig. 14,it is obvious that the path through the front door is shorter.Moreover, the path through the front door has a larger obstacleclearance (larger minimum distance from obstacles along thepath) compared to the path through the back door (since the backdoor is half open). Therefore, based on conventional metricsin deterministic settings, such as shortest path or maximumclearance, MAPRM chooses the path through the front doorover the path through the back door. The feedback tree thatresults from solving the DP in this case is shown in Fig. 15. Asexpected, the DP guides the robot to go through the front door.

    To execute the plan generated from PRM, we use time-varying LQG controllers to keep the robot close to the nomi-nal path returned as the solution of the PRM-based planning.However, due to the lack of enough information along thenominal path, the success rate of this plan is low, and the robotfrequently collides with obstacles along the path as the robot isprone to drift. The success probability along the nominal path iscomputed by a Monte Carlo simulation (100 runs) and is equalto 27% (27 runs out of 100 runs were collision-free).FIRM-based SLAP: As can be seen in Fig. 14, the dis-

    tribution of information is not uniform in the environment.The density of landmarks (information sources) along the paththrough the back door is more than the landmarks along thepath through the front door. In the FIRM-based SLAP, though,the landmark map is automatically incorporated in the planningphase in a principled way. As a result, it leads to a betterjudgment of how narrow the passages are. For example, inthis experiment, although the path through the front door isshorter than the path through the back door, considering theinformation sources, the success probability of going throughthe back door is much more than going through the front door.Such knowledge about the environment is reflected in the FIRMcost-to-go and success probability in a principled framework.As a result, it generates a policy that suits the application, takinginto account the uncertainty, and available information in theenvironment. Solving DP on the FIRM graph gives the feedbackshown in Fig. 16, which results in an 88% success probability.

    C. Online Replanning aspect of SLAP

    In this section, we focus on the simultaneous part in SLAP,which emphasizes the ability of the robot to replan after everylocalization update. In other words, in SLAP, the robot dynam-

    Starting point

    goal point

    Back-door is half open

    Shortest path

    Safest path

    Fig. 14. The environment including obstacles (blue regions), free space (blackregion), and landmarks (white diamonds). An MAPRM graph approximatingthe connectivity of free space is also shown (white graph).

    Fig. 15. The feedback tree generated by solving DP on MAPRM is shownin yellow. From each node there is only one outgoing edge, guiding the robottoward the goal defined in Fig. 14. Arrows in pink coarsely represent thedirection on which the feedback guides the robot.

    Fig. 16. The feedback tree generated by solving DP on FIRM is shown.As can be seen the computed feedback guides robots through the moreinformative regions that leads to more accurate localization and less collisionprobabilities. Arrows in pink coarsely represent the direction on which thefeedback guides the robot.

    ically replans based on the new information coming from itssensors.

    We look into two important cases to illustrate the effect ofonline replanning. We first look into a challenging case wherethe obstacle map changes and possibly eliminates a homotopyclass of solutions. This means a slight deviation in the plan is notsufficient and the planner has to switch to a different homotopyclass in real-time, which is not feasible with the state-of-the-artmethods in the belief space planning literature. Second, we lookinto deviations from the path. There, we focus on the kidnappedrobot problem as the most severe case of deviation, the solutionto which can be applied to smaller deviations as well. Finally, wedesign a complex scenario that includes changes in the obstacles,small and large deviations, online changes in goal location, etc.,and demonstrate the performance of the proposed method on it.

    1) Changes in the obstacle map:Here, we show how enabling simultaneous planning and local-

    13

  • ization in an online manner, can handle changes in the obstaclemap. We consider three types of obstacles: static, l-type, h-type.static refers to obstacles that are not moving such as walls.l-type refers to the class of moving objects that leads to localchanges in the plan. For example, moving people in the envi-ronment. We assume there exists a lower-level reactive behavior(e.g., stopping or dodging) in a subsumption-like architecture[57] that suppresses the belief space planner in the vicinity ofl-type obstacles. Once the control is back to the SLAP layer,the robot might have deviated from its nominal plan and thus theSLAP layer has to replan to recover from such deviations.

    s-type obstacles refer to the most severe obstacles to dealwith in the SLAP setting. s-type obstacles include those whosechanges can globally update the map, by eliminating or creatinghomotopy classes of solutions. Doors are an important exampleof this class. If the robot observes a door is closed (whichwas expected to be open), it might have to globally change theplan to get to the goal from a different passage. This is a verychallenging problem for todays belief space planners.

    As the first experiment, we consider the environment shownin Fig. 12. The start and goal locations are shown in Fig. 17(a).We construct a PRM in the environment ignoring the changingobstacles (assuming all doors are open and there are no peoplein the environment). Then we construct a corresponding FIRMand solve dynamic programming on it. As a result, we get thefeedback tree shown in Fig. 17(a) that guides the robot towardgoal through the back door of room 407. However, the challengeis that the door may be closed when the robot reaches it, andthere may be people moving in the environment. Moreover, fordifferent reasons (such as motion blur in the image or blockedlandmarks by people or different objects), the robot might mis-detect landmarks temporarily during the run. 3 To handle such achange in the obstacle map and replan accordingly, we use thelazy feedback evaluation method outlined in Algorithm 4.Results on physical robots: Figure 17(b) shows a snapshot

    of our run when the robot detects the change signal, i.e., detectsthe door is in a different state than its recorded situation in themap. As a result, the robot updates the obstacle map as canbe seen in Fig. 17(b) (door is closed). Accordingly, the robotreplans; Figure 17(b) shows the feedback tree resulting from thereplanning. The new feedback guides the robot through the frontdoor since it detects the back door is closed. The full video ofthis run provides much more detail and is available in [58].Comparison with the state-of-the-art: It is important to

    note that it is the particular structure of the proposed SLAPframework that makes such replanning feasible online. Thegraph structure of the underlying FIRM allows us to locallychange the collision probabilities in the environment withoutaffecting the collision probability of the rest of the graph (i.e.,properties of different edges on the graph are independent ofeach other). It is important to note that such a property is notpresent in any other state-of-the-art belief space planner, suchas BRM (Belief Roadmap Method) [12], or LQG-MP [13]. Inthose methods, collision probabilities and costs on all edges(the number of possible edges is exponential in the size of theunderlying PRM) need to be re-computed.

    2) Deviations in robots location:In this subsection, we demonstrate how online replanning en-

    3Designing perception mechanisms for obstacle detection is not a concernof this research, thus we circumvent the need for this module by stickingsmall markers with specific IDs on moving objects (doors or peoples shoes).

    Robots view(Back door is open)

    External view

    Goal point

    Feedback goes through the back door

    Robots location

    (a)

    Goal point

    An obstacle is added to the doorway

    Robots location

    Replanning leads to a feedback that goes through the front door

    Back-door is closed

    (b)

    Fig. 17. (a) The back door is open at this snapshot. The feedback guidesthe robot toward goal through the back door. (b) The back door is closed atthis snapshot. Robot detects the door is closed and updates the obstacle map(adds door). Accordingly robot replans and computes the new feedback. Thenew feedback guides the robot through the front door.

    ables SLAP in the presence of large deviations in the robotsposition. As the most severe form of this problem, we considerthe kidnapped robot problem. In the following we discuss thisproblem and challenges it introduces.Kidnapped robot problem: An autonomous robot is said to

    be in the kidnapped situation if it is carried to an unknown loca-tion while it is in operation. The problem of recovering from thissituation is referred to as the kidnapped robot problem [59]. Thisproblem is commonly used to test a robots ability to recoverfrom catastrophic localization failures. This problem introducesdifferent challenges such as (i) how to detect kidnapping, (ii)how to re-localize the robot, and (iii) how to control the robotto accomplish its goal. Our main focus, here, is on the thirdpart, i.e., how to replan in belief space from the new point inthe belief space after recovering from being kidnapped. This isin particular challenging because large deviations in the robotspose can globally change the plan and optimal homotopy classof solutions. Therefore, the planner should be able to change theglobal plan online.Detecting the kidnapped situation: To detect the kid-

    napped situation, we constantly monitor the innovation signalzk = zk zk (the difference between the actual observationsand predicted observation). To define the specific measure ofinnovation we use in our implementation, recall that the ob-servation at time step k from the j-th landmark is the relativerange and bearing of the robot to the j-th landmark, i.e., jzk =( jrk, jk). The predicted version of this measurement is denotedby jzk = (

    jrk ,jk ). We monitor the following measures of the

    innovation signal:

    rk = maxj(| jrk jrk |), k = maxj (d

    ( jk, jk )) (14)

    14

  • where d ( , ) returns the absolute value of the smallest anglethat maps onto . Passing these signals through a low-passfilter, we filter out the outliers (temporary failures in the sensoryreading). Denoting the filtered signals by rk and k, we monitorthe conditions rk < rmax and k < max. If both are satisfied,we follow the current SLAP feedback policy (i.e., we are in theFeedback Following Mode (FFM)). However, violation of anyof these conditions means that the robot is constantly observinghigh innovations, and thus it is not in the location in which itbelieves to be (i.e., it is kidnapped).Information Gathering Mode (IGM): Once it is detected

    that the robot is kidnapped, we first replace the estimationcovariance with a large covariance (to get an approximatelyuniform distribution over the state space). Then, we enter theInformation Gathering Mode (IGM), where we take very smalland conservative steps (e.g., turning in place or taking randomactions with small velocities) to get some known measurements.Once the robot obtains a few measurements, the localizationmodule corrects the estimation value and innovation signalreduces. When conditions rk < rmax and k < max are satisfiedagain, we quit the information gathering mode.Post-IGM replanning: After recovering from being kid-

    napped, controlling the robot in belief space is a significantchallenge as the system can be far from where it was supposed tobe. As a result, the optimal homotopy class might have changedand thus the plan should be updated globally. Using the proposedSLAP framework, the robot just needs to go to a neighboringnode from this deviated point. Since the underlying FIRM graphis spread in the belief space, the only required computation is toevaluate the cost of edges that connect the new start point to theneighboring FIRM nodes.Results on physical robots: Figure 18 shows a snapshot of a

    run that contains two kidnappings and illustrates the robustnessof the planning algorithm to the kidnapping situation. The startand goal positions are shown in Fig. 18. The feedback tree(shown in yellow) guides the robot toward the goal through thefront door. However, before reaching the goal point, the robot iskidnapped in the hallway (see Fig. 18) and placed in an unknownlocation within room 407 (see Fig. 18). In our implementations,we consider rmax = 1 (meters) and max = 50 (degrees). The firstjump in Fig. 19 shows this deviation. Once the robot recoversfrom being kidnapped (i.e., when both innovation signals in Fig.19 fall below their corresponding thresholds), replanning fromthe new point is performed. This time, the feedback guides therobot toward the goal point from within room 407. However,again before the robot reaches the goal point, it is kidnapped andplaced in an unknown location (see Fig. 18). The second jump inthe innovation signals in Fig. 19 corresponds to this kidnapping.

    D. Longer and more complex experiments: Robustness tochanging goals, obstacles, and to large deviations

    In this section, we emphasize the ability of the system toperform long-term SLAP that consist of visiting several goals.The replanning ability allows us to change the plan online asthe goal location changes. In this experiment, we consider ascenario in which the user(s) submit a new goal for the robot toreach after it reaches its currently assigned goal. While the robotneeds to change the plan each time a new goal is submitted, itfrequently encounters changes in the obstacle map (open/closed

    First kidnap point

    First placement

    point

    Second kidnap point

    Second placement point

    Fig. 18. This figure shows the set up for the experiment containing twokidnapping.

    0 200 400 600 800 1000 1200 1400 1600 18000

    2

    4

    6

    8

    10

    12

    step number

    max

    . ran

    ge in

    nova

    tion

    (met

    ers)

    0 200 400 600 800 1000 1200 1400 1600 18000

    50

    100

    150

    200

    step number

    max

    . bea

    ring

    inno

    vatio

    n (d

    egre

    es)

    Fig. 19. This figure shows the innovation signals rk and k during this run.When both of the signals are below their specified thresholds rmax and max(dashed red lines), robot follows the FIRM feedback. Otherwise, the systementers the information gathering mode.

    doors and moving people) as well as intermittent sensor failuresand kidnapping situations. Thus, the ability to simultaneouslyreplan online while localizing is necessary to cope with thesechanges. The video in [58] shows the robots performance inthis long and complex scenario.

    Goal 1

    Goa

    l 2

    Goal 3

    Back door

    P1

    Front door Inner door

    Standing people

    Person blocking the door

    First kidnap point

    Second kidnap point

    First placement point

    Second placement point

    Third kidnap point

    Third

    pl

    acem

    ent p

    oint

    Tras

    h ca

    n

    Fig. 20. This figure shows the set up for the longer experiment witha sequence of goals as well as intermediate events and changes in theenvironment map.

    In the following, we provide an itemized description of thespecific steps involved in this run based on Fig. 20. Also, wediscuss different changes in the environment with which therobot needs to cope along the way to accomplish its goals.

    15

  • All of the following steps can be seen more clearly in theaccompanying video [58].1) The robot begins at the starting point shown in Fig. 20 and

    aims to reach goal 1 as shown in Fig. 20. Goal 1 is inside room407. FIRM returns a feedback tree that guides the robot throughthe back door of 407. 2) The robot goes through the narrowpassage introduced by the back door (it is half-open). However,before reaching the goal it gets kidnapped (the first kidnap pointas shown in Fig. 20). The robot is placed in an unknown location(shown in Fig. 20 by the first placement point.) 3) Observingnew landmarks, the robot detects that it has been kidnapped.Accordingly it adds a new node to the graph and replans online.As a result, the feedback guides the robot toward the goal pointthrough the back door again. 4) However, in the meantime theback door has been closed. When the robot reaches the vicinityof the back door, it detects that it is closed. Therefore, it updatesits map by adding an obstacle at the doorway. Note that the robotwill open the door (remove the obstacle) in its map after theforgetting time of 10 minutes. Accordingly, the robot replans afeedback tree that guides it through the front door toward thegoal point. 5) Along the way, people are moving in the hallwayand inside the 407 area. Thus, the robot replans accordingly asit encounters the people. Moving people are reactively avoidedand the standing people and static obstacles such as a trash-can (see Fig. 20) temporarily get added to the map as obstacles.Replanning several times to cope with such changes, the robotgoes through the front and inner doors and reaches the goal pointinside the 407 area. 6)After reaching the goal point, another goal(second goal in Fig. 20) is assigned to the robot. 7) Replanningfor reaching this goal leads to a feedback tree that guides therobot through the inner door, and front door, toward goal 2. 8)However, as the robot reaches the vicinity of the inner door, itdetects the door has been closed. Therefore, it updates its mapand replans accordingly. The replanning leads to a feedback treethat guides the robot toward goal 2 through the back door. Again,along the way robot encounters moving people in the office407 and in the hallway. 9) However, before reaching the goalpoint, the robot gets kidnapped at the second kidnap point asshown in Fig. 20. The robot is placed at a really far-off point(the second placement point). Once the robot detects it iskidnapped, it replans and moves slower to gather information.Detecting landmarks, it reduces its uncertainty and continuesgoing toward the goal point. 10) After reaching the goal point,the next goal (i.e., third goal) is assigned to the robot (see Fig.20). Replanning for this goal, leads to a feedback that guidesthe robot through the front door. 11) However, when the robotreaches the front door, it encounters a person standing in thedoorway. Accordingly, it replans and decides to go through theback door. 12) On the way to the back door, it is again displacedat the third kidnap point and placed at the third placementpoint. 13) This time, due to the forgetting time, the replanningleads to a path through the front door (the person is not there anymore). 14) Again, the robot follows the feedback and achievesits goal.

    This long and complicated scenario demonstrates how simul-taneous planning can lead to a robust behavior in the presence ofintermittent model discrepancies, changes in the environment,and large deviations in the robots location. It is worth notingthat online replanning in belief space is a challenge for thestate-of-the-art methods in belief space as they mainly rely onstructures that depend on the systems initial belief. Hence, when

    the systems localization pdf encounters a significant deviation,replanning from the new localization pdf requires the structureto be re-built, which is not a feasible operation online. However,constructing a query-independent graph (the underlying FIRM)allows us to embed