ONLINE TRAJECTORY PLANNING ALGORITHMS FOR ROBOTIC …

ONLINE TRAJECTORY PLANNING ALGORITHMS FOR ROBOTIC SYSTEMS

UNDER UNCERTAINTY IN INTERACTIVE ENVIRONMENTS

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF AERONAUTICS AND

ASTRONAUTICS

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Haruki Nishimura

August 2021

http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: https://purl.stanford.edu/jp145pv0596

Includes supplemental files:

1. IEEE Copyright Statement (ieee_copyright_statement.txt)

© 2021 by Haruki Nishimura. All Rights Reserved.

Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

ii



https://purl.stanford.edu/jp145pv0596

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

Mac Schwager, Primary Adviser


Grace Gao


Mykel Kochenderfer

Approved for the Stanford University Committee on Graduate Studies.

Stacey F. Bent, Vice Provost for Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.

iii

Abstract

The mission of this thesis is to develop algorithms for planning and control of intelligent mobile

robots that operate autonomously in open, interactive environments. Presence of other agents

and objects in such an environment makes planning significantly challenging, as they inevitably

bring about environmental and dynamic uncertainty that the robot must properly handle. Despite

recent advances in perception, planning and control, many existing robotic systems to date lack

the capability to consider and address uncertainty, which demands that the robots be caged or

confined to a dedicated, structured workspace. For example, success of thousands of mobile robots

nowadays deployed in logistics centers is heavily reliant on their closed and controlled operating

environments. In this thesis, we propose a series of computationally efficient algorithms that can

collectively overcome uncertainty of various sources towards reliable autonomy for “cage-free” robotic

operations. The methods presented in the thesis leverage probability theory to quantify the amount

of present and future uncertainty. Based on the quantification, we develop planning and control

algorithms that either mitigate, avoid the risk of, or are robust against uncertainty so that the robot

can successfully accomplish a given task. We take a model-based approach in developing those

algorithms, which allows us to exploit physical properties of dynamical systems and onboard sensors

when possible. Another crucial aspect of the proposed methods is their online nature, meaning that

control signals are computed in situ based on the currently available information. This is enabled

by fast, efficient computation of our algorithms, and is advantageous in that the robot can quickly

react to rapidly changing environments.

In the first part of the thesis, we address challenges associated with state uncertainty, which

represents unknowns about the current state of the system of interest. This can include unknown

intent of other interacting agents, or positions of targets to locate. We propose and employ recursive

Bayesian inference frameworks to keep track of evolving state uncertainty over time. The proposed

planning algorithms further assist the inference frameworks to actively mitigate state uncertainty as

appropriate, so that the robot can execute suitable control actions with certainty. We leverage tools

from sequential decision-making and optimal control to develop those algorithms. We demonstrate

the effectiveness of our approach in a multitude of tasks that involve state uncertainty, with different

combinations of dynamical systems and sensing modalities. This includes vision-based active intent

iv

inference, active target tracking with range-only observations, and simultaneous object manipulation

and parameter estimation.

We then turn our attention to transition uncertainty, which governs the unpredictability of future

states of the system. We especially focus on safety-critical problems where transition uncertainty

must not be ignored. For instance, a robot navigating in close proximity to humans has to carefully

perform planning so that collisions are avoided with high confidence. We take a risk-aware planning

approach, in which a risk metric that takes into account the variance of uncertainty is to be optimized.

While being computationally efficient, our proposed method does not require knowledge of the

analytical form of the underlying probability distribution that quantifies transition uncertainty, nor

is it limited to a certain class of distributions such as Gaussian. This atypical feature enables us

to leverage modern data-driven generative models for uncertainty quantification. We demonstrate

the applicability of our approach to the aforementioned robot navigation task, where we show that

the proposed framework can safely navigate the robot towards its goal while interacting with more

than 50 humans simultaneously in real time. Moreover, our risk-aware formulation is demonstrated

to promote safety in both simulation and a real-world experiment, by inducing a proactive robot

behavior that avoids risky situations where high variance of uncertainty could lead to imminent

collision.

The last part of this thesis considers model uncertainty, which is attributed to imperfect modeling

of the underlying stochastic phenomena. Our approach makes the planner distributionally robust,

in which the planner selects a control policy that acts against a worst-case distribution within an

offline-computed set of plausible distributions that could quantify transition uncertainty. We de-

velop a tractable algorithm leveraging mathematical equivalence between risk-aware planning and

distributionally robust planning. We show in simulation that the proposed planning framework can

safely avoid collision despite imperfect knowledge of the stochastic human motion model. Further-

more, our approach lets the risk-aware planner dynamically adjust the level of risk-sensitivity online,

which further improves the flexibility of conventional risk-aware planning methods.

The algorithms developed in this thesis will ultimately allow intelligent mobile robots to oper-

ate in considerably more uncertain and dynamic workspaces than the current industrial standard.

This will open up possibilities for various practical applications, including autonomous field robots

for persistent environmental monitoring, fully-automated driving on urban roads, and autonomous

drone flights in densely populated areas for logistics services. We believe that such “cage-free”

robotic operations will be enabled by proper consideration and treatment of uncertainty, and that

our methods will pave the way towards more reliable robotic autonomy in open and interactive

environments.

v

Acknowledgments

This thesis would not have been possible without the tremendous support of my family, friends,

colleagues, and mentors.

First and foremost, I thank my research advisor, Professor Mac Schwager, for his generous and

invaluable guidance over the course of my doctoral and master’s study. I would not have decided to

pursue a doctoral degree without his advice, to always trust and follow one’s own scientific curiosity

in conducting research. Ever since I joined the Multi-Robot Systems Lab (MSL), his precise and

insightful feedback let me overcome many obstacles in solving challenging robotics problems. Not

only is he a great academic supervisor, but he is always caring and respectful of fellow researchers,

students and their families. I also thank the rest of my thesis reading committee members, Professor

Grace Gao and Professor Mykel Kochenderfer, for their willingness to serve in the committee and

continued guidance in organizing and writing this thesis. I have had the fortunate opportunity

to learn from them inside and outside classrooms as well as at conferences. I additionally thank

Professor Dorsa Sadigh and Professor Monroe Kennedy, for being on my thesis defense committee

and taking their valuable time to evaluate my research.

I have had a pleasure of working with highly skilled and talented researchers through a mul-

titude of collaborative research projects. I thank Professor Marco Pavone and Boris Ivanovic in

the Autonomous Systems Lab for all the constructive feedback and discussion on utilizing data-

driven trajectory prediction methods in planning. I especially thank Boris Ivanovic for developing

and improving Trajectron++ to seamlessly integrate it with the downstream planner. The project

would not have been completed without his unparalleled expertise in deep generative modeling, cod-

ing, and mathematical analysis. I also thank Professor Negar Mehr at the University of Illinois at

Urbana-Champaign and Doctor Adrien Gaidon at Toyota Research Institute, for their tremendous

guidance and support throughout multiple projects. Their unique and fascinating perspectives on

control theory and machine learning have greatly helped to enrich the scientific quality and impact

of the research.

None of the research would have been possible without generous financial support that I re-

ceived over the years. I am grateful for Toyota Research Institute, ONR grant N00014-16-1-2787

vi

and N00014-18-1-2830, NSF grant CMMI-1562335 and NRI-1830402, Japan Student Services Orga-

nization Scholarship, and finally Masason Foundation Fellowship.

I extend my sincere gratitude to all the current and former MSL members, who are not only

extremely skillful and diligent but always open-minded and friendly. I thank Doctor Zijian Wang

and Doctor Eric Cristofalo for being great mentors. I thank Doctor David Fridovich-Keil for giving

me professional advice on research as well as career choice. I thank Doctor Alex Koufas, Adam

Caccavale, Kunal Shah, Ravi Haksar, Jun En Low for working together to make quadrotors more

reliable, and hosting enjoyable lab happy hours. I thank Mingyu Wang and Simon Le Cleac’h for

fruitful discussions on the Toyota project. I thank Preston Culbertson, Keiko Nagami, and Joe

Vincent for sharing their expertise on reinforcement learning and deep learning theory. I thank

Trevor Halsted for answering the many silly questions I asked from the next desk. I thank the rest

of the MSL for all the interesting discussions and conversations over the years.

I am extremely fortunate to have such a caring and wonderful family. I thank my father, Michi-

nari, for encouraging me to apply to graduate schools outside Japan and always supporting my life

decisions. I thank my mother, Kazuyo, for being all ears whenever I needed someone to talk to

about study abroad problems. I thank my brother, Yuki, for his constant support and always taking

care of the cats while I was away from home. I also thank Carin Pacifico, Yvonne Schmidt, Brian

Schmidt, Cindy Fulchiron, Peter Fulchiron, Nancy Sherlock, Leo Redmond, Gayle Keck, and Paul

Herman, for helping me settling down in the Bay Area and getting accustomed to the life in the

United States, especially in the first two years.

vii

Contents

Abstract iv

Acknowledgments vi

1 Introduction 1

1.1 Definition and Taxonomy of Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Approach and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Active Vision-Based Intent Inference 9

2.1 Introduction and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1 Trajectory generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.2 State transition model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.3 Observation model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.4 Repeated trajectory execution . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 Bayesian Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.1 Recursive Bayesian Inference Formula . . . . . . . . . . . . . . . . . . . . . . 16

2.3.2 Multi-hypothesis Extended Kalman Filter . . . . . . . . . . . . . . . . . . . . 17

2.4 Belief Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.1 Levenberg-Marquardt algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.2 Prior estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5 Entropy-based Active Vision Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.5.1 Belief MDP for motion-based communication . . . . . . . . . . . . . . . . . . 20

2.5.2 Upper Confidence Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.6.1 Simulation setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

viii

2.6.2 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 Fast Approximate Belief Space Planning 28

3.1 Introduction and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1.1 Related Work in Belief Space Planning . . . . . . . . . . . . . . . . . . . . . . 29

3.1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2 SACBP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2.1 Problems with Mixed Observability . . . . . . . . . . . . . . . . . . . . . . . 33

3.2.2 General Belief Space Planning Problems . . . . . . . . . . . . . . . . . . . . . 40

3.2.3 Closed-loop Nominal Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.2.4 Computation Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3 Analysis of Mode Insertion Gradient for Stochastic Hybrid Systems . . . . . . . . . . 43

3.3.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.4.1 Active Multi-Target Tracking with Range-only Observations . . . . . . . . . . 49

3.4.2 Object Manipulation under Model Uncertainty . . . . . . . . . . . . . . . . . 53

3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4 Fast Risk-Sensitive Planning with Probabilistic Generative Models 56

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.2.1 Dynamic Collision Avoidance and Safe Robot Navigation . . . . . . . . . . . 59

4.2.2 Stochastic Sequential Action Control . . . . . . . . . . . . . . . . . . . . . . . 60

4.2.3 Multi-Agent Trajectory Modeling from Data . . . . . . . . . . . . . . . . . . 60

4.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.3.1 Dynamics Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.3.2 Optimal Control Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.3.3 Entropic Risk Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.4 Risk-Sensitive Sequential Action Control . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.4.1 Review of Stochastic Sequential Action Control . . . . . . . . . . . . . . . . . 64

4.4.2 Generalized Mode Insertion Gradient . . . . . . . . . . . . . . . . . . . . . . . 65

4.4.3 Extension to Entropic Risk Measure . . . . . . . . . . . . . . . . . . . . . . . 65

4.4.4 RSSAC Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.4.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.5.1 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

ix

4.5.2 Effects of Risk Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.5.3 Real-World Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5 Distributionally Robust Planning under Model Uncertainty 76

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.2.1 Distributional Robustness and Risk-Sensitivity . . . . . . . . . . . . . . . . . 79

5.2.2 Approximate Methods for Optimal Feedback Control . . . . . . . . . . . . . . 80

5.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.3.1 Distributionally Robust Optimal Control . . . . . . . . . . . . . . . . . . . . 80

5.3.2 Equivalent Risk-Sensitive Optimal Control . . . . . . . . . . . . . . . . . . . 82

5.4 RAT iLQR Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.4.1 Iterative Linear-Exponential-Quadratic-Gaussian . . . . . . . . . . . . . . . . 83

5.4.2 Cross-Entropy Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.4.3 RAT iLQR as MPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.5.1 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.5.2 Comparison with Baseline MPC Algorithms . . . . . . . . . . . . . . . . . . . 90

5.5.3 Benefits of Risk-Sensitivity Parameter Optimization . . . . . . . . . . . . . . 91

5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6 Conclusions and Future Research 95

6.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

A Mode Insertion Gradient for Stochastic Hybrid Systems 98

A.1 Nominal Trajectory under Specific Observations . . . . . . . . . . . . . . . . . . . . . 98

A.2 Perturbed Trajectory under Specific Observations . . . . . . . . . . . . . . . . . . . . 104

A.3 Expected Total Cost under Stochastic Observations . . . . . . . . . . . . . . . . . . 120

B Mode Insertion Gradient for Crowd-Robot Interaction 125

B.1 Statement of Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

B.2 Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

B.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

Bibliography 137

x

List of Tables

2.1 Parameter values for MHEKF and MCTS. . . . . . . . . . . . . . . . . . . . . . . . . 25

5.1 Statistics summarizing histogram plots presented in Figure 5.3. RAT iLQR achieved

the largest average value for the minimum separation distance with the smallest stan-

dard deviation, which contributed to safe robot navigation without a single collision.

Note that PETS had multiple collisions despite its access to the true Gaussian mixture

distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.2 Our comparative study between RAT iLQR with θ∗ and iLEQG with θmax (i.e. max-

imum feasible risk-sensitivity) reveals that RAT iLQR’s optimal choice of the risk-

sensitivity parameter θ∗ results in a more efficient robot navigation with smaller

trajectory tracking errors, while still achieving collision avoidance under the model

mismatch. With RAT iLQR, the average tracking error was reduced by 39%, 34%,

and 16%, for 3 true distributions with different KL divergences of 1.34, 7.78, and

32.02, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

xi

List of Figures

1.1 Roadmap illustrating the organization of this thesis. . . . . . . . . . . . . . . . . . . 8

2.1 (Left) A sample 2D trajectory codebook with potential ambiguities between entries.

(Right) A noisy camera projection of a complete trajectory. The trajectory may be

either an “L” or a “V” from the image. Further inference requires the receiver to

move to obtain a more accurate estimate. . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Bayesian network structure for our trajectory classification and pose estimation prob-

lem. The receiver robot observes the camera image z of the sender’s position x to

estimate the relative pose r and the sender’s message m, and applies control u. . . . 13

2.3 (Left) Simulation environment with the receiver in the front and the sender in the

back. (Right) An image obtained by the forward-facing camera mounted on the

receiver. The smooth yellow trajectory is overlaid for clarity. . . . . . . . . . . . . . 23

2.4 2D trajectory codebook used in the experiment. The sender robot moves clockwise.

The trajectories are intentionally ambiguous from different observer angles, to make

trajectory classification difficult for the receiver. . . . . . . . . . . . . . . . . . . . . . 24

2.5 (Left) A set of waypoint measurements used to initialize the prior. The smooth

yellow trajectory and the waypoints are overlaid for clarity. (Right) Prior distribution

obtained from the left image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.6 Classification error (left) and entropy (right) averaged over 30 cases. The MCTS

active vision policy outperforms greedy and random in both criteria. . . . . . . . . . 25

2.7 Attitude error (left) and position error (right) averaged over 30 cases. The MCTS

active vision policy performs the best for pose estimation, even though it is not

explicitly trying to optimize the relative pose estimate. . . . . . . . . . . . . . . . . . 26

2.8 Attitude error (top) and position error (bottom) for a single long episode with class

1 trajectories. The MCTS active vision policy was used to move the receiver. Even-

tually, the relative pose estimation converges even though the receiver’s controller

optimizes for trajectory class, not pose error. . . . . . . . . . . . . . . . . . . . . . . 27

xii

3.1 (Left) Simulation environment with 20 targets and a surveillance robot. (Middle) The

history of the worst entropy value among the targets averaged over 20 random runs,

plotted with the standard deviation. With the budget of 10 Monte Carlo samples,

SACBP showed small variance for the performance curve and resulted in the fastest

reduction of the worst entropy value compared to every other baseline. (Right) Com-

putation times for Greedy, MCTS-DPW, and SACBP achieved real-time performance,

taking less time than simulated tcalc = 0.15 s. . . . . . . . . . . . . . . . . . . . . . . 50

3.2 Sample robot trajectories (depicted in red) generated by each algorithm. Greedy,

MCTS-DPW, and Ergodic did not result in a trajectory that fully covers the two

groups of the targets. Ergodic (Open-Loop) mostly strayed around one group because

the updated beliefs were not reflected into the robot’s trajectory. T-LQG failed to

reduce the estimation uncertainty even after 200 s, due to insufficient time to solve

the NLP with high-dimensional joint states in an online manner. SACBP successfully

explored the space and periodically revisited both of the two target groups. With

SACBP, the robot traveled into one of the four diagonal directions for most of the

time. This is due to the fact that SACBP optimizes a convex quadratic under a box

saturation constraint, which tends to find optimal solutions at the corners. In all the

figures, the blue lines represent the target trajectories and the shaded ellipses are 99%

error ellipses at t = 200 s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.3 (Left) The robot is attached to the rectangular object. (Middle) The history of the l2

norm of the residual between the goal state and the true object state averaged over 20

runs. SACBP with N = 10 samples successfully brought the object close to the goal.

The reduction of the residual norm was much slower for MCTS-DPW. T-LQG was

not as successful either, regardless of whether the policy was derived offline (without

re-planning) or online (with re-planning) . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.1 The proposed RSSAC-Trajectron++ framework is effective for safe robot navigation

in a social environment densely populated with humans. (Left) A simulation envi-

ronment with real human trajectories from the UCY/UNIV scene [90], overlaid with

predictions sampled from Trajectron++. (Right) A risk-sensitive robot safely navi-

gating itself alongside 5 humans. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.2 Quantitative results from 100 runs show that risk-neutral (i.e. θ = 0) RSSAC further

improves the performance of Nominal Search Only as the theory suggests, achieving

both safety and efficiency. Note that the farther up and to the right, the better,

as the x-axis is flipped. Exhaustive Search could not scale to the UNIV scene with

more than 50 humans. BIC resulted in multiple collisions. Error bars show standard

deviation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

xiii

4.3 Compared to the risk-neutral case in Fig. 4.2, RSSAC with θ = 1.0 significantly

reduces the standard deviation of the minimum robot-human distance by 11%, 12%,

and 24% in (a), (b), and (c), respectively. The risk sensitivity trades off the stochastic

collision cost and the deterministic tracking cost, which results in increased standard

deviation in the x-axis in (a) and (b), and overall distance increase in (c) where the

scene was most densely-populated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.4 Qualitative comparison of RSSAC with θ = 0 (left) and θ = 1.0 (right) in the HO-

TEL scene. These results differ in the minimum robot-human distance by only 3 cm

and the normalized goal distance by 0.01, but the risk-sensitive robot (right) yields

to potentially conflicting humans as opposed to the risk-neutral robot (left). Both

simulations used the same random seed. Sampled predictions from Trajectron++ are

also depicted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.5 A synthetic intersection scenario with a human. The prediction is drawn from a linear

Gaussian model with a constant mean velocity. . . . . . . . . . . . . . . . . . . . . . 73

4.6 Minimum robot-human distance (top) and empirical probability of yielding (bottom)

for the synthetic intersection scenario. Changing the risk-sensitivity (left) consistently

affected whether or not the robot yields, while the other two cost tuning parameters

(middle and right) did not. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.7 Quantitative results of the real-world experiment with 5 human subjects. Using robot-

future-conditional predictions with θ = 1.0 achieves the best average performance.

Error bars show the standard deviation of 5 runs with a randomized robot goal and

human start-goal assignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.1 Model-based stochastic control methods often require a Gaussian noise assumption,

such as the one in the left that represents process noise in pedestrian motion under a

collision avoidance scenario (see Section 5.5). However, the true stochastic model can

be highly multi-modal and better captured by a more complex distribution as shown

in the right, which we may not exactly know. The proposed MPC effectively handles

such a model mismatch without the knowledge of the true distribution, except for a

bound on the KL divergence between the two. . . . . . . . . . . . . . . . . . . . . . . 78

xiv

5.2 A unicycle robot avoiding collision with a road-crossing pedestrian. (Left) When

the KL bound is set to d = 0, RAT iLQR ignores this model error and reduces

to iLQG. (Right) With the correct information on the KL, RAT iLQR is aware of

the prediction error and optimally adjusts the risk-sensitivity parameter for iLEQG,

planning a trajectory that stays farther away from the pedestrian. The figures are

overlaid with predictions drawn from the model distribution and closed-loop motion

plans of the robot. Note that the prediction for the pedestrian is erroneous since

the actual pedestrian motion follows the Gaussian mixture distribution. The model

distribution and the true Gaussian mixture are both illustrated in Figure 5.1. . . . . 91

5.3 Histograms of the minimum separation distance between the robot and the pedestrian.

A negative value indicates that a collision has occurred in that run. For each control

algorithm, we performed 30 runs of the simulation with randomized pedestrian start

positions. RAT iLQR consistently maintained a sufficient safety margin to avoid

collision, while iLQG and PETS both failed. See Table 5.1 for the summary statistics

of these data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.4 Time-averaged ratio of the optimal θ∗ found by RAT iLQR to the maximum feasible θ

before the neurotic breakdown occurs, plotted for three distinct KL divergence values.

As the KL bound increases from 1.34 to 32.02, the ratio also consistently increased

from 0.66 to 0.93. Note also that the standard deviation decreased from 0.29 to 0.10.

This suggests that the robot becomes more risk-sensitive as the KL bound increases,

and yet it does not choose the maximum θ value all the time. . . . . . . . . . . . . . 93

5.5 Trade-off between collision avoidance and trajectory tracking for different methods.

Error bars show the standard deviations divided by the square root of the number of

samples. Note that the farther up and to the right, the better, as the x-axis is flipped.

All the methods were tested under the true distribution with KL divergence d = 32.02

from the Gaussian model. As expected, RAT iLQR lies between the risk-neutral iLQG

and the maximally risk-sensitive iLEQG. . . . . . . . . . . . . . . . . . . . . . . . . . 94

xv

Chapter 1

Introduction

Autonomous robots have made a huge leap in the first two decades of the 21st century. In warehouses

and logistics centers, we nowadays see thousands of robots that coordinate with each other to fulfill

part of the shipment tasks [155]. Robots have even served as kitchen staff in some restaurants,

equipped with skills to pick up and flip burger patties without detailed human instructions [128].

There is no doubt that these achievements are attributed to advanced integration of perception,

prediction, motion planning and control, as they all provide modern robots with essential capabilities

to fulfill complex real-world tasks. Nevertheless, the current reliable applications of autonomous

robots are largely limited to dedicated, closed spaces. If we look on public roads, for example, most

of the cars are still driven by humans. In fact, a recent white paper [140] even expects that it may not

be until the 2060s that fully-autonomous vehicles will completely replace human-driven ones. This

in turn highlights that the cutting-edge perception, planning and control are not mature enough for

such applications. What are major differences between warehouse robots and self-driving vehicles?

Why is the current technology not mature enough to deploy autonomous robots into general open

environments?

One key factor is uncertainty that prevents reliable robotic applications in open and interactive

environments. In the case of warehouse robots, there is little or no uncertainty about the operating

space or the other robots to coordinate with, as the robots are usually monitored and operated by a

central computing system. On the other hand, uncertainty is inevitable for autonomous vehicles on

public roads; vehicles are not explicitly communicating, have to deal with partial and full occlusions

in the surrounding environment, and must interact with pedestrians and bicyclists whose states of

mind or future motion are not completely known. The fact that those systems are safety-critical

further complicates the situation, as we cannot simply deploy an existing autonomy stack and hope

that it works on average. Therefore, uncertainty has to be systematically addressed in designing

perception, prediction, planning, and control algorithms, as well as throughout their integration into

a unified autonomy stack.

1

CHAPTER 1. INTRODUCTION 2

In this thesis, we identify and address challenges associated with the environmental and dynamic

uncertainty in order to effectively operate autonomous robotic systems in open environments, where

those systems must also interact with other agents or objects. These include scenarios that are

both safety-critical and non-safety-critical. In particular, this thesis focuses on designing planning

and control algorithms for intelligent mobile robots that can both 1) incorporate different types

of uncertainty that are present in real-world applications, and 2) either mitigate, avoid the risk

of, or be robust against such uncertainty to increase the possibility of successfully accomplishing

a given task. As will become clear later in the thesis, designing such algorithms often requires us

to also consider perception and prediction as part of the entire autonomy stack. This is because

the information about the uncertainty must be first provided to the downstream planner by the

upstream perception/prediction modules. In this sense, perception and prediction are both tightly

coupled with planning and control.

1.1 Definition and Taxonomy of Uncertainty

The term uncertainty is used in this thesis to refer to two distinct types of phenomena. For one,

uncertainty means things that a robot could know perfectly but do not in practice, for example

due to a limited sensing capability of the robot. For the other, uncertainty signifies outcomes of

inherently random or noisy phenomena, no matter whether the robot could perfectly observe them

or not. In the machine learning and statistical inference literature, those two types of uncertainties

are often referred to as epistemic uncertainty and aleatoric uncertainty, respectively [34, 141]. For

the sake of this thesis, however, we choose to categorize uncertainty based on its source in a robotic

system rather than its nature. Specifically, we define and hereafter use the following terminology.

1. State Uncertainty. The robot does not perfectly know all of the system state at the current

time, which possibly includes the state of the surrounding environment as well as the state

of mind or intent of other agents. This is due to limited knowledge, lack of communication,

or imperfect sensing capability, which collectively leads to epistemic uncertainty about the

current state.

2. Transition Uncertainty. Even if the state of a system of interest is completely known at

the current time, uncertainty is incurred as the state transitions to a different one over time,

due to aleatoric uncertainty that exists in the dynamics of the system.

3. Model Uncertainty. The robot may be aware that the state uncertainty and/or the tran-

sition uncertainty exist, but the robot’s internal model to quantify the uncertainty is still

imprecise, due to its limited capability to describe the behavior of complex dynamical sys-

tems.


A similar taxonomy is employed in [79] to explain how uncertainty arises in sequential decision-

making problems. For us, this categorization is convenient, since we can introduce different objectives

and problem formulations to address each root cause of uncertainty. For instance, state uncertainty

could be actively mitigated by strategically interacting with other agents and/or utilizing onboard

sensors. While transition uncertainty is inherent in the dynamics, the robot could still proactively

avoid risky situations where it is found in a high-stakes state that could damage itself or surrounding

agents. For model uncertainty, we could attempt to replace a low-fidelity model with a higher-fidelity

one, or make our planning algorithms robust to such incompleteness.

1.2 Approach and Related Work

The objective of this thesis is to propose and analyze novel algorithms for planning and control of mo-

bile robots such that they collectively address the aforementioned challenges about uncertainty, espe-

cially in open and interactive environments. We take a probabilistic approach to model and quantify

uncertainty. This is in contrast to conventional robust control approaches [37] or Hamilton-Jaocbi-

Bellman reachability methods [157, 26], wherein uncertainty is treated as a bounded disturbance

term that does not necessarily have an associated probability distribution. Utilizing probability

theory is advantageous in several aspects. First, probability theory lets us formally estimate and

quantify state uncertainty, which we can then reduce through sequential decision making and control

as appropriate. We will discuss this in detail in Chapter 2 and 3. Second, probabilistic treatment of

uncertainty naturally leads to a notion of risk that can be represented as a mathematical expression,

which is useful for the robot in order to recognize and avoid risky situations. We will address this

in Chapter 4 and 5.

Throughout the thesis, we are primarily focused on problems with continuous states and con-

trols, and optionally continuous time. This is in contrast to finite discrete formulations that are

often assumed in sequential decision-making problems [93, 30, 142]. We have made this choice in-

tentionally, since open and interactive environments that we are concerned with in the thesis are

most naturally modeled as a continuous world, especially when they are highly dynamic. Although

some discrete algorithms can be extended to handle continuous spaces, as we will detail in Chapter

2, we also demonstrate in Chapter 3 that our continuous formulation leads to a solution method

that outperforms such an extended discrete algorithm.

We take a model-based approach in developing our planning/control algorithms so that we can

exploit physical properties of dynamical systems and onboard sensors when possible. This is contrary

to black-box data-driven methods to derive a control policy, such as model-free reinforcement learn-

ing (RL) [106, 28]. Although such RL methods are agnostic to the underlying analytical properties

of the system, the derived policy is fixed after training and cannot be easily modified at run-time

without re-training. In Chapter 4, we demonstrate that it is trivial for our model-based control


algorithm to induce various control policies that have different risk-sensitivity levels as appropri-

ate, without changing cost function definitions (or re-training). It is also worth noting that our

algorithms are developed to run online with frequent re-planning. Compared to offline model-based

methods, our online approach is capable of considering large state spaces and adjusting behavior of

the robot to rapidly changing environments. For instance, a pedestrian that suddenly appears in

the robot’s sensor footprint could be handled by online methods via re-planning, whereas it is more

difficult for offline methods to address such a case and avoid collision in advance. Of course, this

capability of online methods comes at some computation cost during deployment due to frequent

re-planning, as opposed to RL approaches or offline model-based methods. To address this potential

issue, we develop our algorithms so that they are efficient enough for real-time or near-real-time

execution. Specifically, our proof-of-concept implementations already achieve at most two to three

times the computation time needed for real-time deployment.

The methods proposed in the thesis are related to and developed based on the following domains

from probabilistic inference, sequential decision-making, and control.

1. Recursive Bayesian Inference. We use techniques from Bayesian inference to recursively

update the posterior distribution over unknown states, in order to quantify the state uncer-

tainty given a history of sensory observations. It is known that exact posterior inference is

intractable in many practical cases [154], since it involves marginalization and normalization.

Thus, we employ existing approximate inference techniques to tractably compute the posterior

distribution, which includes Extended Kalman Filter (EKF) [177], Unscented Kalman Filter

(UKF) [154], and Multi-hypothesis Extended Kalman Filter (MHEKF) [115, 101].

2. Belief Space Planning. Given a recursive Bayesian inference algorithm, we can quantify

the state uncertainty with a posterior distribution, which is also referred to as a belief state

[153]. Belief space planning, wherein a control policy is sought that optimizes a user-defined

cumulative cost objective in the belief space, is a principled framework to make optimal se-

quential decisions and reduce state uncertainty as necessary [114]. We show in Chapter 2 and

Chapter 3 that our belief space planning algorithms, coupled with various perception models,

can effectively mitigate the state uncertainty as well as balance uncertainty reduction and task

accomplishment, depending on the problem specification.

3. Stochastic Optimal Control. Some problems may come with an assumption that the

perception module is so precise that state uncertainty can be effectively ignored. We use tools

from stochastic optimal control [20, 14] to address such problems wherein transition uncertainty

is now a major concern. Furthermore, we show in Chapter 3 that stochastic optimal control

is also applicable to belief space planning problems where state uncertainty does exist.

4. Risk-Aware and Distributionally-Robust Planning. In Chapter 4 and 5, We further


extend stochastic optimal control methods to consider risk associated with underlying stochas-

ticity. Risk-aware planning is a mathematical scheme that endows robots with the capability

of quantifying and managing risk. This is achieved either through optimizing a risk metric or

enforcing a risk-aware constraint [97]. This thesis focuses on the former, in particular on a

risk metric known as the entropic risk measure [64, 174, 97]. Furthermore, in Chapter 5 we

employ a connection between entropic risk and distributional robustness [117], wherein the

robot makes robust decisions against imprecise characterization of distributions that govern

stochastic phenomena. This enables the robot to handle model uncertainty. We note that

connections between other risk metrics and distributional robustness are also known in the

literature [30, 97, 159].

1.3 Applications

The methods presented in this thesis have broad applications in mobile robotics and beyond. Our

belief space planning methods presented in Chapter 2 and 3 enable reduction of state uncertainty,

and thus can be applied to robots tasked with gathering information about surrounding environments

using onboard sensors. For instance, our algorithms can be used to derive a desirable trajectory for a

field robot performing a persistent monitoring task to keep track of a spatio-temporal phenomenon.

Such information gathering is also applicable to active intent inference of other interacting agents

that are not explicitly communicating, as we will discuss in detail in Chapter 2. Furthermore,

belief space planning can also tackle problems where information gathering is necessary but not the

primary goal. As an example, a manipulation robot attempting to move an object to a given goal

configuration may not know about its mass and moments of inertia beforehand, which is likely in

practice. In Chapter 3, we will show that our belief space planning algorithm is capable of handling

the scenario, by simultaneously estimating those initially unknown properties and manipulating the

object. Similar issues appear in autonomous driving and aerial robot navigation as well where the

primary objective is to move from a start to a goal location safely, for which uncertainty reduction

about the surrounding environment is crucial.

In particular, autonomous driving and aerial robot navigation are two representative applications

where we must not ignore the safety-critical aspect of the tasks. This is where our risk-aware and

distributioally-robust planning methods discussed in Chapter 4 and 5 become applicable and can

complement the aforementioned belief space planning approaches. As a specific problem instance,

Chapter 4 will focus on ground robot navigation in close proximity to humans, where collision risk

with humans has to be quantified and managed. Chapter 5 will add an additional complexity to

the problem by introducing uncertainty on the (already) stochastic motion model for humans. Our

distributionally robust planner can accomplish the navigation task despite the imperfect knowledge

of the human model.


1.4 Contributions

The novel contributions of this thesis are summarized as follows. The references accompanying each

item represent the author’s own work published or under review in referred conference proceedings

and journals.

1. Active vision-based intent inference [110]. We employ monocular vision as the percep-

tion model, which is widely used for various robotic applications such as autonomous drone

navigation [94] and simultaneous localization and mapping (SLAM) [44]. In particular, we con-

sider an active vision [33, 179] scenario, in which a robot observing the motion of another robot

is tasked with inferring or “decoding” the intent of that robot. Such a problem can be viewed

as a motion-based communication [125, 126, 127, 8, 73] task, where messages between robots

are sent and received using their whole-body motion. We propose a novel belief space planning

approach to actively reducing the state uncertainty about the other interacting robot’s intent,

even when there is no direct (e.g. wireless) communication between the robots. The proposed

framework is based on the Multi-Hypothesis Extended Kalman Filter (MHEKF) [115, 101] for

tracking a multi-modal belief, as well as a variant of Monte Carlo Tree Search (MCTS) [19]

for reducing the entropy of the belief over time. In simulation with two interacting drones, we

demonstrate that our method can correctly identify the hidden intent of the other robot, with

higher accuracy and confidence than baseline policies.

2. Fast approximate belief space planning [111, 112]. We propose a novel online algorithm

for fast, general-purpose belief space planning. Unlike other existing works, our approach is

based on the perturbation theory of differential equations. Specifically, we extend Sequential

Action Control (SAC) [6], a framework for approximate optimal control of nonlinear deter-

ministic systems, to stochastic dynamics. The resulting Stochastic SAC algorithm is run in

the belief space and is named SACBP. Despite being an approximate method, SACBP does

not require discretization of spaces or time and synthesizes control signals in near real-time.

Thus, SACBP is suitable for various tasks that require agile robot control with high-frequency

observations, which are common in open and interactive environments. Our algorithm is useful

in different application domains coupled with various perception models. In continuous robot

control problems where state uncertainty is a major obstacle, we show that the algorithm

significantly outperforms other online and offline methods including an MCTS approach.

3. Fast risk-sensitive planning with probabilistic generative models [108]. We propose a

fast online algorithm for risk-sensitive optimal control, which adopts the entropic risk measure

as the objective function. Entropic risk considers the variance of the underlying distribution

and thus can mitigate “rare but catastrophic” events. Our method extends the Stochastic SAC

algorithm to this risk-sensitive setting, and the algorithm is named Risk-Sensitive Sequential

Action Control (RSSAC). Crucially, RSSAC does not require knowledge of the analytical form


of the distribution, nor is it limited to a certain class of distributions such as Gaussian; a

black-box probabilistic generator can be used to model transition uncertainty. Leveraging this

desirable property, we apply RSSAC to safe autonomous robot navigation in human crowd,

by incorporating a deep-learned generative model for probabilistic human motion prediction

recently proposed by Salzmann, Ivanovic et al. [134]. Our robot navigation framework con-

stitutes a prediction-control pipeline, which is shown to be capable of promoting safety of

crowd-robot interaction by inducing a robot behavior that proactively avoids risky situations.

4. Distributionally Robust Planning under Model Uncertainty [109]. We present a

novel online method for distributionally robust control, which plans a locally optimal feedback

policy against a worst-case distribution within a given set of plausible distributions. This

set is defined by an offline-estimated KL divergence bound between a user-specified Gaussian

model and the unknown ground-truth distribution. Leveraging mathematical equivalence be-

tween distributionally robust control and risk-sensitive optimal control, the proposed approach

achieves efficient local optimization to tractably compute the control policy for nonlinear sys-

tems. This equivalence also suggests that our method serves an algorithm to dynamically

adjust the risk-sensitivity level online for risk-sensitive control. The benefits of the distribu-

tional robustness as well as the automatic risk-sensitivity adjustment are demonstrated in a

dynamic collision avoidance scenario with model uncertainty, in which the predictive distribu-

tion of human motion is erroneous.

1.5 Organization

The organization of this thesis is summarized in Figure 1.1. As can be seen, each of the subsequent

chapters is to address one or more of the uncertainties discussed in Section 1.1. We begin with

problems where state uncertainty is a major concern. In Chapter 2, we specifically focus on the

problem of active intent inference, wherein a robot equipped with a monocular camera is tasked

to reduce intent uncertainty of the other interacting agent by observing its motion. We derive a

recursive Bayesian inference algorithm for simultaneous estimation of the initially unknown relative

pose and the intent. We then build on top of the inference algorithm a non-myopic active vision

framework; a belief space planning approach is taken to derive a control policy that aims at reducing

the entropy of the belief. In Chapter 3, we abstract the belief space planning framework and discuss

its theory rather than a specific application. An extensive literature review of belief space planning

is given. Then, we address the challenge of performing fast decision-making with high-frequency

observations, a crucial capability for autonomous robots in open and interactive environments. To

that end, a novel online framework for efficient, general-purpose belief space planning is developed

based on Stochastic SAC. Note that Stochastic SAC itself is a contribution of this thesis, whose

rigorous analysis is given in Appendix A. Overall, the methods presented in Chapter 2 and 3 are


State

Uncertainty

Transition

Uncertainty

Model

Uncertainty

Uncertainty in Open and Interactive Environments

Approaches and Applications

Belief Space

PlanningRisk-Sensitive

Planning

(Chapter 4)

Distributionally

Robust Planning

(Chapter 5)

Active Intent Inference

(Chapter 2)

General Applications

(Chapter 3)

Autonomous Navigation and

Human-Robot Interaction

Figure 1.1: Roadmap illustrating the organization of this thesis.

suited for reduction of state uncertainty through onboard perception.

Chapter 4 shifts focus from state uncertainty to transition uncertainty. We discuss how belief

space planning alone is insufficient to address difficulties arising from transition uncertainty, espe-

cially in safety-critical scenarios; for the robotic system to remain safe under transition uncertainty, it

is crucial to take “rare but catastrophic events” into consideration. To achieve this computationally,

we formulate risk-sensitive optimal control and propose an efficient solution method. This is derived

as a natural extension of the Stochastic SAC framework to the entropic risk measure. The resulting

RSSAC algorithm is applied to safe autonomous robot navigation in human crowd, to demonstrate

superior performance of the algorithm and effectiveness of risk-sensitivity in safety-critical problems.

Chapter 5 considers model uncertainty that arises due to imperfect modeling of the stochastic

robotic system. Model uncertainty leads to a mismatch between an assumed system model and the

ground-truth, which is often inevitable. To address this model mismatch, we formulate a distribu-

tionally robust control problem, which seeks for a feedback policy that acts against a worst-case

distribution within a given set of plausible models. We then show that such a problem can be made

equivalent to risk-sensitive optimal control, referring to prior work in control theory [117]. Leverag-

ing this equivalence, we propose an efficient online algorithm for distributionally robust control of

nonlinear stochastic systems. Similarly to Chapter 4, we apply our algorithm to safe autonomous

robot navigation.

Finally, Chapter 6 concludes this thesis, presenting lessons learned as well as future research

directions.

Chapter 2

Active Vision-Based Intent

Inference

In this chapter, we consider a motion-based communication [125, 126, 127, 8, 73] problem wherein

whole-body motion is used as a means of sending messages between robots. We focus on an active

vision [33, 179] scenario in which a receiving robot equipped with a monocular camera is tasked

with decoding a message (or more generally, hidden intent) that is encoded in a sending robot’s

trajectory. The difficulty in this problem arises from the monocular vision model of the receiver

and the unknown relative pose between robots, which brings inherent ambiguity into the trajectory

identification, and hence the messages decoding. Assuming that there is a finite set of possible

messages that the sender can express, we first introduce an online Bayesian inference method for the

receiving robot to simultaneously estimate its relative pose to the sender, and the trajectory class

of the sender. We show that the Multi-hypothesis Extended Kalman Filter (MHEKF) [115, 101]

becomes a natural choice to tractably implement this inference algorithm. Furthermore, we formulate

a belief space planning problem to derive a non-myopic active vision policy for the receiver to actively

disambiguate similar trajectory class hypotheses. The policy is constructed online by a variant

of Monte Carlo Tree Search (MCTS) algorithm [19, 23] and aims at reducing the entropy of the

trajectory class distribution. Our simulation results using two interacting quadrotors demonstrate

that the proposed control policy results in an accurate trajectory classification and thus effective

intent inference. The proposed framework for active vision has broad applications where intent

ambiguity of other agents is of a concern for the robot. This includes autonomous driving as well

as autonomous drone operations in open and interactive environments. The materials presented in

this chapter are also reported in [110].

9

CHAPTER 2. ACTIVE VISION-BASED INTENT INFERENCE 10

2.1 Introduction and Related Work

Robots that interact with one another are often assumed to directly communicate over a wireless

network. However, in many instances a wireless network is not available, or cannot be relied upon.

This introduces state uncertainty about the intent of other robots or agents, which could significantly

impede reliable operation of robots in interactive environments. To discuss and address such an issue,

this chapter considers motion-based communication, in which a sender robot encodes a message in

its trajectory, which is decoded by a receiver robot. The receiver robot has only a monocular camera

with which to observe the motion, and the relative pose between the two robots is unknown. This can

make the identification of the trajectory difficult or impossible without an active vision strategy. We

present a recursive Bayesian inference algorithm by which the receiving robot classifies the trajectory

from the sending robot, and simultaneously estimates the relative pose between the two robots. We

also propose a non-myopic active vision algorithm based on Monte Carlo Tree Search (MCTS) with

Double Progressive Widening (DPW) [32]. The planning is performed in the belief space and drives

the receiving robot to move around so that it can disambiguate between similar trajectory classes.

We use entropy as the measure of uncertainty in the trajectory class distribution, which the control

policy seeks to reduce over time in a non-myopic manner. We present theoretical results as well as

experimental results evaluated in a realistic simulation environment.

Our estimation and control framework may be used for robots to literally exchange messages

through their trajectories, or more generally, as an abstraction for motion classification, prediction,

and intent inference. The sender (whether intentionally or not) sends a “message” with its trajectory,

which the receiving robot must “decode.” For example, an autonomous car must categorize an

observed vehicle as aggressive or defensive based on its observed trajectory. Thus our algorithm has

applications to autonomous driving, where vehicles have to infer the intent and predict the motion

of other vehicles, pedestrians, and bikers, and to autonomous drones, which have to avoid collisions

with other drones and pedestrians in their airspace by observing and predicting their trajectories.

The algorithm may also model motion-based communication in animals, for example the “waggle

dance” of honeybees studied in zoology [163].

In robotics, previous work [8, 73] has employed control theoretic approaches for the sending

robot to design a set of distinguishable and energy-optimal trajectories. In contrast, here we consider

algorithms for the receiver to decode and infer the message. This problem is particularly complicated

by the monocular camera of the receiver (which gives only a 2D projection of the sender’s trajectory)

and the unknown relative pose between the two robots. In this scenario, two different trajectories can

look identical from the point of view of the receiver, e.g., an “L” trajectory looks like a “V” trajectory

from an oblique angle (see Figure 2.1). Although recent work [8, 73, 125, 126, 127] has explored the

emerging field of motion-based communication, an effective way to resolve the ambiguity inherent

due to the monocular vision and unknown pose has not yet been addressed to our knowledge.

More specifically, the main contributions of this study are two fold. First, we present an online


Figure 2.1: (Left) A sample 2D trajectory codebook with potential ambiguities between entries.(Right) A noisy camera projection of a complete trajectory. The trajectory may be either an “L”or a “V” from the image. Further inference requires the receiver to move to obtain a more accurateestimate.

estimation framework in which both the sender’s message and the receiver’s relative pose are se-

quentially estimated. Based on the Markov assumption, we provide a recursive Bayesian inference

algorithm to handle the multi-modal distribution over the joint state of the receiver and the sender.

A Gaussian approximation and a model linearization yield a Gaussian mixture representation of the

robots’ joint state, naturally leading to the Multi-hypothesis Extended Kalman Filter (MHEKF)

[115, 101] algorithm. Second, we actively control the receiver to move around the sender in order to

disambiguate between similar trajectory hypotheses. An information-theoretic approach is taken in

which we control the receiver to maximally decrease the entropy of the trajectory class distribution.

Furthermore, we plan over multiple future poses of the receiver by formulating the optimization

problem as belief space planning and employing the MCTS algorithm with DPW. The performance

of this control policy is compared to that of a greedy Monte Carlo algorithm and a random policy

in simulations.

Our algorithms draw inspiration from several existing works in the areas of state estimation,

computer vision, and sequential decision-making under uncertainty. For example, general nonlinear

filtering with a Gaussian mixture model is introduced by Alspach and Sorenson in [4]. Jochmann et

al. discuss its applicability to robot localization [72]. Chen and Liu [27] introduce mixture Kalman

filters to handle conditional dynamic linear models. The MHEKF algorithm specifically is applied

to various state estimation problems such as spacecraft attitude estimation [101] and bearings-only

object tracking [115]. In computer vision, simultaneous active vision classification and camera pose

estimation is proposed by Sipe and Casasent in [144]. Although their estimation scheme has some

similarities with ours, it does not deal with motion-based communication, and it does not use an

entropy-based control objective. Denzler et al. [33] and Zobel et al. [179] present an entropy-based

camera control framework, and derive a closed-form control objective for Gaussian distributions.


Seekircher et al. [139] employ Monte Carlo exploration to implement an entropy-based active vision

control. These works also do not address communication through motion, but active vision. In

terms of the planning algorithms, MCTS is an online method for finding optimal decisions [19] and

is widely used in various decision processes that involve uncertainty, ranging from game-playing [50]

to robotics [15]. The algorithm is anytime, and can be also applied to continuous or large state and

action spaces by introducing DPW [32]. There are several existing works that apply Monte Carlo

Tree Search to active perception and belief space planning [15, 145]. Our work has similar goals,

but we are particularly interested in applications in motion-based communication and active intent

inference.

The rest of the chapter is organized as follows. In Section 2.2 we define the robot models

and formalize the problem. In Section 2.3 we derive the recursive Bayesian inference formula and

provide the MHEKF algorithm. A belief initialization algorithm is discussed in Section 2.4. The

active-vison control policy for the receiver is formulated in Section 2.5. Simulation results in a ROS-

Gazebo environment are presented in Section 2.6 and conclusions with future directions are given

in Section 2.7.

2.2 Preliminaries

The sender robot chooses a pre-specified trajectory to encode the corresponding message. The

sender then executes this trajectory, while the receiver observes the motion and controls its viewing

positions. In this work we let the receiver move and estimate the message simultaneously while the

sender executes a single trajectory.

This leads to a sequential state estimation and decision making problem represented by the

Bayesian network structure in Figure 2.2. We formalize this stochastic process in what follows,

dropping the time index for convenience when it does not cause confusion.

2.2.1 Trajectory generation

A trajectory of the sender is represented by an initial position x0 ∈ R3 and a set of n stochastic

transitions xk | xk+1 = fk(xk,m) + vk, k = 0, . . . , n− 1, where fk is a deterministic, time-indexed

state transition function of the sender, m is a message or a trajectory class and vk is zero-mean

Gaussian white noise. The initial position x0 is also assumed to be Gaussian-distributed. Given

a trajectory class m ∈ 1, . . . ,M, the knowledge of the corresponding noiseless trajectory can

be completely described by a set of n + 1 waypoints x0, x(m)1 , . . . , x

(m)n . This is expressed in a

frame of reference named the trajectory frame and is fixed in inertial space. We assume that the

receiver has already acquired the set of ideal trajectories as a “trajectory codebook.” This can be

done through modeling or learning the deterministic transition function fk of the sending agent, and

is an interesting topic for future research. We also assume that the receiver can identify the time


rn−1, zn−1, xn−1, rn, zn, xnrn−1, zn−1, xn−1, rn, zn, xn

rn−1, zn−1, xn−1, rn, zn, xn

rn−1, zn−1, xn−1, rn, zn, xnrn−1, zn−1, xn−1, rn, zn, xn

rn−1, zn−1, xn−1, rn, zn, xn

u1, r1, z1, x1, uk

u1, r1, z1, x1, uk

u1, r1, z1, x1, uk

u1, r1, z1, x1, uk

u1, r1, z1, x1, un−1 x0, un

x0, un

rn+1

Figure 2.2: Bayesian network structure for our trajectory classification and pose estimation problem.The receiver robot observes the camera image z of the sender’s position x to estimate the relativepose r and the sender’s message m, and applies control u.

when the sender initiates the trajectory execution.

2.2.2 State transition model

Let sk , (rTk , x

Tk )T ∈ R9 be the joint robot state of the receiver and the sender at timestep k, where

r denotes the pose. Conditional on a trajectory class m, the state transition for the sender is as

described in the previous section. The receiver’s pose r specifies its relative position and attitude in

the trajectory frame as

r , (ωT, tT)T ∈ R6, (2.1)

where ω = (ωx, ωy, ωz)T ∈ R3 gives the angle-axis representation of the 3D rotation group SO(3).

The direction of ω specifies the axis around which the receiver’s body frame is rotated with respect to

the trajectory frame, and the `2 norm ‖ω‖2 gives the magnitude of the rotation angle [89]. Similarly,

t = (tx, ty, tz)T ∈ R3 represents the translation of the receiver’s body in the trajectory frame. The

pose r is not known to the receiver or the sender.

The new position tk+1 is given by the current position tk, attitude ωk and the translational

control input utrans ∈ R3 as

tk+1 = tk + exp([ωk]×)utrans, (2.2)


where [·]× : R3 → so(3) is the skew-symmetric matrix operator:

[ω]× ,

0 −ωz ωy

ωz 0 −ωx−ωy ωx 0

(2.3)

and so(3) is the Lie algebra of SO(3). The matrix exponential in (2.2) can be evaluated using the

Rodrigues formula [89]:

exp([ω]×) = I3×3 +sin θ

θ[ω]× +

(1− cos θ)

θ2[ω]2×, (2.4)

where θ = ‖ω‖2.

The new attitude ωk+1 is related to the current attitude ωk and the rotational control input

urot ∈ R3 as

exp([ωk+1]×) = exp([ωk]×) exp([urot]×). (2.5)

The concatenated control input uk = (uTrot, u

Ttrans)

T ∈ R6 is represented in the body coordinates of

the receiver.

Finally, the state transition for the receiver is also corrupted by zero-mean Gaussian white noise.

Along with the state transition noise for the sender, we denote Q ∈ S9×9+ as the symmetric, positive

semi-definite covariance matrix for the joint state transition.

Note that the receiver will apply an input uk only after the sender has moved from xk−1 to xk,

and the state transition to rk+1 is assumed to be finished by the time the sender reaches the next

position xk+1.

2.2.3 Observation model

The monocular vision of the receiver can be approximated by a simple pinhole camera model. We

will provide a brief review of the model that relates a sender’s position to an observation. Interested

readers are referred to [89, 60] for further details.

Pinhole camera projection

Let z ∈ R2 represent the observation expressed in the pixel coordinates on the image plane and

x ∈ R3 be the Cartesian coordinates representing the sender’s position. The relationship between a

3D point x and the corresponding 2D point z is given by

λ

(z

1

)= P

(x

1

), (2.6)


where λ is a scale factor and P ∈ R3×4 is the projection matrix. P can be further decomposed into a

product of the camera calibration matrix K ∈ R3×3 and the external parameters matrix T ∈ R3×4.

The camera calibration matrix contains intrinsic parameters of the camera and we assume that it is

known to the receiver.

External parameters matrix

The external parameters matrix T defines the rotation and the translation of the camera frame with

respect to the trajectory frame:

T ,(Rctraj t

ctraj

), (2.7)

where tctraj is the translational vector and Rctraj is the 3D rotation matrix represented in the camera

coordinates. These parameters can be expressed in terms of the receiver’s state r = (ωT, tT)T as

Rctraj = RbTc [exp([ω]×)]T (2.8)

tctraj = −RbTc [exp([ω]×)]T t−RbTc tbc, (2.9)

where Rbc and tbc define the relative pose between the body and the camera coordinates, and are

determined by the physical position and attitude of the camera on the receiver robot.

The pinhole camera projection (2.6) along with the external parameters matrix (2.7) comprise the

nonlinear observation function h(s). The actual observation is also subject to zero-mean Gaussian

white noise with covariance R ∈ S2×2+ , hence the distribution is given by

p(z | s) = N (z;h(s), R) . (2.10)

2.2.4 Repeated trajectory execution

The stochastic process depicted in Figure 2.2 represents a single execution of a trajectory. In fact, the

process is repeated multiple times, with rn+1 in the current iteration being r1 in the next iteration.

Similarly, the sender transitions from xn to x1 through x0. Note that the posterior at the end of

the current iteration becomes the prior for the next iteration.

2.3 Bayesian Filtering

As illustrated in Figure 2.2, the trajectory class m and the receiver’s state r are correlated via the

observation z and the sender’s position x. We therefore need to simultaneously estimate m and

s = (rT, xT)T even though eventually we are more interested in m than s in this problem. Formally,

this is equivalent to describing the joint distribution of s and m in the Bayesian framework, which

is discussed in this section.


2.3.1 Recursive Bayesian Inference Formula

We are now set to derive the recursive Bayesian inference formula to determine the posterior distri-

bution over the joint robot state s and the trajectory class m.

Theorem 2.1 (Recursive Bayesian Update). The posterior of the joint distribution p(sk+1,m |z1:k+1, u1:k) can be obtained recursively as a function of the class-dependent state transition model

p(sk+1 | sk, uk,m) and the observation model p(zk+1 | sk+1), given the Bayesian network structure

depicted in Figure 2.2 and the prior belief p(sk,m | z1:k, u1:k−1).

Proof. From the definition of conditional probability, the posterior can be factored into a product

of a continuous probability density function p of the joint state s and a discrete probability mass

function P of the trajectory class m:

p(sk+1,m | z1:k+1, u1:k) = p(sk+1 | z1:k+1, u1:k,m)P (m | z1:k+1, u1:k). (2.11)

The first term can be decomposed using Bayes’ rule

p(sk+1 | z1:k+1, u1:k,m) ∝ p(zk+1 | sk+1)p(sk+1 | z1:k, u1:k,m). (2.12)

The first term p(zk+1 | sk+1) in (2.12) is the observation model. The second term in (2.12) can

be expressed as a marginalization integral with the class-dependent state transition model p(sk+1 |sk, uk,m) = p(rk+1 | rk, uk)p(xk+1 | xk,m) as

p(sk+1 | z1:k, u1:k,m) =

∫p(sk+1 | sk, uk,m)p(sk | z1:k, u1:k−1,m)dsk. (2.13)

Substituting (2.13) into (2.12), we see that the first term of the factored posterior in (2.11) can be

specified by the observation model, the state transition model and p(sk | z1:k, u1:k−1,m), which is

part of the prior belief factored in the same manner as in (2.11).

On the other hand, the second term of the factored posterior in (2.11) can be also decoupled

according to the Bayes’ rule:

P (m | z1:k+1, u1:k) ∝ p(zk+1 | m, z1:k, u1:k)P (m | z1:k, u1:k−1). (2.14)

P (m | z1:k, u1:k−1) is the prior belief on the trajectory class m. Now, p(zk+1 | m, z1:k, u1:k) can be

expressed by a marginal density function over sk+1 as

p(zk+1 | m, z1:k, u1:k) =

∫p(zk+1 | sk+1)p(sk+1 | z1:k, u1:k,m)dsk+1. (2.15)

The first term in the integral above is the observation model. The second term is equivalent to

(2.13).


Remark 2.1. This result suggests that we can separately update 1) the belief of the joint state s

conditional on the trajectory class m and 2) the belief of m itself, given the state transition model and

the observation model. The factorization presented in the proof will allow us to derive closed-form

parameter update equations, with certain assumptions imposed on the model as explained below.

2.3.2 Multi-hypothesis Extended Kalman Filter

The problem of estimating the joint robot state s and the message m given past observations

and control inputs can be viewed as a multi-hypothesis filtering problem. We employ the Multi-

hypothesis Extended Kalman Filter (MHEKF) [115, 101] algorithm as our filter. The nonlinear

state transition and observation functions are linearized around the current estimate and the prior

distribution on the joint robot state is assumed to be a mixture of Gaussians. In what follows, we

derive the Kalman update equations based on the recursive Bayesian inference formulae discussed

above.

Model linearization

As in the usual EKF update equations, we approximate the nonlinear state transition and observation

with their first-order Taylor expansions. First, let gk(sk, uk,m) represent the joint robot state

transition function defined by (2.2), (2.5) and fk(xk,m), which is the state transition function for

the sender. The linearization of gk around E[sk | z1:k, u1:k−1,m] , µ(m)k yields

gk(rk, uk,m) ≈ gk(µ(m)k , uk,m) + F

(m)k (sk − µ(m)

k ) (2.16)

with Jacobian F(m)k , ∂

∂sgk(s, uk,m)|µ(m)k

∈ R9×9.

In a similar manner, the observation function is linearized around E[sk+1 | z1:k, u1:k,m] , µ(m)k+1.

Assuming that the trajectory class is m, the linearized observation function is given by

h(sk+1) ≈ h(µ(m)k+1) +G

(m)k+1(sk+1 − µ(m)

k+1), (2.17)

where G(m)k+1 , ∂

∂sh(s)|µ(m)k+1

∈ R2×9 is the Jacobian matrix.

Parameter update equations

In order to derive the Kalman update equations, we assume that the prior belief of s conditioned

on m is a Gaussian distribution:

p(sk | z1:k, u1:k−1,m) , N (sk;µ(m)k ,Σ

(m)k ). (2.18)


Algorithm 2.1 Multi-hypothesis Extended Kalman Filter algorithm for trajectory classificationand state estimation.

INPUT: µ(m)k ,Σ

(m)k , φ

(m)k | m = 1, . . . ,M, uk, zk+1

OUTPUT: µ(m)k+1,Σ

(m)k+1, φ

(m)k+1 | m = 1, . . . ,M

1: for each m ∈ 1, . . . ,M do

2: F(m)k ← ∂

∂sgk(s, uk,m)|µ(m)k

3: µ(m)k+1 ← gk(µ

(m)k , uk,m)

4: Σ(m)k+1 ← F

(m)k Σ

(m)k F

(m)Tk +Q

5: G(m)k+1 ←

∂∂sh(s)

∣∣∣∣µ(m)k+1

6: H(m)k+1 ← R+G

(m)k+1Σ

(m)k+1G

(m)Tk+1

7: K(m)k+1 ← Σ

(m)k+1G

(m)Tk+1 H

(m)−1k+1

8: µ(m)k+1 ← µ

(m)k+1 +K

(m)k+1

(zk+1 − h(µ

(m)k+1)

)9: Σ

(m)k+1 ← (I −K(m)

k+1G(m)k+1)Σ

(m)k+1

10: φ(m)k+1 ← N

(zk+1;h(µ

(m)k+1), H

(m)k+1

)φ

(m)k

11: end for12: φk+1 ← normalize(φk+1)

We also denote by φ(m)k the probability of the trajectory class being m given z1:k and u1:k−1:

P (m | z1:k, u1:k−1) , φ(m)k . (2.19)

As a result, the marginal distribution over s takes the form of a mixture of Gaussians:

p(sk | z1:k, u1:k−1) =

M∑m=1

φ(m)k N (sk;µ

(m)k ,Σ

(m)k ). (2.20)

Given the prior distribution (2.20), the Bayesian filtering (2.11)–(2.15) is equivalent to the pa-

rameter update equations given in Algorithm 2.1. This is consistent with similar Gaussian Mixture

Filtering algorithms presented in [4] and [72]. Note that the categorical distribution parameters

φ(m)k+1 must be normalized so that

∑Mm=1 φ

(m)k+1 = 1 is satisfied.

2.4 Belief Initialization

2.4.1 Levenberg-Marquardt algorithm

Convergence performance of the Extended Kalman Filter algorithm is sensitive to the accuracy of

the prior belief. Especially the initial estimate of the receiver’s pose is of particular importance; we


observed in simulation that a randomized initial value for the pose estimate leads to poor convergence

performance of the filter. This stems from the nonlinearity in the state transition model (2.2),

(2.5). Thus we resort to the Levenberg-Marquardt algorithm [89], a well-known nonlinear least

squares method in the computer vision and optimization literature. At the very beginning of the

communication process, we let the sender perform a complete trajectory once while the receiver

remains stationary at the initial pose and have the receiver collect a set of observations. Based on

these observations and the noiseless trajectory waypoints x0, x(m)1 , . . . , x

(m)n for trajectory class m

in the codebook, the algorithm seeks for a minimizer of the following least squares:

arg minρ(m)

n∑k=0

∥∥∥zk − h((ρ(m)T, x(m)Tk )T

)∥∥∥2

2, (2.21)

where ρ(m) is the receiver’s pose assuming the trajectory class m.

Each run of the Levenberg-Marquardt algorithm itself requires an initial estimate ρ(m)0 and the

algorithm runs until a local minimum is found or a desired number of iterations is finished. For

better performance, it is preferred that the initial value ρ(m)0 be a reasonable estimate. We employ

the Direct Linear Transformation algorithm to compute this initial estimate from the same set of

initial observations and the waypoint data from the codebook. The Direct Linear Transformation

algorithm directly estimates the projection matrix P by solving a linear system of equations, from

which we can extract the pose information ρ(m)0 using (2.7), (2.8) and (2.9). The details are omitted

for brevity and readers are referred to [89, 33] for further information.

2.4.2 Prior estimate

After evaluating ρ(m) | m = 1, . . . ,M, the prior is specified by the mean µ(m)0 =

(ρ(m)T, xT

0

)T,

covariance matrix Σ(m)0 , and the initial trajectory class probability φ

(m)0 = P (m) for each m ∈

1, . . . ,M. We could utilize the information of how well the nonlinear least squares performed to

initialize φ(m)0 . Under the assumption that noise is independent and Gaussian-distributed, φ

(m)0 can

be evaluated by taking the likelihood of the set of observations using the Bayes’ rule:

φ(m)0 ∝ exp

(− 1

2σ2

n∑k=0

∥∥∥zk − h((ρ(m)T, x(m)Tk )T

)∥∥∥2

2

), (2.22)

in which the residual of the least squares appears and σ is a parameter. In the experiment we used

σ = 1× 102.


2.5 Entropy-based Active Vision Control

Given the current estimate of s and m, our goal is to control the position and the attitude of the

receiver in order to correctly classify the trajectory and decode the message. One information-

theoretic approach to this active vision control is to select uk from the pre-specified control space U ,

so that the entropy over the trajectory class distribution is maximally decreased over time. To this

end, we formulate the optimal control problem as a belief space planning problem, more specifically

a Markov decision process (MDP) [14, 79] in the belief space. To compute the policy, we adapt the

Monte Carlo Tree Search (MCTS) algorithm, an approximate dynamic programming method that

computes near-optimal solutions in a given amount of time. In our implementation, we employ its

most well-known variant called Upper Confidence Trees [81] with an extension known as Double

Progressive Widening [32].

2.5.1 Belief MDP for motion-based communication

MDPs provide a general framework for an autonomous agent to act optimally in a fully observable

environment in order to maximize its expected cumulative reward, or the value function [79]. In

a belief MDP, the state space is the entire belief space. A belief in our problem is defined by the

following tuple:

b ,(

(φ(1), µ(1),Σ(1)), . . . , (φ(M), µ(M),Σ(M))). (2.23)

Belief transition model

We first need to define the state transition model for the belief b and the reward model related to

the state transition. Recall from Algorithm 2.1 that the belief transition is completely described

by the Multi-hypothesis Extended Kalman Filter equations once given the control input uk and the

observation zk+1. There is stochasticity involved in this transition due to the observation. However,

we can still specify the distribution from which a new observation is drawn based on the current

prior belief b that has z1:k, u1:k as evidence variables. Namely,

p(zk+1 | z1:k, u1:k) =

M∑m=1

p(zk+1 | m, z1:k, u1:k)P (m | z1:k, u1:k), (2.24)

where p(zk+1 | m, z1:k, u1:k) = N(zk+1;h(µ

(m)k+1), H

(m)k+1

)and P (m | z1:k, u1:k) = φ

(m)k . Therefore, for

the sake of planning we can simulate stochastic belief transitions by drawing an observation sample

from the following Gaussian mixture distribution:

zk+1 ∼M∑m=1

φ(m)k N

(zk+1;h(µ

(m)k+1), H

(m)k+1

)(2.25)


Algorithm 2.2 Sample a new belief and a reward.

INPUT: current belief b, control input u.OUTPUT: new belief b′, reward r

1: Compute µ(m) and H(m) for each m from b.2: Sample z ∼

∑Mm=1 φ

(m)N(z;h(µ(m)), H(m)

)3: Obtain b′ from u and z with Algorithm 2.1.4: Compute r = Reward(b, b′) (2.27).

Reward model

The reward needs to be designed so that the receiver robot exhibits desired behavior. In the motion-

based communication problem, we would like to reduce the uncertainty in the trajectory class dis-

tribution. Entropy provides a measure of uncertainty, which is defined as follows:

H[M] = −M∑m=1

P (m) logP (m), (2.26)

where M is the random variable representing the trajectory class.

Since our MDP is defined in the belief space, we can directly use the entropy as our reward. The

reward function Reward(b, b′) is determined by the change in entropy between the current belief b

and a new belief b′

Reward(b, b′) = l (H[M | b])− l (H[M | b′]) , (2.27)

where l : R+ → R is a strictly monotonically decreasing function. Note that the reward is positive

when the new belief b′ results in a smaller entropy than b since the receiver aims at maximizing the

reward.

In the simulation experiments we chose l(·) = log(·). This is to prevent the reward from effectively

getting discounted as the entropy is reduced, because the entropy of a categorical distribution (2.26)

is always lower-bounded by zero.

The resulting sampling algorithm is summarized in Algorithm 2.2.

2.5.2 Upper Confidence Trees

The Upper Confidence Trees algorithm approximates the state-action value function using Monte

Carlo simulations with a generative model that gives reward and state transition samples. It is an

online method that only considers a certain depth from the current state. The depth and the number

of iterations are limited by the amount of computational resources available before selecting the next

action. Starting from the current state as the root node, in each iteration the algorithm gradually

grows a search tree. A node in the tree represents a reachable state and an edge corresponds to an


action. In our belief MDP, a state is a belief b and an action is a control input u.

One iteration consists of a simulation with four different stages listed below [23].

1. Selection. This stage recursively selects a child node in the tree according to a child selection

policy called UCB1 until a leaf node is reached. A detailed explanation on UCB1 is provided

below.

2. Expansion. When a new state is observed that did not exist in the tree at the selection step,

the the state is added to the tree as a new leaf node.

3. Simulation. If a new leaf node is added in the expansion stage, a simulation is run for the

rest of the depth based on a rollout policy to compute the initial approximate value of the

node.

4. Backpropagation. Lastly, the simulation result is back-propagated from the leaf node to

the root node through the selected nodes, updating the visit counts and the approximated

state-action values for each node.

UCB1

The key aspect of the algorithm is the use of a search heuristic called UCB1, which well balances

the exploration and exploitation trade-off. At node b, control input u is chosen so that it maximizes

the following score [32]:

scoret(b, u) =totalRewardt(b, u)

nt(b, u) + 1+ kucb

√log(t)

nt(b, u) + 1, (2.28)

where t is the number of times the node b has been visited, nt(b, u) is the number of times the control

input u has been applied at b, and totalRewardt(b, u) is the total cumulative reward obtained so far

by executing u at b. kucb is a parameter that controls the amount of exploration.

Double Progressive Widening

The conventional Upper Confidence Trees algorithm assumes small discrete action and state spaces.

Nevertheless, the belief MDP defined above consists of stochastic transitions in a continuous domain

and the probability of reaching the same belief state twice is zero. Couetoux et al. [32] point out

that the algorithm fails in such a case and propose a modification as follows. Instead of always

transitioning to a new state using the generative model, a new state is added to the tree only

if the number of children is less than or equal to dCnt(b, u)αe, where C > 0 and α ∈]0, 1[ are

constants; otherwise an existing child is chosen with probability nt(b, u, b′)/nt(b, u), where nt(b, u, b

′)

is the number of transitions observed from b to b′ with u. A similar strategy is used to limit the


Figure 2.3: (Left) Simulation environment with the receiver in the front and the sender in the back.(Right) An image obtained by the forward-facing camera mounted on the receiver. The smoothyellow trajectory is overlaid for clarity.

number of actions available if the action space is also large. The k-th action becomes available

when k = dCutαue with constants Cu > 0 and αu ∈]0, 1[. This restriction is known as Progressive

Widening, and the algorithm is called Double Progressive Widening as it is applied to both the

(belief) states and the actions.

2.6 Experimental Results

2.6.1 Simulation setup

We designed our experiment in Gazebo [82], a multi-robot simulator compatible with ROS [122].

Two Hummingbird [66] quadrotors are used in the simulation, whose dynamics are simulated with

RotorS [49]. Figure 2.3 illustrates the simulation environment1.

The trajectory codebook presented in Figure 2.4 consists of three different elliptic trajectories.

A trajectory is represented as an ordered collection of 6 waypoints. The sender takes 7 seconds to

transition between successive waypoints, which corresponds to a single timestep. The codebook was

deliberately designed to make the classification challenging. Indeed, the prior initialization with the

Levenberg-Marquardt algorithm resulted in an ambiguous classification as depicted in Figure 2.5.

The parameters used in the Multi-hypothesis Extended Kalman Filter and the Double Progressive

Widening algorithms are summarized in Table 2.1. The horizontal and the vertical field-of-views

(FoVs) of the camera are 80.1 [deg] and 64.5 [deg], respectively. Initially the two quadrotors are 4.2

[m] apart in the horizontal direction. In controlling the receiver, we only allow lateral translational

motion and yaw rotation as possible control inputs: utrans = (0, uy, 0)T and urot = (0, 0, uyaw)T with

1The supplemental video is available at https://youtu.be/u4HxiA2d2IQ.

https://youtu.be/u4HxiA2d2IQ


Figure 2.4: 2D trajectory codebook used in the experiment. The sender robot moves clockwise.The trajectories are intentionally ambiguous from different observer angles, to make trajectoryclassification difficult for the receiver.

Figure 2.5: (Left) A set of waypoint measurements used to initialize the prior. The smooth yellowtrajectory and the waypoints are overlaid for clarity. (Right) Prior distribution obtained from theleft image.


Table 2.1: Parameter values for MHEKF and MCTS.Parameter Value Parameter ValueQ diag(Qr, Qt, Qx) kucb 10.0Qr diag(1× 10−3, 1× 10−3, 1× 10−2) C 3.0Qt diag(5× 10−2, 5× 10−2, 1× 10−3) α 0.1Qx diag(1× 10−4, 1× 10−3, 1× 10−3) Cu 10.0R diag(1× 101, 1× 101) αu 0.3

Figure 2.6: Classification error (left) and entropy (right) averaged over 30 cases. The MCTS activevision policy outperforms greedy and random in both criteria.

uy ∈ −1,−0.5, 0, 0.5, 1 m and uyaw ∈ −45,−25,−10, 0, 10, 25, 45 deg. The resulting size of the

control space is |U| = 35.

We compared the performance of the proposed control policy with a greedy algorithm and a

random policy. The greedy algorithm still employs the Monte Carlo simulation to estimate the

reward, but it only considers the immediate next action and performs Algorithm 2.2 ten times per

action to select one that has the maximum average reward.

2.6.2 Simulation results

The classification and the pose estimation performance was evaluated through 30 simulation episodes

with different trajectory classes and priors. In each episode the sender repeated the trajectory

execution 6 times, with an interval time of 7 seconds between each execution. We observed that the

MCTS and the greedy policies both actively controlled the yaw rotation as well as the horizontal

position, trying to keep the sender inside the FoV of the camera. As can be seen in Figure 2.6,

the MCTS control policy resulted in the lowest classification error, which is defined as 1−P (mtrue)

where mtrue is the true trajectory class. Even though all the policies achieved similar errors after


Figure 2.7: Attitude error (left) and position error (right) averaged over 30 cases. The MCTS activevision policy performs the best for pose estimation, even though it is not explicitly trying to optimizethe relative pose estimate.

step 22, there are still notable differences in the entropy, indicating that with the proposed approach

the receiver is most confident in its classification.

However, this performance improvement was obtained at the expense of computation time. On

average, MCTS with Double Progressive Widening took 4.0 seconds to compute the next action,

with the maximum tree depth of 3 and 400 iterations. The greedy algorithm only took 1.6 seconds

to perform 350 iterations. Obviously, the random policy ran much faster than the other two, taking

0.0024 seconds. All the computation was performed on a laptop with Intel Core i7-6700HQ CPU

and 16GB RAM. Considering this runtime and the classification performance discussed above, there

is a trade-off between the proposed approach and the greedy policy since the performance of the

greedy policy is not significantly worse than the proposed method.

The MCTS active vision policy also outperformed the other two policies in the pose estimate,

even though the filtering algorithm was not able to correct the pose error within 30 steps regardless

of the control policy. In Figure 2.7, the attitude error is represented by the magnitude of the rotation

angle between the true attitude and the mean estimate. The position error is given by the distance

between the true position and the mean estimate, which is divided by the initial distance of the

two robots to give a normalized unit-less distance error. Interestingly, the results suggest that the

classification can become correct when the pose estimate is still erroneous.

In order to confirm that the filter essentially improves the pose estimate over time, we ran an

additional episode for 100 steps with the proposed control policy. The result is presented in Figure

2.8. Both the attitude error and the position error converged to near-zero values after 100 steps.


Figure 2.8: Attitude error (top) and position error (bottom) for a single long episode with class 1trajectories. The MCTS active vision policy was used to move the receiver. Eventually, the relativepose estimation converges even though the receiver’s controller optimizes for trajectory class, notpose error.

2.7 Conclusions

This chapter presents an online active vision algorithm via belief space planning for motion-based

communication between two robots. Intent inference difficulties arising from the monocular vision

model and the unknown relative pose are addressed. We present a recursive Bayesian inference algo-

rithm for simultaneous trajectory classification and relative pose estimation. We further propose an

information-theoretic approach to active vision, in which an entropy-based objective is optimized via

the belief MDP formulation and the MCTS algorithm with DPW. Through the extensive simulation

study, we have demonstrated that the proposed method significantly improves both the classification

accuracy and confidence compared to baseline policies. Future research directions include improving

our active vision method to take account of certain constraints such as collision avoidance. We are

also interested to generalize this vision-based belief space planning framework and apply to other

applications such as autonomous driving.

Chapter 3

Fast Approximate Belief Space

Planning

In Chapter 2, we discussed how to formulate a belief space planning problem to derive an active

vision policy for inferring intent of other interacting agents. In this chapter, we focus on the theory

of belief space planning rather than a specific application, which includes a holistic review of various

belief space planning algorithms that exist in the literature. We address the challenge of performing

fast decision-making in continuous spaces with high-frequency observations, a crucial capability for

autonomous robots operating in open and interactive environments.

Specifically, we propose a novel online framework for fast, general-purpose belief space planning.

Unlike other existing works, our approach presented in this chapter is based on the perturbation

theory of differential equations and extends Sequential Action Control (SAC) [6] to stochastic belief

dynamics. The resulting algorithm, which we name SACBP, is a stochastic optimal control algo-

rithm for belief space planning and does not require discretization of spaces or time. SACBP is an

anytime algorithm that synthesizes control signals in near real-time, and can handle general para-

metric1 belief dynamics under certain assumptions. Despite being an approximate (i.e. suboptimal)

method, we demonstrate the effectiveness of SACBP in an active sensing scenario and a model-based

Bayesian reinforcement learning problem. In these challenging problems, we show that the algorithm

significantly outperforms other online and offline solution methods including approximate dynamic

programming and local trajectory optimization. The materials reported in this chapter are based

on our conference paper [111] and the extended journal version [112].

1In this chapter, we use the term “parametric” to mean that the functional form of the belief distribution is fixed.Many of the known belief dynamics (i.e. Bayesian filters) are parametric, except the particle filter [154] that is usuallyconsidered non-parametric. Section 3.2.1 gives a detailed explanation on the possible choice of filters.

28

CHAPTER 3. FAST APPROXIMATE BELIEF SPACE PLANNING 29

3.1 Introduction and Related Work

Planning under uncertainty still remains as a challenge for robotic systems. Various sources of

uncertainty, including unknown states, stochastic disturbances, and imperfect sensing, significantly

complicate problems that are otherwise easy. For example, suppose that a robot needs to manipulate

an object from some initial state to a desired goal. If the mass properties of the object are not known

beforehand, the robot needs to simultaneously estimate these parameters and perform control, while

taking into account the effects of their uncertainty; the exploration and exploitation trade-off needs to

be resolved [145]. On the other hand, uncertainty is quite fundamental in motivating some problems.

For instance, a noisy sensor may encourage the robot to carefully plan a trajectory so the observations

taken along it are sufficiently informative. This type of problem concerns pure information gathering

and is often referred to as active sensing [103], active perception [9], or informative motion planning

[63], wherein the sole objective is to reduce state uncertainty using onboard perception.

A principled approach to address all those problems is to form plans in the belief space, where

the planner chooses sequential control inputs based on the evolution of the belief state, to optimize a

belief-based cost accumulated over time. This approach enables the robot to appropriately execute

controls under state and transition uncertainties since they are both naturally incorporated into the

belief. Belief space planning is well suited also for generating information gathering actions [119].

This chapter proposes a novel online algorithm for fast, general-purpose belief space planning.

It does not require discretization of the state space or the action space, and can directly han-

dle continuous-time system dynamics. The algorithm provides an approximate solution to belief

space planning; instead of seeking for a globally optimal control policy, it constantly perturbs a

given nominal policy to optimize the expected value of a first-order cost reduction, proceeding in

a receding-horizon fashion. We are inspired by the Sequential Action Control (SAC) algorithm

recently proposed in [6] for model-based deterministic optimal control problems. SAC is an online

method to synthesize control signals in real time for challenging (but deterministic) physical systems

such as a cart pendulum and a spring-loaded inverted pendulum. Based on the concept of SAC, this

chapter develops an algorithmic framework to control stochastic belief systems whose dynamics are

governed by parametric Bayesian filters.

3.1.1 Related Work in Belief Space Planning

There is a large body of literature in belief space planning. Below we briefly review relevant work

in three major types of solution methods: greedy strategies, trajectory optimization methods, and

belief MDP and POMDP approaches. We then change our perspective and discuss advantages and

drawbacks of closed-loop and open-loop planning schemes, while also introducing other relevant

work that do not necessarily fall into the three categories mentioned above.


Greedy Strategies. Belief space planning is known to be challenging for a couple of reasons.

First, the belief state is continuous and can be high-dimensional even if the underlying state space

is small or discrete. Second, the dynamics that govern the belief state transitions are stochastic due

to unknown future observations. Greedy approaches alleviate the complexity by ignoring long-term

effects and solve single-shot decision making problems sequentially. Despite their suboptimality for

long-term planning, these methods are often employed to find computationally tractable solutions

and achieve reasonable performance in different problems [18, 139, 138], especially in the active

sensing domain.

Trajectory Optimization Methods. In contrast to the greedy approaches, trajectory optimiza-

tion methods take into account multiple timesteps at once and find non-myopic solutions. In doing

so, it is often assumed that the maximum likelihood observation (MLO) will always occur during

planning [119, 45, 114]. This heuristic assumption yields a deterministic optimal control problem,

to which various nonlinear trajectory optimization algorithms are applicable. However, ignoring the

effects of stochastic future observations can degrade the performance [161]. Other methods [161, 123]

that do not rely on the MLO assumption are advantageous in that regard. In particular, belief iLQG

[161] performs iterative local optimization in a Gaussian belief space by quadratically approximat-

ing the value function and linearizing the dynamics to obtain a time-varying affine feedback policy.

However, this method as well as many other solution techniques in this category result in multiple

iterations of intensive computation and can require a significant amount of time until convergence.

Belief MDP and POMDP Approaches. Belief space planning can be modeled as a Markov

decision process (MDP) in the belief space, given that the belief state transition is Markovian. If

the reward (or cost) is defined as an explicit function of the state and the control, the problem

is equivalent to a partially observable Markov decision process (POMDP) [74]. A key challenge in

POMDPs and belief MDPs has been to address problems with large state spaces. This is particularly

important in belief MDPs since the state space for a belief MDP is a continuous belief space. To

handle continuous spaces, Couetoux et al. [32] introduce double progressive widening (DPW) for

MCTS [19]. In [145], this MCTS-DPW algorithm is run in a Gaussian belief space to solve the object

manipulation problem mentioned earlier. We too have applied MCTS-DPW to active vision-based

intent inference in Chapter 2, in which a non-myopic policy is derived to reduce the entropy of a

multi-modal belief. Sunberg and Kochenderfer [150] study suboptimality of DPW in POMDPs to

propose POMCPOW, which effectively handles continuous state, action, and observation spaces by

maintaining weighted particle mixture beliefs.

While MCTS-DPW as well as other general purpose POMDP methods [150, 147] are capable

of handling continuous state spaces, their algorithmic concepts are rooted in dynamic programming

and tree search, requiring a sufficient amount of exploration in the tree. (Indeed, our MCTS-DPW

implementation in Chapter 2 assumes that the robot has up to 7 seconds to compute its next action.)


The tree search technique also implicitly assumes discrete-time transition models. In fact, most

prior works reviewed in this section are intended for discrete-time systems. A notable exception

is [24], in which a sequence of discrete-time POMDP approximations is constructed such that it

converges to a true continuous-time model. However, this approach still relies on an existing POMDP

solver [84] that is designed for discrete systems. There still remains a need for an efficient and

high-performance belief space planning algorithm that is capable of directly handling systems with

inherently continuous-space, continuous-time dynamics, such as maneuvering micro-aerial vehicles,

or autonomous cars at freeway speeds.

Closed-loop and Open-loop Planning. An important aspect of belief space planning is the

stochastic belief dynamics, which essentially demand that one plan over closed-loop (i.e. feedback)

policies. This is a primary assumption in belief MDP and POMDP frameworks where an optimal

mapping from beliefs to control actions are sought [65, 79]. However, computing exact optimal

policies in general is intractable due to curse of dimensionality and curse of history [113, 96, 147].

Therefore, in practice we need to make certain approximations to achieve a tractable solution within

a reasonable computation time, whether the solution method is online or offline. For POMDP

methods, those approximations are often done by discretization and/or sampling [84, 24, 147, 150].

Belief iLQG [161] makes an approximation by restricting the policy to a time-varying affine feedback

controller and performing local optimization. Other methods that provide closed-loop policies in-

clude Feedback-based Information RoadMap (FIRM) [2], its extension with online local re-planning

[1], and T-LQG [123]. FIRM-based methods, which are developed primarily for motion planning

problems, construct a graph in the belief space whose edges correspond to local LQG controllers.

T-LQG approximately decouples belief space planning into local trajectory optimization and LQG

tracking via a use of the separation principle. The local trajectory optimization part achieves one of

the lowest asymptotic computational complexities among existing trajectory optimization methods,

and thus can be solved efficiently with a commercial nonlinear programming (NLP) solver [123].

Even though closed-loop methods are appealing, their computation cost can be prohibitive in

some challenging planning and control problems, especially when the state space is large and yet

real-time performance is required. In such cases, a practical strategy to alleviate the computa-

tional burden is open-loop planning, wherein one seeks to optimize a static sequence of control

actions. Often times feedback is provided through re-planning of a fixed horizon, open-loop control

sequence after making a new observation. This combination of open-loop planning and feedback

through re-planning is called receding horizon control (RHC) or model predictive control (MPC)

[65]. Although they do not consider feedback at each planning time, MPC methods in general

have successfully solved challenging planning and control problems with high efficiency where fast

closed-loop policy computation can be impractical. Examples of such real-time applications include

trajectory tracking with quadrotors [10] and agile autonomous driving on dirt [175], although they

are both in the optimal control literature. Within belief space planning, some methods based on the


MLO assumption [118, 114] are run as MPC.

3.1.2 Contributions

Our approach presented in this chapter takes the form of MPC, which is reasonable for high-

dimensional and continuous belief space planning problems that also demand highly-dynamic, real-

time maneuvers with high frequency observation updates. However, the proposed method is signif-

icantly different from the previous approaches discussed in Section 3.1.1. We view the stochastic

belief dynamics as a hybrid system with time-driven switching [61], where the controls are applied in

continuous time and the observations are made in discrete time. A discrete-time observation creates

a jump discontinuity in the belief state trajectory due to a sudden Bayesian update of the belief

state. This view of belief space planning yields a continuous-time optimal control problem of a high-

dimensional hybrid system. We then propose a model-based control algorithm to efficiently compute

control signals in a receding-horizon fashion. The algorithm is based on Sequential Action Control

(SAC) [6]. SAC in its original form is a deterministic, model-based hybrid control algorithm, which

“perturbs” a nominal control trajectory in a structured way so that the cost functional is optimally

reduced up to the first order. The key to this approach is a careful use of the perturbation theory

of differential equations that is often discussed in the mode scheduling literature [41, 172]. As a

result, SAC derives the optimal perturbation in closed form and synthesizes control signals at a high

frequency to achieve a significant improvement over other optimal control methods that are based

on local trajectory optimization [6].

We apply the perturbation theory to parametric Bayesian filters and derive the optimal control

perturbation using the framework of SAC. Even though each control perturbation is small, high-

frequency control synthesis and online re-planning yield a series of control actions that is significantly

different than the nominal control, reacting to stochastic observations collected during execution or

online changing conditions. Furthermore, we extend the original SAC algorithm to also take account

of stochasticity in the future observations during planning, by incorporating Monte Carlo sampling

of nominal belief trajectories. Our key contribution is the resulting continuous belief space planning

algorithm, which we name SACBP. The algorithm has the following desirable properties:

1. Although the form of control perturbation is open-loop with on-line re-planning, the perturba-

tion computed by SACBP in near real-time is optimized for better average performance over

the planning horizon than a given nominal control, whether it is open-loop or closed-loop.

2. SACBP does not require discretization of the state space, the observation space, or the control

space. It also does not require discretization of time other than for numerical integration

purposes.

3. General nonlinear parametric Bayesian filters can be used for state estimation as long as the

system is control-affine and the control cost is quadratic.


4. Stochasticity in the future observations are fully considered.

5. SACBP is an anytime algorithm. Furthermore, the Monte Carlo sampling part of the algorithm

is naturally parallelizable.

6. Even though SACBP is inherently suboptimal for the original stochastic optimal control prob-

lem, empirical results suggest that it is highly sample-efficient and outperforms other open-loop

and closed-loop methods when near real-time performance is required.

There exists prior work [98] that uses SAC for active sensing, but its problem formulation re-

lies on the ergodic control framework, which is significantly different from the belief space planning

framework we propose here. We show that our SACBP outperforms projection-based ergodic trajec-

tory optimization, MCTS-DPW, T-LQG, and a greedy method on an active multi-target tracking

example. We also show that SACBP outperforms belief iLQG, MCTS-DPW, and T-LQG on a

manipulation scenario.

In the next section we derive relevant equations and present the SACBP algorithm along with

a discussion on computational complexity. Section 3.3 provides key results of our mathematical

analysis that lead to a performance guarantee for SACBP. Section 3.4 summarizes the simulation

results. Conclusions and future work are presented in Section 3.5.

3.2 SACBP Algorithm

We first consider the case where some components of the state are fully observable. We begin with

this mixed observability case as it is simpler to explain, yet still practically relevant. For example,

this is a common assumption in various active sensing problems [138, 86, 121] where the state of

the robot is perfectly known, but some external variable of interest (e.g. a target’s location) is

stochastic. In addition, deterministic state transitions are often assumed for the robot. Therefore,

in Section 3.2.1 we derive the SACBP control update formulae for this case. The general belief space

planning where none of the state is fully observable or deterministically controlled is discussed in

Section 3.2.2. An extension to use a closed-loop policy as the nominal control is presented in Section

3.2.3. The computation time complexity is discussed in Section 3.2.4.

3.2.1 Problems with Mixed Observability

Suppose that a robot can fully observe and deterministically control its own state p(t) ∈ Rnp . Other

external states over which the robot does not have direct control are not known and are estimated

with the belief vector b(t) ∈ Rnb . This belief vector characterizes a probability distribution that the

robot uses for state estimation. If the belief is Gaussian, for example, the covariance matrix can be

vectorized column-wise and stacked all together with the mean to form the belief vector. We define

the augmented state as s , (pT, bT)T ∈ Rns .


Dynamics Model

The physical state p is described by the following ordinary differential equation (ODE):

p(t) = f (p(t), u(t)) , (3.1)

where u(t) ∈ Rm is the control signal. On the other hand, suppose that the belief state only changes

in discrete time upon arrival of a new observation from the sensors. In the case of target tracking,

for example, this means that the change of the location of the target is described by a discrete

time model from the robot’s perspective. The behavior of such a system can be estimated by a

discrete-time Bayesian filter. We will discuss the more general continuous-discrete time filtering

case in Section 3.2.2. Let tk be the time when the k-th observation becomes available to the robot.

The belief state transition is given byb(tk) = g(p(t−k ), b(t−k ), yk)

b(t) = b(tk) ∀t ∈ [tk, tk+1),(3.2)

where t−k is infinitesimally smaller than tk. Nonlinear function g corresponds to a discrete-time,

parametric Bayesian filter (e.g., Kalman filter, extended Kalman filter, discrete Bayesian filter,

etc2.) that forward-propagates the belief for prediction, takes the new observation yk ∈ Rq, and

returns the updated belief state. The concrete choice of the filter depends on the instance of the

problem. Note that the belief state stays constant for the most of the time in (3.2). This is because

we are viewing a discrete time model in a continuous time framework.

Equations (3.1) and (3.2) constitute a hybrid system with time-driven switching [61]. This hybrid

system representation is practical since it captures the fact that the observation updates occur less

frequently than the control actuation in general, due to expensive information processing of sensor

readings. Furthermore, with this representation one can naturally handle agile robot dynamics as

they are without coarse discretization in time.

Given the initial state s0 , (p(t0)T, b(t0)T)T and some control u(t) for t ∈ [t0, tf ], the system

evolves stochastically according to the hybrid dynamics equations. The stochasticity is due to a

sequence of stochastic future observations that will be taken by the end of the planning horizon tf .

In this chapter we assume that the observation interval tk+1 − tk , ∆to is fixed, and the control

signals are recomputed when a new observation is incorporated in the belief, although one can also

use a variable observation interval.

2We do not assume the particle filter here, which is non-parametric, since we later require g to be differentiablewhereas the re-sampling process [154] in the particle filter is non-differentiable. Note however that a recent work[76] makes it differentiable by replacing the re-sampling process with a soft approximation. Consideration of suchdifferentiable approximations in SACBP is an interesting future research direction.


Perturbed Dynamics

The control synthesis of SACBP begins with a given nominal control schedule u(t) for t ∈ [t0, tf ]. For

simplicity we assume here that the nominal control schedule is open-loop, but we remind the reader

that SACBP can also handle closed-loop nominal policies, which we discuss in Section 3.2.3. Suppose

that the nominal control is applied to the system and a sequence of T observations (y1, . . . , yT ) is

obtained. Conditioned on the observation sequence, the augmented state evolves deterministically.

Let s = (pT, bT)T be the nominal trajectory of the augmented state induced by (y1, . . . , yT ).

Now let us consider perturbing the nominal trajectory at a fixed time τ < t1 for a short duration

ε. The perturbed control uε is defined as

uε(t) ,

v if t ∈ (τ − ε, τ ]

u(t) otherwise.(3.3)

Therefore, the control perturbation is determined by the nominal control u(t), the tuple (τ, v), and

ε. Given (τ, v), the resulting perturbed system trajectory can be written aspε(t) , p(t) + εΨp(t) + o(ε)

bε(t) , b(t) + εΨb(t) + o(ε),(3.4)

where Ψp(t) and Ψb(t) are the state variations that are linear in the perturbation duration ε:

Ψp(t) =∂+

∂εpε(t)

∣∣∣∣ε=0

, limε→0+

pε(t)− p(t)ε

(3.5)

Ψb(t) =∂+

∂εbε(t)

∣∣∣∣ε=0

, limε→0+

bε(t)− b(t)ε

. (3.6)

The notation ∂+∂ε represents the right derivative with respect to ε. The state variations at perturba-

tion time τ satisfy Ψp(τ) = f(p(τ), v)− f(p(τ), u(τ))

Ψb(τ) = 0.(3.7)

The initial belief state variation Ψb(τ) is zero because the control perturbation uε has no effect

on the belief state until the first Bayesian update is performed at time t1, according to the hybrid

system model (3.1)(3.2). For t ≥ τ , the physical state variation Ψp evolves according to the following


first-order ODE:

Ψp(t) =d

dt

(∂+

∂εpε(t)

∣∣∣∣ε=0

)(3.8)

=∂+

∂εpε(t)

∣∣∣∣ε=0

(3.9)

=∂+

∂εf(pε(t), u(t))

∣∣∣∣ε=0

(3.10)

=∂

∂pf (p(t), u(t)) Ψp(t), (3.11)

where the chain rule of differentiation and pε(t)|ε=0 = p(t) are used in (3.11). For a more rigorous

analysis, see Appendix A. The dynamics of the belief state variation Ψb in the continuous region

t ∈ [tk, tk+1) satisfy Ψb(t) = 0 since the belief vector b(t) is constant according to (3.2). However,

across the jumps the belief state variation Ψb changes discontinuously and satisfies

Ψb(tk) =∂+

∂εbε(tk)

∣∣∣∣ε=0

(3.12)

=∂+

∂εg(pε(t−k ), bε(t−k ), yk

) ∣∣∣∣ε=0

(3.13)

=∂

∂pg(p(t−k ), b(t−k ), yk

)Ψp(t

−k ) +

∂

∂bg(p(t−k ), b(t−k ), yk

)Ψb(t

−k ). (3.14)

Perturbed Cost Functional

Let us consider a total cost of the form∫ tf

t0

c (p(t), b(t), u(t)) dt+ h(p(tf ), b(tf )), (3.15)

where c is the running cost and h is the terminal cost. Following the discussion above on the

perturbed dynamics, let J denote the total cost of the nominal trajectory conditioned on the given

observation sequence (y1, . . . , yT ). Under the fixed (τ, v), we can represent the perturbed cost Jε in

terms of J as

Jε , J + εν(tf ) + o(ε), (3.16)

where ν(tf ) , ∂+∂ε J

ε|ε=0 is the variation of the total cost with respect to the perturbation. For

further analysis it is convenient to express the running cost in the Mayer form [92]. Let s(t) be

a new state variable defined by ˙s(t) = c (p(t), b(t), u(t)) and s(t0) = 0. Then the total cost is a

function of the appended augmented state s , (s, sT)T ∈ R1+ns at time tf , which is given by

J = s(tf ) + h (s(tf )) . (3.17)


Using this form of the total cost J , the perturbed cost (3.16) becomes

Jε = J + ε

1

∂∂ph (p(tf ), b(tf ))∂∂bh (p(tf ), b(tf ))

T

Ψ(tf ) + o(ε), (3.18)

where Ψ(tf ) ,(

Ψ(tf ),Ψp(tf )T,Ψb(tf )T)T

and Ψ is the variation of s. Note that the dot product

in (3.18) corresponds to ν(tf ) in (3.16). The variation Ψ follows the variational equation for t ≥ τ :

˙Ψ(t) =

d

dt

(∂+

∂εsε(t)

∣∣∣∣ε=0

)(3.19)

=∂+

∂ε˙sε(t)

∣∣∣∣ε=0

(3.20)

=∂+

∂εc(pε(t), bε(t), u(t))

∣∣∣∣ε=0

(3.21)

=∂

∂pc(p(t), b(t), u(t))TΨp(t) +

∂

∂bc(p(t), b(t), u(t))TΨb(t), (3.22)

where the initial condition is given by Ψ(τ) = c(p(τ), b(τ), v)− c(p(τ), b(τ), u(τ)).

The perturbed cost equation (3.18), especially the dot product expressing ν(tf ), is consequential;

it tells us how the total cost changes due to the perturbation applied at some time τ , up to the first

order with respect to the perturbation duration ε. At this point, one could compute the value of

ν(tf ) for a control perturbation with a specific value of (τ, v) by simulating the nominal dynamics

and integrating the variational equations (3.11)(3.14)(3.22) from τ up to tf .

Adjoint Equations

Unfortunately, this forward integration of ν(tf ) is not so useful by itself since we are interested

in finding the value of (τ, v) that achieves the largest possible negative ν(tf ), if it exists; it would

be computationally intensive to apply control perturbation at different application times τ with

different values of v and re-simulate the state variation Ψ. To avoid this computationally expensive

search, we mirror the approach presented in [6] and introduce the adjoint system ρ , (ρ, ρTp , ρ

Tb )T

with which the dot product remains invariant:

d

dt

(ρ(t)TΨ(t)

)= 0 ∀t ∈ [t0, tf ]. (3.23)

If we let

ρ(tf ) ,

(1,

∂

∂ph (p(tf ), b(tf ))

T,∂

∂bh (p(tf ), b(tf ))

T

)T

(3.24)


so that its dot product with Ψ(tf ) equals ν(tf ) as in (3.18), the time invariance gives

ν(tf ) = ρ(tf )TΨ(tf ) (3.25)

= ρ(τ)TΨ(τ) (3.26)

= ρ(τ)T

c (p(τ), b(τ), v)− c (p(τ), b(τ), u(τ))

f(p(τ), v)− f(p(τ), u(τ))

0

. (3.27)

Therefore, we can compute the first-order cost change ν(tf ) for different values of τ once the adjoint

trajectory is derived. For t ∈ [tk, tk+1) the time derivative of Ψ exists, and the invariance property

leads to the following equation:

ρ(t)TΨ(t) + ρ(t)TΨ(t) = 0. (3.28)

It can be verified that the following system satisfies (3.28) with ρ(t) =(ρ(t), ρp(t)

T, ρb(t)T)T

:˙ρ(t) = 0

ρp(t) = − ∂∂pc(p(t), b(t), u(t))− ∂

∂pf(p(t), u(t))Tρp(t)

ρb(t) = − ∂∂bc(p(t), b(t), u(t)).

(3.29)

Analogously, across discrete jumps we can still enforce the invariance by setting ρ(tk)TΨ(tk) =

ρ(t−k )TΨ(t−k ), which holds for the following adjoint equations:ρ(t−k ) = ρ(tk)

ρp(t−k ) = ρp(tk) + ∂

∂pg(p(t−k ), b(t−k ), yk

)Tρb(tk)

ρb(t−k ) = ∂

∂bg(p(t−k ), b(t−k ), yk

)Tρb(tk).

(3.30)

Note that the adjoint system integrates backward in time as it has the boundary condition (3.24)

defined at tf . More importantly, the adjoint dynamics (3.29)(3.30) only depend on the nominal

trajectory of the system (p, b) and the observation sequence (y1, . . . , yT ). Considering that ρ(t) = 1

at all times, the cost variation term ν(tf ) is finally given by

ν(tf ) = c(p(τ), b(τ), v)− c(p(τ), b(τ), u(τ)) + ρp(τ)T f(p(τ), v)− f(p(τ), u(τ)) . (3.31)

Control Optimization

In order to efficiently optimize (3.31) with respect to (τ, v), we assume hereafter that the control

cost is additive quadratic 12u

TCuu and the dynamics model f(p, u) is control-affine with linear term


H(p)u, where H : Rnp → Rm can be any nonlinear mapping. Although the control-affine assumption

may appear restrictive, many physical systems possess this property in engineering practice. As a

result of these assumptions, (3.31) becomes

ν(tf ) =1

2vTCuv + ρp(τ)TH(p(τ))(v − u(τ))− 1

2u(τ)TCuu(τ). (3.32)

So far we have treated the observation sequence (y1, . . . , yT ) as given and fixed. However, in

practice it is a random process that we have to take into account. Fortunately, our control opti-

mization is all based on the nominal control schedule u(t), with which we can both simulate the

augmented dynamics and sample the observations. To see this, let us consider the observations as

a sequence of random vectors (Y1, . . . , YT ) and rewrite ν(tf ) in (3.32) as ν(tf , Y1, . . . , YT ) to clarify

the dependence on it. The expected value of the first order cost variation is given by

E[ν(tf )] =

∫ν(tf , Y1, . . . , YT )dP, (3.33)

where P is the probability measure associated with these random vectors. Although we do not

know the specific values of P, we have the generative model; we can simulate the augmented state

trajectory using the nominal control and sequentially sample the stochastic observations from the

belief states along the trajectory.

Using the linearity of expectation for (3.32), we have

E[ν(tf )] =1

2vTCuv + E[ρp(τ)]TH(p(τ))(v − u(τ))− 1

2u(τ)TCuu(τ). (3.34)

Notice that only the adjoint trajectory is stochastic. We can employ Monte Carlo sampling to

sample a sufficient number of observation sequences to approximate the expected adjoint trajectory.

Now (3.34) becomes a convex quadratic in v for a positive definite Cu. Assuming that Cu is also

diagonal, analytical solutions are available to the following convex optimization problem with an

input saturation constraint:

minimizev

E[ν(tf )]

subject to a v b,(3.35)

where a, b ∈ Rm are some saturation vectors and is an elementwise inequality. This optimization

is solved for different values of τ ∈ (t0 + tcalc + ε, t0 + tcalc + ∆to), where tcalc is a pre-allocated

computation time budget and ∆to is the time interval between two successive observations as well

as control updates. We then search over (τ, v∗(τ)) for the optimal perturbation time τ∗ to globally

minimize E[ν(tf )]. There is only a finite number of such τ to consider since in practice we use

numerical integration such as the Euler scheme with some step size ∆tc to compute the trajectories.

In [6] the finite perturbation duration ε is also optimized using line search, but in this work we set ε

as a tunable parameter to reduce the computation time. The complete algorithm is summarized in


Algorithm 3.1 SACBP Control Update for Problems with Mixed Observability

INPUT: Current augmented state s0 = (p(t0)T, b(t0)T)T, nominal control schedule u(t) for t ∈[t0, tf ], perturbation duration ε

OUTPUT: Optimally perturbed control schedule uε(t) for t ∈ [t0, tf ]1: for i = 1:N do2: Forward-simulate nominal augmented state trajectory (3.1)(3.2) and sample observation se-

quence (yi1, . . . , yiT ) along the augmented state trajectory.

3: Backward-simulate nominal adjoint trajectory ρip, ρib (3.29)(3.30) with sampled observations.

4: end for5: Compute Monte Carlo estimate: E[ρp] ≈ 1

N

∑Ni=1 ρ

ip.

6: for (τ = t0 + tcalc + ε; τ ≤ t0 + tcalc + ∆to; τ ← τ + ∆tc) do7: Solve quadratic minimization (3.35) with (3.34). Store optimal value ν∗(τ) and optimizerv∗(τ).

8: end for9: τ∗ ← arg min ν∗(τ), v∗ ← v∗(τ∗)

10: uε ← PerturbControlTrajectory(u, v∗, τ∗, ε) (3.3)11: return uε

Algorithm 3.1. The call to the algorithm occurs every ∆to s in a receding-horizon fashion, after the

new observation is incorporated in the belief.

3.2.2 General Belief Space Planning Problems

If none of the state is fully observable, the same stochastic SAC framework still applies almost as is

to the belief sate b. In this case we consider a continuous-discrete filter [177] where the prediction

step follows an ODE and the update step provides an instantaneous discrete jump. The hybrid

dynamics for the belief vector are given byb(tk) = g(b(t−k ), yk)

b(t) = f(b(t), u(t)) ∀t ∈ [tk, tk+1).(3.36)

Letting Ψ(t) = ∂+∂ε b

ε(t)∣∣ε=0

, the variational equation yields

Ψ(tk) = ∂∂bg(b(t−k ), yk)Ψ(t−k )

Ψ(t) = ∂∂bf(b(t), u(t))Ψ(t) ∀t ∈ [tk, tk+1)

(3.37)

with initial condition Ψ(τ) = f(b(τ), v)− f(b(τ), u(τ)).

Let the total cost be of the form:∫ tf

t0

c(b(t), u(t))dt+ h(b(tf )). (3.38)


Under the given (τ, v) and (y1, . . . , yT ), the variation ν(tf ) of the total cost can be computed as

ν(tf ) = c(b(τ), v)− c(b(τ), u(τ)) +

∫ tf

τ

∂

∂bc(b(t), u(t))TΨ(t)dt+

∂

∂bh(b(tf ))TΨ(tf ). (3.39)

This is equivalent to

ν(tf ) = c(b(τ), v)− c(b(τ), u(τ)) + ρ(τ)T f(b(τ), v)− f(b(τ), u(τ) , (3.40)

where ρ is the adjoint system that follows the dynamics:ρ(t−k ) = ∂∂bg(b(t−k ), yk)Tρ(tk)

ρ(t) = − ∂∂bc(b(t), u(t))− ∂

∂bf(b(t), u(t))Tρ(t)(3.41)

with the boundary condition ρ(tf ) = ∂∂bh(b(tf )). Under the control-affine assumption for f and the

additive quadratic control cost, the expected first order cost variation (3.40) yields

E[ν(tf )] =1

2vTCuv + E[ρ(τ)]TH(b(τ))(v − u(τ))− 1

2u(τ)TCuu(τ), (3.42)

where H(b(τ)) is the control coefficient term in f .

Although it is difficult to state the general conditions under which this control-affine assumption

holds, for instance one can verify that the continuous-discrete EKF [177] satisfies this property if

the underlying system dynamics fsys is control-affine:µ(t) = fsys(µ(t), u(t))

Σ(t) = AΣ + ΣAT +Q(3.43)

In the above continuous-time prediction equations, A is the Jacobian of the dynamics function

fsys(x(t), u(t)) evaluated at the mean µ(t) and Q is the process noise covariance. If fsys is control-

affine, so is A and therefore so is Σ. Obviously µ is control affine as well. As a result the dynamics

for the belief vector b = (µT, vec(Σ)T)T satisfy the control-affine assumption.

Mirroring the approach in Section 3.2.1, we can use Monte Carlo sampling to estimate the

expected value in (3.42). The resulting algorithm is presented in Algorithm 3.2.

3.2.3 Closed-loop Nominal Policy

In Sections 3.2.1 and 3.2.2 we assumed that the nominal control was an open-loop control schedule.

However, one can think of a scenario where a nominal control is a closed-loop policy computed

offline, such as a discrete POMDP policy that maps beliefs to actions [84]. Indeed, SACBP can also

handle closed-loop nominal policies. Let π be a closed-loop nominal policy, which is a mapping from


Algorithm 3.2 SACBP Control Update for General Belief Space Planning Problems

INPUT: Current belief state b0 = b(t0), nominal control schedule u(t) for t ∈ [t0, tf ], perturbationduration ε

OUTPUT: Optimally perturbed control schedule uε(t) for t ∈ [t0, tf ].1: for i = 1:N do2: Forward-simulate nominal belief state trajectory (3.36) and sample observation sequence

(yi1, . . . , yiT ) along the belief trajectory.

3: Backward-simulate nominal adjoint trajectory ρi (3.41) with sampled observations.4: end for5: Compute Monte Carlo estimate: E[ρ] ≈ 1

N

∑Ni=1 ρ

i.6: for (τ = t0 + tcalc + ε; τ ≤ t0 + tcalc + ∆to; τ ← τ + ∆tc) do7: Solve quadratic minimization (3.35) with (3.42). Store optimal value ν∗(τ) and optimizerv∗(τ).

8: end for9: τ∗ ← arg min ν∗(τ), v∗ ← v∗(τ∗)

10: uε ← PerturbControlTrajectory(u, v∗, τ∗, ε) (3.3)11: return uε

either an augmented state s(t) or a belief state b(t) to a control value u(t). Due to the stochastic

belief dynamics, the control values returned by π in the future is also stochastic for t ≥ t1. This is

reflected when we forward-propagate the nominal dynamics. Specifically, each sampled trajectory

has a different control trajectory in addition to a different observation sequence. However, the

equations are still convex quadratic in v as shown below. For problems with mixed observability, we

have

E[ν(tf )] =1

2vTCuv + E[ρp(τ)]TH(p(τ)) v − π(s(τ)) − 1

2π(s(τ))TCuπ(s(τ)). (3.44)

The general belief space planning case also yields a similar equation:

E[ν(tf )] =1

2vTCuv + E[ρ(τ)]TH(b(τ)) v − π(b(τ)) − 1

2π(b(τ))TCuπ(b(τ)). (3.45)

Note that s(τ) and b(τ) are both deterministic since the first observation y1 is not yet taken at τ < t1.

The expectations in (3.44) and (3.45) can be estimated using Monte Carlo sampling. The forward-

simulation of the nominal trajectory in Line 2 of Algorithms 3.1 and 3.2 is now with the closed

loop policy π, and the equations in Line 7 need to be replaced with (3.44) and (3.45), respectively.

However, the rest remains unchanged.

3.2.4 Computation Time Complexity

Let us analyze the time complexity of the SACBP algorithm. The bottleneck of the computation

is when the forward-backward simulation is performed multiple times (lines 1–4 of Algorithms 3.1

and 3.2). The asymptotic complexity of this part is given by O(N(tf−t0∆to

)(Mforward + Mbackward)),


where Mforward and Mbackward are the times to respectively integrate the forward and backward

dynamics between two successive observations. For a more concrete analysis let us use the Gaussian

belief dynamics given by EKF as an example. For simplicity we assume the same dimension n

for the state, the control, and the observation. The belief state has dimension O(n2). Using the

Euler scheme, the forward integration takes Mforward = O((∆to∆tc

+ 1)n3) since evaluating continuous

and discrete EKF equations are both O(n3). Computation of the continuous part of the adjoint

dynamics (3.41) is dominated by the evaluation of the Jacobian ∂f∂b , which is O(n5) because O(n3)

operations to evaluate f are carried out O(n2) times. The discrete part is also O(n5). Therefore,

Mbackward = O((∆to∆tc

+ 1)n5). Overall, the time complexity is O(N(tf−t0∆to

)(∆to∆tc

+ 1)n5). This is

asymptotically smaller in n than belief iLQG, which is O(n6). See [123] for a comparison of time

complexity among different belief space planning algorithms. We also remind the readers that

SACBP is an online method and a naive implementation already achieves near real-time performance,

computing control over a 2 s horizon in about 0.1 s to 0.4 s. By near real-time we mean that a

naive implementation of SACBP requires approximately 0.7× tcalc to 3× tcalc time to compute an

action that must be applied tcalc s in the future. We expect that parallelization in a GPU and a

more efficient implementation will result in real-time computation for SACBP.

3.3 Analysis of Mode Insertion Gradient for Stochastic Hy-

brid Systems

The SACBP algorithm presented in Section 3.2 as well as the original SAC algorithm [6] both rely

on the local sensitivity analysis of the cost functional with respect to the control perturbation. This

first-order sensitivity term (i.e. ν(tf ) in our notation) is known as the mode insertion gradient

in the mode scheduling literature [41, 172]. In [6] the notion of the mode insertion gradient has

been generalized to handle a broader class of hybrid systems than discussed before. What remains

to be seen is a further generalization of the mode insertion gradient to stochastic hybrid systems,

such as the belief dynamics discussed in this chapter. Indeed, the quantity we can optimize in

(3.35) is essentially the expected value of the first-order sensitivity of the total cost. This is not to

be confused with the first-order sensitivity of the expected total cost, which would be the natural

generalization of the mode insertion gradient to stochastic systems. In general, those two quantities

can be different, since the order of expectation and differentiation may not be swapped arbitrarily.

In this section, we provide a set of sufficient conditions under which the order can be exchanged.

By doing so we show that 1) the notion of mode insertion gradient can be generalized to stochastic

hybrid systems, and 2) the SACBP algorithm optimizes this generalized mode insertion gradient.

Through this analysis we will see that the SACBP algorithm has a guarantee that, in expectation

it performs at least as good as the nominal policy with an appropriate choice of ε.


3.3.1 Assumptions

Let us begin with a set of underlying assumptions for the system dynamics, the control, and the

cost functions. Without loss of generality, we assume that the system starts at time t = 0 and ends

at t = T , with a sequence of T observations (y1, . . . , yT ) made every unit time. For generality, we

use notation x to represent the state variable of the system in this section, in place of b or s that

respectively represented the belief state or the augmented state in Section 3.2. This means that

the analysis presented here is not restricted to belief systems where the dynamics are governed by

Bayesian filters, but rather applies to a broader class of stochastic systems.

Assumption 3.1 (Control Model). The controls are in C0,m[0, T ], the space of piecewise continuous

functions from [0, T ] into Rm. We further assume that there exists some ρmax < ∞ such that for

all t ∈ [0, T ], we have u(t) ∈ B(0, ρmax) where B(0, ρmax) is the closed Euclidean ball of radius ρmax

centered at 0, i.e. ‖u(t)‖2 ≤ ρmax. We denote this admissible control set by U , u ∈ C0,m[0, T ] |∀t ∈ [0, T ] u(t) ∈ B(0, ρmax).

Remark 3.1. For the sake of analysis, the control model described above takes the form of an open-

loop control schedule, where time t determines the control signal u(t). However the analysis can be

extended to closed-loop nominal policies in a straightforward manner, which is discussed in Appendix

A. (See Remark A.2.)

Assumption 3.2 (Dynamics Model). Let x0 ∈ Rnx be the given initial state value at t = 0. Given

a control u ∈ U and a sequence of observations (y1, . . . , yT ) ∈ Rny × · · · ×Rny , the dynamics model

is the following hybrid system with time-driven switching:

x(t) , xi(t) ∀t ∈ [i− 1, i) ∀i ∈ 1, 2, . . . , T, (3.46)

where xi is the i-th “mode” of the system state defined on [i− 1, i] as:

xi(i− 1) = g (xi−1(i− 1), yi−1) (3.47)

xi(t) = f (xi(t), u(t)) ∀t ∈ [i− 1, i], (3.48)

with x(0) = x1(0) = x0. We also define the final state as x(T ) , g(xT (T ), yT ).

For the transition functions f and g we assume the following:

(3.2a) the function f : Rnx × Rm → Rnx is continuously differentiable;

(3.2b) the function g : Rnx × Rny → Rnx is continuous. It is also differentiable in x;

(3.2c) for function f , there exists a constant K1 ∈ [1,∞) such that ∀x′, x′′ ∈ Rnx and ∀u′, u′′ ∈B(0, ρmax), the following condition holds:

‖f(x′, u′)− f(x′′, u′′)‖2 ≤ K1 (‖x′ − x′′‖2 + ‖u′ − u′′‖2) (3.49)


(3.2d) for function g, there exist finite non-negative constants K2,K3,K4,K5 and positive integers

L1, L2 such that ∀x ∈ Rnx and ∀y ∈ Rny , the following relations hold:

‖g(x, y)‖2 ≤ K2 +K3‖x‖L12 +K4‖y‖L2

2 +K5‖x‖L12 ‖y‖

L22 (3.50)

∥∥∥∥ ∂∂xg(x, y)

∥∥∥∥2

≤ K2 +K3‖x‖L12 +K4‖y‖L2

2 +K5‖x‖L12 ‖y‖

L22 (3.51)

Remark 3.2. Assumptions (3.2a) and (3.2c) are used to show existence and uniqueness of the solu-

tion to the differential equation (3.48) as well as the variational equation under control perturbation.

(See Propositions A.1 and A.14 in Appendix A.) Note that Assumption (3.2c) is essentially a Lips-

chitz continuity condition, which is a quite common assumption in the analysis of nonlinear ODEs,

as can be seen in Assumption 5.6.2 in [43] and Theorem 2.3 in [77]. Assumptions (3.2b) and (3.2d)

are the growth conditions on x across adjacent modes. Recall that in belief space planning where

the system state x is the belief state b, the jump function g corresponds to the observation update of

the Bayesian filter. The form of the bound in (3.50) and (3.51) allows a broad class of continuous

functions to be considered as g, and is inspired by a few examples of the Bayesian update equations

as presented below.

Proposition 3.1 (Bounded Jump for Univariate Gaussian Distribution). Let b = (µ, s)T ∈ R2

be the belief state, where µ is the mean parameter and s > 0 is the variance. Suppose that the

observation y is the underlying state x ∈ R corrupted by additive Gaussian white noise v ∼ N (0, 1).

Then, the Bayesian update function g for this belief system satisfies Assumption (3.2d).

Proof. The Bayesian update formula for this system is given by g(b, y) = b , (µ, s)T, where

µ = µ+s

s+ 1(y − µ) (3.52)

s = s− s2

s+ 1(3.53)

is the update step of the Kalman filter. Rearranging the terms, we have

g(b, y) =1

s+ 1

(µ+ sy

s

)(3.54)

and consequently,

∂

∂bg(b, y) =

1

(s+ 1)2

(s+ 1 y

0 1

). (3.55)


We will show that the function g satisfies Assumption (3.2d). For the bound on g(b, y),

‖g(b, y)‖22 =1

(s+ 1)2

(µ+ sy)2 + s2

(3.56)

≤ (µ+ sy)2 + s2 (3.57)

≤ ‖b‖22 +

(bT

(y

1

))2

(3.58)

≤ ‖b‖22(2 + ‖y‖22), (3.59)

where we have used (s+ 1)2 ≥ 1 and the Cauchy-Schwarz inequality. Thus,

‖g(b, y)‖2 ≤ ‖b‖2√

2 + ‖y‖22 (3.60)

≤√

2‖b‖2 + ‖b‖2‖y‖2 (3.61)

Similarly, the bound on the Jacobian yields∥∥∥∥ ∂∂bg(b, y)

∥∥∥∥2

2

≤∥∥∥∥ ∂∂bg(b, y)

∥∥∥∥2

F

(3.62)

=1

(s+ 1)4

(s+ 1)2 + y2 + 1

(3.63)

=1

(s+ 1)2+

1

(s+ 1)4(y2 + 1) (3.64)

≤ 2 + ‖y‖22. (3.65)

Therefore, ∥∥∥∥ ∂∂bg(b, y)

∥∥∥∥2

≤√

2 + ‖y‖2 (3.66)

This shows that the jump function g for the above univariate Gaussian model satisfies Assumption

(3.2d) with (K2,K3,K4,K5) = (√

2,√

2, 1, 1) and (L1, L2) = (1, 1).

Proposition 3.2 (Bounded Jump for Categorical Distribution). Let b = (b1, . . . , bn)T ∈ Rn be

the n-dimensional belief state representing the categorical distribution over the underlying state x ∈1, . . . , n. We choose the unnormalized form where the probability of x = i is given by bi/

∑ni=1 bi.

Let the observation y ∈ 1, . . . ,m be modeled by a conditional probability mass function p(y | x) ∈[0, 1]. Then, the Bayesian update function g for this belief system satisfies Assumption (3.2d).


Proof. The Bayes rule gives g(b, y) = b , (b1, . . . , bn), whereb1

b2...

bn

=

p(y | 1)b1

p(y | 2)b2...

p(y | n)bn

. (3.67)

Therefore, we can easily bound the norm of the posterior belief b by

‖g(b, y)‖2 = ‖b‖2 ≤ ‖b‖2, (3.68)

as p(y | x) ≤ 1. The Jacobian is simply the diagonal matrix diag(p(y | 1), . . . , p(y | n)), and hence∥∥∥∥ ∂∂bg(b, y)

∥∥∥∥2

≤ 1. (3.69)

This shows that the jump function g for the categorical belief model above satisfies Assumption

(3.2d) with (K2,K3,K4,K5) = (1, 1, 0, 0) and (L1, L2) = (1, 1).

Assumption 3.3 (Cost Model). The instantaneous cost c : Rnx ×Rm → R is continuous. It is also

continuously differentiable in x. The terminal cost h : Rnx → R is differentiable. Furthermore, we

assume that there exist finite non-negative constants K6,K7 and a positive integer L3 such that for

all x ∈ Rnx and u ∈ B(0, ρmax), the following relations hold:

|c(x, u)| ≤ K6 +K7‖x‖L32 (3.70)∥∥∥∥ ∂∂xc(x, u)

∥∥∥∥2

≤ K6 +K7‖x‖L32 (3.71)

|h(x)| ≤ K6 +K7‖x‖L32 (3.72)∥∥∥∥ ∂∂xh(x)

∥∥∥∥2

≤ K6 +K7‖x‖L32 . (3.73)

Remark 3.3. Assumption 3.3 is to guarantee that the cost function is integrable with respect to

stochastic observations, which are introduced in Assumption 3.4. Note that even though the above

bound is not general enough to apply to all analytic functions, it does include all finite order poly-

nomials of ‖x(t)‖2 and ‖u(t)‖2, for example, since ‖u(t)‖2 is bounded by Assumption 3.1.

Assumption 3.4 (Stochastic Observations). Let (Ω,F ,P) be a probability space. Let (Y1, . . . , YT ) be

a sequence of random vectors in Rny defined on this space, representing the sequence of observations.

Assume that for each Yi all the moments of the `2 norm is finite. That is,

∀i ∈ 1, . . . , T ∀k ∈ N E[‖Yi‖k2

]<∞. (3.74)


Definition 3.1 (Perturbed Control). Let u ∈ U be a control. For τ ∈ (0, 1) and v ∈ B(0, ρmax),

define the perturbed control uε by

uε(t) ,

v if t ∈ (τ − ε, τ ]

u(t) otherwise,(3.75)

where ε ∈ [0, τ ]. By definition if ε = 0 then uε is the same as u. We assume that the nominal control

u(t) is left continuous in t at t = τ .

3.3.2 Main Results

The main result of the analysis is the following theorem.

Theorem 3.1 (Mode Insertion Gradient). Suppose that Assumptions 3.1 – 3.4 are satisfied. For a

given (τ, v), let uε denote the perturbed control of the form (3.75). The perturbed control uε and the

stochastic observations (Y1, . . . , YT ) result in the stochastic perturbed state trajectory xε. For such

uε and xε, let us define the mode insertion gradient of the expected total cost as

∂+

∂εE

[∫ T

0

c(xε(t), uε(t))dt+ h(xε(T ))

] ∣∣∣∣∣ε=0

. (3.76)

Then, this right derivative exists and we have

∂+

∂εE

[∫ T

0


] ∣∣∣∣∣ε=0

= c(x(τ), v)− c(x(τ), u(τ)) + E

[∫ T

τ

∂

∂xc(x(t), u(t))TΨ(t)dt+

∂

∂xh(x(T ))TΨ(T )

], (3.77)

where Ψ(t) = ∂+∂ε x

ε(t)∣∣ε=0

is the state variation.

The proof of the theorem is deferred to Appendix A (see Theorem A.1). One can see that the

mode insertion gradient (3.76) is a natural generalization of the ones discussed in [41, 172, 6] to

stochastic hybrid systems. Furthermore, by comparing (3.77) with (3.39) it is apparent that the

right hand side of (3.77) is mathematically equivalent to E[ν(tf )], the quantity to be optimized with

the SACBP algorithm in Section 3.2.

The fact that SACBP optimizes (3.76) leads to a certain performance guarantee of the algorithm.

In the open-loop nominal control case, the term E[ν(tf )] as in (3.34) or (3.42) becomes 0 if the

control perturbation v is equal to the nominal control u(τ). Therefore, as long as u(τ) is a feasible

solution to (3.35) the optimal value is guaranteed to be less than or equal to zero. Furthermore, in

expectation the actual value of E[ν(tf )] matches the one approximated with samples, since the Monte


Carlo estimate is unbiased. In other words, the perturbation (τ∗, v∗) computed by the algorithm

is expected to result in a non-positive mode insertion gradient. If the mode insertion gradient is

negative, there always exists a sufficiently small ε > 0 such that the expected total cost is decreased

by the control perturbation. In the corner case that the mode insertion gradient is zero, one can

set ε = 0 to not perturb the control at all. Therefore, for an appropriate choice of ε the expected

performance of the SACBP algorithm over the planning horizon is at least as good as that of the

nominal control.

The same discussion holds for the case of closed-loop nominal control policies, when the expression

for E[ν(tf )] is given by (3.44) or (3.45). This is because Theorem 1 still holds if the nominal control

u(t) is a closed-loop policy as stated in Remark A.2 (Appendix A). Therefore, the expected worst-

case performance of the algorithm is lower-bounded by that of the nominal policy. This implies that

if a reasonable nominal policy is known, at run-time SACBP is expected to further improve it while

synthesizing continuous-time control inputs efficiently.

3.4 Simulation Results

We evaluated the performance of SACBP in the following two simulation studies: (i) active multi-

target tracking with range-only observations; (ii) object manipulation under model uncertainty.

SACBP as well as other baseline methods were implemented in Julia3, except for T-LQG [123]

whose NLP problems were modeled by CasADi [5] in Python and then solved by Ipopt [164], a

standard NLP solver based on interior-point methods. All the computation was performed on a

desktop computer with Intel® Core™ i7-8750H CPU and 32.1GB RAM. The Monte Carlo sampling

of SACBP was parallelized on multiple cores of the CPU.

3.4.1 Active Multi-Target Tracking with Range-only Observations

This problem focuses on pure information gathering, namely identifying where the moving targets

are in the environment. In doing so, the surveillance robot modeled as a single integrator can only

use relative distance observations. The robot’s position p is fully observable and the transitions are

deterministic. Assuming perfect data association, the observation for target i is di = ||qi− p+ vi||2,

where qi is the true target position and vi is zero-mean Gaussian white noise with state-dependent

covariance R(p, qi) = R0 + ||qi − p||2R1. We use 0.01I2×2 for the nominal noise R0. The range-

dependent noise R1 = 0.001I2×2 degrades the observation quality as the robot gets farther from the

target. The discrete-time UKF is employed for state estimation in tracking 20 independent targets.

The target dynamics are modeled by a 2D Brownian motion with covariance Q = 0.1I2×2. Similar

to [148], an approximated observation covariance R(p, µi) is used in the filter to obtain tractable

3The code for SACBP as well as the baseline methods is publicly available at https://github.com/StanfordMSL/

SACBP.jl.

https://github.com/StanfordMSL/SACBP.jl

https://github.com/StanfordMSL/SACBP.jl


-10 0 10 20 30 40 50x [m]

-10

0

10

20

30

40

50

y [m

]

TargetsRobot

0 50 100 150 200Time [s]

2

3

4

5

6

7

8

Wor

st E

ntro

py V

alue

MCTS-DPWT-LQGErgodicGreedySACBP (Ours)

Greedy MCTS-DPW T-LQG Ergodic SACBP (Ours)0.0

0.2

0.4

0.6

0.8

Onlin

e Re

plan

ning

Tim

e [s

]

Targeted

Figure 3.1: (Left) Simulation environment with 20 targets and a surveillance robot. (Middle) Thehistory of the worst entropy value among the targets averaged over 20 random runs, plotted withthe standard deviation. With the budget of 10 Monte Carlo samples, SACBP showed small variancefor the performance curve and resulted in the fastest reduction of the worst entropy value comparedto every other baseline. (Right) Computation times for Greedy, MCTS-DPW, and SACBP achievedreal-time performance, taking less time than simulated tcalc = 0.15 s.

estimation results, where µi is the most recent mean estimate of qi.

The SACBP algorithm produces a continuous robot trajectory over 200 s with planning horizon

tf − t0 = 2 s, update interval ∆to = 0.2 s, perturbation duration ε = 0.16 s, and N = 10 Monte

Carlo samples. The Euler scheme is used for integration with ∆tc = 0.01 s. The Jacobians and

the gradients are computed either analytically or using an automatic differentiation tool [129] to

retain both speed and precision. In this simulation tcalc = 0.15 s is assumed no matter how long

the actual control update takes. We use c(p, b, u) = 0.05uTu for the running cost and h(p, b) =∑20i=1 exp(entropy(bi)) for the terminal cost, with an intention to reduce the worst-case uncertainty

among the targets. This expression for h(p, b) is equivalent to:

h(p, b) =

20∑i=1

√det(2πeΣi), (3.78)

where Σi is the covariance for the i-th target. The nominal control u(t) is constantly zero.

We compared SACBP against four baseline methods: (i) a greedy algorithm based on the gradient

descent of terminal cost h, similar to [138]; (ii) MCTS-DPW [32, 42] in the Gaussian belief space; (iii)

projection-based trajectory optimization for ergodic exploration [104, 105, 38]; (iv) T-LQG [123].

We also attempted to implement the belief iLQG [161] algorithm, but the policy did not converge for

this problem. We suspect that the non-convex terminal cost h contributed to this behavior, which

in fact violates one of the underlying assumptions made in the paper [161].

MCTS-DPW uses the same planning horizon as SACBP, however it draws N = 25 samples from

the belief tree so the computation time of the two algorithms matches approximately.


Figure 3.2: Sample robot trajectories (depicted in red) generated by each algorithm. Greedy, MCTS-DPW, and Ergodic did not result in a trajectory that fully covers the two groups of the targets.Ergodic (Open-Loop) mostly strayed around one group because the updated beliefs were not reflectedinto the robot’s trajectory. T-LQG failed to reduce the estimation uncertainty even after 200 s,due to insufficient time to solve the NLP with high-dimensional joint states in an online manner.SACBP successfully explored the space and periodically revisited both of the two target groups.With SACBP, the robot traveled into one of the four diagonal directions for most of the time. Thisis due to the fact that SACBP optimizes a convex quadratic under a box saturation constraint,which tends to find optimal solutions at the corners. In all the figures, the blue lines represent thetarget trajectories and the shaded ellipses are 99% error ellipses at t = 200 s.

For T-LQG, an NLP is formulated so that the optimization objective is a discrete-time equivalent

of the SACBP objective, with the discrete time interval of ∆to = 0.2 s. Unfortunately, the Ipopt

solver takes a long time to solve this NLP; on average the optimization over 10 timesteps (thus 2

s) takes 43.9 s to converge to a local minimum, with the worst case of over 250 s, due to the high

dimensionality of the joint state space. This high computation cost makes it prohibitive to derive

an offline policy over the entire 200 s simulation horizon prior to the execution. Therefore, we only


test T-LQG in an online manner for this problem, with the same planning horizon as SACBP. For

a fair comparison with the other methods, we adjust the max cpu time parameter of the solver so

that the Ipopt iterations terminate within a computation time that is the same order of magnitude

as SACBP, MCTS-DPW, and ergodic exploration. Note also that T-LQG is a closed-loop planning

method that comes with a local LQG controller to track the optimized nominal belief trajectory.

Designed to require minimal online computation, re-planning for T-LQG only happens if the end of

the planning horizon is reached or the symmetric KL-divergence between the nominal belief state

and the actual belief state exceeds a certain threshold dth. We set dth to 1e6 for this problem, since

we noticed that a smaller value would result in re-planning after almost each observation update.

With our choice, re-planning almost only happens at the end of the planning horizon, which allows

for efficient execution of the policy. More details about the re-planning for T-LQG can be found in

[124, 123].

Ergodic exploration is not a belief space planning approach but has been used in the active

sensing literature. Beginning with the nominal control of zero, it locally optimizes the ergodicity

of the trajectory with respect to the spatial information distribution based on Fisher information.

This optimization is open-loop since the effect of future observations is not considered. As a new

observation becomes available, the distribution and the trajectory are recomputed.

All the controllers were saturated at the same limit. The results presented in Figure 3.1 clearly

indicates superior performance of SACBP while achieving real-time computation. More notably,

SACBP generated a trajectory that periodically revisited the two groups whereas other methods

failed to do so (Figure 3.2). With SACBP the robot was moving into one of the four diagonal

directions for most of the time. This is plausible, as SACBP solves the quadratic program with

a box input constraint (3.35), which tends to find optimal solutions at the corners. MCTS-DPW

resulted in a highly non-smooth trajectory and failed to fully explore the environment. The greedy

approach improved the smoothness, but the robot eventually followed a cyclic trajectory in a small

region of the environment. To our surprise, the ergodic method did not generate a trajectory that

covers the two groups of the targets. This is likely due to the use of a projection-based trajectory

optimization method, which has been recently found to perform rather poorly with rapid re-planning

[39]. To verify our hypothesis, we performed an additional simulation run for the ergodic, in which

the entire trajectory over 200 s was computed and executed without any re-planning. The result

is presented in Figure 3.2 as Ergodic (Open-Loop). With this open-loop computation, the robot

was not stuck any more. However, it strayed into the lower-left corner where there did not exist

any target. This is because the updated, refined beliefs were not reflected into the trajectory. As a

result, the worst entropy value among the targets at t = 200 s was 2.37, which is worse than SACBP

according to Figure 3.1. Among all the baselines implemented, T-LQG performed the worst, leaving

large estimation covariances even after the full 200 s. This is due to the insufficient time budget to

solve the NLP, which indicates that T-LQG is not suited for high-dimensional belief space planning


-1 0 1 2x [m]

-1.0

-0.5

0.0

0.5

1.0

1.5

y [m

]

ObjectC.M.RobotGoal

0 5 10 15 20Time [s]

0

1

2

3

4

5

Resid

ual N

orm

MCTS-DPWT-LQGT-LQG, OfflineBelief iLQGSACBP (Ours)

MCTS-DPW T-LQG SACBP (Ours)0.0

0.2

0.4

0.6

0.8

Onlin

e Re

plan

ning

Tim

e [s

]

Targeted

Figure 3.3: (Left) The robot is attached to the rectangular object. (Middle) The history of the l2norm of the residual between the goal state and the true object state averaged over 20 runs. SACBPwith N = 10 samples successfully brought the object close to the goal. The reduction of the residualnorm was much slower for MCTS-DPW. T-LQG was not as successful either, regardless of whetherthe policy was derived offline (without re-planning) or online (with re-planning), although it eventually achieved similar residual norms to SACBP. Belief iLQG resulted in largeovershoots at around 2 s and 11 s . (Right) Computation time of SACBP was increased from the

multi-target tracking problem due to increased complexity related to the continuous-discrete beliefdynamics, but still achieved a reasonable value. Note that the computation times of the offline

algorithms were significantly longer and are not shown in this plot.

problems such as this active multi-target tracking task.

3.4.2 Object Manipulation under Model Uncertainty

This problem is identical to the model-based Bayesian reinforcement learning problem studied in

[145], therefore a detailed description of the nonlinear dynamics and the observation models are

omitted. See Figure 3.3 for the illustration of the environment. A 2D robot attached to a rigid body

object applies forces and torques to move the object to the origin. The object’s mass, moment of

inertia, moment arm lengths, and linear friction coefficient are unknown. These parameters as well

as the object’s 2D state need to be estimated using EKF, with noisy sensors which measure the

robot’s position, velocity, and acceleration in the global frame. The same values for tf − t0, ∆to,

∆tc, tcalc as in the previous problem are assumed. Each simulation is run for 20 s. SACBP uses

ε = 0.04 s and N = 10. The nominal control for SACBP is a position controller whose input is the

mean x-y position and the rotation estimates of the object. The cost function is quadratic in the

true state x and control u, given by 12x

TCxx+ 12u

TCuu. Taking expectations yields the equivalent

cost in the Gaussian belief space c(b, u) = 12µ

TCxµ+ 12 tr(CxΣ)+ 1

2uTCuu, where Σ is the covariance

matrix. We let terminal cost h be the same as c except that it is without the control term.

We compared SACBP against (i) MCTS-DPW in the Gaussian belief space, (ii) belief iLQG, and


(iii) T-LQG. MCTS-DPW uses the same planning horizon as SACBP, and is set to draw N = 240

samples so that the computation time is approximately equal to that of SACBP. Furthermore,

MCTS-DPW uses the position controller mentioned above as the rollout policy, which is suggested

in [145].

The policy computation with belief iLQG is performed offline over the entire simulation horizon

of 20 s. The solver is initialized with a nominal trajectory generated by the same position controller.

The entire computation of the policy takes 5945 s, a significant amount of time until convergence

to a locally optimal affine feedback policy. Note however that the online policy execution can be

performed instantaneously.

For T-LQG, we test the algorithm in both the online and the offline modes. The online mode is

equivalent to the implementation for the active multi-target tracking problem. Re-planning happens

when the end of the planning horizon is reached or the symmetric KL-divergence surpasses dth = 25,

which was rarely exceeded during the simulation for this problem. Furthermore, the max cpu time

parameter of Ipopt is adjusted so the computation time is comparable to SACBP and MCTS-DPW.

This is because the full optimization over 10 timesteps (i.e. 2 s) takes 2.5 s on average and 7 s in the

worst case, which is better than in the active multi-target tracking problem but still prohibitively

slow for online control computation, taking about 15× tcalc to 45× tcalc with tcalc = 0.15 s. On the

other hand, the offline mode computes the closed-loop policy once for the entire simulation horizon

without online re-planning or limiting max cpu time. This setup is identical to belief iLQG. In this

mode, it takes T-LQG 1065 s to compute the policy, which is about 5 to 6 times faster than belief

iLQG. This improved efficiency is congruous with the complexity analysis provided in [123]. Note

also that we use the aforementioned position controller to initialize the NLP solver in both the online

and the offline modes.

Overall, the results presented in Figure 3.3 demonstrate that SACBP outperformed all the base-

lines in this task with only 10 Monte Carlo samples, bringing the object close to the goal within 10 s.

Although the computation time increased from the previous problem due to the continuous-discrete

filtering, it still achieved near real-time performance with less than 3× tcalc s on average. Compared

to SACBP, reduction of the residual norm was slower for MCTS-DPW and online T-LQG. The two

offline algorithms tested, belief iLQG and offline T-LQG, both had a large overshoot at around 2 s

and 3 s, respectively. We suppose that this was caused by the offline nature of the policies, as well

as a mismatch between the discrete-time model used for planning and the continuous-time model

employed for dynamics simulation. Another large overshoot for belief iLQG at 11 s was likely due

to a locally optimal behavior of the iLQG solver.

3.5 Conclusions

In this chapter we present SACBP, a novel novel online algorithm for fast, general-purpose belief


space planning. We view the stochastic belief dynamics as a hybrid system with time-driven switch-

ing and derive the optimal control perturbation based on the perturbation theory of differential

equations. The resulting algorithm extends the framework of SAC to stochastic belief dynamics and

is highly parallelizable to run in near real-time. The rigorous mathematical analysis shows that the

notion of mode insertion gradient can be generalized to stochastic hybrid systems, which leads to

the property of SACBP that the algorithm is expected to perform at least as good as the nominal

policy with an appropriate choice of the perturbation duration. Through an extensive simulation

study, we have confirmed that SACBP outperforms other online and offline algorithms, including

a greedy approach and non-myopic closed-loop planners that are based on approximate dynamic

programming and/or local trajectory optimization. In future work, we are interested in considering

a distributed multi-robot version of SACBP as well as problems with hard state constraints. Other

research directions include providing additional case studies for more complex systems with efficient

implementation.

Chapter 4

Fast Risk-Sensitive Planning with

Probabilistic Generative Models

In Chapter 2 and 3, we addressed challenges concerning state uncertainty for robots operating in

open and interactive workspaces. Our coherent strategy was to form plans in the belief space that

optimally reduce state uncertainty through sequential control actions and recursive Bayesian infer-

ence. The belief space planning methods proposed in the previous chapters are effective to overcome

challenges arising from epistemic uncertainty about the current state of the system. However, belief

space planning alone does not address all the difficulties of the future state uncertainty that origi-

nates from transition uncertainty, as discussed in Chapter 1 (Section 1.1); transition uncertainty is

difficult to diminish since it is often inherent in the dynamics, especially when the system involves

other decision-making agents (e.g. humans or human-driven vehicles). Even worse, such systems are

often deemed safety-critical, meaning that any failure that the robot makes can cause severe physi-

cal damage to itself or the other agents. Nevertheless, they are of practical importance in robotics

industry, including self-driving vehicles at highway speeds and autonomous drones performing agile

outdoor flights. Therefore, it is crucial for planning and control algorithms to be designed so that

the transition uncertainty is properly addressed and that the system remains safe under various

operating conditions.

In this chapter, we shift our attention to transition uncertainty in safety-critical systems. We

propose an online stochastic control algorithm that approximately optimizes a risk-sensitive objective

called the entropic risk measure. Our method is an extension of the Stochastic SAC algorithm

presented in Chapter 3, therefore it inherits many desirable properties of Stochastic SAC, such

as fast computation and high performance. Compared to Stochastic SAC that only considers the

expected objective value, however, the proposed method is aware of the variance of the underlying

56

CHAPTER 4. FAST RISK-SENSITIVE PLANNING WITH GENERATIVE MODELS 57

distribution and thus can mitigate “rare but catastrophic” events. We name this new algorithm Risk-

Sensitive Sequential Action Control (RSSAC). RSSAC achieves this extension without increasing

computational complexity from Stochastic SAC. Furthermore, RSSAC does not require knowledge

of the analytical form of the distribution, nor is it limited to a certain class of distributions such as

Gaussian; Monte Carlo samples from a probabilistic generative model suffices to run this algorithm.

As a representative application, we choose to focus on an autonomous navigation task for a ground

robot among many mutually-interacting humans, where the robot must smoothly reach its goal

without a collision. To model transition uncertainty in the humans, we employ Trajectron++ [134],

a deep-learned generative model that performs multimodal probabilistic trajectory forecasting. We

demonstrate that the combined RSSAC-Trajectron++ framework can promote safety by inducing

a proactive robot behavior that avoids risky situations. The methods proposed in this chapter are

based on our prior work [108].

4.1 Introduction

As autonomous robots expand their workspace to cage-free, social environments, they must be

designed as safety-critical systems; failures in avoiding collisions with humans sharing the workspace

result in catastrophic accidents. Safe and efficient robot navigation alongside many humans is still a

challenging problem in robotics, especially due to the potentially unpredictable and uncooperative

nature of human motion. We propose an effective solution to this problem via risk-sensitive stochastic

optimal control, wherein desired collision avoidance and goal reaching motion is achieved by a

cost function, a risk-sensitivity parameter, and dynamic optimization. Specifically, we extend the

Stochastic Sequential Action Control (SAC) algorithm [111, 112] to a risk-sensitive setting through

the use of exponential disutility [174], the objective often referred to as the entropic risk measure [97].

The proposed sampling-based algorithm, which we name Risk-Sensitive Sequential Action Control

(RSSAC), is a stochastic nonlinear model predictive control (NMPC) algorithm that optimally

improves upon a given nominal control with a series of control perturbations. Unlike many other

control-theoretic stochastic NMPCs [156, 52, 175], RSSAC is not limited to a particular class of

distributions such as Gaussian, nor does it need to know the analytical form; it only requires a

black-box probabilistic generative model. Leveraging this property and recent advances in machine

learning for modeling multi-agent behavior, we combine RSSAC with Trajectron++ [134], a state-of-

the-art generative model for predicting the many possible future trajectories of multiple interacting

agents accurately and efficiently. The overall framework constitutes a prediction-control pipeline for

safe robot navigation, which is shown to be capable of interacting with more than 50 pedestrians

simultaneously while avoiding collisions and steering the robot towards its goal.

The contributions of the present work are two-fold. First, the extension of the Stochastic SAC

to risk-sensitive optimal control is presented as a theoretical contribution. This is achieved by


Figure 4.1: The proposed RSSAC-Trajectron++ framework is effective for safe robot navigationin a social environment densely populated with humans. (Left) A simulation environment withreal human trajectories from the UCY/UNIV scene [90], overlaid with predictions sampled fromTrajectron++. (Right) A risk-sensitive robot safely navigating itself alongside 5 humans.

generalizing the mode insertion gradient [172, 6] to the entropic risk measure. Second, the combined

RSSAC-Trajectron++ framework is presented as a novel modular approach to safe crowd-robot

interaction with dynamic collision avoidance. RSSAC takes advantage of the parallelizability of

Monte Carlo sampling, leading to an efficient GPU implementation with real-time performance.

We conduct simulation studies as well as a real-world experiment to demonstrate that the explicit

probabilistic prediction of Trajectron++ and risk-sensitive optimization of RSSAC allow a robot to

interact with many humans safely and efficiently. They also reveal that the risk sensitivity parameter

adds an additional degree of freedom in determining the type of interaction behavior exhibited by

the robot. In particular, high risk-sensitivity induces a proactive robot behavior that avoids high-

stakes states, for example by yielding to oncoming humans. We further show that such behavioral

shift is not achieved by tuning conventional cost function parameters.

The rest of the chapter is organized as follows. Section 4.2 summarizes the relevant prior work.

We define the crowd-robot interaction problem for safe navigation in Section 4.3. The RSSAC

algorithm is described in Section 4.4 along with implementation details and a run-time analysis.

Section 4.5 presents simulation and experimental results. The paper concludes in Section 4.6 with

future work.


4.2 Related Work

We first review relevant work in dynamic collision avoidance and safe robot navigation in Section

4.2.1. We then focus on Stochastic SAC (Section 4.2.2) and data-driven trajectory forecasting

methods (Section 4.2.3), on which our RSSAC-Trajectron++ framework is built.

4.2.1 Dynamic Collision Avoidance and Safe Robot Navigation

There exists a vast body of literature in the field of dynamic collision avoidance and safe robot

navigation. Coordinated collision avoidance is widely studied in the multi-robot systems literature,

with methods including mixed integer programming [102], reciprocal velocity obstacles [160], buffered

Voronoi cells [178, 170], Nash-equilibria-based differential games [107], and safety barrier certificates

[168]. These methods do not explicitly model uncoordinated agents or motion uncertainty, which

are both crucial aspects in human motion; later in Section 4.5 we observe collisions for [170] due

to humans breaking the coordination assumption. A notable exception is [95], where the authors

propose probabilistic safety barrier certificates for bounded motion uncertainty. An alternative

approach suited to bounded uncertainty is the Hamilton-Jacobi (HJ) reachability analysis [157, 26],

although it suffers from the curse of dimensionality in solving the Hamilton-Jacobi-Isaacs equation.

Another common approach to modeling motion uncertainty is via the use of probability distribu-

tions. MDP and POMDP frameworks have been applied to robot navigation and collision avoidance

[152, 21] and are best suited to discrete action spaces. Stochastic Optimal Control (SOC), especially

sampling-based model predictive control (MPC), is also gaining increasing attention due to its abil-

ity to handle stochastic uncertainty and incorporate prediction of possible future outcomes. Recent

methods applied to safe robot navigation and dynamic collision avoidance include exhaustive search

with motion primitives [137], path-integral control [52], and information-theoretic control [176].

While the latter two approaches can find approximately optimal solutions under certain assump-

tions, they require the form of distribution to be Gaussian. In addition, it is often not satisfactory to

only optimize for the expected cost, especially for safety-critical systems. There exist many possible

ways to endow robots with risk-awareness [97], such as chance-constrained optimization [16], condi-

tional value at risk (CVaR) [136], and entropic risk measure [99]. All of these methods come with

advantages and drawbacks, but in general there is a trade-off between the complexity of modeling

assumptions and the statistical safety assurances. Our work employs the entropic risk measure which

has been extensively studied in control theory [173, 51]. Albeit not yielding statistical safety assur-

ances, our algorithm can compute control inputs in real-time for highly dynamic environments with

multimodal predictive distributions, which is quite challenging for chance constrained optimization

or CVaR-based methods.

Lastly, we briefly highlight similarities and differences of our work from policy learning methods

such as imitation learning [120] or end-to-end deep reinforcement learning [46, 25]. Similar to these


methods, our framework applies machine learning to extract features from multi-agent interactions.

However, we use these features to make explicit probabilistic predictions about other agents which

are then incorporated into optimization by the model-based controller. Our modular approach has

two advantages. First, explicit prediction brings about better interpretability and explainability

of the resulting interaction. Second, our model-based controller allows the robot’s behavior to be

changed at run time by simply modifying the cost function design or the risk sensitivity parameter,

while other methods would need re-training to do so as the policy itself is data-driven.

4.2.2 Stochastic Sequential Action Control

SAC in its original form is a deterministic model predictive control algorithm for nonlinear hybrid

systems [6, 158]. In contrast to local trajectory optimization, at each time the algorithm perturbs

a given (open-loop or closed-loop) nominal control to optimize a local improvement of the total

receding-horizon cost, a quantity known as the mode insertion gradient [172]. SAC is shown to yield

highly efficient control computation with high-frequency re-planning, which often achieves better

performance than local trajectory optimization methods. In Chapter 3, we generalized the mode

insertion gradient to stochastic systems and solved challenging belief space planning problems, where

our approach significantly outperformed other state-of-the-art methods under both state and state

transition uncertainties. The proposed RSSAC algorithm is an extension of this algorithm to risk-

sensitive optimal control in which the cost functional is the entropic risk measure. This makes the

algorithm more suitable for problems where safety is concerned, such as the crowd-robot interaction

one that we address in this chapter.

4.2.3 Multi-Agent Trajectory Modeling from Data

The problem of multi-agent trajectory forecasting has been substantially studied in the literature.

Many of earlier works focus on producing the best single output trajectory from a history of state

observations. They include the earliest works that view humans as physical objects whose mo-

tions are determined by imaginary Newtonian forces [62], as well as other approaches that solve

a time-series regression problem by Gaussian processes [167], inverse reinforcement learning [88],

or recurrent neural networks (RNNs) [3]. However, since human behavior is rarely deterministic

or unimodal, deep learning-based generative approaches have emerged as state-of-the-art trajectory

forecasting methods, due to recent advancements in deep generative models [146, 53]. Notably, they

have caused a shift from focusing on predicting the single best trajectory to producing a distribution

of potential future trajectories. This is advantageous in autonomous systems as full distribution

information is more useful for downstream tasks, e.g. motion planning and decision making where

information such as variance can be used to make safer decisions. Most works in this category

use a recurrent neural network architecture with a latent variable model, such as a Conditional

Variational Autoencoder (CVAE) [146], to explicitly encode multimodality (e.g. [87, 69, 130]), or a


Generative Adversarial Network (GAN) [53] to do so implicitly (e.g. [56, 133, 83]). Trajectron++

[134] belongs to the former, making use of a CVAE to explicitly model the multimodality of human

behavior with a latent variable. Specifically, it uses a discrete Categorical latent variable which aids

in performance as well as interpretability, since different behavior modes are explicitly represented

and able to be visualized. Designed for downstream robotic motion planning and control tasks,

Trajectron++ can also produce predictions that are conditioned on candidate future motion plans

of the ego robot. We leverage this capability in a real-world experiment in Section 4.5 and show

that the robot-future-conditional prediction significantly improves both safety and efficiency of the

robot navigation.

4.3 Problem Statement

4.3.1 Dynamics Model

Mirroring Chapter 3 (Section 3.2.1 and 3.4.1), we assume a discrete-time dynamical system model

for other agents (i.e. humans) while employing a continuous-time one for the robot. Recall that this

modeling assumption is practical, as 1) in general the robot receives information of its surrounding

environment at a much lower frequency than actuation; and 2) with this representation one can

naturally handle potentially agile robot dynamics without coarse time discretization. Let xr(t) ∈ Rn

denote the fully-observable state of the robot at time t, whose dynamics are modeled by

xr(t) = fr(xr(t)) +H(xr(t))u(t), (4.1)

where u(t) ∈ U ⊆ Rm is the control and U is a bounded convex set. We assume that the dynamics

are deterministic and control-affine. As the robot navigates in the environment, it receives position

information on humans in the scene and updates it at discrete times tkk≥0, with interval time

∆to. Similar to related works [178, 170] we do not assume velocity information to be available to the

robot. Let pi(tk) ∈ R2 denote the position of the human with label i ∈ 1, . . . , N. From tk−1 to tk,

the position pi(tk−1) changes to pi(tk) with difference yik. Note that in this chapter we assume full

observability of the human positions. From the robot’s continuous-time perspective, this is viewed

as a periodic jump discontinuity given bypi(tk) = pi(t−k ) + yik

pi(t) = pi(tk) ∀t ∈ [tk, tk+1),(4.2)

where t−k is infinitesimally smaller than tk. The dynamics of the joint state:

x(t) , (xr(t)T, p1(t)T, . . . , pN (t)T)T ∈ Rn+2N (4.3)


are specified by (4.1) and (4.2), which together constitute a hybrid dynamical system with time-

driven switching. We treat N as a fixed number in this section, but relax this assumption later and

handle a time-varying number of humans via re-planning.

The robot plans its control actions over a finite horizon [t0, tT ] while the humans make stochastic

transitions yik(1≤i≤N,1≤k≤T ). Stacking these transition variables, we obtain a random vector with

distribution D. This distribution can be dependent on other variables such as past history of

interactions and/or future motion plans of the robot. For the sake of planning, we only require that

samples drawn from D are available. In this work we choose to model D with Trajectron++ [134],

due to its superior performance over other trajectory forecasting methods as well as its capability

to produce robot-future-conditional predictions. The reader is referred to [134] for further details

on Trajectron++.

4.3.2 Optimal Control Problem

Consider a finite horizon optimal control cost of the form

J =

∫ tT

t0

c(x(t), u(t))dt+ h(x(tT )), (4.4)

where c(·) is the instantaneous cost function and h(·) is the terminal cost function. In this work we

assume that c(·) is the well-known LQ tracking cost plus a collision cost term:

c(x(t), u(t)) =1

2(xr(t)− r(t))TQ(xr(t)− r(t)) +

1

2u(t)TRu(t) + ccol(x(t)), (4.5)

where Q = QT 0 and R = RT 0 are weight matrices, r is a reference trajectory for the robot,

and ccol(·) ≥ 0 is a collision penalty function that is continuously differentiable, bounded, and has

bounded gradient with respect to xr. The reference trajectory r is assumed to be given, possibly

from a high level global planner. Similarly, the terminal cost h(·) is defined by

h(x(tT )) =β

2(xr(tT )− r(tT ))TQ(xr(tT )− r(tT )) + βccol(x(tT )), (4.6)

where β ≥ 0 determines the relative weight between the terminal and instantaneous costs.

Having specified the cost functional J , we define the (risk-neutral) stochastic optimal control

problem as follows.


Problem 4.1 (Risk-Neutral Stochastic Optimal Control).

minimizeu

ED[J ]

subject to (4.1), (4.2) ∀i ∈ 1, . . . , N ∀k ∈ 0, . . . , T

xr(t0) = xr0, pi(t0) = pi0 ∀i ∈ 1, . . . , N

u(t) ∈ U ∀t ∈ [t0, tT )

Remark 4.1. Note that the definition above yields an open-loop optimal control problem. Even

though this is suboptimal compared to closed-loop planning that seeks for a feedback policy, in practice

we will solve Problem 4.1 repeatedly as MPC, which has been found to work well in prior work

(e.g. [137]). This open-loop assumption also makes the subsequent analysis presented in Section 4.4

straightforward.

4.3.3 Entropic Risk Measure

The formulation of Problem 4.1 ignores the safety-critical aspect of collision avoidance, as simply

optimizing the expected value fails to take into account the shape of the distribution. A remedy is

to introduce the following entropic risk measure with risk sensitivity parameter θ > 0:

RD,θ(J) ,1

θlog(ED[eθJ ]

). (4.7)

As long as the robot trajectory xr is bounded for all admissible control, J becomes a bounded random

variable and RD,θ(J) is finite. It is known that this transformation approximately decouples the

mean and variance:

RD,θ(J) ≈ ED[J ] +θ

2VarD(J) (4.8)

for small θVarD(J) [174]. The meaning of θ is now clear; it is a parameter that determines how much

we care about the variability of the cost in addition to the mean. Larger θ increases risk sensitivity,

while the risk-neutral objective ED[J ] is recovered as θ → 0+.

Replacing the expectation in Problem 4.1 with (4.7), we obtain the following risk-sensitive optimal

control problem:

Problem 4.2 (Risk-Sensitive Stochastic Optimal Control).

minimizeu

RD,θ(J)

subject to (4.1), (4.2) ∀i ∈ 1, . . . , N ∀k ∈ 0, . . . , T

xr(t0) = xr0, pi(t0) = pi0 ∀i ∈ 1, . . . , N

u(t) ∈ U ∀t ∈ [t0, tT )


4.4 Risk-Sensitive Sequential Action Control

4.4.1 Review of Stochastic Sequential Action Control

Even solving Problem 4.1 is intractable due to potential non-convexity in ccol and complexity in

D. An efficient, approximate MPC solution can be obtained via Stochastic SAC introduced in

Chapter 3. In this framework, we seek the optimal perturbation of a given nominal control such

that the expected value of the mode insertion gradient [172, 6], which quantifies local effects of

the perturbation on the cost functional J , is minimized. We assume that the nominal control is

an open-loop control schedule u (although it is straightforward to extend our method to deal with

closed-loop nominal control policies [111, 112]). The perturbed control is defined as

uε(t) ,

v if t ∈ (τ − ε, τ ]

u(t) otherwise,(4.9)

where v ∈ U , τ ∈ (t0, tT ), ε ≥ 0 are the perturbation parameters. The perturbation uε yields

deterministic, perturbed state trajectory xε and cost Jε under a specific sample from D. The mode

insertion gradient is the sensitivity of the cost J to the perturbation duration ε, with v and τ fixed:

∂+J

∂ε

∣∣∣∣ε=0

, limε→0+

Jε − Jε

. (4.10)

The value of the mode insertion gradient is given by

∂+J

∂ε

∣∣∣∣ε=0

=1

2vTRv + ρr(τ)TH(xr(τ))(v − u(τ))− 1

2u(τ)TRu(τ), (4.11)

where xr is the robot state trajectory induced by nominal control u, and ρr is the adjoint variable

matching the dimensionality of the robot state. Specifically, it follows the ordinary differential

equation (ODE):

ρr(t) = − ∂

∂xrc(x(t), u(t))−

(∂

∂xrfr(xr(t)) +

∂

∂xrH(xr(t))u(t)

)T

ρr(t), (4.12)

with boundary condition ρr(tT ) = ∂∂xr

h(x(tT )). One can obtain (4.11) and (4.12) by following the

same derivation as presented in Chapter 3 (Section 3.2.1). As the joint state x(t) is in fact a random

vector due to stochasticity in human motion, so is ρr(t) and the deterministic ODE (4.12) is only

valid under a specific sample from D. Taking the expectation of (4.11) yields the expected mode

insertion gradient:

ED[∂+J

∂ε

∣∣∣∣ε=0

]=

1

2vTRv + ED[ρr(τ)]TH(xr(τ))(v − u(τ))− 1

2u(τ)TRu(τ). (4.13)


4.4.2 Generalized Mode Insertion Gradient

Under a weak regularity condition on the dynamics function (4.1), and for the cost J defined by

(4.4), (4.5), (4.6), one can show that the robot trajectory xr is bounded for all admissible control,

and that the perturbed cost Jε is globally Lipschitz continuous in ε ≥ 0:

|Jε − J | ≤ εK, (4.14)

under the same random sample from D for J and Jε. The value of the constant K is not dependent on

the samples. Therefore, the dominated convergence theorem allows us to interchange the derivative

and the expectation:

ED[∂+J

∂ε

∣∣∣∣ε=0

]=∂+ED[J ]

∂ε

∣∣∣∣ε=0

. (4.15)

The statement of the regularity condition and the proof of the key results (i.e. (4.14) and (4.15))

are deferred to Appendix B. The right-hand side of (4.15) is the mode insertion gradient generalized

to the case of risk-neutral stochastic optimal control (i.e. Problem 4.1). If the optimal value with

respect to v is negative for some τ , then there exists a sufficiently small ε for which the perturbation

defined by (v, τ, ε) will reduce the expected cost. If instead it is zero, we can think of the nominal

control as satisfying a local optimality condition.

4.4.3 Extension to Entropic Risk Measure

Sections 4.4.1 and 4.4.2 have provided a brief summary of the Stochastic SAC framework presented

in Chapter 3. Now we are set to derive the generalized mode insertion gradient for the entropic risk

measure RD,θ(J), which is a novel contribution of this chapter.

Lemma 4.1. Suppose that the total cost J satisfies (4.14). Then, for θ > 0 the following relation

holds:

ED[∂+e

θJ

∂ε

∣∣∣∣ε=0

]=∂+ED[eθJ ]

∂ε

∣∣∣∣ε=0

. (4.16)

Proof. Let Jε and J be the perturbed and the nominal cost, respectively. We have

|eθJε

− eθJ | = |eθ(Jε−J) − 1|

eθJ(4.17)

≤ eθ|Jε−J| − 1, (4.18)


where we used |ex − 1| ≤ e|x| − 1 and eθJ ≥ 1. Substituting (4.14) and dividing by ε > 0, we obtain

|eθJε − eθJ |ε

≤ eθKε − 1

ε(4.19)

Let g(ε) denote the right-hand side of this inequality, which is a strictly monotone increasing function

for ε > 0. Now let |fε(J)| denote the left-hand side as a function of J parameterized by ε. Take some

positive ε0 and a sequence εnn≥0 converging to 0+ as n→∞. It follows that ∀n ≥ 0, |fεn(J)| ≤g(ε0). The dominated convergence theorem applies as g(ε0) is a finite, deterministic constant:

ED[ limn→∞

fεn(J)] = limn→∞

ED[fεn(J)], (4.20)

which is equivalent to (4.16).

Theorem 4.1 (Mode Insertion Gradient of Entropic Risk). Suppose that the regularity condition

mentioned in Section 4.4.2 is met so the robot trajectory x is bounded under admissible control and

the cost J satisfies (4.14). Then, for fixed v and τ the mode insertion gradient of the entropic risk

measure ∂+∂εRD,θ(J)|ε=0 exists and is given by

∂+

∂εRD,θ(J)

∣∣∣∣ε=0

=1

2vTRv +

ED[eθJρr(τ)]T

ED[eθJ ]H(xr(τ))(v − u(τ))− 1

2u(τ)TRu(τ), (4.21)

where J inside the expectations is the cost value under the nominal control u.

Proof. As the robot trajectory is bounded, J is a bounded random variable and the value of RD,θ(J)

is finite. The chain rule gives

∂+

∂εRD,θ(J)

∣∣∣∣ε=0

=1

θED[eθJ ]

∂+ED[eθJ ]

∂ε

∣∣∣∣ε=0

(4.22)

=1

ED[eθJ ]ED[eθJ

∂+J

∂ε

∣∣∣∣ε=0

], (4.23)

where we also used Lemma 4.1. Substituting (4.11) and simplifying the terms complete the proof.

The mode insertion gradient of entropic risk (4.21) is a generalization of the stochastic mode

insertion gradient (4.15). Indeed, for the risk neutral case (i.e. θ = 0) the two equations match.

This enables us to extend the Stochastic SAC algorithm to risk-sensitive optimal control problems

without changing the structure of the algorithm.

4.4.4 RSSAC Algorithm

The core of RSSAC is the following optimization problem, which substitutes Problem 4.1 (if θ = 0)

and 4.2 (if θ > 0).


Problem 4.3 (Mode Insertion Gradient Optimization).

minimizev,τ

∂+

∂εRD,θ(J)

∣∣∣∣ε=0

subject to τ ∈ (t0 + tcalc, tT )

v(τ) ∈ U ∀τ ∈ (t0, tT ),

where tcalc is a computation time budget. The objective can be evaluated by Monte Carlo

sampling the joint dynamics (4.1), (4.2) under the nominal control and backward integration of the

adjoint dynamics (4.12). Fixing τ , Problem 4.3 is a quadratic minimization over v under a convex

constraint. The optimal value v∗(τ) can be obtained analytically for simple constraints for U such as

a box constraint or a norm inequality constraint with R being a scaled identity matrix. Optimizing

τ is achieved by solving v∗(τ) multiple times and searching for the minimum value. There is only

a finite number of τ to consider1 since in practice we use numerical integration, such as the Euler

scheme with some discrete step size, to integrate the robot dynamics. The optimal mode insertion

gradient is always non-positive, and if it is negative then there exists some ε > 0 such that the

entropic risk measure is reduced as a result of the control perturbation. The value of ε can be either

specified or searched. If instead the mode insertion gradient is zero, we set ε = 0 and choose not to

perturb.

The pseudo-code of the RSSAC algorithm is presented in Algorithm 4.1. Importantly, the Monte

Carlo sampling part is naturally parallelizable. Note that the risk sensitivity has no effect unless we

draw at least M = 2 samples. The re-planning happens at every ∆tr seconds, allowing for a variable

number of humans to be considered over time. The complexity of the algorithm is O(NMT ) where

N denotes the number of humans, M the number of samples, and T the planning horizon.

4.4.5 Implementation Details

In our implementation, we use the double integrator model for the robot’s dynamics: xr(t) =

(xp(t)T, xv(t)T)T ∈ R4, xp(t) = xv(t), xv(t) = u(t). These dynamics are integrated using the Euler

scheme with step size ∆tc = 0.02 s. The cost weight matrices are Q = Diag(0.5, 0.5, 0, 0) and

R = 0.2I2×2. The collision cost is the following sum of Gaussian functions centered at each human:

ccol(x(t)) =

N∑i=1

α exp

(−‖xp(t)− pi(t)‖22

2λ

), (4.24)

1In this chapter, our search space for τ extends to tT , unlike t0 + tcalc + ∆to in Chapter 3. That is, we allow thecontroller to perturb the system at a much later time than assumed in the original Stochastic SAC algorithm. Wenote that this larger search space is in agreement with the SAC algorithm presented in [6], and have found that thismodification improves the performance in practice.


Algorithm 4.1 Control Computation with RSSAC

INPUT: Initial joint state x(t0), reference trajectory r, nominal control schedule u(t) for t ∈ [t0, tT ].OUTPUT: Optimally perturbed control schedule uε

1: for j = 1:M do2: Forward-simulate the joint dynamics (4.1), (4.2) while sampling human transitionsyik

j(1≤i≤N,1≤k≤T ) from D

3: Compute sampled cost Jj

4: Backward-simulate adjoint robot trajectory ρjr with sampled human transitions (4.12)5: end for6: Compute Monte Carlo estimate: ED[eθJρr] ≈ 1

M

∑Mj=1 e

θJjρjr and ED[eθJ ] ≈ 1M

∑Mj=1 e

θJj

7: Solve Problem 4.38: Specify ε or search by re-simulating the dynamics9: uε ← PerturbControlSchedule(u, v∗, τ∗, ε) (4.9)

10: return uε

with peak parameter α = 100 and bandwidth parameter λ = 0.2. The relative weight of the terminal

cost is set to β = 0.1. The reference trajectory r is a straight line connecting the initial robot position

to the goal position at a constant target speed, and is re-planned whenever ‖xr(t) − r(t)‖2 > 2 m.

The control constraint is ‖u(t)‖2 ≤ umax with umax = 5.0, 2.5 m s−2 in the simulation and the

real-world experiment, respectively. Monte Carlo sampling is parallelized on a GPU with sample

size M = 30. The planning horizon is 4.8 s, which corresponds to T = 12 steps of human motion

prediction with measurement interval ∆to = 0.4 s.

In contrast to prior work [111, 112] and Chapter 3, we searched for ε from 0, 1× 10−3, 2× 10−3,

4× 10−3, 8× 10−3, 1.6× 10−2, 2× 10−2, 4× 10−2, 8× 10−2 s by re-simulating the perturbed dy-

namics. Furthermore, the nominal control u is a simple MPC-style search algorithm and takes the

form

u(t) =

us(t) if t ∈ [t0 + tcalc, t0 + tcalc + ∆to]

upr(t) otherwise,(4.25)

where upr is the perturbed control from the previous iteration, and us is either the same as upr or

chosen from a set of constant control inputs (a cos(b), a sin(b)) with a ∈ 0.4umax, 0.8umax and

b ∈ 0, π/4, π/2, . . . , 2π. The best us is chosen based on the evaluation of RD,θ(J) for each nominal

control candidate using the Monte Carlo samples. This nominal search is similar to the iterative

update scheme presented in [158] in that the previously-computed perturbation is used in the next

iteration, but is different in that we also insert us prior to running Algorithm 4.1. The nominal

search can be alternatively considered as a simplified version of a tree search with motion primitives

[137] with only 17 control choices and tree depth 1 (i.e. us only lasts for one discrete-time interval

∆to). We found this simple nominal search to be effective compared to constant nominal control

[6, 158] while retaining low computational cost. Note that if robot-future-conditional prediction is


0.00.10.20.30.40.5Normalized Distance to Goal

0.0

0.5

1.0

1.5

Min

. Dist

. to

Hum

an [m

] (a) ETH

Nominal SearchExhausitve SearchBICRSSAC (Ours)Collision Line


0.0

0.5

1.0

1.5

Min

. Dist

. to

Hum

an [m

] (b) HOTEL

Nominal SearchExhausitve SearchBICRSSAC (Ours)Collision Line

0.000.250.500.751.00Normalized Distance to Goal

0.0

0.5

1.0

1.5

Min

. Dist

. to

Hum

an [m

] (c) UNIV

Nominal SearchBICRSSAC (Ours)Collision Line

Figure 4.2: Quantitative results from 100 runs show that risk-neutral (i.e. θ = 0) RSSAC furtherimproves the performance of Nominal Search Only as the theory suggests, achieving both safety andefficiency. Note that the farther up and to the right, the better, as the x-axis is flipped. ExhaustiveSearch could not scale to the UNIV scene with more than 50 humans. BIC resulted in multiplecollisions. Error bars show standard deviation.

used, the distribution D is different for each of the 17 candidate nominal controls (4.25), allowing

the robot to consider the effect of various future robot trajectories on reactive human behavior.

We implemented all the control code in Julia and achieved real-time performance with re-planning

interval ∆tr = tcalc = 0.1 s.

4.5 Results

4.5.1 Simulation Results

We evaluated the performance of RSSAC in simulation and compare against three baseline collision

avoidance algorithms: 1) LQ-Tracking with Buffered Input Cell (BIC) [170], 2) Nominal Search

(Section 4.4.5) Only, 3) Exhaustive Tree Search with Motion Primitives [137]. BIC is a reciprocal

collision avoidance method that has similar computational complexity to the Velocity Obstacle (VO)

approaches. Unlike VO, BIC does not require the velocity information of other agents to be available

and is applicable to high-order linear dynamics. As our robot is a double-integrator and only obtains



0.91.01.11.21.31.41.51.6

Min

. Dist

. to

Hum

an [m

] (a) ETH

RSSAC, = 0.0RSSAC, = 1.0


0.91.01.11.21.31.41.51.6

Min

. Dist

. to

Hum

an [m

] (b) HOTEL

RSSAC, = 0.0RSSAC, = 1.0


0.91.01.11.21.31.41.51.6

Min

. Dist

. to

Hum

an [m

] (c) UNIVRSSAC, = 0.0RSSAC, = 1.0

Figure 4.3: Compared to the risk-neutral case in Fig. 4.2, RSSAC with θ = 1.0 significantly reducesthe standard deviation of the minimum robot-human distance by 11%, 12%, and 24% in (a), (b),and (c), respectively. The risk sensitivity trades off the stochastic collision cost and the deterministictracking cost, which results in increased standard deviation in the x-axis in (a) and (b), and overalldistance increase in (c) where the scene was most densely-populated.

position measurements, BIC is a more suitable reciprocal collision avoidance approach than VO. BIC

is minimally invasive in that an arbitrary control input is projected onto a convex polygon in the

control space. Collision avoidance is guaranteed as long as all the agents respect their own BIC

constraint. We use the LQ-Tracking cost (i.e. (4.5) without ccol) to solve a deterministic optimal

control problem and the BIC constraint adjusts the control input based on the current positions of

humans. The robot is said to be in collision if it is within 40 cm from a human, but we set this

threshold to 80 cm for planning with BIC only to give it an extra safety margin. The exhaustive

search is similar to the nominal search in that it uses samples from D, but the tree depth is T = 4

and at each depth the constant control is chosen from (a cos(b), a sin(b)) with a ∈ 0, 0.6umaxand b ∈ 0, π/4, π/2, . . . , 2π. This results in 94 = 6561 sequences of control inputs to be considered.

Despite parallelization on a GPU and a long replanning interval of ∆tr = tcalc = 0.4 s, Exhaustive

Search never achieved real-time performance. RSSAC, Nominal Search, and Exhaustive Search all

used M = 30 prediction samples (per human) drawn from the Trajectron++ model [134].

For this evaluation, we used three distinct sequences from the publicly available ETH [116]

and UCY [90] pedestrian motion datasets. They consist of real pedestrian trajectories with rich


multi-human interaction scenarios. Each sequence that we used is a series of consecutive frames

clipped from the ETH/ETH Test scene, the ETH/HOTEL Test scene, and the UCY/UNIV Test

scene. For brevity, we refer to these as the “ETH,” “HOTEL,” and “UNIV” sequences, respectively.

In each scene, the robot starts from a collision free configuration and moves towards a specified

goal while avoiding collisions. The ETH sequence is 10 s long and has 16 total pedestrians. The

HOTEL sequence is also 10 s long and has 8 total pedestrians. The UNIV sequence is the most

challenging with 95 total pedestrians over a horizon of 20 s; there always exists 36 to 54 pedestrians

simultaneously in each frame (see Fig. 4.1). Note that none of those pedestrians recognize or react

to the robot as their motion is a replay from the data. This is the so-called “invisible robot” setting

[25] where the robot has to predict Human-Human interaction only. For this reason, we did not

condition the prediction of Trajectron++ on the nominal control candidates of the robot.

Trajectron++ was trained following the same procedure as detailed in [134]. The model was

trained for 2000 steps with the Adam [78] optimizer and a leave-one-out approach. That is, the

model was trained on all datasets but the one used for evaluation. This training was done off-line

prior to running RSSAC. At run time, RSSAC fetches prediction samples from the trained model

every ∆to = 0.4 s and uses them for the mode insertion gradient optimization. All the online

computation was performed on a desktop computer with an Intel Xeon(R) CPU ES-2650 v3, 62.7

GiB memory, and a GeForce GTX 1080 graphics card.

Fig. 4.2 shows the statistical simulation results of RSSAC and the baseline methods. RSSAC,

Nominal Search, Exhaustive Search were implemented with θ = 0. The x-axis shows the final

distance between the robot and its goal normalized by the initial distance. The y-axis is the minimum

distance between the robot and a human. These performance metrics measure the safety and

efficiency of robot navigation, and are often in conflict. For each method we performed 100 runs

with different random seeds. The goal of the robot was also randomized. As can be seen, RSSAC

was both the safest and the most efficient among those methods that ran in real-time. BIC ended up

in multiple collisions in all the three cases as the humans violated the reciprocity assumption. We

did not test Exhaustive Search for the UNIV case due to its poor scalability; the computation took

about 10× longer than the allocated time budget. We also note that RSSAC did further improve the

performance of the Nominal Search, which itself was already achieving reasonably safe navigation.

4.5.2 Effects of Risk Sensitivity

Next, we studied the effects of risk sensitivity in the context of safe robot navigation among humans.

Fig. 4.3 compares RSSAC with θ = 0 (the same data as in Fig. 4.2) and θ = 1.0. Our risk-sensitive

optimization significantly reduced the standard deviation of the minimum robot-human distance

by 11%, 12%, and 24% for the ETH, HOTEL, and UNIV scenes, respectively. The risk sensitivity

trades off the stochastic collision cost and the deterministic tracking cost, which appears in the ETH

and HOTEL results where the standard deviation of the normalized goal distance slightly increased


-2 -1 0 1 2 3 4-10.0

-7.5

-5.0

-2.5

0.0

Time: 6.00 [s]

x[m]

y[m

]

Ego RobotVelocityTarget TrajectoryEgo GoalEgo Trajectory

-2 -1 0 1 2 3 4-10.0

-7.5

-5.0

-2.5

0.0

Time: 6.00 [s]

x[m]y[

m]

Ego RobotVelocityTarget TrajectoryEgo GoalEgo Trajectory

Figure 4.4: Qualitative comparison of RSSAC with θ = 0 (left) and θ = 1.0 (right) in the HOTELscene. These results differ in the minimum robot-human distance by only 3 cm and the normalizedgoal distance by 0.01, but the risk-sensitive robot (right) yields to potentially conflicting humansas opposed to the risk-neutral robot (left). Both simulations used the same random seed. Sampledpredictions from Trajectron++ are also depicted.

by 3.9% and 2.3%. On the other hand, the UNIV scene was so densely populated with humans

that the risk-sensitive robot was unable to approach the goal. Although not necessarily captured by

the quantitative metrics, overall the risk-sensitive robot also exhibited another intriguing behavioral

difference as depicted in Fig. 4.4; the risk-sensitive robot tends to yield to potentially conflicting

humans.

To gain a better understanding of the effects of risk-sensitivity, we designed a simplified intersec-

tion scenario as illustrated in Fig. 4.5. The human in this toy example followed a linear Gaussian

model with a constant mean velocity and was in a potential collision course with the robot. We ran

100 simulations for each θ in 0.0, 0.5, 1.0. The goal position was not randomized in this study. We

also ran two additional sets of 100 simulations wherein we kept θ = 0.0 but varied the collision cost

peak parameter α and the bandwidth λ in (4.24), respectively. This was to elucidate the difference

between risk sensitivity tuning and cost function design. The results are summarized in Fig. 4.6 and

confirm that the more risk-sensitive the robot becomes, the more likely it is to yield to the human.

This consistent trend was not present when changing α or λ only. On the other hand, there was a

positive correlation between the minimum robot-human distance and all the three parameters. This


-6 -4 -2 0 2 4 6-6

-4

-2

0

2

4

6

Time: 0.00 [s]

x[m]

y[m

]

Ego RobotVelocityTarget TrajectoryEgo Goal

Figure 4.5: A synthetic intersectionscenario with a human. The predic-tion is drawn from a linear Gaussianmodel with a constant mean velocity.

0.0 0.5 1.0

1.4

1.6

1.8

2.0

Min

. Dist

. to

Hum

an [m

]

= 100, = 0.2Risk Sensitivity

100 150 200

1.4

1.6

1.8

2.0= 0.0, = 0.2

Cost Peak

0.20 0.25 0.30

1.4

1.6

1.8

2.0= 0.0, = 100

Cost Bandwidth

0.0 0.5 1.00.5

0.6

0.7

0.8

Prob

abilit

y of

Yie

ldin

g

Risk Sensitivity

100 150 2000.5

0.6

0.7

0.8Cost Peak

0.20 0.25 0.300.5

0.6

0.7

0.8Cost Bandwidth

Figure 4.6: Minimum robot-human distance(top) and empirical probability of yielding (bot-tom) for the synthetic intersection scenario.Changing the risk-sensitivity (left) consistentlyaffected whether or not the robot yields, whilethe other two cost tuning parameters (middleand right) did not.

observation suggests that the risk sensitivity parameter θ affects the global behavior of the robot

that determines the “type” of interaction, whereas cost function tuning affects the local behavior

only (i.e. minimum robot-human distance). Thus, the risk sensitivity parameter can be considered

as an additional degree of freedom in choosing desirable robot behavior.

4.5.3 Real-World Experiment

Our simulation study was accompanied by a real-world experiment with a holonomic robot [171] and

5 human subjects in an indoor environment (see Fig. 4.1). Those subjects were assigned a specific

start and goal positions, and were instructed to walk to each individual goal at normal speeds.

Although the start-goal pairs remained the same, we changed their assignment to the subjects after

each run so the type of interaction remained diverse. The robot started at a known position, but

its exact goal was randomized and not known to the subjects. The positions of all the subjects as

well as the robot were measured by a motion capture system. We used the same parameters as in

the simulation study except umax. A major difference between the simulation and the real-world

experiment is that the robot is “visible” to humans in the experiment, which requires the robot to

take into account resulting human-robot interaction in addition to human-human interaction. This



0.0

0.2

0.4

0.6

0.8

1.0

Min

. Dist

. to

Hum

an [m

]RSSAC, = 1.0RSSAC, = 0.0RSSAC-Conditional, = 1.0RSSAC-Conditional, = 0.0Collision Line

Figure 4.7: Quantitative results of the real-world experiment with 5 human subjects. Using robot-future-conditional predictions with θ = 1.0 achieves the best average performance. Error bars showthe standard deviation of 5 runs with a randomized robot goal and human start-goal assignment.

was achieved by conditioning the prediction of Trajectron++ by the nominal control candidates

for RSSAC, similar to [137]. As explained in Section 4.4.5, this robot-future-conditional prediction

lets the robot reason about the effect of different nominal control candidates on human behavior,

prior to RSSAC perturbation on the best one. We compared the performance of the unconditional

prediction (as in the simulation study) to the conditional prediction for both risk-neutral and risk-

sensitive cases.

The results of the experiment are presented in Fig 4.7. For each setting we performed 5 runs.

The robot with unconditional predictions was either too conservative (θ = 1.0) or unsafe (θ = 0.0).

Using conditional predictions improved overall performance; we did not observe a single collision

with conditional predictions. This supports our hypothesis that the robot-future-conditional pre-

diction facilitates appropriate human-robot interaction. Of the four cases tested, using conditional

predictions with θ = 1.0 achieved the best average performance, as well as the smallest variance in

minimum robot-human distance.

4.6 Conclusions

This chapter focuses on transition uncertainty in safety-critical systems, in particular systems that

involve other decision-making agents such as humans. We propose a novel online algorithm for

risk-sensitive optimal control to take account of risk associated with uncertain human motion. Our

theoretical contribution is a derivation of the mode insertion gradient for the entropic risk measure

under nonlinear, non-Gaussian dynamics. This theory leads to an efficient and high-performing

implementation of RSSAC, an extension of Stochastic SAC that is more suited for safety-critical


applications. Crucially, RSSAC does not require knowledge of the analytical form of the distribu-

tion. To demonstrate its effectiveness, we combine RSSAC with Trajectron++ and apply it to a

crowd-robot interaction problem with dynamic collision avoidance. As a major practical contribu-

tion, we achieve safe and efficient robot navigation among many mutually interacting humans by the

probabilistic, robot-future-conditional prediction of Trajectron++ and the risk-sensitive optimiza-

tion of RSSAC. Furthermore, the risk sensitivity parameter is found to play a key role in determining

the type of interaction behavior, such as yielding. In future work, we plan to apply the RSSAC-

Trajectron++ framework to vehicle dynamics and scenes with heterogeneous agents. We are also

interested in automatically adapting the risk sensitivity parameter so that any specifically-desired

robot behavior emerges.

Chapter 5

Distributionally Robust Planning

under Model Uncertainty

In the previous chapters, we presented online planning and control methods that address two types

of uncertainty—state uncertainty and transition uncertainty—defined in Chapter 1 (Section 1.1).

It remains to discuss the one remaining type of uncertainty, namely model uncertainty. Model

uncertainty arises due to imperfect knowledge of complex dynamical systems, which affects reliability

of robot autonomy. In stochastic environments that we consider in the thesis, this limited knowledge

leads to inaccurate characterization of the underlying probability distribution that describes, for

example, future motion of other interacting agents. In practice, such a mismatch between the model

employed by the robot and the ground-truth most likely exists, and can have serious consequences for

safety-critical systems [29]. This chapter presents a control algorithm that is capable of handling the

distributional mismatch. Specifically, we propose a novel nonlinear MPC for distributionally robust

control, which plans a locally optimal feedback policy against a worst-case distribution within a

given set of plausible distributions. This set is defined by an offline-estimated KL divergence bound

between a user-specified Gaussian distribution and the unknown ground-truth model. Leveraging

mathematical equivalence between distributionally robust control and risk-sensitive optimal control,

we propose an efficient local optimization method to tractably compute the control policy. This

equivalence also indicates that our method serves an algorithm to dynamically adjust the risk-

sensitivity level online for risk-sensitive control. The benefits of the distributional robustness as

well as the automatic risk-sensitivity adjustment are demonstrated in a dynamic collision avoidance

scenario where the predictive distribution of human motion is erroneous. The materials presented

in this chapter are also reported in [109].

76

CHAPTER 5. DISTRIBUTIONALLY ROBUST PLANNING 77

5.1 Introduction

Proper modeling of a stochastic system of interest is a key step towards successful control and deci-

sion making under uncertainty. In particular, accurate characterization of the underlying probability

distribution is crucial, as it encodes how we expect the system to behave unexpectedly over time.

However, such a modeling process can pose significant challenges in real-world problems. On the

one hand, we may have only limited knowledge of the underlying system, which would force us to

use an erroneous model. On the other hand, even if we can perfectly model a complicated stochastic

phenomenon, such as a complex multi-modal distribution, it may still not be appropriate for the

sake of real-time control or planning. Indeed, many model-based stochastic control methods require

a Gaussian noise assumption, and many of the others need computationally intensive sampling. In

Chapter 4, we circumvented this computational issue by proposing the RSSAC algorithm, which can

efficiently process samples drawn from data-driven, probabilistic prediction models. Although the

method has been shown effective, recent work [29] also suggests that even state-of-the-art methods

for probabilistic prediction are not accurate enough to be deployed on safety-critical systems. There-

fore, we must design our planner/controller so that the inevitable model uncertainty is considered

appropriately.

The present work addresses this problem via distributionally robust control, wherein a poten-

tial distributional mismatch is considered between a baseline Gaussian process noise and the true,

unknown model that lies within a certain Kullback-Leibler (KL) divergence bound. The use of

the Gaussian distribution is advantageous to retain computational tractability without the need for

sampling in the state space. Our contribution is a novel model predictive control (MPC) method

for nonlinear, non-Gaussian systems with non-convex costs. This controller would be useful, for

example, to safely navigate a robot among human pedestrians while the stochastic transition model

for humans is not perfect.

It is important to note that our contribution is built on the mathematical equivalence between

distributionally robust control and risk-sensitive optimal control [117]. As reviewed in Chapter 4,

risk-sensitive optimal control seeks to optimize the following entropic risk measure [97]:

Rp,θ(J) ,1

θlogEp [exp(θJ)] , (5.1)

where p is a probability distribution characterizing any source of randomness in the system, θ > 0

is a user-defined scalar parameter called the risk-sensitivity parameter, and J is an optimal control

cost. The risk-sensitivity parameter θ determines a relative weight between the expected cost and

other higher-order moments such as the variance [174]. Loosely speaking, the larger θ becomes, the

more the objective cares about the variance and is thus more risk-sensitive.

Our distributionally robust control algorithm can alternatively be viewed as an algorithm for


Figure 5.1: Model-based stochastic control methods often require a Gaussian noise assumption, suchas the one in the left that represents process noise in pedestrian motion under a collision avoidancescenario (see Section 5.5). However, the true stochastic model can be highly multi-modal and bettercaptured by a more complex distribution as shown in the right, which we may not exactly know.The proposed MPC effectively handles such a model mismatch without the knowledge of the truedistribution, except for a bound on the KL divergence between the two.

automatic online tuning of the risk-sensitivity parameter in applying risk-sensitive control. Risk-

sensitive optimal control has been shown effective and successful in many robotics applications

[99, 100, 11, 108]. However, in prior work (as well as in Chapter 4) the user has to specify a fixed

risk-sensitivity parameter offline. This would require an extensive trial and error process until a

desired robot behavior is observed. Furthermore, a risk-sensitivity parameter that works in a certain

state can be infeasible in another state, as we will see in Section 5.4. Ideally, the risk-sensitivity

should be adapted online depending on the situation to obtain a specifically desired robot behavior

[100, 108], yet this is highly nontrivial as no simple general relationship is known between the risk-

sensitivity parameter and the performance of the robot. Our algorithm addresses this challenge

as a secondary contribution. Due to the fundamental equivalence between distributionally robust

control and risk-sensitive control, it serves as a nonlinear risk-sensitive control that can dynamically

adjust the risk-sensitivity parameter depending on the state of the robot as well as the surrounding

environment.

The rest of the chapter is organized as follows. Section 5.2 reviews the related work in controls

and robotics literature. Section 5.3 summarizes the theoretical results originally presented in [117]

that connect distributionally robust control to risk-sensitive optimal control. Section 5.4 develops

this theory into an algorithm that provides a locally optimal solution for general nonlinear systems

with non-convex cost functions, which is a novel contribution of our work. In Section 5.5, we test our

method in a collision avoidance scenario wherein the predictive distribution of pedestrian motion is


erroneous. We further show its benefits as a risk-sensitive optimal controller that can automatically

adjust its risk-sensitivity parameter in this section. The chapter concludes in Section 5.6 with

potential future research directions.

5.2 Related Work

5.2.1 Distributional Robustness and Risk-Sensitivity

Distributionally robust control seeks to optimize control actions against a worst-case distribution

within a given set of probability distributions, often called the ambiguity set [162, 135]. There

exist various formulations to take account of distributional robustness in optimal control. Some

works are concerned with minimizing the worst-case expectation of a cost objective [135, 143], while

others enforce risk-based or chance constraint satisfaction under a worst-case distribution [59, 162].

The present work belongs to the former class. Existing methods also differ in the formulation of

the ambiguity set. Moment-based ambiguity sets require knowledge of moments of the ground-truth

distribution up to a finite order [162, 135], which is often overly conservative [59]. Statistical distance-

based ambiguity sets are also gaining attention. The authors of [59] use a Wasserstein metric to define

the ambiguity set for motion planning with collision avoidance, but their MPC formulation is not

suited for nonlinear systems. χ2-divergence and more general φ-divergences (which KL divergence

belongs to) are employed in [143], similar to the present work. However, the ambiguity set considered

in [143] is restricted to categorical distributions, while our work requires no assumption on the class

of the ground-truth distributions. Furthermore, we make use of risk-sensitive optimal control to

obtain planned robot trajectories with feedback, unlike sampling in their implementation.

Optimization of the entropic risk measure has been an active research topic in economics and

controls literature since 1970s [70, 173, 174, 47]. The concept of risk-sensitive optimal control has

been successfully applied to robotics in various domains, including haptic assistance [99, 100], model-

based reinforcement learning (RL) [11], and safe robot navigation [108, 169], to name a few. In all

these works, the risk-sensitivity parameter is introduced as a user-specified constant, and is found

to significantly affect the behavior of the robot. For instance, our study of safe robot navigation

in Chapter 4 reveals that a robot with higher risk-sensitivity tends to yield more to oncoming

human pedestrians. However, how to find a desirable risk-sensitivity parameter still remains an

open research question; in the robot navigation problem, the robot simply freezes if it is too risk-

sensitive when the scene is crowded. As the authors of [100] point out, the robot should adapt its

risk-sensitivity level depending on the situation, yet there still does not exist an effective algorithmic

framework to automate it due to the issues discussed in Section 5.1. In this work, we provide such

an algorithm for nonlinear, non-Gaussian stochastic systems. As mentioned earlier, our approach is

built on previously-established theoretical results that link risk-sensitive and distributionally robust

control [117].


5.2.2 Approximate Methods for Optimal Feedback Control

The theory of optimal control lets us derive an optimal feedback control law via dynamic pro-

gramming (DP) [13]. For linear systems with additive Gaussian white noise and quadratic cost

functions, the exact DP solution is tractable and is known as Linear-Quadratic-Gaussian (LQG) [7]

or Linear-Exponential-Quadratic-Gaussian (LEQG) [173]. They are different in that LQG optimizes

the expected cost while LEQG optimizes the entropic risk measure, although both DP recursions

are quite similar.

However, solving general optimal control problems for nonlinear systems remains a challenge

due to lack of analytical tractability. Hence, approximate local optimization methods have been

developed, including Differential Dynamic Programming (DDP) [71], iterative Linear-Quadratic

Regulator (iLQR) [91], and iterative Linear-Quadratic-Gaussian (iLQG) [156, 151]. While both DDP

and iLQR are designed for deterministic systems with quadratic cost functions, iLQG can locally

optimize the expected cost objective for Gaussian stochastic systems with non-convex cost functions.

Similarly, the iterative Linear-Exponential-Quadratic-Gaussian (iLEQG) has been recently proposed

to locally optimize the entropic risk for Gaussian systems with non-convex costs [48, 169, 131]. Note

however that they are not designed to be robust to model mismatches that we consider in this

chapter. In fact, it is known that even LQG does not possess guaranteed robustness [36].

5.3 Problem Statement

5.3.1 Distributionally Robust Optimal Control

Consider the following stochastic nonlinear system:

xk+1 = f(xk, uk) + g(xk, uk)wk, (5.2)

where xk ∈ Rn denotes the state, uk ∈ Rm the control, and wk ∈ Rr the noise input to the system

at time k. For some finite horizon N , let w0:N , (w0, . . . , wN ) denote the joint noise vector with

probability distribution q(w0:N ). This distribution is assumed to be a known Gaussian white noise

process, i.e. wi is independent of wj for all i 6= j, and we call (5.2) the reference system. Ideally,

we would like the model distribution q to perfectly characterize the noise in the dynamical system.

However, in reality the noise may come from a different, more complex distribution which we may

not know exactly. Let w0:N , (w0, . . . , wN ) denote a perturbed noise vector that is distributed

according to p(w0:N ). We define the following perturbed system that characterizes the true but

unknown dynamics:

xk+1 = f(xk, uk) + g(xk, uk)wk. (5.3)


Note that we make no assumptions on Gaussianity or whiteness of p. One could also attribute it

to potentially unmodeled dynamics. The true, unknown probability distribution p is contained in

the set P of all probability distributions on the support Rr(N+1). We assume that p is not “too

different” from q. This is encoded as the following constraint on the KL divergence between p and

q:

DKL(p‖q) ≤ d, (5.4)

where DKL(·‖·) is the KL divergence and d > 0 is a given constant. Note that DKL(p‖q) ≥ 0 always

holds, with equality if and only if p ≡ q. The set of all possible probability distributions p ∈ Psatisfying (5.4) is denoted by Ξ, which we define as our ambiguity set. This set is a convex subset

of P for a fixed q (Lemma 1.4.3, [40]).

We are interested in controlling the perturbed system (5.3) with a state feedback controller of

the form uk = K(k, xk). The operator K(k, ·) defines a mapping from Rn into Rm. The class of all

such controllers is denoted Λ.

The cost function considered in this chapter is given by

J(x0:N+1, u0:N ) ,N∑k=0

c(k, xk, uk) + h(xN+1), (5.5)

where c is the stage cost function and h is the terminal cost. We assume that the above objective

satisfies the following non-negativity assumption.

Assumption 5.1 (Assumption 3.1, [117]). The functions h(·) and c(k, ·, ·) satisfy h(x) ≥ 0 and

c(k, x, u) ≥ 0 for all k ∈ 0, . . . , N, x ∈ Rn, and u ∈ Rm.

Under the dynamics model (5.3), the cost model (5.5), and the KL divergence constraint (5.4) on

p, we are interested in finding an admissible controller K ∈ Λ that minimizes the worst-case expected

value of the cost objective (5.5). In other words, we are concerned with the following distributionally

robust optimal control problem:

infK∈Λ

supp∈Ξ

Ep [J(x0:N+1, u0:N )] , (5.6)

where Ep[·] indicates that the expectation is taken with respect to the true, unknown distribution

p. In this formulation, the robustness arises from the ability of the controller to plan against a

worst-case distribution p in the ambiguity set Ξ.

Remark 5.1. If the KL divergence bound d is zero, then p ≡ q is necessary. In this degenerate


case, (5.6) reduces to the standard stochastic optimal control problem:

infK∈Λ

Eq [J(x0:N+1, u0:N )] . (5.7)

5.3.2 Equivalent Risk-Sensitive Optimal Control

Unfortunately, the distributionally robust optimal control problem (5.6) is intractable as it involves

maximization with respect to a probability distribution p. To circumvent this, Petersen et al. [117]

prove that problem (5.6) is equivalent to a bilevel optimization problem involving risk-sensitive

optimal control with respect to the model distribution q. We refer the reader to [117] for the

derivation and only re-state the main results in this section for self-containedness. Before doing so,

we impose an additional assumption on the worst-case expected cost.

Assumption 5.2 (Assumption 3.2, [117]). For any admissible controller K ∈ Λ, the resulting

closed-loop system satisfies

supp∈P

Ep [J(x0:N+1, u0:N )] =∞. (5.8)

This assumption states that, without the KL divergence constraint, some adversarially-chosen

noise could make the expected cost objective arbitrarily large in the worst case. It amounts to a

controllability-type assumption with respect to the noise input and an observability-type assumption

with respect to the cost objective [117].

Under Assumptions 5.1 and 5.2, the following theorem holds.

Theorem 5.1. Consider the stochastic systems (5.2), (5.3) with the KL divergence constraint (5.4)

and the cost model (5.5). Under Assumptions 5.1 and 5.2, the following equivalence holds for the

distributionally robust optimal control problem (5.6):

infK∈Λ

supp∈Ξ

Ep [J(x0:N+1, u0:N )] = infτ∈Γ

infK∈Λ

τ logEq[exp

(J(x0:N+1, u0:N )

τ

)]+ τd, (5.9)

provided that the set

Γ ,

τ > 0 : inf

K∈Λτ logEq [exp(J/τ)] is finite

(5.10)

is non-empty.

Proof. See Theorems 3.1 and 3.2 in [117].


Remark 5.2. Notice that the first term in the right-hand side of (5.9) is the entropic risk measure

Rq, 1τ (J), where the risk is computed with respect to the model distribution q and τ > 0 serves as

the inverse of the risk-sensitivity parameter. Rewriting the equation in terms of the risk-sensitivity

parameter θ = 1/τ > 0, we see that the right-hand side of (5.9) is equivalent to

infθ∈Γ

(infK∈Λ

Rq,θ (J(x0:N+1, u0:N )) +d

θ

), (5.11)

where Γ , θ > 0 : infK∈ΛRq,θ(J) is finite. The non-emptiness of Γ (and equivalently, Γ) is sat-

isfied if there exists some non-zero risk-sensitivity θ that gives a finite entropic risk value. This is

almost always satisfied in practical situations where risk-sensitive optimal control can be applied,

as otherwise the problem would be ill-formed. Theorem 5.1 shows that the original distributionally

robust optimal control problem (5.6) is mathematically equivalent to a bilevel optimization prob-

lem (5.11) involving risk-sensitive optimal control. Note that the new problem does not involve any

optimization with respect to the unknown distribution p.

5.4 RAT iLQR Algorithm

Even though the mathematical equivalence shown in [117] and summarized in Section 5.3.2 is general,

it does not immediately lead to a tractable method to efficiently solve (5.11) for general nonlinear

systems. There are two major challenges to be addressed. First, exact optimization of the entropic

risk with a state feedback control law is intractable, except for linear systems with quadratic costs.

Second, the optimal risk-sensitivity parameter has to be searched efficiently over the feasible space

Γ, which not only is unknown but also varies dependent on the initial state x0. A novel contribution

of this chapter is a tractable algorithm that approximately solves both of the problems for general

nonlinear systems with non-convex cost functions. In what follows, we detail how we solve both the

inner and the outer loop of (5.11) to develop a distributionally-robust, risk-sensitive MPC.

5.4.1 Iterative Linear-Exponential-Quadratic-Gaussian

Let us first consider the inner minimization of (5.11):

infK∈Λ

Rq,θ (J(x0:N+1, u0:N )) , (5.12)

where we omitted the extra term d/θ as it is constant with respect to the controller K. This amounts

to solving a risk-sensitive optimal control problem for a nonlinear Gaussian system. Recently, a

computationally-efficient, local optimization method called iterative Linear-Exponential-Quadratic-

Gaussian (iLEQG) has been proposed for both continuous-time systems [48] and the discrete-time

counterpart [131, 169]. Both versions locally optimize the entropic risk measure with respect to a


receding horizon, affine feedback control law for general nonlinear systems with non-convex costs.

We adopt a variant of the discrete-time iLEQG algorithm [169] to obtain a locally optimal solution

to (5.12). In what follows, we assume that the noise coefficient function g(xk, uk) in (5.2) is the

identity mapping for simplicity, but it is straightforward to handle nonlinear functions in a similar

manner as discussed in [156]. The algorithm starts by applying a given nominal control sequence

l0:N to the noiseless dynamics to obtain the corresponding nominal state trajectory x0:N+1. In each

iteration, the algorithm maintains and updates a locally optimal controller K of the form:

K(k, xk) = Lk(xk − xk) + lk, (5.13)

where Lk ∈ Rm×n denotes the feedback gain matrix. The i-th iteration of our iLEQG implementation

consists of the following four steps:

1. Local Approximation. Given the nominal trajectory l(i)0:N , x(i)0:N+1, we compute the follow-

ing linear approximation of the dynamics as well as the quadratic approximation of the cost

functions:

Ak = Dxf(x(i)k , l

(i)k ) (5.14)

Bk = Duf(x(i)k , l

(i)k ) (5.15)

qk = c(k, x(i)k , l

(i)k ) (5.16)

qk = Dxc(k, x(i)k , l

(i)k ) (5.17)

Qk = Dxxc(k, x(i)k , l

(i)k ) (5.18)

rk = Duc(k, x(i)k , l

(i)k ) (5.19)

Rk = Duuc(k, x(i)k , l

(i)k ) (5.20)

Pk = Duxc(k, x(i)k , l

(i)k ) (5.21)

for k = 0 to N , where D is the differentiation operator. We also let qN+1 = h(x(i)N+1),

qN+1 = Dxh(x(i)N+1), and QN+1 = Dxxh(x

(i)N+1).

2. Backward Pass. We perform approximate DP using the current feedback gain matrices

L(i)0:N as well as the approximated model obtained in the previous step. Suppose that the noise

vector wk is Gaussian-distributed according to N (0,Wk) with Wk 0. Let sN+1 , qN+1,

sN+1 , qN+1, and SN+1 , QN+1. Given these terminal conditions, we recursively compute


the following quantities:

Mk = W−1k − θSk+1 (5.22)

gk = rk +BTk (I + θSk+1M

−1k )sk+1 (5.23)

Gk = Pk +BTk (I + θSk+1M

−1k )Sk+1Ak (5.24)

Hk = Rk +BTk (I + θSk+1M

−1k )Sk+1Bk, (5.25)

and

sk = qk + sk+1 −1

2θlog det(I − θWkSk+1) +

θ

2sTk+1M

−1k sk+1 +

1

2l(i)Tk Hkl

(i)k + l

(i)Tk gk (5.26)

sk = qk +ATk (I + θSk+1M

−1k )sk+1 + L

(i)Tk Hkl

(i)k + L

(i)Tk gk +GT

k l(i)k (5.27)

Sk = Qk +ATk (I + θSk+1M

−1k )Sk+1Ak + L

(i)Tk HkL

(i)k + L

(i)Tk Gk +GT

kL(i)k , (5.28)

from k = N down to 0. Note that Mk 0 is necessary so it is invertible, which may not hold

if θ is too large. This is called “neurotic breakdown,” when the optimizer is so pessimistic that

the cost-to-go approximation becomes infinity [174]. Otherwise, the approximated cost-to-go

for this optimal control (under the controller L(i)0:N , l

(i)0:N) is given by s0.

3. Regularization and Control Computation. Having derived the DP solution, we compute

new control gains L(i+1)0:N and offset updates dl0:N as follows:

L(i+1)k = −(Hk + µI)−1Gk (5.29)

dlk = −(Hk + µI)−1gk, (5.30)

where µ ≥ 0 is a regularization parameter to prevent (Hk + µI) from having negative eigen-

values. We adaptively change µ across multiple iterations as suggested in [151], so the algo-

rithm enjoys fast convergence near a local minimum while ensuring the positive-definiteness

of (Hk + µI) at all times.

4. Line Search for Ensuring Convergence. It is known that the update could lead to in-

creased cost or even divergence if a new trajectory strays too far from the region where the

local approximation is valid [151]. Thus, the new nominal control trajectory l(i+1)0:N is computed

by backtracking line search with line search parameter ε. Initially, ε = 1 and we derive a new

candidate nominal trajectory as follows:

lk = L(i+1)k (xk − x(i)

k ) + l(i)k + εdlk (5.31)

xk+1 = f(xk, lk). (5.32)


If this candidate trajectory l0:N , x0:N+1 results in a lower cost-to-go than the current nom-

inal trajectory, then the candidate trajectory is accepted and returned as l(i+1)0:N , x

(i+1)0:N+1.

Otherwise, the trajectory is rejected and re-derived with ε ← ε/2 until it is accepted. More

details on this line search can be found in [161].

The above procedure is iterated until the nominal control lk does not change beyond some

threshold in a norm. Once converged, the algorithm returns the nominal trajectory l0:N , x0:N+1as well as the feedback gains L0:N and the approximate cost-to-go s0.

5.4.2 Cross-Entropy Method

Having implemented the iLEQG algorithm as a local approximation method1 for the inner-loop

optimization of (5.11), it remains to solve the outer-loop optimization for the optimal risk-sensitivity

parameter θ∗. This is a one-dimensional optimization problem in which the function evaluation is

done by solving the corresponding risk-sensitive optimal control (5.12). One might consider using a

simple grid search over a fixed interval specified a priori. Unfortunately, this may not work since the

feasible space Γ is dynamically changing due to re-planning; we empirically observed that the size

of Γ varied up to an order of magnitude in the collision avoidance scenario in Section 5.5. In this

work, we choose to adapt the cross entropy method [132, 80] to derive an approximately optimal

value for θ∗. This method is favorable for online optimization due to its any-time nature and high

parallelizability of the Monte Carlo sampling. As a byproduct, it is also effective in approximately

finding the maximum feasible θ ∈ Γ within a few iterations, which is detailed later in this section.

We note however that it is possible to use other methods for the outer-loop optimization as well.

The cross entropy method is a stochastic method that maintains an explicit probability distribution

over the design space. At each step, a set of ms Monte Carlo samples is drawn from the distribution,

out of which a subset of me “elite samples” that achieve the best performance is retained. The

parameters of the distribution is then updated according to the maximum likelihood estimate on

the elite samples. The algorithm stops after a desired number of steps M .

In our implementation we model the distribution2 as univariate Gaussian N (µ, σ2). A remaining

issue is that the iLEQG may return the cost-to-go of infinity if a sampled θ is too large, due to

neurotic breakdown. Since our search space is limited to Γ where θ yields a finite cost-to-go, we

have to ensure that each iteration has enough samples in Γ.

To address this problem, we augment the cross entropy method with rejection and re-sampling.

Out of the ms samples drawn from the univariate Gaussian, we first discard all non-positive samples.

1Although the solutions obtained by iLEQG are local optima and different for varied risk-sensitivity θ, we haveempirically observed that they are not changing drastically as a function of θ. This indicates that iLEQG is a sensiblechoice for the inner-loop optimization problem, since the solutions do not hop around local optima that are toodifferent as we vary θ.

2Note that this distribution is defined in the space of θ and has nothing to do with p or q that define stochasticnoise in the dynamical system.


For each of the remaining samples, we evaluate the objective (5.11) by a call to iLEQG, and then

count the number of samples that obtained a finite cost-to-go. Let mv be the number of such valid

samples. If mv ≥ max(me,ms/2), we proceed and fit the distribution. Otherwise, we redo the

sampling procedure as there are not sufficiently many valid samples to choose the elites from.

In practice, re-sampling is not likely to occur after the first iteration of the cross entropy method.

At the same time, we empirically found that the first iteration has a risk of re-sampling multi-

ple times, hence degrading the efficiency. We therefore also perform an adaptive initialization of

the Gaussian parameters µinit and σinit in the first iteration as follows. If the first iteration with

N (µinit, σ2init) results in re-sampling, we not only re-sample but also divide µinit and σinit by half. If

all of the ms samples are valid, on the other hand, we accept them but double µinit and σinit, since

it implies that the initial set of samples is not wide-spread enough to cover the whole feasible set

Γ. The parameters µinit and σinit are stored internally in the cross entropy solver and carried over

to the next call to the algorithm. We have empirically found that this adaptive initialization is also

useful for approximately finding the maximum feasible θ, which we exploited in a comparative study

in Section 5.5.3.

5.4.3 RAT iLQR as MPC

We name the proposed bilevel optimization algorithm RAT iLQR. The pseudo-code is given in

Algorithm 5.1. At run time, it is executed as an MPC in a receding-horizon fashion; the control

is re-computed after executing the first control input u0 = l0 and transitioning to a new state. A

previously-computed control trajectory l0:N is reused for the initial nominal control trajectory at

the next time step to warm-start the computation.

5.5 Results

This section presents qualitative and quantitative results of the simulation study that we conducted

to show the effectiveness of the RAT iLQR algorithm. We provide the problem setup as well as

implementation details in Section 5.5.1. The goals of this study are two-fold. First, we demonstrate

that the robot controlled by RAT iLQR can successfully accomplish its task under the presence of

stochastic disturbance, without access to the ground-truth distribution but the knowledge of the KL

divergence bound. This is presented in Section 5.5.2 with comparisons to (non-robust) iLQG and a

model-based MPC with sampling from the true generative model. Second, Section 5.5.3 focuses on

the nonlinear risk-sensitive optimal control aspect of RAT iLQR to show its value as an algorithm

that can optimally adjust the risk-sensitivity parameter online, which itself is a novel contribution.


Algorithm 5.1 RAT iLQR Algorithm

INPUT: Initial state x0, controls l0:N , L0:N (can be zero), KL divergence bound dOUTPUT: New nominal trajectory l0:N , x0:N+1, control gains L0:N , risk-sensitivity parameter

θ∗

1: Compute initial nominal trajectory x0:N+1 using l0:N

2: i← 13: while i ≤M do /* outer-loop optimization */4: while True do5: if i = 1 then6: θsampled ← drawSamples(ms, µinit, σinit)7: else8: θsampled ← drawSamples(ms, µ, σ)9: end if

10: array r . Empty array of size ms

11: for j ← 1 : ms do /* inner-loop optimization */12: Solve iLEQG with l0:N , x0:N+1, θsampled[j]13: Obtain approximate cost-to-go s0

14: r[j]← s0 + d/θsampled[j]15: end for16: mv ← countV alidSamples(θsampled, r)17: if i = 1 and mv < max(me,ms/2) then18: µinit ← µinit/2, σinit ← σinit/219: else if i = 1 and mv = ms then20: µinit ← 2µinit, σinit ← 2σinit

21: break22: else if mv ≥ max(me,ms/2) then23: break24: end if25: end while26: θelite ← selectElite(me, θsampled, r)27: µ, σ ← fitGaussian(θelite)28: i← i+ 129: end while30: θ∗ ← µ31: Solve iLEQG with l0:N , x0:N+1, θ

∗32: return new l0:N , x0:N+1 with L0:N and θ∗


5.5.1 Problem Setup

We consider a dynamic collision avoidance problem where a unicycle robot has to avoid a pedestrian

in a collision course as illustrated in Figure 5.2. Collision avoidance problems are often modeled by

stochastic optimal control in the autonomous systems literature [85] and the human-robot interaction

literature [137, 68], including our work presented in Chapter 4.

The state of the robot is defined by (rx, ry, v, θ) ∈ R4, where (rx, ry) m denotes the position,

v m s−1 the velocity, and θ rad the heading angle. The robot’s control input is u = (a, b) ∈R2, where a m s−2 is the acceleration and b rad s−1 is the angular velocity. The pedestrian

is modeled as a single integrator, whose position is given by (px, py) m. We assume that the

position of the pedestrian is known to the robot through onboard sensing. We also employ a

constant nominal velocity model (ux, uy) = (0.0, 1.0) m s−1 for the pedestrian. The joint sys-

tem x ∈ R6 consists of the state of the robot and the position of the pedestrian. The dy-

namics are propagated by Euler integration with time interval dt = 0.1 s and additive noise

wk to the joint state. The model distribution for wk is a zero-mean Gaussian N (0,W ) with

W = diag(1× 10−10, 1× 10−10, 1× 10−3, 1× 10−4, 2× 10−2, 2× 10−2) × dt. This covariance ma-

trix W is chosen to encode our modeling assumption that the pedestrian motion is the main source

of uncertainty in this joint system, and that the magnitude of the slip is almost negligible for the

robot. The ground-truth distribution for the robot is the same Gaussian as in the model, but the

pedestrian’s distribution is a mixture of Gaussians that is independent of the robot’s noise. Both the

model and the true distributions for the pedestrian are illustrated in Figure 5.1. Gaussian mixtures

are favored by many recent papers in machine learning to account for multi-modality in human’s

decision making [22, 165, 134].

RAT iLQR requires an upper-bound on the KL divergence between the model and the true

distribution. For the sake of this work we assume that there is a separate module that provides

an estimate. In this specific simulation study, we performed Monte Carlo integration with samples

drawn from the true distribution offline. During the simulation, however, we did not reveal any

information on the true distribution to RAT iLQR but the estimated KL value3 of 32.02. This

offline computation was possible due to our time-invariant assumption on the Gaussian mixture.

To use more realistic data-driven prediction instead (such as Trajectron++ discussed in Chapter

4), it would be necessary to estimate the KL divergence online since the predictive distribution

may change over time as the human-robot interaction evolves. Even though RAT iLQR works with

time-varying KL bounds owing to its MPC formulation, we limit our attention to a static KL bound

in this thesis as real-time computation of KL divergence can be challenging. Note that efficient

and accurate estimation of information measures (including KL divergence) is still an active area

of research in information theory and machine learning [12, 67], which is one of our future research

3One could test RAT iLQR under a distribution that has stronger multi-modality with a much larger KL boundthan the one used in this simulation study, but it could introduce over-conservatism and lead to poor mean performanceas the ambiguity set becomes too large. This is a common property of distributionally robust optimization [75].


directions.

The cost functions for this problem are given by

c(k, xk, uk) = ctrack(k, rx,k, ry,k, vk, θk) + ccoll(rx,k, ry,k, px,k, py,k) + cctrl(ak, bk), (5.33)

where ctrack denotes a quadratic cost that penalizes the deviation from a given target robot trajectory,

ccoll a collision penalty that incurs high cost when the robot is too close to the pedestrian, and cctrl a

small quadratic cost on the control input. Mirroring the formulation in [169], we used the following

collision cost:

ccoll =10(

0.2√

(rx − px)2 + (ry − py)2 + 0.9)10 . (5.34)

RAT iLQR was implemented in Julia and the Monte Carlo sampling of the cross entropy method

was distributed across multiple CPU cores. Our implementation with N = 19, M = 5, ms = 10,

and me = 3 yielded the average computation time of 0.27 s. This is 2.7 times slower than real time,

with dt = 0.1 s used to measure the real-time property. We expect to achieve improved efficiency

by further parameter tuning as well as more careful parallelization.

5.5.2 Comparison with Baseline MPC Algorithms

We compared the performance of RAT iLQR against two baseline MPC algorithms, iLQG [156] and

PETS [31]. iLQG corresponds to RAT iLQR with the KL bound of d = 0, i.e. no distributional

robustness is considered. Instead it is more computationally efficient than RAT iLQR, taking only

0.01 s. PETS is a state-of-the-art, model-based stochastic MPC algorithm with sampling and is

originally proposed in a model-based reinforcement learning (RL) context. We chose PETS as our

baseline since it also relies on the cross entropy method for online control optimization and is not

limited to Gaussian distributions, similar to RAT iLQR. However, there are three major differences

between PETS and RAT iLQR. First, PETS performs the cross entropy optimization directly in the

high-dimensional control sequence space, which is far less sample efficient than RAT iLQR which

uses the cross entropy method to only optimize the scalar risk-sensitivity parameter. Second, PETS

does not consider feedback during planning as opposed to RAT iLQR. Third, PETS requires access

to the exact ground-truth Gaussian mixture distribution to perform sampling, while RAT iLQR

only relies on the KL divergence bound and the Gaussian distribution that we have modeled. We

let PETS perform M = 5 iterations of the cross entropy optimization, each with 25 samples for the

control sequence coupled with 50 samples for the joint state trajectory prediction, which resulted in

the average computation time of 0.67 s.

We performed 30 runs of the simulation for each algorithm, with randomized pedestrian start

positions and stochastic transitions. To measure the performance, we computed the minimum


Figure 5.2: A unicycle robot avoiding collision with a road-crossing pedestrian. (Left) When the KLbound is set to d = 0, RAT iLQR ignores this model error and reduces to iLQG. (Right) With thecorrect information on the KL, RAT iLQR is aware of the prediction error and optimally adjuststhe risk-sensitivity parameter for iLEQG, planning a trajectory that stays farther away from thepedestrian. The figures are overlaid with predictions drawn from the model distribution and closed-loop motion plans of the robot. Note that the prediction for the pedestrian is erroneous since theactual pedestrian motion follows the Gaussian mixture distribution. The model distribution and thetrue Gaussian mixture are both illustrated in Figure 5.1.

separation distance between the robot and the pedestrian in each run, assuming that the both

agents are circular with the same diameter. The histogram plots presented in Figure 5.3 clearly

indicates the failure of iLQG and PETS as well as RAT iLQR’s capability to maintain a sufficient

safety margin for collision avoidance despite the distributional model mismatch. As summarized

in Table 5.1, RAT iLQR achieved the largest minimum separation distance on average with the

smallest standard deviation, which contributed to safe robot navigation. Note that even iLQG had

one collision under this large model mismatch. Figure 5.2 provides a qualitative explanation of this

failure; the planned trajectories by iLQG tend to be much closer to the passing pedestrian than

those by the risk-sensitive RAT iLQR. This difference is congruous with our earlier observations in

Chapter 4 where risk-sensitivity is shown to affect the global behavior of the robot.

5.5.3 Benefits of Risk-Sensitivity Parameter Optimization

We also performed two additional sets of 30 simulations for RAT iLQR, with two different ground-

truth distributions that are closer to the model distribution having the estimated KL divergence of

7.78 and 1.34, respectively. This is to study how different KL bounds affect the optimal choice of the

risk-sensitivity parameter θ∗. The results are shown in Figure 5.4. As the KL bound increases from

1.34 to 32.02, the ratio between the optimal θ∗ found by RAT iLQR to the maximum feasible θ also


Figure 5.3: Histograms of the minimum separation distance between the robot and the pedestrian.A negative value indicates that a collision has occurred in that run. For each control algorithm,we performed 30 runs of the simulation with randomized pedestrian start positions. RAT iLQRconsistently maintained a sufficient safety margin to avoid collision, while iLQG and PETS bothfailed. See Table 5.1 for the summary statistics of these data.

increases. This matches our intuition that the larger the model mismatch, the more risk-sensitive

the robot becomes. However, we also note that the robot does not saturate θ all the time even under

the largest KL bound of 32.02. This raises a fundamental question on the benefits of RAT iLQR as

a risk-sensitive optimal control algorithm: how favorable is RAT iLQR with optimal θ∗ compared

to iLEQG with the highest risk-sensitivity?

To answer this question, we performed a comparative study between RAT iLQR with θ∗ and

iLEQG with θmax (i.e. maximum feasible θ found during the cross entropy sampling of RAT iLQR)

under the same simulation setup as before. The results are reported in Table 5.2. In terms of collision

avoidance, both algorithms were equally safe with collision count of 0. However, RAT iLQR achieved

significantly more efficient robot navigation compared to iLEQG with θmax, reducing the average

tracking error by 39%, 34%, and 16% for the KL values of 1.34, 7.78, and 32.02, respectively. The

efficiency and the safety of robot navigation are often in conflict in dynamic collision avoidance (see

Figure 5.5), and prior work [108] struggles to find the right balance by manually tuning a fixed θ.

With RAT iLQR, such need for manual tuning is eliminated since the algorithm dynamically adjusts

θ so it is the most desirable to handle the potential model mismatch specified by the KL bound.


Method Min. Sep. Dist. [m] Total Collision CountRAT iLQR (Ours) 1.26± 0.51 0iLQG 0.94± 0.62 1PETS 0.87± 0.69 4

Table 5.1: Statistics summarizing histogram plots presented in Figure 5.3. RAT iLQR achieved thelargest average value for the minimum separation distance with the smallest standard deviation,which contributed to safe robot navigation without a single collision. Note that PETS had multiplecollisions despite its access to the true Gaussian mixture distribution.

Figure 5.4: Time-averaged ratio of the optimal θ∗ found by RAT iLQR to the maximum feasible θbefore the neurotic breakdown occurs, plotted for three distinct KL divergence values. As the KLbound increases from 1.34 to 32.02, the ratio also consistently increased from 0.66 to 0.93. Notealso that the standard deviation decreased from 0.29 to 0.10. This suggests that the robot becomesmore risk-sensitive as the KL bound increases, and yet it does not choose the maximum θ value allthe time.

5.6 Conclusions

In this chapter we propose RAT iLQR, a novel nonlinear MPC algorithm for distributionally robust

control under a given KL divergence bound. Our method is based on the mathematical equiva-

lence between distributionally robust control and risk-sensitive optimal control. A locally optimal

solution to the resulting bilevel optimization problem is derived with iLEQG and the cross entropy

method. The simulation study shows that RAT iLQR successfully takes into account the distribu-

tional mismatch during collision avoidance. It also shows the effectiveness of dynamic adjustment of

the risk-sensitivity parameter by RAT iLQR, which overcomes a limitation of existing risk-sensitive

optimal control methods. Future work will focus on accurate online estimation of the KL divergence

from a stream of data. We are also interested in exploring applications of RAT iLQR, including

control of learned dynamical systems and perception-aware control.


KL Bound: d = 1.34Method Total Collision Count Tracking Error [m]RAT iLQR (Ours) 0 0.22± 0.29iLEQG with θmax 0 0.36± 0.45



Table 5.2: Our comparative study between RAT iLQR with θ∗ and iLEQG with θmax (i.e. maximumfeasible risk-sensitivity) reveals that RAT iLQR’s optimal choice of the risk-sensitivity parameterθ∗ results in a more efficient robot navigation with smaller trajectory tracking errors, while stillachieving collision avoidance under the model mismatch. With RAT iLQR, the average trackingerror was reduced by 39%, 34%, and 16%, for 3 true distributions with different KL divergences of1.34, 7.78, and 32.02, respectively.

Figure 5.5: Trade-off between collision avoidance and trajectory tracking for different methods.Error bars show the standard deviations divided by the square root of the number of samples. Notethat the farther up and to the right, the better, as the x-axis is flipped. All the methods were testedunder the true distribution with KL divergence d = 32.02 from the Gaussian model. As expected,RAT iLQR lies between the risk-neutral iLQG and the maximally risk-sensitive iLEQG.

Chapter 6

Conclusions and Future Research

6.1 Summary of Contributions

In this thesis, we study the problem of online planning/control for intelligent mobile robots that

are operating autonomously in open and interactive environments. Challenges associated with en-

vironmental and dynamic uncertainty have been identified, and categorized based on the sources of

uncertainty as state uncertainty, transition uncertainty, and model uncertainty. We have taken an

online, model-based approach to propose computationally efficient solution methods that overcome

those uncertainties. State uncertainty is addressed via recursive Bayesian inference and belief space

planning, which together let the robot keep track of evolving state uncertainty and appropriately

mitigate it over time. Our online belief space planning frameworks are shown effective in various

applications with different combinations of dynamical systems and sensing modalities. Transition

uncertainty is handled by our nonlinear risk-sensitive MPC algorithm, whose ability to incorporate

modern generative models and consider the variance of uncertainty is advantageous in safety-critical

applications. This is demonstrated in safe autonomous navigation of a ground robot among many

mutually interacting humans. Finally, model uncertainty is addressed by our distributionally robust

MPC framework, which extends conventional risk-sensitive optimal control by dynamically and opti-

mally adjusting the risk-sensitivity parameter. This extension not only enables the robot to operate

without precise knowledge of the underlying probability distribution that characterizes transition

uncertainty, but also prevents the planner to become overly conservative; in simulation, the robot

controlled by our algorithm achieved more efficient navigation compared to typical risk-sensitive

optimal control, while not sacrificing safety. The methods developed in this thesis collectively make

the robotic system more reliable in open and interactive environments, by either mitigating, avoiding

the risk of, or becoming robust against inevitable uncertainty.

95

CHAPTER 6. CONCLUSIONS AND FUTURE RESEARCH 96

6.2 Future Work

There exists several open questions for future research. The active vision-based intent inference

framework proposed in Chapter 2 is found to effectively resolve message ambiguity in motion-based

communication. However, the current formulation is limited in that it requires a set of hand-designed

trajectories represented by ordered collections of discrete waypoints. In order for us to apply our

active vision framework to general intent inference problems, it is of practical importance to be able

to identify how distinct intent of an agent leads to different statistical behaviors described by general

stochastic dynamical systems. This most likely requires domain specific, labeled data of the agent’s

motion, from which we need to perform regression to fit a stochastic dynamical system model that

corresponds to each intent type of the agent.

For fast belief space planning presented in Chapter 3, our method applies local perturbation to a

given nominal control policy, which requires that the system is differentiable. Thus our algorithm is

not suited for perception models that are not differentiable, such as a monocular camera with a field-

of-view limit. It is likely not effective either for problems that involve inherently discrete decisions

as part of the optimization variables. For the former issue, there exist recent works [76, 149] that

approximate non-differentiable processes with differentiable surrogate models. For the latter, we

could combine the local perturbation with a discrete search algorithm so that the search is followed

by the perturbation. This is indeed similar to the approach that we have taken in Chapter 4,

in which a discrete search method with motion primitives precedes control perturbation by our

optimization algorithm. Another interesting question is how to apply our belief space planning

algorithm to complex perception systems for which an analytical model is not available. To that

end, representation learning that encodes a series of high dimensional observations (such as a stream

of RGB images) into learned Bayesian posteriors can be useful. Learning hidden state representations

along with their belief dynamics is gaining attention in the machine learning literature [57, 55, 58],

and could be incorporated into the model-based belief space planning framework developed in this

thesis.

For risk-sensitive and distributionally robust planning presented in Chapter 4 and 5, it is worth

considering other risk metrics than the entropic risk measure employed in this thesis. In particular,

chance constrained optimization would be a beneficial framework for safety-critical applications,

since it can give a statistical safety assurance by limiting the probability of undesirable outcome.

However, chance constrained optimization is highly intractable for nonlinear systems under general

multimodal distributions, with a notable exception of recent work by Wang et al. [166]. Furthermore,

Cheng et al. [29] have suggested that modern data-driven methods for quantifying uncertainty are

too inaccurate for chance constrained optimization to the extent that the safety assurances are no

longer reliable. Therefore, the notion of distributional robustness is still necessary to take account

of model uncertainty. We think that distributionally robust, chance constrained optimization is an

interesting topic for future research.

CHAPTER 6. CONCLUSIONS AND FUTURE RESEARCH 97

We believe that these open research questions can eventually lead to more effective solution

methods that overcome fundamental issues associated with uncertainty. We hope that the methods

and the findings presented in this thesis will become the basis for future research towards truly

reliable robotic autonomy in open and interactive environments.

Appendix A

Mode Insertion Gradient for

Stochastic Hybrid Systems

In this appendix, we provide a thorough analysis of the stochastic hybrid systems with time-driven

switching that satisfy Assumptions 3.1 – 3.4. Our goal is to prove Theorem 3.1, which guarantees

that the mode insertion gradient exists and is well-defined. The results presented in this appendix

also appear in our journal paper [112] and serve as a generalization of the mode insertion gradient

presented in [41, 172, 6] to also take account of stochasticity in the dynamical system.

A.1 Nominal Trajectory under Specific Observations

First, we analyze the properties of the system x(t) for t ∈ [0, T ] under a given initial condition x0,

control u ∈ U , and a specific sequence of observations (y1, . . . , yT ) sampled from (Y1, . . . , YT ).

Proposition A.1 (Existence and Uniqueness of Solutions). Given a control u ∈ U and a sequence

of observations (y1, . . . , yT ), the system x(t) starting at x0 has a unique solution for t ∈ [0, T ].

Proof. We will show that each xi for i ∈ 1, . . . , T has a unique solution, and thus x is uniquely

determined as a whole. First, by Assumption (3.2a), (3.2c) and the Picard Lemma (Lemma 5.6.3 in

[43]), the differential equation

x1(t) = f(x1(t), u(t)) (A.1)

with initial condition x1(0) = x0 has a solution for t ∈ [0, 1]. Furthermore, Proposition 5.6.5 in [43]

assures that the solution x1 is unique under Assumption (3.2a) and (3.2c). This guarantees that the

initial condition for x2 defined by x2(1) = g(x1(1), y1) is unique. Therefore, proceeding by induction

each x1, . . . , xT has a unique solution, which completes the proof.

98

APPENDIX A. MODE INSERTION GRADIENT FOR STOCHASTIC HYBRID SYSTEMS 99

Corollary A.2 (Right Continuity). Given a control u ∈ U and a sequence of observations (y1, . . . , yT ),

the system x(t) starting at x0 is right continuous in t on [0, T ].

Proof. By Proposition A.1 each xi has a unique solution that follows xi = f(x, u). Clearly each

xi is continuous on [i − 1, i], which proves that x(t) , xi(t) ∀t ∈ [i − 1, i) ∀i ∈ 1, 2, . . . , T with

x(T ) , g(xT (T ), yT ) is right continuous on [0, T ].

Lemma A.3. Let ξi denote the initial condition for xi. Then, there exists a constant K8 <∞ such

that for all i ∈ 1, . . . , T,

∀t ∈ [i− 1, i] ‖xi(t)‖2 ≤ (1 + ‖ξi‖2) eK8 (A.2)

Proof. Using Assumption (3.2a) and (3.2c), the claim follows directly from Proposition 5.6.5 in

[43].

Proposition A.4 (Lipschitz Continuity). For each i ∈ 1, . . . , T, let ξ′i and ξ′′i be two distinct

initial conditions for xi. Furthermore, let u′ and u′′ be two controls from U . The pairs (ξ′, u′) and

(ξ′′, u′′) respectively define two solutions x′i and x′′i to ODE xi = f(xi, u) over [i− 1, i]. Then, there

exists an L <∞, independent of ξ′i, ξ′′i , u′ and u′′, such that

∀i ∈ 1, . . . , T ∀t ∈ [i− 1, i] ‖x′i(t)− x′′i (t)‖2 ≤ L(‖ξ′i − ξ′′i ‖2 +

∫ i

i−1

‖u′(t)− u′′(t)‖2dt). (A.3)

Proof. The proof is similar to that of Lemma 5.6.7 in [43]. Making use of the Picard Lemma (Lemma

5.6.3 in [43]) and Assumption (3.2c), we obtain

‖x′i(t)− x′′i (t)‖2 ≤ eK1

(‖ξ′i − ξ′′i ‖2 +K1

∫ i

i−1

‖u′(t)− u′′(t)‖2dt). (A.4)

As K1 ≥ 1,

‖x′i(t)− x′′i (t)‖2 ≤ K1eK1

(‖ξ′i − ξ′′i ‖2 +

∫ i

i−1

‖u′(t)− u′′(t)‖2dt). (A.5)

Defining L , K1eK1 <∞ completes the proof.

Corollary A.5 (Uniform Continuity in Initial Conditions). Let u ∈ U be a given control. For each

i ∈ 1, . . . , T, let ξ′i and ξ′′i be two distinct initial conditions for xi. The pairs (ξ′, u) and (ξ′′, u)

respectively define two solutions x′i and x′′i to ODE xi = f(xi, u) over [i− 1, i]. Note that they share

the same control but have different initial conditions, unlike Proposition A.4. Then, for any ε > 0

there exists δ > 0 such that

∀i ∈ 1, . . . , T ∀t ∈ [i− 1, i] ‖ξ′i − ξ′′i ‖2 < δ =⇒ ‖x′i(t)− x′′i (t)‖2 < ε. (A.6)


Proof. By Proposition A.4, we find

‖x′i(t)− x′′i (t)‖2 ≤ L‖ξ′i − ξ′′i ‖2 (A.7)

for all i ∈ 1, . . . , T and t ∈ [i− 1, i], where L <∞. Take any δ < εL to prove the claim.

Proposition A.6 (Continuity in Observations). Given a control u ∈ U , the map (y1, . . . , yT ) 7→ x(t)

is continuous for all t ∈ [0, T ], where x represents the solution to the system under Assumption 3.2

starting at x(0) = x0.

Proof. We will show the continuity of (y1, . . . , yT ) 7→ xi(t) for each i ∈ 1, . . . , T by mathematical

induction. First, by Assumption 3.2 the value of x1(t) is solely determined by x0 and u. Therefore,

for any t ∈ [0, 1] the function (y1, . . . , yT ) 7→ x1(t) is a constant map, which is continuous. Next,

suppose that ∀t ∈ [i − 1, i] (y1, . . . , yT ) 7→ xi(t) is continuous for some i ∈ 1, . . . , T − 1. Now

consider xi+1. Let Fi+1(ξi+1, t) be the map from an initial condition xi+1(i) = ξi+1 to the solution

xi+1 at t ∈ [i, i + 1] under the given u. In other words, Fi+1(ξi+1, t) is equivalent to the integral

equation

Fi+1(ξi+1, t) , ξi+1 +

∫ t

i

f(xi+1(a), u(a))da (A.8)

and takes initial condition ξi+1 as well as time t as its arguments. Substituting ξi+1 = g(xi(i), yi),

Fi+1 (g(xi(i), yi), t) gives the actual value of xi+1(t). We will prove the continuity of Fi+1(g(xi(i), yi), t)

as follows. First, note that the map from (y1, . . . , yT ) to Fi+1 (g(xi(i), yi), t) is the result of the com-

position: y1

...

yT

7→(xi(i)

yi

)7→ g(xi(i), yi) 7→ Fi+1(g(xi(i), yi), t). (A.9)

The first map is continuous since xi(i) is continuous in (y1, . . . , yT ) by the induction hypothesis. The

second map is also continuous by Assumption (3.2b). Lastly, Corollary A.5 shows that Fi+1(ξi+1, t)

is (uniformly) continuous in ξi+1 for t ∈ [i, i + 1]. Therefore, (y1, . . . , yT ) 7→ xi+1(t) is continuous

for all t ∈ [i, i+ 1].

Proposition A.7 (Bounded State Trajectory). Given a control u ∈ U and a sequence of observa-

tions (y1, . . . , yT ), the system x(t) starting at x(0) = x1(0) = x0 has the following bound:

∀i ∈ 2, . . . , T ∀t ∈ [i− 1, i] ‖xi(t)‖2 ≤∑

(j1,...,ji−1)∈Ki

α(j1,...,ji−1)i (x0)

i−1∏m=1

‖ym‖jm2 , (A.10)


where Ki is a finite set of sequences of non-negative integers of length i− 1, and α(j1,...,ji−1)i (x0) is

a finite positive constant that depends on x0 and (j1, . . . , ji−1) but not on any of the observations or

the control.

For i = 1 the bound is given by ∀t ∈ [0, 1] ‖x1(t)‖2 ≤ α1(x0) for some finite positive constant

α1(x0).

Proof. For i = 1, Lemma A.3 gives ∀t ∈ [0, 1] ‖x1(t)‖2 ≤ (1 + ‖x0‖2) eK8 , α1(x0). For i = 2, by

Assumption (3.2d) and the case for i = 1, we have

‖x2(1)‖2 = ‖g(x1(1), y1)‖2 (A.11)

≤ K2 +K3‖x1(1)‖L12 +K4‖y1‖L2

2 +K5‖x1(1)‖L12 ‖y1‖L2

2 (A.12)

≤ K2 +K3α1(x0)L1 +K4‖y1‖L22 +K5α1(x0)L1‖y1‖L2

2 . (A.13)

Then, by Lemma A.3, ∀t ∈ [1, 2]

‖x2(t)‖2 ≤ (1 + ‖x2(1)‖2)eK8 (A.14)

≤ eK8

(1 +K2 +K3α1(x0)L1 +K4‖y1‖L2

2 +K5α1(x0)L1‖y1‖L22

)(A.15)

,∑

(j1)∈K2

α(j1)2 (x0)‖y1‖j12 , (A.16)

where K2 = (0), (L2), and

α(0)2 (x0) = eK8

(1 +K2 +K3α1(x0)L1

)(A.17)

α(L2)2 (x0) = eK8

(K4 +K5α1(x0)L1

)(A.18)

are both finite positive constants that depend on x0 but not on any of the observations or the control.

Next, suppose that the claim holds for some i ≥ 2. That is,

∀t ∈ [i− 1, i]

‖xi(t)‖2 ≤∑

(j1,...,ji−1)∈Ki

α(j1,...,ji−1)i (x0)

i−1∏m=1

‖ym‖jm2 , (A.19)

where Ki and α(j1,...,ji−1)i (x0) are as defined in the statement of the proposition. Making use of this

assumption, Assumption (3.2d) and Lemma A.3, we find that for all t ∈ [i, i+ 1],

‖xi+1(t)‖2 ≤ (1 + ‖g(xi(i), yi)‖2)eK8 (A.20)

≤ eK8(1 +K2) + eK8K4‖yi‖L22 + eK8‖xi(i)‖L1

2 (K3 +K5‖yi‖L22 ). (A.21)


The first two terms in the above sum can be rewritten as

eK8(1 +K2) = eK8(1 +K2)‖y1‖02 × · · · × ‖yi‖02 (A.22)

eK8K4‖yi‖L22 = eK8K4‖y1‖02 × · · · × ‖yi−1‖02 × ‖yi‖

L22 (A.23)

For the last term, we can use (A.19) and the multinomial theorem to write

‖xi(i)‖L12 ≤

∑(j1,...,ji−1)∈Ki

α(j1,...,ji−1)i (x0)

i−1∏m=1

‖ym‖jm2

L1

(A.24)

=∑

k1+···+k|Ki|=L1

(L1

k1, . . . , k|Ki|

)

×

|Ki|∏l=1

α(j

(l)1 ,...,j

(l)i−1)

i (x0)kl

×i−1∏m=1

‖ym‖∑|Ki|l=1 klj

(l)m

2 ,

(A.25)

where (j(l)1 , . . . , j

(l)i−1) is the l-th element in Ki. Note that kl is non-negative for all l ∈ 1, . . . , |Ki|.

By the induction hypothesis (A.19), exponent∑|Ki|l=1 klj

(l)m is also non-negative for all m ∈ 1, . . . , i−

1.Thus, substituting (A.22), (A.23), and (A.25) into (A.21), rearranging the sums and re-labeling

the sequences of integer exponents, we can write

‖xi+1(t)‖2 ≤∑

(j1,...,ji)∈Ki+1

α(j1,...,ji)i+1 (x0)

i∏m=1

‖ym‖jm2 (A.26)

for all t ∈ [i, i + 1], where Ki+1 is a set of sequences of non-negative integers of length i, and each

α(j1,...,ji)i+1 (x0) does not depend on any of the observations or the control. Here the cardinality of

Ki+1 is at most finite, since

|Ki+1| ≤ 2 + 2

(L1 + |Ki| − 1

|Ki| − 1

)(A.27)

by (A.21) and the multinomial theorem.

Finally, proceeding by mathematical induction over i ∈ 2, . . . , T completes the proof.

Proposition A.8 (Bounded Cost Functions). Given a control u ∈ U and a sequence of observations

(y1, . . . , yT ), the instantaneous cost c(x(t), u(t)) induced by the state trajectory x(t) has the following


bound.

∀i ∈ 2, . . . , T ∀t ∈ [i− 1, i] |c(xi(t), u(t))| ≤∑

(j1,...,ji−1)∈K′i

α′(j1,...,ji−1)i (x0)

i−1∏m=1

‖ym‖jm2 , (A.28)

where K′i is a finite set of sequences of non-negative integers of length i− 1, and α′(j1,...,ji−1)i (x0) is

a finite positive constant that depends on x0 and (j1, . . . , ji−1) but not on any of the observations or

the control.

For i = 1 the bound is given by

∀t ∈ [0, 1] |c(x1(t), u(t))| ≤ α′1(x0) (A.29)

for some finite positive constant α′1(x0).

Similarly, the terminal cost h(x(T )) is bounded by

|h(x(T ))| ≤∑

(j1,...,jT )∈K′T+1

α′(j1,...,jT )T+1 (x0)

T∏m=1

‖ym‖jm2 . (A.30)

Proof. For the instantaneous cost function c(xi(t), u(t)), Assumption 3.3 along with Proposition A.7

yields the following bound:

∀i ∈ 2, . . . , T ∀t ∈ [i− 1, i] |c(xi(t), u(t))|

≤ K6 +K7

∑(j1,...,ji−1)∈Ki

α(j1,...,ji−1)i (x0)

i−1∏m=1

‖ym‖jm2

L3

. (A.31)

Making use of the multinomial expansion formula in the same manner as in the proof of Proposition

A.7, we conclude that

∀i ∈ 2, . . . , T ∀t ∈ [i− 1, i] |c(xi(t), u(t))|

≤∑

(j1,...,ji−1)∈K′i

α′(j1,...,ji−1)i (x0)

i−1∏m=1

‖ym‖jm2 (A.32)

for some finite setK′i of sequences of non-negative integers and finite positive constants α′(j1,...,ji−1)i (x0).

Similarly, for i = 1 we obtain

|c(x1(t), u(t))| ≤ K6 +K7α1(x0)L3 , α′1(x0) (A.33)

for all t ∈ [0, 1].


To bound the terminal cost, note that

|h(x(T ))| ≤ K6 +K7‖x(T )‖L32 (A.34)

= K6 +K7‖g(xT (T ), yT )‖L32 (A.35)

≤ K6 +K7

K2 +K4‖yT ‖L2

2 + ‖xT (T )‖L12

(K3 +K5‖yT ‖L2

2

)L3

(A.36)

by Assumptions (3.2d) and 3.3. Since L3 < ∞, (A.36) yields a polynomial of ‖xT (T )‖2 and ‖yT ‖2of finite terms, for each of which we can use Proposition A.7 and apply the multinomial expansion

formula to show

|h(x(T ))| ≤∑

(j1,...,jT )∈K′T+1

α′(j1,...,jT )T+1 (x0)

T∏m=1

‖ym‖jm2 , (A.37)

for some finite set K′T+1 of sequence of non-negative integers as well as finite positive constants

α′(j1,...,jT )T+1 (x0).

A.2 Perturbed Trajectory under Specific Observations

Next, we will perturb the nominal state x of the system while assuming the same initial condition x0

and the specific observations (y1, . . . , yT ) as in Section A.1. The perturbed control is an open-loop

perturbation as defined in Definition 3.1, which we restate below.

Definition A.1 (Perturbed Control). Let u ∈ U be a control. For τ ∈ (0, 1) and v ∈ B(0, ρmax),

define the perturbed control uε by

uε(t) ,

v if t ∈ (τ − ε, τ ]

u(t) otherwise,(A.38)



Remark A.1. It is obvious that uε(t) is piecewise continuous on [0, T ]. Therefore, for v ∈ B(0, ρmax)

we have uε ∈ U , i.e. uε is an admissible control. Thus, Proposition A.1 assures that there exists a

unique solution xε for the trajectory of the system under the control perturbation. In the remainder

of the analysis, we assume that (τ, v) is given and fixed.

Lemma A.9. Let ε, ε′ ∈ [0, τ ], and let uε, uε′ ∈ U be two perturbed controls of the form (A.38). Let

xε1, xε′

1 be the solutions of x1 for t ∈ [0, 1] by applying uε and uε′

respectively to the initial condition


x0. Then, there exists an L′ <∞, independent of ε, ε′, xε1 and xε′

1 , such that

∀ε, ε′ ∈ [0, τ ] ∀t ∈ [0, 1] ‖xε1(t)− xε′

1 (t)‖2 ≤ L′|ε− ε′|. (A.39)

Proof. By Proposition A.4, we find that

∀t ∈ [0, 1] ‖xε1(t)− xε′

1 (t)‖2 ≤ L∫ 1

0

‖uε(t)− uε′(t)‖2dt (A.40)

for some L <∞. Let us derive an upper-bound on the integral on the right hand side. If ε ≥ ε′,

∫ 1

0

‖uε(t)− uε′(t)‖2dt =

∫ τ−ε′

τ−ε‖v − u(t)‖2dt, (A.41)

where u(t) is the nominal control that both uε and uε′

are based on. Since u(t) ∈ B(0, ρmax), we

obtain ∫ 1

0

‖uε(t)− uε′(t)‖2dt ≤ sup

u∈B(0,ρmax)

‖v − u‖2(ε− ε′). (A.42)

Similarly, if ε < ε′ we have∫ 1

0

‖uε(t)− uε′(t)‖2dt ≤ sup

u∈B(0,ρmax)

‖v − u‖2(ε′ − ε). (A.43)

Put these two cases together and substitute into (A.40) to get

∀t ∈ [0, 1] ‖xε1(t)− xε′

1 (t)‖2 ≤ L′|ε− ε′|, (A.44)

where L′ , L supu∈B(0,ρmax) ‖v − u‖2 ≤ 2Lρmax <∞.

Lemma A.10. Let uε and xε1 be as in Lemma A.9. Then,

limε→0+

1

ε

∫ τ

τ−εf(xε1(t), uε(t))− f(x1(t), u(t)) dt = f(x1(τ), v)− f(x1(τ), u(τ)), (A.45)

where x1 denotes the solution under the nominal control u ∈ U .

Proof. We will show that the difference norm∥∥∥∥1

ε

∫ τ

τ−εf(xε1(t), uε(t))− f(x1(t), u(t)) dt− f(x1(τ), v) + f(x1(τ), u(τ))

∥∥∥∥2

(A.46)


converges to 0 as ε→ 0+. Indeed, (A.46) becomes∥∥∥∥1

ε

∫ τ

τ−εf(xε1(t), uε(t))− f(x1(t), u(t))− f(x1(τ), v) + f(x1(τ), u(τ))dt

∥∥∥∥2

(A.47)

≤ 1

ε

∫ τ

τ−ε‖f(xε1(t), uε(t))− f(x1(t), u(t))− f(x1(τ), v) + f(x1(τ), u(τ))‖2dt (A.48)

≤ 1

ε

∫ τ

τ−ε

‖f(xε1(t), uε(t))− f(x1(τ), v)‖2 + ‖f(x1(t), u(t))− f(x1(τ), u(τ))‖2

dt. (A.49)

We used the triangle inequality in (A.49). Now, making use of the fact that ∀t ∈ (τ−ε, τ ] uε(t) = v,

Assumption (3.2c) yields

‖f(xε1(t), uε(t))− f(x1(τ), v)‖2 ≤ K1‖xε1(t)− x1(τ)‖2 (A.50)

= K1‖xε1(t)− x1(t) + x1(t)− x1(τ)‖2 (A.51)

≤ K1‖xε1(t)− x1(t)‖2 +K1‖x1(t)− x1(τ)‖2 (A.52)

≤ K1L′ε+K1‖x1(t)− x1(τ)‖2 (A.53)

for any t ∈ (τ − ε, τ ], where we applied Lemma A.9 in (A.53) with ε′ = 0. Similarly,

‖f(x1(t), u(t))− f(x1(τ), u(τ))‖2 ≤ K1‖x1(t)− x1(τ)‖2 +K1‖u(t)− u(τ)‖2. (A.54)

Therefore, (A.46) is upper-bounded by

1

ε

∫ τ

τ−εK1L

′ε+ 2K1‖x1(t)− x1(τ)‖2 +K1‖u(t)− u(τ)‖2dt (A.55)

≤ K1L′ε+ 2K1 sup

t∈[τ−ε,τ ]

‖x1(t)− x1(τ)‖2 +K1 supt∈[τ−ε,τ ]

‖u(t)− u(τ)‖2, (A.56)

which converges to 0 as ε→ 0+, since

0 ≤ supt∈[τ−ε,τ ]

‖x1(t)− x1(τ)‖2 −→ 0 (A.57)

and

0 ≤ supt∈[τ−ε,τ ]

‖u(t)− u(τ)‖2 −→ 0 (A.58)

as ε→ 0+.

Lemma A.11. Let xε1 and x1 be as in Lemma A.10. Let Ψ1(t) be the right derivative of xε1(t) with


respect to ε evaluated at ε = 0. That is,

Ψ1(τ) =∂+

∂εxε1(τ)

∣∣∣∣ε=0

, limε→0+

xε1(τ)− x1(τ)

ε. (A.59)

Then, we have

Ψ1(τ) = f(x1(τ), v)− f(x1(τ), u(τ)). (A.60)

Proof. Let us express both xε1(τ) and x1(τ) in the integral form:

xε1(τ) = x0 +

∫ τ

0

f(xε1(t), uε(t))dt (A.61)

x1(τ) = x0 +

∫ τ

0

f(x1(t), u(t))dt. (A.62)

Note that uε(t) = u(t) and xε1(t) = x1(t) for t ∈ [0, τ − ε], since no perturbation is applied to the

system until t > τ − ε. Therefore,

xε1(τ)− x1(τ) =

∫ τ

τ−εf(xε1(t), uε(t))− f(x1(t), u(t)) dt. (A.63)

Making use of Lemma A.10, we conclude that

limε→0+

xε1(τ)− x1(τ)

ε= f(x1(τ), v)− f(x1(τ), u(τ)). (A.64)

Lemma A.12. Suppose that Assumption (3.2c) is satisfied. Then, we have the following bound on

the (matrix) norm of the Jacobian of function f :∥∥∥∥ ∂∂xf(x′, u′)

∥∥∥∥2

≤ K1, (A.65)

for any x′ ∈ Rnx and u′ ∈ B(0, ρmax).

Proof. Let u′′ be equal to u′ in Assumption (3.2c). Then, we have

‖f(x′, u′)− f(x′′, u′)‖2 ≤ K1‖x′ − x′′‖2 (A.66)

for any x′, x′′ ∈ Rnx and any u′ ∈ B(0, ρmax). Now, let us define some non-zero scalar t and some

unit vector v ∈ Rnx that satisfies ‖v‖2 = 1, and let

x′′ , x′ + tv. (A.67)


Substituting (A.67) into (A.66), we get

‖f(x′ + tv, u′)− f(x′, u′)‖2|t|

≤ K1. (A.68)

Note that this holds for any t and v as defined above. From multivariate calculus, on the other

hand, the directional derivative of f with respect to v is given by

limt→0

f(x′ + tv, u′)− f(x′, u′)

t=

(∂

∂xf(x′, u′)

)v. (A.69)

Therefore, we have ∥∥∥∥( ∂

∂xf(x′, u′)

)v

∥∥∥∥2

≤ K1, (A.70)

which implies ∥∥∥∥ ∂∂xf(x′, u′)

∥∥∥∥2

, sup‖v‖2=1

∥∥∥∥( ∂

∂xf(x′, u′)

)v

∥∥∥∥2

≤ K1. (A.71)

Lemma A.13. Let xε1 and x1 be as in Lemma A.10. Then, Ψ1(t) = ∂+∂ε x

ε1(t)

∣∣ε=0

uniquely exists for

t ∈ [τ, 1] and follows the ODE:

Ψ1(t) =∂

∂x1f(x1(t), u(t))Ψ1(t), (A.72)

with the initial condition Ψ1(τ) given by Lemma A.11.

Proof. Taking some a ∈ (τ, 1], let us express xε1(a) and x1(a) in the integral form:

xε1(a) = xε1(τ) +

∫ a

τ

f(xε1(t), uε(t))dt (A.73)

x1(a) = x1(τ) +

∫ a

τ

f(x1(t), u(t))dt. (A.74)

Thus, we have

Ψ1(a) = Ψ1(τ) + limε→0+

∫ a

τ

1

εf(xε1(t), u(t))− f(x1(t), u(t))dt, (A.75)

where we used ∀t ∈ [τ, a] uε(t) = u(t). We will take a measure-theoretic approach to prove that

the order of the limit and the integration can be switched in (A.75). First, think of the integral as

a Lebesgue integral on the measure space ([τ, a],B([τ, a]), λ), where B([τ, a]) is the Borel σ-algebra


on [τ, a] and λ is the Lebesgue measure. Furthermore, consider the integrand as a function from

[τ, a] into the Banach space (Rnx , ‖ · ‖2), i.e. the Euclidean space endowed with the `2 norm. By

the piecewise continuity of uε, u and the continuity of xε1, x1, and f , the integrand is a piecewise

continuous function with respect to t, which is λ-measurable. In fact it is also Bochner-integrable,

since for t ∈ [τ, a] we have the constant bound:

1

ε‖f(xε1(t), u(t))− f(x1(t), u(t))‖2 ≤

1

εK1‖xε1(t)− x1(t)‖2 (A.76)

≤ K1L′ (A.77)

by Assumption (3.2c) and Lemma A.9. Furthermore, the chain rule gives

limε→0+

1

ε(f(xε1(t), u(t))− f(x1(t), u(t))) =

∂

∂x1f(x1(t), u(t))Ψ1(t), (A.78)

assuming that Ψ1(t) exists for t ∈ [τ, a]. Therefore, by the Bochner-integral version of the dominated

convergence theorem (Theorem 3 in [35], Chapter II), we obtain

Ψ1(a) = Ψ1(τ) +

∫ a

τ

∂

∂x1f(x1(t), u(t))Ψ1(t)dt. (A.79)

This is equivalent to the ordinary differential equation:

Ψ1(t) =∂

∂x1f(x1(t), u(t))Ψ1(t). (A.80)

It remains to show that the solution Ψ1(t) that satisfies (A.80) does exist and is unique. First, let

Ψ′1 and Ψ′′1 be two systems that follow (A.80) and share the same initial condition Ψ1(τ). Then, by

Lemma A.12 we have

‖Ψ′1(t)− Ψ′′1(t)‖2 =

∥∥∥∥ ∂

∂x1f(x1(t), u(t))(Ψ′1(t)−Ψ′′1(t))

∥∥∥∥2

(A.81)

≤∥∥∥∥ ∂

∂x1f(x1(t), u(t))

∥∥∥∥2

· ‖Ψ′1(t)−Ψ′′1(t)‖2 (A.82)

≤ K1 ‖Ψ′1(t)−Ψ′′1(t)‖2 . (A.83)

Existence follows from this inequality in conjunction with the Picard Lemma (Lemma 5.6.3 in [43]).

To show the uniqueness, apply the Bellman-Gronwall Lemma (Lemma 5.6.4 in [43]) to the following

integral inequality:

∀a ∈ [τ, 1] ‖Ψ′1(a)−Ψ′′1(a)‖2 ≤ K1

∫ a

τ

‖Ψ′1(t)−Ψ′′1(t)‖2dt. (A.84)


Proposition A.14 (Variational Equation). Let u ∈ U and x1 be the nominal control and the

resulting state trajectory. Let xε1 be the perturbed state induced by the perturbed control uε of the

form (A.38). Propagating xε1(t) through the hybrid dynamics, we get a series of modes xε2, . . . , xεT

that constitutes the entire trajectory xε(t) for t ∈ [τ, T ]. Define the state variation Ψ(t) for t ∈ [τ, T ]

by

Ψ(t) =∂+

∂εxε(t)

∣∣∣∣ε=0

, limε→0+

xε(t)− x(t)

ε. (A.85)

Then, Ψ(t) exists for t ∈ [τ, T ] and follows the hybrid system with time-driven switching:

Ψ(t) =

Ψ1(t) ∀t ∈ [τ, 1)

Ψi(t) ∀t ∈ [i− 1, i) ∀i ∈ 2, . . . , T(A.86)

with Ψ(T ) = ∂∂xT

g(xT (T ), yT )ΨT (T ), where Ψ1 is defined on [τ, 1] as in Lemma A.13, and Ψi for

i ≥ 2 is defined on [i− 1, i] with

Ψi(i− 1) =∂

∂xi−1g(xi−1(i− 1), yi−1)Ψi−1(i− 1) (A.87)

Ψi(t) =∂

∂xif(xi(t), u(t))Ψi(t) ∀t ∈ [i− 1, i]. (A.88)

Proof. The case for t ∈ [τ, 1) follows from Lemma A.13, since in this case we have

Ψ(t) , limε→0+

xε(t)− x(t)

ε(A.89)

= limε→0+

xε1(t)− x1(t)

ε(A.90)

= Ψ1(t). (A.91)

At t = 1, we obtain

Ψ(1) , limε→0+

xε(1)− x(1)

ε(A.92)

= limε→0+

xε2(1)− x2(1)

ε(A.93)

= limε→0+

g(xε1(1), y1)− g(x1(1), y1)

ε(A.94)

=∂

∂xg(x1(1), y1)Ψ1(1) (A.95)

by (3.47) and the chain rule. Let us define Ψ2 by Ψ2(t) , limε→0+(xε2(t)−x2(t))/ε. Similarly to the


proof of Lemma A.13, one can show that Ψ2 follows the integral equation:

Ψ2(a) = Ψ2(1) +

∫ a

1

∂

∂x2f(x2(t), u(t))Ψ2(t)dt (A.96)

with Ψ2(1) = Ψ(1), and that Ψ2(t) that satisfies (A.87) and (A.88) uniquely exists. This proves the

case for t ∈ [1, 2). Proceeding by induction completes the proof.

So far we have focused entirely on the right derivative ∂+∂ε x

ε(t) evaluated at ε = 0. The next

proposition shows that xε1 is in fact right differentiable with respect to ε at all ε ∈ [0, τ).

Proposition A.15 (Right Differentiability of State Perturbation). Let xε(t) be the perturbed state

trajectory under the perturbed control uε defined by (A.38). Let Ψε(t) denote the right derivative∂+∂ε x

ε(t) evaluated at a particular ε ∈ [0, τ). (Note that when ε = 0 we have Ψε = Ψ.) Then, Ψε(t)

exists for t ∈ [τ − ε, T ] and follows the hybrid system with time-driven switching:

Ψε(t) =

Ψε1(t) ∀t ∈ [τ − ε, 1)

Ψεi(t) ∀t ∈ [i− 1, i) ∀i ∈ 2, . . . , T

(A.97)

with Ψε(T ) = ∂∂xεT

g(xεT (T ), yT )ΨεT (T ). Ψε

1 is defined on [τ − ε, 1], where

Ψε1(τ − ε) = f(xε1(τ − ε), v)− f(xε1(τ − ε), uε(τ − ε)) (A.98)

and

Ψε1(t) =

∂

∂xε1f(xε1(t), uε(t))Ψε

1(t) ∀t ∈ [τ − ε, 1]. (A.99)

Ψεi for i ≥ 2 is defined on [i− 1, i] with

Ψεi(i− 1) =

∂

∂xεi−1

g(xεi−1(i− 1), yi−1)Ψεi−1(i− 1) (A.100)

Ψεi(t) =

∂

∂xεif(xεi(t), u

ε(t))Ψεi(t) ∀t ∈ [i− 1, i]. (A.101)

Proof. The proposition can be proven by considering the perturbed control uε as the new nominal

control, and defining a further perturbation based on this nominal control. Formally, define uε′

by

uε′(t) ,

v if t ∈ (τ − ε− ε′, τ − ε]

uε(t) otherwise(A.102)

with ε′ ∈ [0, τ−ε), where (τ, v) is the same pair of values as for uε(t). Since uε(t) is left continuous in

t at t = τ − ε, uε′ is a valid perturbed control of the form (A.38) with uε being the nominal control.


Here ε is considered as fixed and the parameters defining this new perturbation are v, τ − ε, and ε′.

This perturbation yields the new perturbed state trajectory xε′

based on the nominal trajectory xε.

Note that when ε′ = 0 we have uε′

= uε and xε′

= xε. We can define the new state variation:

Ψ(t) =∂+

∂ε′xε′(t)

∣∣∣∣ε′=0

, limε′→0+

xε′(t)− xε(t)ε′

. (A.103)

Applying Proposition A.14 to this new setting, we find that Ψ(t) exists for t ∈ [τ − ε, T ] and follows

the hybrid system with time-driven switching:

Ψ(t) =

Ψ1(t) ∀t ∈ [τ − ε, 1)

Ψi(t) ∀t ∈ [i− 1, i) ∀i ∈ 2, . . . , T(A.104)

with Ψ(T ) = ∂∂xεT

g(xεT (T ), yT )ΨT (T ). Ψ1 is defined on [τ − ε, 1], where

Ψ1(τ − ε) = f(xε1(τ − ε), v)− f(xε1(τ − ε), uε(τ − ε)) (A.105)

and

˙Ψ1(t) =

∂

∂xε1f(xε1(t), uε(t))Ψ1(t) ∀t ∈ [τ − ε, 1]. (A.106)

Ψi for i ≥ 2 is defined on [i− 1, i] with

Ψi(i− 1) =∂

∂xεi−1

g(xεi−1(i− 1), yi−1)Ψi−1(i− 1) (A.107)

˙Ψi(t) =

∂

∂xεif(xεi(t), u

ε(t))Ψεi(t) ∀t ∈ [i− 1, i]. (A.108)

On the other hand, notice that the new perturbed control uε′

is actually equivalent to the

perturbed control uε+ε′

that is based on the original nominal control u. Namely,

uε′(t) = uε+ε

′(t) ,

v if t ∈ (τ − (ε+ ε′), τ ]

u(t) otherwise.(A.109)


Consequently, the new perturbed state xε′

is equal to xε+ε′, and thus

Ψ(t) = limε′→0+

xε′(t)− xε(t)ε′

(A.110)

= limε′→0+

xε+ε′(t)− xε(t)ε′

(A.111)

= Ψε(t). (A.112)

This completes the proof.

Lemma A.16. Let Ψε1, . . . ,Ψ

εT be as given by Proposition A.15. Then, for Ψε

1 the following holds.

∀t ∈ [τ − ε, 1] ‖Ψε1(t)‖2 ≤ 2K1ρmaxe

K1 (A.113)

Similarly, for all i ∈ 2, . . . , T we have

∀t ∈ [i− 1, i] ‖Ψεi(t)‖2 ≤ ‖Ψε

i(i− 1)‖2eK1 . (A.114)

Proof. We begin with the integral equation:

Ψε1(a) = Ψε

1(τ − ε) +

∫ a

τ−ε

∂

∂xε1f(xε1(t), uε(t))Ψε

1(t)dt. (A.115)

Therefore,

‖Ψε1(a)‖2 ≤ ‖Ψε

1(τ − ε)‖2 +

∫ a

τ−ε

∥∥∥∥ ∂

∂xε1f(xε1(t), uε(t))

∥∥∥∥2

· ‖Ψε1(t)‖2dt. (A.116)

Using

Ψε1(τ − ε) = f(xε1(τ − ε), v)− f(xε1(τ − ε), uε(τ − ε)), (A.117)

Assumption (3.2c), and Lemma A.12, we get

‖Ψε1(a)‖2 ≤ K1‖v − uε(τ − ε)‖2 +K1

∫ a

τ−ε‖Ψε

1(t)‖2dt (A.118)

≤ K1 supu∈B(0,ρmax)

‖v − u‖2 +K1

∫ a

τ−ε‖Ψε

1(t)‖2dt (A.119)

≤ 2K1ρmax +K1

∫ a

τ−ε‖Ψε

1(t)‖2dt (A.120)


for all a ∈ [τ − ε, 1]. Thus, by the Bellman-Gronwall Lemma (Lemma 5.6.4 in [43]) it follows that

∀t ∈ [τ − ε, 1] ‖Ψε1(t)‖2 ≤ 2K1ρmaxe

K1 . (A.121)

For general i ≥ 2, apply the Bellman-Gronwall Lemma to the similar integral inequality:

∀a ∈ [i− 1, i] ‖Ψεi(a)‖2 ≤ ‖Ψε

i(i− 1)‖2 +K1

∫ a

i−1

‖Ψεi(t)‖2dt (A.122)

to get the result.

Proposition A.17 (Bounded State Variation). Given uε and (y1, . . . , yT ), Ψε defined in Proposition

A.15 has the following bound:

∀ε ∈ [0, τ) ∀i ∈ 2, . . . , T ∀t ∈ [i− 1, i]

‖Ψεi(t)‖2 ≤

∑(j1,...,ji−1)∈Li

β(j1,...,ji−1)i (x0)

i−1∏m=1

‖ym‖jm2 , (A.123)

where Li is a finite set of sequences of non-negative integers of length i − 1, and β(j1,...,ji−1)i (x0) is

a finite positive constant that depends on x0 and (j1, . . . , ji−1) but not on ε, uε, or (y1, . . . , yT ).

Proof. The proof of this proposition is similar to that of Proposition A.7. Take any ε ∈ [0, τ). For

i = 2, we have ∀t ∈ [1, 2]

‖Ψε2(t)‖2 ≤ ‖Ψε

2(1)‖2eK1 (A.124)

≤∥∥∥∥ ∂

∂xε1g(xε1(1), y1)Ψε

1(1)

∥∥∥∥2

eK1 (A.125)

≤∥∥∥∥ ∂

∂xε1g(xε1(1), y1)

∥∥∥∥2

· ‖Ψε1(1)‖2 e

K1 (A.126)

≤ 2K1ρmaxe2K1

K2 +K4‖y1‖L2

2 +(K3 +K5‖y1‖L2

2

)‖xε1(1)‖L1

2

(A.127)

by Assumption (3.2d), Proposition A.15, and Lemma A.16. Using Proposition A.7, we can bound

xε1(1) by

‖xε1(1)‖2 ≤∑

(j1)∈K2

α(j1)2 (x0)‖y1‖j12 . (A.128)

Substituting (A.128) into (A.127) and using the multinomial theorem, one can verify that

∀t ∈ [1, 2] ‖Ψε2(t)‖2 ≤

∑(j1)∈L1

β(j1)1 (x0)‖y1‖j12 (A.129)


for some finite set L1 and finite β(j1)1 (x0).

Next, suppose that the claim holds for some i ≤ 2. That is,

∀t ∈ [i− 1, i] ‖Ψεi(t)‖2 ≤

∑(j1,...,ji−1)∈Li

β(j1,...,ji−1)i (x0)

i−1∏m=1

‖ym‖jm2 , (A.130)

where Li and β(j1,...,ji−1)i (x0) are as defined in the statement of the proposition. Similar to the case

for i = 2, we have ∀t ∈ [i, i+ 1]

‖Ψεi+1(t)‖2 ≤ eK1

∑(j1,...,ji−1)∈Li

β(j1,...,ji−1)i (x0)

i−1∏m=1

‖ym‖jm2

×K2 +K4‖yi‖L2

2 +(K3 +K5‖yi‖L2

2

)‖xεi(i)‖

L12

(A.131)

Proposition A.7 gives the following bound:

‖xεi(i)‖2 ≤∑

(j1,...,ji−1)∈Ki

α(j1,...,ji−1)i (x0)

i−1∏m=1

‖ym‖jm2 . (A.132)

Substituting (A.132) into (A.131) and using the multinomial theorem, we conclude that

∀t ∈ [i, i+ 1] ‖Ψεi+1(t)‖2 ≤

∑(j1,...,ji)∈Li+1

β(j1,...,ji)i+1 (x0)

i∏m=1

‖ym‖jm2 (A.133)

for some finite set Li+1 and finite β(j1,...,ji)i+1 (x0).

Finally, proceeding by mathematical induction over i ∈ 2, . . . , T completes the proof.

Lemma A.18. Let uε and xε1 be as in Lemma A.9. Then, similarly to Lemma A.10 we have

limε→0+

1

ε

∫ τ

τ−εc(xε1(t), uε(t))− c(x1(t), u(t)) dt = c(x1(τ), v)− c(x1(τ), u(τ)). (A.134)

Proof. As uε(t) = v ∀t ∈ (τ − ε, τ ], we will show that

limε→0+

1

ε

∫ τ

τ−εc(xε1(t), v)− c(x1(t), u(t)) dt = c(x1(τ), v)− c(x1(τ), u(τ)). (A.135)

By Assumption 3.3 and the continuity of xε1(t), c(xε1(t), v) is continuous with respect to t on [τ−ε, τ ].

Thus, the mean value theorem yields

1

ε

∫ τ

τ−εc(xε1(t), v)dt = c(xε1(t), v) (A.136)


for some t ∈ [τ − ε, τ ]. From the triangle inequality and Lemma A.9 it follows that

‖xε1(t)− x1(τ)‖2 ≤ ‖xε1(t)− x1(t)‖2 + ‖x1(t)− x1(τ)‖2 (A.137)

≤ L′ε+ ‖x1(t)− x1(τ)‖2. (A.138)

Therefore, limε→0+ xε1(t) = x1(τ) and

limε→0+

1

ε

∫ τ

τ−εc(xε1(t), v)dt = c(x1(τ), v). (A.139)

On the other hand, u(t) is continuous on [τ − ε, τ ] for all sufficiently small ε, since u is left

continuous at τ by Definition A.1. Therefore, c(x1(t), u(t)) is continuous with respect to t on

[τ − ε, τ ] for ε small. The mean value theorem gives

1

ε

∫ τ

τ−εc(x1(t), u(t))dt = c(x1(t), u(t)) (A.140)

for some t ∈ [τ − ε, τ ]. Taking the limit ε → 0+, the right hand side converges to c(x1(τ), u(τ)).

Combining this result with (A.139), we conclude that

limε→0+

1

ε

∫ τ

τ−εc(xε1(t), v)− c(x1(t), u(t)) dt = c(x1(τ), v)− c(x1(τ), u(τ)). (A.141)

Lemma A.19. Given ε ∈ [0, τ), uε and (y1, . . . , yT ), the right derivative of the instantaneous cost

function with respect to ε is given by

∀t ∈ [i− 1, i]∂+

∂εc(xεi(t), u

ε(t)) =∂

∂xεic(xεi(t), u(t))TΨε

i(t). (A.142)

for each i ∈ 2, . . . , T.For i = 1 we have

∀t ∈ (τ, 1]∂+

∂εc(xε1(t), uε(t)) =

∂

∂xε1c(xε1(t), u(t))TΨε

1(t), (A.143)

Similarly, the right derivative of the terminal cost function with respect to ε is given by

∂+

∂εh(xε(T )) =

∂

∂xεh(xε(T ))TΨε(T ). (A.144)

Proof. To prove the claim for the instantaneous cost, note that uε(t) = u(t) for all t ∈ (τ, T ] and

use the chain rule. The case for the terminal cost also follows from the chain rule.


Proposition A.20 (Bounded Cost Variations). Given ε ∈ [0, τ), uε and (y1, . . . , yT ), the right

derivative of the instantaneous cost function with respect to ε has the following uniform bound:

∀t ∈ [i− 1, i]

∣∣∣∣∂+

∂εc(xεi(t), u

ε(t))

∣∣∣∣ ≤ ∑(j1,...,ji−1)∈L′i

β′(j1,...,ji−1)i (x0)

i−1∏m=1

‖ym‖jm2 (A.145)

for each i ∈ 2, . . . , T, where L′i is a finite set of sequences of nen-negative integers of length i− 1,

and β′(j1,...,ji−1)i (x0) is a finite positive constant that depends on x0 and (j1, . . . , ji−1) but not on ε,

uε, or (y1, . . . , yT ).


∀t ∈ (τ, 1]

∣∣∣∣∂+

∂εc(xε1(t), uε(t))

∣∣∣∣ ≤ β′1(x0) (A.146)

for some finite positive constant β′1(x0).

Similarly, the right derivative of the terminal cost function with respect to ε has the following

bound: ∣∣∣∣∂+

∂εh(xε(T ))

∣∣∣∣ ≤ ∑(j1,...,jT )∈L′T+1

β′(j1,...,jT )T+1 (x0)

T∏m=1

‖ym‖jm2 (A.147)

for some finite set L′T+1 of sequence of non-negative integers as well as finite positive constants

β′(j1,...,jT )T+1 (x0).

Proof. The proof of this proposition is similar to that of Proposition A.8. Take any ε ∈ [0, τ). For

i ∈ 2, . . . , T, Assumption 3.3 along with Lemma A.19 yields∣∣∣∣∂+

∂εc(xεi(t), u

ε(t))

∣∣∣∣ =

∣∣∣∣ ∂∂xεi c(xεi(t), u(t))TΨεi(t)

∣∣∣∣ (A.148)

≤∥∥∥∥ ∂

∂xεic(xεi(t), u(t))

∥∥∥∥2

· ‖Ψεi(t)‖2 (A.149)

≤ (K6 +K7‖xεi(t)‖L32 ) ‖Ψε

i(t)‖2 . (A.150)

By Propositions A.7 and A.17, we have

‖xεi(t)‖2 ≤∑

(j1,...,ji−1)∈Ki

α(j1,...,ji−1)i (x0)

i−1∏m=1

‖ym‖jm2 (A.151)

‖Ψεi(t)‖2 ≤

∑(j1,...,ji−1)∈Li

β(j1,...,ji−1)i (x0)

i−1∏m=1

‖ym‖jm2 (A.152)

for all t ∈ [i − 1, i]. Substituting these into (A.150) and using the multinomial expansion formula,


we conclude that

∀i ∈ 2, . . . , T ∀t ∈ [i− 1, i]∣∣∣∣∂+

∂εc(xεi(t), u

ε(t))

∣∣∣∣ ≤ ∑(j1,...,ji−1)∈L′i

β′(j1,...,ji−1)i (x0)

i−1∏m=1

‖ym‖jm2 (A.153)

for some finite set L′i of sequences of non-negative integers as well as finite positive constants

β′(j1,...,ji−1)i (x0). Similarly, for i = 1 we have ∀t ∈ (τ, 1]∣∣∣∣∂+


∣∣∣∣ ≤ (K6 +K7‖xε1(t)‖L32 )‖Ψε

1(t)‖2 (A.154)

≤ 2(K6 +K7α1(x0)L3)K1ρmaxeK1 (A.155)

, β′1(x0), (A.156)

by Proposition A.7 and Lemma A.16.

To bound the right derivative of the terminal cost, note that∣∣∣∣∂+

∂εh(xε(T ))

∣∣∣∣ =

∣∣∣∣ ∂∂xεh(xε(T ))TΨε(T )

∣∣∣∣ (A.157)

≤ (K6 +K7‖xε(T )‖L32 )‖Ψε(T )‖2 (A.158)

=(K6 +K7‖g(xεT (T ), yT )‖L3

2

)‖Ψε(T )‖2 (A.159)

≤(K6 +K7‖g(xεT (T ), yT )‖L3

2

)×∥∥∥∥ ∂

∂xεTg(xεT (T ), yT )

∥∥∥∥2

· ‖ΨεT (T )‖2 (A.160)

by Assumption 3.3, Proposition A.15, and Lemma A.19. One can apply Assumption (3.2d) to bound

the norms of g and its Jacobian in terms of ‖xεT (T )‖2 and ‖yT ‖. Then, (A.160) becomes a polynomial

of ‖xεT (T )‖2 and ‖yT ‖, multiplied by ‖ΨεT (T )‖2. Finally, using (A.151) and (A.152) with i = T to

replace ‖xεT (T )‖2 and ‖ΨεT (T )‖2, one can verify that

∣∣∣∣∂+

∂εh(xε(T ))

∣∣∣∣ ≤ ∑(j1,...,jT )∈L′T+1

β′(j1,...,jT )T+1 (x0)

T∏m=1

‖ym‖jm2 (A.161)

for some finite set L′T+1 of sequence of non-negative integers as well as finite positive constants

β′(j1,...,jT )T+1 (x0).

Lemma A.21. Let xε be the perturbed state induced by the perturbed control uε, and let (y1, . . . , yT )

be the given observations. Then, the function ε 7→ c(xε(t), uε(t)) is continuous with respect to

ε ∈ [0, τ ] for all t ∈ (τ, T ].

Proof. Note that for t ∈ (τ, T ] we have uε(t) = u(t). Thus, c(xε(t), uε(t)) = c(xε(t), u(t)). The


continuity of xε1(t) with respect to ε ∈ [0, τ ] follows from Lemma A.9. In particular, xε1(1) is

continuous with respect to ε. Next, suppose that ε 7→ xεi(i) is continuous for some i ∈ 1, . . . , T.Then, by Assumption (3.2b) and Corollary A.5 it follows that ε 7→ xεi+1(t) is continuous for all

t ∈ [i, i + 1]. Proceeding by mathematical induction, we conclude that xε(t) is continuous with

respect to ε ∈ [0, τ ] for all t ∈ (τ, T ]. Therefore, ε 7→ c(xε(t), u(t)) is continuous by Assumption

3.3.

Proposition A.22. Let u ∈ U be a control, which yields the nominal state x. Let xε be the perturbed

state induced by the perturbed control uε, and let (y1, . . . , yT ) be the given observations. Then, the

following bounds hold for all ε ∈ [0, τ ]:

∀t ∈ (τ, 1] |c(xε1(t), uε(t))− c(x1(t), u(t))| ≤ εβ′1(x0) (A.162)

and

∀i ∈ 2, . . . , T ∀t ∈ [i− 1, i]

|c(xεi(t), uε(t))− c(xi(t), u(t))| ≤ ε∑

(j1,...,ji−1)∈L′i

β′(j1,...,ji−1)i (x0)

i−1∏m=1

‖ym‖jm2 , (A.163)

where β′1(x0), β′(j1,...,ji−1)i (x0) and L′ are as defined in Proposition A.20.

Proof. For ε ∈ [0, τ ] and t ∈ (τ, T ], the function ε 7→ c(xε(t), uε(t)) is continuous by Lemma A.21 and

finite by Proposition A.8. It is also right differentiable with respect to ε for ε ∈ [0, τ) and t ∈ (τ, T ]

by Lemma A.19. Therefore, the mean value theorem (Corollary in [17], p.15) along with Proposition

A.20 proves the claim.

Lemma A.23. Let xε be the perturbed state induced by the perturbed control uε, and let (y1, . . . , yT )

be the given observations. Then, the function ε 7→ h(xε(T )) is continuous with respect to ε ∈ [0, τ ].

Proof. By the proof of Lemma A.21 it follows that the function ε 7→ xε(T ) is continuous with respect

to ε ∈ [0, τ ]. The continuity of h by Assumption 3.3 completes the proof.

Proposition A.24. Let u ∈ U be a control, which yields the nominal state x. Let xε be the perturbed

state induced by the perturbed control uε, and let (y1, . . . , yT ) be the given observations. Then, the

following bound holds for all ε ∈ [0, τ ]:

|h(xε(T ))− h(x(T ))| ≤ ε∑

(j1,...,jT )∈L′T+1

β′(j1,...,jT )T+1 (x0)

T∏m=1

‖ym‖jm2 , (A.164)

where β′(j1,...,jT )T+1 (x0) and L′T+1 are as defined in Proposition A.20.


Proof. The proof is very similar to that of Proposition A.22. Use Proposition A.8, Lemma A.23,

and Lemma A.19 to show finiteness, continuity, and right differentiability of ε 7→ h(xε(T )). Then

use the same mean value theorem with Proposition A.20 to prove the claim.

A.3 Expected Total Cost under Stochastic Observations

In this last part of the analysis, we finally let the observations (y1, . . . , yT ) take random values;

more formally, we treat them as a sequence of random variables (Y1(ω), . . . , YT (ω)) for ω ∈ Ω, where

(Ω,F ,P) is the probability space and each Yi satisfies Assumption 3.4. With (τ, v) given and fixed,

(ω, t) and ε uniquely determine the perturbed control uε and the observations, hence the resulting

state trajectory xε and the costs c(xε(t), uε(t)), h(xε(T )).

Lemma A.25. Let ([τ, T ],B ([τ, T ]) , λ) be a measure space, where B ([τ, T ]) is the Borel σ-algebra

on [τ, T ] and λ is the Lebesgue measure. Let µ , λ×P be the product measure defined on the product

space (Ω× [τ, T ],F ⊗ B ([τ, T ])), where F ⊗ B ([τ, T ]) is the product σ-algebra. Then, the function

(ω, t) 7→ c(xε(t), uε(t)) is F ⊗ B ([τ, T ])-measurable for every ε ∈ [0, τ ].

Proof. Take any ε ∈ [0, τ ]. Then, uε is in U and thus the function (Y1(ω), . . . , YT (ω)) 7→ xε(t) is

continuous for every t ∈ [τ, T ] by Proposition A.6. Therefore, the map (ω, t) 7→ xε(t) as a function of

ω is F-measurable for every t ∈ [τ, T ]. By Corollary A.2, xε(t) is also right continuous with respect

to t for every ω ∈ Ω. Therefore, from Theorem 3 in [54] it follows that (ω, t) 7→ xε(t) is measurable

with respect to the product σ-algebra F ⊗ B ([τ, T ]).

On the other hand, uε(t) is piecewise continuous in t and is constant with respect to the sequence

of random variables (Y1(ω), . . . , YT (ω)). Therefore, (ω, t) 7→ uε(t) is also measurable with respect to

F ⊗ B ([τ, T ]).

Finally, the continuity of the instantaneous cost c by Assumption 3.3 proves the claim.

Proposition A.26. For the perturbed control uε and the perturbed state xε, we have

∂+

∂εE

[∫ T

τ

c(xε(t), uε(t))dt

] ∣∣∣∣∣ε=0

= E

[∫ T

τ

∂

∂xc(x(t), u(t))TΨ(t)dt

], (A.165)

where Ψ is the state variation defined in Proposition A.14.

Proof. By definition, the left hand side of (A.165) is

∂+

∂εE

[∫ T

τ

c(xε(t), uε(t))dt

] ∣∣∣∣∣ε=0

= limε→0+

E

[∫ T

τ

1

εc(xε(t), uε(t))− c(x(t), u(t)) dt

]. (A.166)


Consider the expected value above as the equivalent Lebesgue integral:

∫Ω

(∫[τ,T ]

1

εc(xε(t), uε(t))− c(x(t), u(t)) dλ(t)

)dP(ω). (A.167)

By Lemma A.25 the integrand is measurable in the product space. In addition, Proposition A.22

shows that the absolute value:

1

ε|c(xε(t), uε(t))− c(x(t), u(t))| (A.168)

is bounded by an integrable function for µ-a.e. (ω, t). Indeed, if we let c(ω, t) to be a function

defined by

c(ω, t) =

β′1(x0) ∀t ∈ [τ, 1)∑(j1,...,ji−1)∈L′i

β′(j1,...,ji−1)i (x0)

∏i−1m=1 ‖Ym‖

jm2 ∀t ∈ [i− 1, i) ∀i ∈ 1, . . . , T,

(A.169)

then we have

1

ε|c(xε(t), uε(t))− c(x(t), u(t))| ≤ c(ω, t) (A.170)

for every non-zero ε and µ-a.e. (ω, t), and

∫Ω

(∫[τ,T ]

c(ω, t)dλ(t)

)dP(ω) = β′1(x0)(1− τ)

+

T∑i=2

∑(j1,...,ji−1)∈L′i

β′(j1,...,ji−1)i (x0)E

[i−1∏m=1

‖Ym‖jm2

], (A.171)

where

E

[i−1∏m=1

‖Ym‖jm2

]≤√

E[‖Y1‖j12

]√√√√E

[i−1∏m=2

‖Ym‖jm2

](A.172)

≤...

≤i−1∏m=1

(E[‖Ym‖jm2

]) 12m

<∞ (A.173)

by the Cauchy-Schwarz inequality and Assumption 3.4.


Furthermore, Lemma A.19 proves that

limε→0+

1

εc(xε(t), uε(t))− c(x(t), u(t)) =

∂

∂xc(x(t), u(t))TΨ(t) (A.174)

for µ-a.e. (ω, t). Therefore, the dominated convergence theorem yields

∂+

∂ε

∫Ω

(∫ T

τ

c(xε(t), uε(t))dλ(t)

)dP(ω)

∣∣∣∣∣ε=0

=

∫Ω

(∫ T

τ

∂

∂xc(x(t), u(t))TΨ(t)dλ(t)

)dP(ω). (A.175)

Proposition A.27. For the perturbed control uε and the perturbed state xε, we have

∂+

∂εE [h(xε(T )]

∣∣∣∣ε=0

= E[∂

∂xh(x(T ))TΨ(T )

], (A.176)


Proof. The proof is similar to that of Proposition A.26. By Assumption 3.3 and Proposition A.6,

we have that (Y1(ω), . . . , YT (ω)) 7→ h(xε(T )) is a continuous map for every ε ∈ [0, τ ]. Therefore, the

function ω 7→ h(xε(T )) is F-measurable.

By definition, the left hand side of (A.176) is

∂+

∂εE [h(xε(T )]

∣∣∣∣ε=0

= limε→0+

∫Ω

1

εh(xε(T ))− h(x(T )) dP(ω) (A.177)

The analysis above implies that the integrand is F-measurable. In addition, Proposition A.24 shows

that the absolute value:

1

ε|h(xε(T ))− h(x(T ))| (A.178)

is bounded by an integrable function for every non-zero ε and every ω ∈ Ω. Furthermore, Lemma

A.19 proves that

limε→0+

1

εh(xε(T ))− h(x(T )) =

∂

∂xh(x(T ))TΨ(T ) (A.179)

for every ω ∈ Ω. Therefore, the dominated convergence theorem yields

∂+

∂ε

∫Ω

h(xε(T ))dP(ω)

∣∣∣∣ε=0

=

∫Ω

∂

∂xh(x(T ))TΨ(T )dP(ω). (A.180)


Now, we are finally set to show Theorem 3.1. For completeness, let us restate the theorem below

before we prove it.

Theorem A.1 (Mode Insertion Gradient). Suppose that Assumptions 3.1 – 3.4 are satisfied. For a

given (τ, v), let uε denote the perturbed control of the form (A.38). The perturbed control uε and the

stochastic observations (Y1, . . . , YT ) result in the stochastic perturbed state trajectory xε. For such

uε and xε, let us define the mode insertion gradient of the expected total cost as

∂+

∂εE

[∫ T

0


] ∣∣∣∣∣ε=0

. (A.181)


∂+

∂εE

[∫ T

0


] ∣∣∣∣∣ε=0

= c(x(τ), v)− c(x(τ), u(τ)) + E

[∫ T

τ

∂

∂xc(x(t), u(t))TΨ(t)dt+

∂

∂xh(x(T ))TΨ(T )

], (A.182)


Proof. We first consider the instantaneous cost c. Split the integration interval to get

E

[∫ T

0

c(xε(t), uε(t))dt

]

= E[∫ τ−ε

0

c(xε(t), uε(t))dt

]+ E

[∫ τ

τ−εc(xε(t), uε(t))dt

]+ E

[∫ T

τ

c(xε(t), uε(t))dt

](A.183)

For the first two terms in the sum, recall that the evolution of the state xε(t) is not affected by any

observations for all t ∈ [0, τ ]. Thus,

E[∫ τ−ε

0

c(xε(t), uε(t))dt

]=

∫ τ−ε

0

c(xε(t), uε(t))dt (A.184)

E[∫ τ


]=

∫ τ

τ−εc(xε(t), uε(t))dt. (A.185)

Note that (A.184) is constant with respect to ε, since for all t ∈ [0, τ − ε] we have uε(t) = u(t) and

xε(t) = x(t). On the other hand, for (A.185) we can apply Lemma A.18 to obtain

∂+

∂εE[∫ τ


] ∣∣∣∣∣ε=0

= c(x(τ), v)− c(x(τ), u(τ)) (A.186)


For the last term, Proposition A.26 gives

∂+

∂εE

[∫ T

τ

c(xε(t), uε(t))dt

] ∣∣∣∣∣ε=0

= E

[∫ T

τ

∂

∂xc(x(t), u(t))TΨ(t)dt

]. (A.187)

Finally, for the terminal cost h we have

∂+

∂εE [h(xε(T )]

∣∣∣∣ε=0

= E[∂

∂xh(x(T ))TΨ(T )

](A.188)

by Proposition A.27.

Remark A.2 (Closed-loop Nominal Policy). As far as the control is concerned, the analysis above

only requires that the nominal control u is in U (as in Assumption 3.1) and that the perturbed control

uε is measurable with respect to F ⊗ B([τ, T ]) (as in Lemma A.25). To satisfy these requirements

with a closed-loop nominal policy π : Rnx → Rm, it is sufficient that π is a measurable map and

that the induced nominal control trajectory u(t) = π(x(t)) for t ∈ [0, T ] belongs to U for any given

observations (y1, . . . , yT ).

Notice that even with state feedback, both the nominal control trajectory u ∈ U and the nominal

state trajectory x are uniquely determined under a fixed sequence of observations. Note also that the

type of the control perturbation considered in this sensitivity analysis is still open-loop:

uε(t) =

v if t ∈ (τ − ε, τ ]

π(x(t)) otherwise,(A.189)

because we still follow Definition A.1 for the control perturbation model. That is, the nominal

state trajectory x is used in the control feedback. This is not to be confused with the closed-loop

perturbation:

uεclosed(t) =

v if t ∈ (τ − ε, τ ]

π(xε(t)) otherwise,(A.190)

where the perturbed state trajectory xε is fed back to the perturbed control.

Appendix B

Mode Insertion Gradient for

Crowd-Robot Interaction

In this appendix, we will provide rigorous analysis required to prove that the mode insertion gradient

for the entropic risk measure exists and is well-defined for the crowd-robot interaction problem

(Chapter 4, Theorem 4.1). Specifically, we will provide a set of necessary conditions for (4.14), the

Lipschitz continuity of the perturbed cost with respect to the perturbation duration ε, to hold. We

then show how this result can be used to show (4.15), the existence of the mode insertion gradient for

stochastic systems. Note that (4.15) is essentially the same statement as Theorem 3.1 in Chapter 3,

which we spent the entire Appendix A to prove. In this appendix, we show how the same result can

be proven with a substantially simpler analysis, by imposing a set of stricter assumptions than those

introduced in Chapter 3 (Section 3.3). The key to this simplification is (4.14), which not only leads

to the stochastic mode insertion gradient (4.15) but is also the basis of the mode insertion gradient

for the entropic risk measure given in Theorem 4.1, as already discussed in Chapter 4 (Section 4.4).

B.1 Statement of Assumptions

We first formally state the set of underlying assumptions for the system dynamics, the control,

and the cost functions as they appear in Chapter 4. Without loss of generality and analogously to

Chapter 3 (Section 3.3), we assume that the system starts at time t = 0 and ends at t = T , with

a sequence of T stochastic transitions (y1, . . . , yT ) where each joint transition yi is made by the

humans every unit time.

Assumption B.1 (Control Model). The controls are in C0,m[0, T ], the space of piecewise continuous

functions from [0, T ] into Rm. We further assume that there exists some bounded convex set U ⊆ Rm

such that for all t ∈ [0, T ], we have u(t) ∈ U . We denote this admissible control set by U , u ∈

125

APPENDIX B. MODE INSERTION GRADIENT FOR CROWD-ROBOT INTERACTION 126

C0,m[0, T ] | ∀t ∈ [0, T ] u(t) ∈ U.

Assumption B.2 (Dynamics Model). Let N be the number of humans to be considered by the robot.

Let p(t) = (p1(t)T, . . . , pN (t)T)T ∈ R2N denote the joint human positions at time t, and xr(t) ∈ Rn

the robot state. The joint state is defined by x(t) , (xr(t)T, p(t)T)T ∈ Rn+2N . Now, let x0 ∈ Rn+2N

be the given initial state value at t = 0. Given a control u ∈ U and a sequence of joint human

transitions (i.e. changes in their positions) (y1, . . . , yT ) ∈ R2N × · · · × R2N , the joint dynamics

model is the following hybrid system with time-driven switching:

x(t) , xi(t) ∀t ∈ [i− 1, i) ∀i ∈ 1, 2, . . . , T, (B.1)

where xi , (xTi,r, pTi )T is the i-th “mode” of the joint system state defined on [i - 1, i] as:

xi(i− 1) = g(xi−1(i− 1), yi−1) (B.2)

, xi−1(i− 1) +

(0

yi−1

)(B.3)

xi(t) = f(xi(t), u(t)) (B.4)

,

(fr(xi,r(t)) +H(xi,r(t))u(t)

0

)∀t ∈ [i− 1, i], (B.5)

with x(0) = x1(0) = x0. We also define the final state as x(T ) , g(xT (T ), yT ).

For the robot’s transition function, we assume the following:

(B.2a) the dynamics function for the robot

(xr(t), u(t)) 7→ fr(xr(t)) +H(xr(t))u(t) (B.6)

is continuously differentiable.

(B.2b) there exists a constant K1 ∈ [1,∞) such that ∀x′r, x′′r ∈ Rn and ∀u′, u′′ ∈ U , the following

condition holds:

‖fr(x′r)− fr(x′′r ) +H(x′r)u′ −H(x′′r )u′′‖2 ≤ K1(‖x′r − x′′r‖2 + ‖u′ − u′′‖2) (B.7)

Remark B.1. Assumptions (B.2a) and (B.2b) above are used to show existence and uniqueness of

the solution to the differential equation (B.5). Note that Assumption (B.2b) is essentially a Lipschitz

continuity condition, which is a quite common assumption in the analysis of nonlinear ODEs, as

can be seen in Assumption 5.6.2 in [43] and Theorem 2.3 in [77].


Assumption B.3 (Cost Model). The instantaneous cost c : Rn+2N × Rm → R is given by the

well-known LQ tracking cost plus a collision cost term:

c(x, u) =1

2(xr − r)TQ(xr − r) +

1

2uTRu+ ccol(x), (B.8)

where Q = QT 0 and R = RT 0 are weight matrices of appropriate dimensions, r : [0, T ]→ Rn

is a given, bounded reference trajectory for the robot, and ccol(·) ≥ 0 is a collision penalty function.

This collision penalty is continuously differentiable in x, bounded, and has a bounded gradient with

respect to x.

Similarly, the terminal cost h : Rn+2N → R is defined by

h(x) =β

2(xr − r)TQ(xr − r) + βccol(x), (B.9)

where β ≥ 0 determines the relative weight between the terminal and instantaneous costs.

Assumption B.4 (Stochastic Human Transitions). Let (Ω,F ,P) be a probability space. Let (Y1, . . . ,

YT ) be a sequence of random vectors in R2N defined on this space, representing the sequence of 2D

position transitions for the humans. Unlike Assumption 3.4, we do not impose any assumption on

Yi; these random vectors may be arbitrarily distributed in R2N .

Definition B.1 (Perturbed Control). Let u ∈ U be a control. For τ ∈ (0, 1) and v ∈ U , define the

perturbed control uε by

uε(t) ,

v if t ∈ (τ − ε, τ ]

u(t) otherwise,(B.10)



Remark B.2. Definition B.1 is identical to Definition 3.1 (and equivalently, Definition A.1). Al-

though we assumed in Chapter 4 that the perturbation time τ could be virtually any time in the

planning horizon, the analysis presented in this appendix only handles the case τ ∈ (0, 1), for sim-

plicity and retaining consistency with Appendix A. Nevertheless, we expect that the results still hold

for the more general case considered in Chapter 4.

Remark B.3. Note that Assumptions B.1 – B.4 are closely related to Assumptions 3.1 – 3.4, under

which we analyzed the mode insertion gradient for general stochastic hybrid systems with time-driven

switching. In fact, Assumptions B.1 – B.3 are sufficient conditions for Assumptions 3.2 – 3.3 as

we prove below. These more strict assumptions will eventually let us loosen the assumption on

the random variable (i.e. Assumption B.4), in that the finiteness of the moments as appeared in

Assumption 3.4 is no longer required to show existence of the mode insertion gradient.


Proposition B.1 (Sufficiency of Assumptions). If a stochastic hybrid system satisfies Assumption

B.1, B.2, and B.3 simultaneously, it also satisfies Assumption 3.1, 3.2, and 3.3.

Proof. Assumption 3.1 trivially follows from Assumption B.1 and the property of bounded convex

sets. That is, u(t) ∈ U implies that there exists some ρmax <∞ such that u(t) ∈ B(0, ρmax), where

B(0, ρmax) is the closed Euclidean ball of radius ρmax centered at 0.

For Assumption 3.2, one can easily verify that the continuous differentiability and the Lipschitz

continuity of f(xi, u) follow from (B.2a) and (B.2b) of Assumption B.2. We now show that the

function g defined by (B.3) satisfies the remaining (3.2b) and (3.2d) of Assumption 3.2. Namely, we

will show the following:

• the function g : Rn+2N × R2N → Rn+2N is continuous. It is also differentiable in x ∈ Rn+2N .

• there exist finite non-negative constants K2, K3, K4, K5 and positive integers L1, L2 such

that ∀x ∈ Rn+2N and ∀y ∈ R2N , the following relations hold:

‖g(x, y)‖2 ≤ K2 +K3‖x‖L12 +K4‖y‖L2

2 +K5‖x‖L12 ‖y‖

L22 (B.11)

∥∥∥∥ ∂∂xg(x, y)

∥∥∥∥2

≤ K2 +K3‖x‖L12 +K4‖y‖L2

2 +K5‖x‖L12 ‖y‖

L22 (B.12)

The continuity and the differentiability of g are obvious from its definition (B.3). For the norm

bounds, simply substitute (B.3) to (B.11) and (B.12) to get

‖g(x, y)‖2 ≤ ‖x‖2 + ‖y‖2 (B.13)

≤ 1 + ‖x‖2 + ‖y‖2, (B.14)

∥∥∥∥ ∂∂xg(x, y)

∥∥∥∥2

= ‖I(n+2N)×(n+2N)‖2 (B.15)

, sup‖v‖2=1

‖I(n+2N)×(n+2N) · v‖2 (B.16)

= 1 (B.17)

≤ 1 + ‖x‖2 + ‖y‖2. (B.18)

Thus, K2 = 1, K3 = 1, K4 = 1, K5 = 0 and L1 = 1, L2 = 1 yield the desired bound.

Finally, for Assumption 3.3 one can easily verify that the instantaneous cost c defined by (B.8)

is continuous in (x, u) and continuously differentiable in x. Similarly, the terminal cost h given by

(B.9) is differentiable. Therefore, it remains to show the following:


• there exist finite non-negative constants K6, K7 and a positive integer L3 such that for all

x ∈ Rn+2N and u ∈ B(0, ρmax), the following relations hold:

|c(x, u)| ≤ K6 +K7‖x‖L32 (B.19)∥∥∥∥ ∂∂xc(x, u)

∥∥∥∥2

≤ K6 +K7‖x‖L32 (B.20)

|h(x)| ≤ K6 +K7‖x‖L32 (B.21)∥∥∥∥ ∂∂xh(x)

∥∥∥∥2

≤ K6 +K7‖x‖L32 . (B.22)

To prove the above, we first note that the robot’s trajectory is determined solely by the (determin-

istic) ODE:

xr(t) = fr(xr(t)) +H(xr(t))u(t) ∀t ∈ [0, T ]. (B.23)

Making use of (B.2a), (B.2b), one can show by Proposition 5.6.5 in [43] that the solution xr(t)

uniquely exists and is bounded for all t ∈ [0, T ], where the bound does not depend on u. Thus, the

absolute cost functions:

|c(x, u)| =∣∣∣∣12(xr − r)TQ(xr − r) +

1

2uTRu+ ccol(x)

∣∣∣∣ (B.24)

|h(x)| =∣∣∣∣β2 (xr − r)TQ(xr − r) + βccol(x)

∣∣∣∣ (B.25)

are both bounded, due to boundedness of xr, r, u, and the collision penalty ccol. Similarly,∥∥∥∥ ∂∂xc(x, u)

∥∥∥∥2

=

∥∥∥∥Q(xr − r) +∂

∂xccol(x)

∥∥∥∥2

(B.26)

≤ ‖Q‖2‖xr − r‖2 +

∥∥∥∥ ∂∂xccol(x)

∥∥∥∥2

(B.27)

is bounded since the gradient of ccol is assumed to be bounded. The same result applies to the

gradient of the terminal cost function∥∥ ∂∂xh(x)

∥∥2. Therefore, there exists a certain K6 > 0 (with

K7 = 0 and L3 = 1) such that (B.19) – (B.22) hold.

B.2 Preliminary Results

The fact that Assumptions B.1 – B.3 are sufficient for Assumption 3.1 – 3.3 is of critical importance,

since it indicates that all the results shown in Appendix A.1 and A.2 hold. Moreover, we can utilize

the specific constants (e.g. K7 = 0, etc.) derived in the proof of Proposition B.1 to make stronger

claims about the properties of the system than shown in Appendix A. In what follows, we provide


preliminary steps that will eventually lead us to the proof of (4.14). Note that all the analysis given

in this section assumes that the sequence of human transitions (y1, . . . , yT ) is (deterministically)

given and fixed, similar to Appendix A.1 and A.2.

Proposition B.2 (Bounded State Variation). Let uε be the perturbed control given by Definition

B.1. Under a specific sequence of human transitions (y1, . . . , yT ), this perturbed control induces

the perturbed state trajectory xε(t). Let Ψε(t) denote the right derivative ∂+∂ε x

ε(t) evaluated at a

particular ε ∈ [0, τ), as defined in Proposition A.15. Recall that Ψε is itself a hybrid system with

time-driven switching and is given by a sequence of state variations (Ψε1, . . . ,Ψ

εT ), where

Ψε(t) =

Ψε1(t) ∀t ∈ [τ − ε, 1)

Ψεi(t) ∀t ∈ [i− 1, i) ∀i ∈ 2, . . . , T.

(B.28)

For such (Ψε1, . . . ,Ψ

εT ), we have the following upper bounds for any ε ∈ [0, τ):

‖Ψε1(t)‖2 ≤ 2K1ρmaxe

K1 ∀t ∈ [τ − ε, 1] (B.29)

‖Ψεi(t)‖2 ≤ 2K1ρmaxe

iK1 ∀t ∈ [i− 1, i] ∀i ∈ 2, . . . , T. (B.30)

Proof. The case for Ψε1 follows immediately from Lemma A.16. For general i ∈ 2, . . . , T, the proof

is similar to that of Proposition A.17. Take any ε ∈ [0, τ). For i = 2, we have ∀t ∈ [1, 2]

‖Ψε2(t)‖2 ≤ ‖Ψε

2(1)‖2eK1 (B.31)

≤∥∥∥∥ ∂

∂xε1g(xε1(1), y1)Ψε

1(1)

∥∥∥∥2

eK1 (B.32)

≤∥∥∥∥ ∂

∂xε1g(xε1(1), y1)

∥∥∥∥2

· ‖Ψε1(1)‖2 e

K1 (B.33)

=

∥∥∥∥ ∂

∂xε1g(xε1(1), y1)

∥∥∥∥2

2K1ρmaxe2K1 (B.34)

by Proposition A.15 and Lemma A.16. Now, recall from the proof of Proposition B.1 that we have∥∥∥∥ ∂∂xg(x, y)

∥∥∥∥2

= 1. (B.35)

Therefore, the bound simplifies to

‖Ψε2(t)‖2 ≤ 2K1ρmaxe

2K1 . (B.36)

Proceeding by induction over i ∈ 2, . . . , T completes the proof.


Remark B.4. These upper bounds in Proposition B.2 are more strict than those in Proposition

A.17 in that they are not stochastic (since they do not involve y1, . . . , yT ).

Proposition B.3 (Bounded Cost Variations). Given ε ∈ [0, τ), uε and (y1, . . . , yT ), the right deriva-

tive of the instantaneous cost function with respect to ε has the following uniform bound:

∀i ∈ 2, . . . , T ∀t ∈ [i− 1, i]

∣∣∣∣∂+

∂εc(xεi(t), u

ε(t))

∣∣∣∣ ≤ 2K1K6ρmaxeiK1 . (B.37)


∀t ∈ (τ, 1]

∣∣∣∣∂+


∣∣∣∣ ≤ 2K1K6ρmaxeK1 . (B.38)

Similarly, the right derivative of the terminal cost function with respect to ε has the following

bound: ∣∣∣∣∂+

∂εh(xε(T ))

∣∣∣∣ ≤ 2K1K6ρmaxeTK1 . (B.39)

Proof. This proposition is a more strict version of Proposition A.20. Take any ε ∈ [0, τ). For

i ∈ 2, . . . , T, Assumption 3.3 along with Lemma A.19 yields∣∣∣∣∂+

∂εc(xεi(t), u

ε(t))

∣∣∣∣ =

∣∣∣∣ ∂∂xεi c(xεi(t), u(t))TΨεi(t)

∣∣∣∣ (B.40)

≤∥∥∥∥ ∂

∂xεic(xεi(t), u(t))

∥∥∥∥2

· ‖Ψεi(t)‖2 (B.41)

≤ (K6 +K7‖xεi(t)‖L32 ) ‖Ψε

i(t)‖2 . (B.42)

Now, recall that K7 = 0 from the proof of Proposition B.1, and that ‖Ψεi(t)‖2 ≤ 2K1ρmaxe

iK1 ∀t ∈[i− 1, i] from Proposition B.2. Substituting these, we obtain∣∣∣∣∂+

∂εc(xεi(t), u

ε(t))

∣∣∣∣ ≤ 2K1K6ρmaxeiK1 ∀t ∈ [i− 1, i]. (B.43)

Similarly, for i = 1 we have ∀t ∈ (τ, 1]∣∣∣∣∂+


∣∣∣∣ ≤ (K6 +K7‖xε1(t)‖L32 )‖Ψε

1(t)‖2 (B.44)

≤ 2K1K6ρmaxeK1 (B.45)

by Proposition B.2.


To bound the right derivative of the terminal cost, note that∣∣∣∣∂+

∂εh(xε(T ))

∣∣∣∣ =

∣∣∣∣ ∂∂xεh(xε(T ))TΨε(T )

∣∣∣∣ (B.46)

≤ (K6 +K7‖xε(T )‖L32 )‖Ψε(T )‖2 (B.47)

=(K6 +K7‖g(xεT (T ), yT )‖L3

2

)‖Ψε(T )‖2 (B.48)

≤(K6 +K7‖g(xεT (T ), yT )‖L3

2

)×∥∥∥∥ ∂

∂xεTg(xεT (T ), yT )

∥∥∥∥2

· ‖ΨεT (T )‖2 (B.49)

by Assumption 3.3, Proposition A.15, and Lemma A.19. By Proposition B.1 and B.2, this upper

bound simplifies to ∣∣∣∣∂+

∂εh(xε(T ))

∣∣∣∣ ≤ 2K1K6ρmaxeTK1 , (B.50)

where we also used ‖ ∂∂xg(x, y)‖2 = 1.

Proposition B.4. Let u ∈ U be a control, which yields the nominal state x. Let xε be the perturbed

state induced by the perturbed control uε, and let (y1, . . . , yT ) be the given human transitions. Then

the following bounds hold for all ε ∈ [0, τ ]:

∀t ∈ (τ, 1] |c(xε1(t), uε(t))− c(x1(t), u(t))| ≤ ε(2K1K6ρmaxeK1) (B.51)

and

∀i ∈ 2, . . . , T ∀t ∈ [i− 1, i] |c(xεi(t), uε(t))− c(xi(t), u(t))| ≤ ε(2K1K6ρmaxeiK1). (B.52)

Proof. The proof of this proposition is similar to that of Proposition A.22. For ε ∈ [0, τ ] and

t ∈ (τ, T ], the function ε 7→ c(xε(t), uε(t)) is continuous by Lemma A.21 and finite by Proposition

A.8. It is also right differentiable with respect to ε for ε ∈ [0, τ) and t ∈ (τ, T ] by Lemma A.19.

Therefore, the mean value theorem (Corollary in [17], p.15) along with Proposition B.3 proves the

claim.

Proposition B.5. Let u ∈ U be a control, which yields the nominal state x. Let xε be the perturbed

state induced by the perturbed control uε, and let (y1, . . . , yT ) be the given human transitions. Then,

the following bound holds for all ε ∈ [0, τ ]:

|h(xε(T ))− h(x(T ))| ≤ ε(2K1K6ρmaxeTK1). (B.53)

Proof. The proof is very similar to that of Proposition A.24 and B.4. Use Proposition A.8, Lemma

A.23, and Lemma A.19 to show finiteness, continuity, and right differentiability of ε 7→ h(xε(T )).


Then use the same mean value theorem as in Proposition B.4 along with Proposition B.3 to prove

the claim.

B.3 Main Results

We are now set to derive (4.14), a key result that shows the Lipschitz continuity of the perturbed

cost with respect to the perturbation duration ε.

Theorem B.1 (Lipschitz Continuity of Perturbed Cost). Suppose that Assumptions B.1 – B.3 are

satisfied. Let u ∈ U be a control, which yields the nominal state x under a specific sequence of human

transitions (y1, . . . , yT ). Let uε be the perturbed control given by Definition B.1. Under the same

(y1, . . . , yT ), this control perturbation yields the perturbed state xε. Let J and Jε denote the nominal

and the perturbed total cost under (y1, . . . , yT ), respectively, which are given by

J ,∫ T

0

c(x(t), u(t))dt+ h(x(T )) (B.54)

Jε ,∫ T

0

c(xε(t), uε(t))dt+ h(xε(T )). (B.55)

Then, the following bound holds for all ε ∈ [0, τ ]:

|Jε − J | ≤ εK, (B.56)

for some fixed, finite positive constant K whose value does not depend on (y1, . . . yT ).

Proof. By the triangle inequality and the integral inequality about absolute values, we obtain

|Jε − J | =

∣∣∣∣∣∫ T

0

c(xε(t), uε(t))− c(x(t), u(t))dt+ h(xε(T ))− h(x(T ))

∣∣∣∣∣ (B.57)

≤∫ T

0

|c(xε(t), uε(t))− c(x(t), u(t))| dt+ |h(xε(T ))− h(x(T ))| . (B.58)

The first term of (B.58) can be decoupled as

∫ T

0

|c(xε(t), uε(t))− c(x(t), u(t))| dt

=

∫ τ−ε

0

|c(xε1(t), uε(t))− c(x1(t), u(t))| dt+

∫ τ

τ−ε|c(xε1(t), uε(t))− c(x1(t), u(t))| dt

+

∫ 1

τ

|c(xε1(t), uε(t))− c(x1(t), u(t))| dt+

T∑i=2

∫ i

i−1

|c(xεi(t), uε(t))− c(xi(t), u(t))| dt. (B.59)


In (B.59), the first integral is zero, since u(t) = uε(t) ∀t ∈ [0, τ − ε] by Definition B.1 and thus

xε1(t) = x(t) ∀t ∈ [0, τ − ε]. For the second integral, we have∫ τ

τ−ε|c(xε1(t), uε(t))− c(x1(t), u(t))| dt ≤

∫ τ

τ−ε2K6dt (B.60)

≤ ε(2K6), (B.61)

where we used |c(x, u)| ≤ K6 < ∞ ∀x ∈ Rn+2N , u ∈ B(0, ρmax) that is obtained in the proof of

Proposition B.1. For the third and the fourth integrals, we can immediately apply Proposition B.4

to obtain ∫ 1

τ

|c(xε1(t), uε(t))− c(x1(t), u(t))| dt ≤∫ 1

τ

ε(2K1K6ρmaxeK1)dt (B.62)

= ε(1− τ)(2K1K6ρmaxeK1) (B.63)

and

T∑i=2

∫ i

i−1

|c(xεi(t), uε(t))− c(xi(t), u(t))| dt ≤T∑i=2

∫ i

i−1

ε(2K1K6ρmaxeiK1)dt (B.64)

= ε

(T∑i=2

2K1K6ρmaxeiK1

)(B.65)

≤ ε(T − 1)(2K1K6ρmaxeTK1), (B.66)

where K1 ∈ [1,∞) is as defined in Assumption B.2.

For the second term of (B.58), recall that Proposition B.5 yields

|h(xε(T ))− h(x(T ))| ≤ ε(2K1K6ρmaxeTK1). (B.67)

Therefore, we conclude that

|Jε − J | ≤ ε

2K6 + 2K1K6ρmax

((1− τ)eK1 + TeTK1

). (B.68)

Define K , 2K6 + 2K1K6ρmax

((1− τ)eK1 + TeTK1

). This K is a fixed finite constant1, and is not

dependent on (y1, . . . , yT ).

Finally, we will show (4.15), the existence of the mode insertion gradient for stochastic systems.

We assume hereafter that the the human transitions (y1, . . . , yT ) are no longer fixed, and treat them

as random vectors (Y1, . . . , YT ) on the probability space (Ω,F ,P) as defined in Assumption B.4.

1Note that we treat the perturbation time τ as a fixed constant in this analysis. See Remark A.1.


Lemma B.6. Let (Ω,F ,P) be the probability space, on which the sequence of random vectors (Y1(ω),

. . . , YT (ω)) is defined2. Take any ε ∈ [0, τ ], and let Jε be as in Theorem B.1. Then, the function

ω 7→ Jε is measurable in the probability space.

Proof. We will show that the function ω 7→∫ T

0c(xε(t), uε(t))dt is measurable, by interpreting the

Riemann integral as a Lebesgue integral. By Lemma A.25, we have that for ω ∈ Ω and t ∈ [0, T ],

(ω, t) 7→ c(xε(t), uε(t)) is measurable in the corresponding product space. (Note that Lemma A.25

is shown for t ∈ [τ, T ], but it is straightforward to verify that the result also holds for t ∈ [0, T ].)

Furthermore, c is bounded as shown in the proof of Proposition B.1. Thus, c is integrable, and by

Fubini’s theorem the function ω 7→∫ T

0c(xε(t), uε(t))dt is well-defined and measurable. Finally, recall

that we showed in the proof of Proposition A.27 that the function ω 7→ h(xε(T )) is also measurable.

Therefore,

ω 7→ Jε =

∫ T

0

c(xε(t), uε(t))dt+ h(xε(T )) (B.69)

is measurable.

Theorem B.2 (Mode Insertion Gradient). Suppose that Assumptions B.1 – B.4 are satisfied. For

a given (τ, v), let uε denote the perturbed control defined by Definition B.1. The perturbed control

uε and the random vectors (Y1, . . . , YT ) result in the stochastic, perturbed total cost Jε. For such Jε

and the nominal total cost J , let us define the mode insertion gradient of the expected total cost as

∂+E[J ]

∂ε

∣∣∣∣ε=0

, limε→0+

E[Jε − Jε

]. (B.70)


∂+E[J ]

∂ε

∣∣∣∣ε=0

= E[∂+J

∂ε

∣∣∣∣ε=0

], (B.71)

where

E[∂+J

∂ε

∣∣∣∣ε=0

], E

[limε→0+

Jε − Jε

]. (B.72)

is the expected value of the (deterministic) mode insertion gradient whose value is given by (4.11).

2Here we write ω ∈ Ω to clarify that random vectors are measurable (vector-valued) functions of ω, by definition.


Proof. By definition, the left hand side of (B.71) is

limε→0+

E[Jε − Jε

]= limε→0+

∫Ω

Jε − Jε

dP(ω) (B.73)

= limn→∞

∫Ω

Jεn − Jεn

dP(ω), (B.74)

where εnn∈N is a sequence in [0, τ ] converging to 0 from above as n → ∞. According to Lemma

B.6, the integrand:

Jεn − Jεn

(B.75)

is measurable in (Ω,F ,P) for all n ∈ N. (Recall that J is the same as Jε with ε = 0.) Furthermore,

by Theorem B.1 we have ∣∣∣∣Jεn − Jεn

∣∣∣∣ ≤ K, (B.76)

where K is some fixed, finite positive constant. Thus, by the dominated convergence theorem,

limn→∞

∫Ω

Jεn − Jεn

dP(ω) =

∫Ω

limn→∞

Jεn − Jεn

dP(ω) (B.77)

=

∫Ω

limε→0+

Jε − Jε

dP(ω) (B.78)

= E[∂+J

∂ε

∣∣∣∣ε=0

], (B.79)

where ∂+J∂ε

∣∣ε=0

is the (deterministic) mode insertion gradient given by (4.11) under a specific ω ∈ Ω

(i.e. specific (y1, . . . , yT )).

Bibliography

[1] A. Agha-mohammadi, S. Agarwal, S. Kim, S. Chakravorty, and N. M. Amato. Slap: Simul-

taneous localization and planning under uncertainty via dynamic replanning in belief space.

IEEE Transactions on Robotics, 34(5):1195–1214, 2018.

[2] Ali-Akbar Agha-Mohammadi, Suman Chakravorty, and Nancy M Amato. Firm: Sampling-

based feedback motion-planning under motion uncertainty and imperfect measurements. The

International Journal of Robotics Research, 33(2):268–304, 2014.

[3] Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, Li Fei-Fei, and

Silvio Savarese. Social lstm: Human trajectory prediction in crowded spaces. In Proceedings

of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961–971, 2016.

[4] Daniel L. Alspach and Harold W. Sorenson. Nonlinear bayesian estimation using gaussian sum

approximations. IEEE Transactions on Automatic Control, 17(4):439–448, 1972.

[5] Joel AE Andersson, Joris Gillis, Greg Horn, James B Rawlings, and Moritz Diehl. Casadi: a

software framework for nonlinear optimization and optimal control. Mathematical Program-

ming Computation, 11(1):1–36, 2019.

[6] Alexander R. Ansari and Todd D. Murphey. Sequential Action Control: Closed-Form Optimal

Control for Nonlinear and Nonsmooth Systems. IEEE Transactions on Robotics, 32(5):1196–

1214, 2016.

[7] Karl J Astrom. Introduction to stochastic control theory. Academic Press, 1970.

[8] J. Baillieul and K. Ozcimder. The control theory of motion-based communication: Problems

in teaching robots to dance. In 2012 American Control Conference (ACC), pages 4319–4326.

IEEE, jun 2012.

[9] Ruzena Bajcsy. Active perception. Proceedings of the IEEE, 76(8):966–1005, 1988.

[10] Moses Bangura and Robert Mahony. Real-time model predictive control for quadrotors. IFAC

Proceedings Volumes, 47(3):11773–11780, 2014.

137

BIBLIOGRAPHY 138

[11] Sarah Bechtle, Yixin Lin, Akshara Rai, Ludovic Righetti, and Franziska Meier. Curious ilqr:

Resolving uncertainty in model-based rl. In Conference on Robot Learning, pages 162–171,

2020.

[12] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio,

Aaron Courville, and Devon Hjelm. Mutual information neural estimation. In International

Conference on Machine Learning, pages 531–540. PMLR, 2018.

[13] Dimitri P. Bertsekas. Dynamic Programming and Stochastic Control. Academic Press, Inc.,

USA, 1976.

[14] Dimitri P Bertsekas et al. Dynamic programming and optimal control: Vol. 1. Athena scientific

Belmont, 2000.

[15] Graeme Best, Oliver M Cliff, Timothy Patten, Ramgopal R Mettu, and Robert Fitch. Decen-

tralised monte carlo tree search for active perception. In Algorithmic Foundations of Robotics

XII, pages 864–879. Springer, 2020.

[16] Lars Blackmore, Masahiro Ono, and Brian C Williams. Chance-constrained optimal path

planning with obstacles. IEEE Transactions on Robotics, 27(6):1080–1094, 2011.

[17] Nicolas Bourbaki and Philip Spain. Elements of Mathematics Functions of a Real Variable:

Elementary Theory. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004.

[18] F. Bourgault, A.A. Makarenko, S.B. Williams, B. Grocholsky, and H.F. Durrant-Whyte. In-

formation based adaptive robotic exploration. In 2002 IEEE/RSJ International Conference

on Intelligent Robots and System (IROS), volume 1, pages 540–545. IEEE, 2002.

[19] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen,

S. Tavener, D. Perez, S. Samothrakis, and S. Colton. A survey of monte carlo tree search

methods. IEEE Transactions on Computational Intelligence and AI in Games, 4(1):1–43, mar

2012.

[20] Arthur E Bryson Jr and Yu-Chi Ho. Applied optimal control: optimization, estimation, and

control. Taylor & Francis Group, 1975.

[21] Panpan Cai, Yuanfu Luo, David Hsu, and Wee Sun Lee. Hyp-despot: A hybrid parallel

algorithm for online planning under uncertainty. In Robotics: Science and Systems (RSS),

2018.

[22] Yuning Chai, Benjamin Sapp, Mayank Bansal, and Dragomir Anguelov. Multipath: Multiple

probabilistic anchor trajectory hypotheses for behavior prediction. In Conference on Robot

Learning, pages 86–99. PMLR, 2019.

BIBLIOGRAPHY 139

[23] Guillaume Chaslot, Sander Bakkes, Istvan Szita, and Pieter Spronck. Monte-carlo tree search:

A new framework for game ai. In Proceedings of the Fourth Artificial Intelligence and Inter-

active Digital Entertainment Conference, 2008.

[24] Pratik Chaudhari, Sertac Karaman, David Hsu, and Emilio Frazzoli. Sampling-based algo-

rithms for continuous-time pomdps. In 2013 American Control Conference (ACC), pages

4604–4610. IEEE, 2013.

[25] Changan Chen, Yuejiang Liu, Sven Kreiss, and Alexandre Alahi. Crowd-robot interaction:

Crowd-aware robot navigation with attention-based deep reinforcement learning. In 2019 IEEE

International Conference on Robotics and Automation (ICRA), pages 6015–6022. IEEE, 2019.

[26] Mo Chen, Jennifer C Shih, and Claire J Tomlin. Multi-vehicle collision avoidance via hamilton-

jacobi reachability and mixed integer programming. In IEEE 55th Conference on Decision and

Control (CDC), pages 1695–1700. IEEE, 2016.

[27] Rong Chen and Jun S. Liu. Mixture kalman filters. Journal of the Royal Statistical Society:

Series B (Statistical Methodology), 62(3):493–508, aug 2000.

[28] Yu Fan Chen, Miao Liu, Michael Everett, and Jonathan P How. Decentralized non-

communicating multiagent collision avoidance with deep reinforcement learning. In 2017 IEEE


[29] Richard Cheng, Richard M Murray, and Joel W Burdick. Limits of probabilistic safety guar-

antees when considering human uncertainty. arXiv preprint arXiv:2103.03388, 2021.

[30] Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust

decision-making: a cvar optimization approach. In Advances in Neural Information Processing

Systems, volume 28, 2015.

[31] Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement

learning in a handful of trials using probabilistic dynamics models. In Advances in Neural

Information Processing Systems, volume 31, pages 4754–4765, 2018.

[32] Adrien Couetoux, Jean-Baptiste Hoock, Nataliya Sokolovska, Olivier Teytaud, and Nicolas

Bonnard. Continuous Upper Confidence Trees. In 2011 International Conference on Learning

and Intelligent Optimization, pages 433–445. Springer, Berlin, Heidelberg, 2011.

[33] Joachim Denzler, Matthias Zobel, and Heinrich Niemann. On optimal camera parameter

selection in kalman filter based object tracking. Pattern Recognition, pages 17–25, 2002.

[34] Armen Der Kiureghian and Ove Ditlevsen. Aleatory or epistemic? does it matter? Structural

safety, 31(2):105–112, 2009.

BIBLIOGRAPHY 140

[35] J Diestel and JJ Uhl. Vector Measures. American Mathematical Society, 1977.

[36] John C Doyle. Guaranteed margins for lqg regulators. IEEE Transactions on automatic

Control, 23(4):756–757, 1978.

[37] John C Doyle, Bruce A Francis, and Allen R Tannenbaum. Feedback control theory. New York:

Macmillan Pub. Co., 1992.

[38] Louis Dressel and Mykel J Kochenderfer. Tutorial on the generation of ergodic trajectories

with projection-based gradient descent. IET Cyber-Physical Systems: Theory & Applications,

2018.

[39] Louis Kenneth Dressel. Efficient and Low-cost Localization of Radio Sources with an Au-

tonomous Drone. PhD thesis, Stanford University, 2018.

[40] Paul Dupuis and Richard S Ellis. A weak convergence approach to the theory of large deviations.

John Wiley & Sons, 1997.

[41] M. Egerstedt, Y. Wardi, and H. Axelsson. Transition-time optimization for switched-mode

dynamical systems. IEEE Transactions on Automatic Control, 51(1):110–115, 2006.

[42] Maxim Egorov, Zachary N Sunberg, Edward Balaban, Tim A Wheeler, Jayesh K Gupta,

and Mykel J Kochenderfer. Pomdps. jl: A framework for sequential decision making under

uncertainty. Journal of Machine Learning Research, 18(26):1–5, 2017.

[43] Polak Elijah. Optimization: Algorithms and Consistent Approximations. Springer Verlage

Publications, 1997.

[44] Jakob Engel, Thomas Schops, and Daniel Cremers. Lsd-slam: Large-scale direct monocular

slam. In European conference on computer vision, pages 834–849. Springer, 2014.

[45] Tom Erez and William D. Smart. A scalable method for solving high-dimensional continuous

pomdps using local approximation. In Proceedings of the Twenty-Sixth Conference on Un-

certainty in Artificial Intelligence, UAI’10, pages 160–167, Arlington, Virginia, United States,

2010. AUAI Press.

[46] Michael Everett, Yu Fan Chen, and Jonathan P How. Motion planning among dynamic,

decision-making agents with deep reinforcement learning. In 2018 IEEE/RSJ International

Conference on Intelligent Robots and Systems (IROS), pages 3052–3059. IEEE, 2018.

[47] Ioannis Exarchos, Evangelos A Theodorou, and Panagiotis Tsiotras. Game-theoretic and risk-

sensitive stochastic optimal control via forward and backward stochastic differential equations.

In IEEE 55th Conference on Decision and Control (CDC), pages 6154–6160. IEEE, 2016.

BIBLIOGRAPHY 141

[48] Farbod Farshidian and Jonas Buchli. Risk sensitive, nonlinear optimal control: Iterative linear

exponential-quadratic optimal control with gaussian noise. arXiv preprint arXiv:1512.07173,

2015.

[49] Fadri Furrer, Michael Burri, Markus Achtelik, and Roland Siegwart. RotorS — A Modular

Gazebo MAV Simulator Framework. In Robot Operating System (ROS), volume 625 SCI, pages

595–625. Springer, Cham, 2016.

[50] Sylvain Gelly, Levente Kocsis, Marc Schoenauer, Michele Sebag, David Silver, Csaba

Szepesvari, and Olivier Teytaud. The grand challenge of computer go: Monte carlo tree

search and extensions. Communications of the ACM, 55(3):106–113, 2012.

[51] Keith Glover and John C Doyle. Relations between h∞ and risk sensitive controllers. In

Analysis and Optimization of Systems, pages 1–10. Springer, 1988.

[52] Vicenc Gomez, Sep Thijssen, Andrew Symington, Stephen Hailes, and Hilbert J Kappen.

Real-time stochastic optimal control for multi-agent quadrotor systems. In Proceedings of the

International Conference on Automated Planning and Scheduling, volume 26, pages 468–476,

2016.

[53] Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil

Ozair, Aaron C Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in

Neural Information Processing Systems, volume 27, 2014.

[54] Kohur Gowrisankaran. Measurability of functions in product spaces. Proceedings of the Amer-

ican Mathematical Society, 31(2):485–488, 1972.

[55] Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Bernardo A Pires, and Remi

Munos. Neural predictive belief representations. arXiv preprint arXiv:1811.06407, 2018.

[56] Agrim Gupta, Justin Johnson, Li Fei-Fei, Silvio Savarese, and Alexandre Alahi. Social gan:

Socially acceptable trajectories with generative adversarial networks. In Proceedings of the

IEEE Conference on Computer Vision and Pattern Recognition, pages 2255–2264, 2018.

[57] David Ha and Jurgen Schmidhuber. Recurrent world models facilitate policy evolution. In

Advances in Neural Information Processing Systems, volume 31, 2018.

[58] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee,

and James Davidson. Learning latent dynamics for planning from pixels. In International

Conference on Machine Learning, pages 2555–2565. PMLR, 2019.

[59] A. Hakobyan and I. Yang. Wasserstein distributionally robust motion planning and control

with safety constraints using conditional value-at-risk. In 2020 IEEE International Conference

on Robotics and Automation (ICRA), pages 490–496. IEEE, 2020.

BIBLIOGRAPHY 142

[60] Richard Hartley and Andrew Zisserman. Multiple View Geometry in Computer Vision. Cam-

bridge University Press, second edition, sep 2004.

[61] W.P.M.H. Heemels, D. Lehmann, J. Lunze, and B. De Schutter. Introduction to hybrid

systems. In J. Lunze and F. Lamnabhi-Lagarrigue, editors, Handbook of Hybrid Systems

Control – Theory, Tools, Applications, chapter 1, pages 3–30. Cambridge University Press,

2009.

[62] Dirk Helbing and Peter Molnar. Social force model for pedestrian dynamics. Physical review

E, 51(5):4282, 1995.

[63] Geoffrey A. Hollinger and Gaurav S. Sukhatme. Sampling-based robotic information gathering

algorithms. The International Journal of Robotics Research, 33(9):1271–1287, 2014.

[64] Ronald A Howard and James E Matheson. Risk-sensitive markov decision processes. Manage-

ment science, 18(7):356–369, 1972.

[65] Marco Huber. Probabilistic framework for sensor management. PhD thesis, Universitat Karl-

sruhe, 2009.

[66] Clearpath Robotics Inc. Hummingbird unmanned aerial vehicle. Available: https://www.

clearpathrobotics.com/hummingbird-unmanned-aerial-vehicle/ (Accessed: September

13, 2017).

[67] Mortaza Noushad Iranzad. Estimation of Information Measures and its Applications in Ma-

chine Learning. PhD thesis, University of Michigan, 2019.

[68] Boris Ivanovic, Amine Elhafsi, Guy Rosman, Adrien Gaidon, and Marco Pavone. Mats: An

interpretable trajectory forecasting representation for planning and control. In Conference on

Robot Learning. PMLR, 2020.

[69] Boris Ivanovic and Marco Pavone. The Trajectron: Probabilistic multi-agent trajectory model-

ing with dynamic spatiotemporal graphs. In Proceedings of the IEEE International Conference

on Computer Vision, pages 2375–2384, 2019.

[70] David Jacobson. Optimal stochastic linear systems with exponential performance criteria and

their relation to deterministic differential games. IEEE Transactions on Automatic control,

18(2):124–131, 1973.

[71] David H Jacobson and David Q Mayne. Differential dynamic programming. North-Holland,

1970.

https://www.clearpathrobotics.com/hummingbird-unmanned-aerial-vehicle/

https://www.clearpathrobotics.com/hummingbird-unmanned-aerial-vehicle/

BIBLIOGRAPHY 143

[72] Gregor Jochmann, Soren Kerner, Stefan Tasse, and Oliver Urbann. Efficient multi-hypotheses

unscented kalman filtering for robust localization. In RoboCup 2011: Robot Soccer World Cup

XV, volume 7416 LNCS, pages 222–233. Springer Berlin Heidelberg, jun 2012.

[73] Austin Jones and Sean Andersson. A motion-based communication system. In 2013 American

Control Conference (ACC), pages 365–370. IEEE, jun 2013.

[74] Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting

in partially observable stochastic domains. Artificial Intelligence, 101(1-2):99–134, 1998.

[75] Michael G Kapteyn, Karen E Willcox, and Andy Philpott. A distributionally robust approach

to black-box optimization. In 2018 AIAA Non-Deterministic Approaches Conference, page

0666, 2018.

[76] Peter Karkus, David Hsu, and Wee Sun Lee. Particle filter networks with application to visual

localization. In Conference on Robot Learning, pages 169–178. PMLR, 2018.

[77] Hassan K Khalil and Jessy W Grizzle. Nonlinear systems, volume 3. Prentice hall Upper

Saddle River, NJ, 2002.

[78] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv

preprint arXiv:1412.6980, 2014.

[79] Mykel J Kochenderfer. Decision making under uncertainty: theory and application. MIT press,

2015.

[80] Mykel J Kochenderfer and Tim A Wheeler. Algorithms for optimization. Mit Press, 2019.

[81] Levente Kocsis and Csaba Szepesvari. Bandit based monte-carlo planning. In Proceedings

of the 17th European Conference on Machine Learning, ECML’06, pages 282–293, Berlin,

Heidelberg, 2006. Springer-Verlag.

[82] N. Koenig and A. Howard. Design and use paradigms for gazebo, an open-source multi-robot

simulator. In 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems

(IROS), volume 3, pages 2149–2154. IEEE, 2004.

[83] Vineet Kosaraju, Amir Sadeghian, Roberto Martın-Martın, Ian Reid, Hamid Rezatofighi, and

Silvio Savarese. Social-bigat: Multimodal trajectory forecasting using bicycle-gan and graph

attention networks. In Advances in Neural Information Processing Systems, volume 32, 2019.

[84] Hanna Kurniawati, David Hsu, and Wee Sun Lee. Sarsop: Efficient point-based pomdp plan-

ning by approximating optimally reachable belief spaces. In Robotics: Science and Systems

(RSS), 2008.

BIBLIOGRAPHY 144

[85] Raphael Kusumoto, Luigi Palmieri, Markus Spies, Akos Csiszar, and Kai O Arras. Informed

information theoretic model predictive control. In 2019 International Conference on Robotics

and Automation (ICRA), pages 2047–2053. IEEE, 2019.

[86] Jerome Le Ny and George J. Pappas. On trajectory optimization for active sensing in Gaussian

process models. In Proceedings of the 48h IEEE Conference on Decision and Control (CDC)

held jointly with 2009 28th Chinese Control Conference, pages 6286–6292. IEEE, 2009.

[87] Namhoon Lee, Wongun Choi, Paul Vernaza, Christopher B Choy, Philip HS Torr, and Manmo-

han Chandraker. Desire: Distant future prediction in dynamic scenes with interacting agents.

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages

336–345, 2017.

[88] Namhoon Lee and Kris M Kitani. Predicting wide receiver trajectories in american football.

In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1–9.

IEEE, 2016.

[89] Vincent Lepetit and Pascal Fua. Monocular model-based 3d tracking of rigid objects. Foun-

dations and Trends in Computer Graphics and Vision, 1(1):1–89, jan 2005.

[90] Alon Lerner, Yiorgos Chrysanthou, and Dani Lischinski. Crowds by example. Computer

graphics forum, 26(3):655–664, 2007.

[91] Weiwei Li and Emanuel Todorov. Iterative linear quadratic regulator design for nonlinear

biological movement systems. In ICINCO (1), pages 222–229, 2004.

[92] Daniel Liberzon. Calculus of variations and optimal control theory: A concise introduction.

Princeton University Press, 2011.

[93] Shiau Hong Lim, Huan Xu, and Shie Mannor. Reinforcement learning in robust markov

decision processes. In Advances in Neural Information Processing Systems, volume 26, 2013.

[94] Yuncheng Lu, Zhucun Xue, Gui-Song Xia, and Liangpei Zhang. A survey on vision-based uav

navigation. Geo-spatial information science, 21(1):21–32, 2018.

[95] Wenhao Luo, Wen Sun, and Ashish Kapoor. Multi-robot collision avoidance under uncertainty

with probabilistic safety barrier certificates. Advances in Neural Information Processing Sys-

tems, 33, 2020.

[96] Omid Madani, Steve Hanks, and Anne Condon. On the undecidability of probabilistic planning

and infinite-horizon partially observable markov decision problems. In AAAI/IAAI, pages 541–

548, 1999.

BIBLIOGRAPHY 145

[97] Anirudha Majumdar and Marco Pavone. How should a robot assess risk? towards an axiomatic

theory of risk in robotics. In Robotics Research, pages 75–84. Springer, 2020.

[98] Anastasia Mavrommati, Emmanouil Tzorakoleftherakis, Ian Abraham, and Todd D Murphey.

Real-time area coverage and target localization using receding-horizon ergodic exploration.

IEEE Transactions on Robotics, 34(1):62–80, 2018.

[99] Jose Ramon Medina, Dongheui Lee, and Sandra Hirche. Risk-sensitive optimal feedback con-

trol for haptic assistance. In 2012 IEEE International Conference on Robotics and Automation

(ICRA), pages 1025–1031. IEEE, 2012.

[100] Jose Ramon Medina, Tamara Lorenz, Dongheui Lee, and Sandra Hirche. Disagreement-aware

physical assistance through risk-sensitive optimal feedback control. In 2012 IEEE/RSJ In-

ternational Conference on Intelligent Robots and Systems (IROS), pages 3639–3645. IEEE,

2012.

[101] Raman Mehra, Sanjeev Seereeram, David Bayard, and Fred Hadaegh. Adaptive kalman filter-

ing, failure detection and identification for spacecraft attitude estimation. In Proceedings of

International Conference on Control Applications, pages 176–181. IEEE, 1995.

[102] Daniel Mellinger, Alex Kushleyev, and Vijay Kumar. Mixed-integer quadratic program trajec-

tory generation for heterogeneous quadrotor teams. In 2012 IEEE International Conference


[103] L. Mihaylova, T. Lefebvre, H. Bruyninckx, K. Gadeyne, and J. De Schutter. Active Sensing

for Robotics - A Survey. In Proceedings of 5th International Conference on Numerical Methods

and Applications, pages 316–324, 2002.

[104] Lauren M Miller and Todd D Murphey. Trajectory optimization for continuous ergodic explo-

ration. In 2013 American Control Conference (ACC), pages 4196–4201. IEEE, 2013.

[105] Lauren M Miller, Yonatan Silverman, Malcolm A MacIver, and Todd D Murphey. Ergodic

exploration of distributed information. IEEE Transactions on Robotics, 32(1):36–52, 2016.

[106] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G

Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.

Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.

[107] Thulasi Mylvaganam, Mario Sassano, and Alessandro Astolfi. A differential game approach to

multi-agent collision avoidance. IEEE Transactions on Automatic Control, 62(8):4229–4235,

2017.

BIBLIOGRAPHY 146

[108] Haruki Nishimura, Boris Ivanovic, Adrien Gaidon, Marco Pavone, and Mac Schwager. Risk-

sensitive sequential action control with multi-modal human trajectory forecasting for safe

crowd-robot interaction. In 2020 IEEE/RSJ International Conference on Intelligent Robots

and Systems (IROS), pages 11205–11212, 2020. © 2021 IEEE. Reprinted, with permission.

[109] Haruki Nishimura, Negar Mehr, Adrien Gaidon, and Mac Schwager. Rat ilqr: A risk auto-

tuning controller to optimally account for stochastic model mismatch. IEEE Robotics and

Automation Letters, 6(2):763–770, 2021. © 2021 IEEE. Reprinted, with permission.

[110] Haruki Nishimura and Mac Schwager. Active motion-based communication for robots with

monocular vision. In 2018 IEEE International Conference on Robotics and Automation

(ICRA), pages 2948–2955, 2018. © 2021 IEEE. Reprinted, with permission.

[111] Haruki Nishimura and Mac Schwager. Sacbp: Belief space planning for continuous-time

dynamical systems via stochastic sequential action control. In Algorithmic Foundations of

Robotics XIII, pages 267–283. Springer, 2020.

[112] Haruki Nishimura and Mac Schwager. Sacbp: Belief space planning for continuous-time

dynamical systems via stochastic sequential action control. The International Journal of

Robotics Research, Accepted. Available: https://journals.sagepub.com/doi/full/10.

1177/02783649211037697 (Accessed: August 17, 2021).

[113] Christos H Papadimitriou and John N Tsitsiklis. The complexity of markov decision processes.

Mathematics of operations research, 12(3):441–450, 1987.

[114] Sachin Patil, Gregory Kahn, Michael Laskey, John Schulman, Ken Goldberg, and Pieter

Abbeel. Scaling up gaussian belief space planning through covariance-free trajectory opti-

mization and automatic differentiation. In Algorithmic Foundations of Robotics XI, pages

515–533. Springer, 2015.

[115] N Peach. Bearings-only tracking using a set of range-parameterised extended kalman filters.

IEE Proceedings-Control Theory and Applications, 142(1):73–80, 1995.

[116] Stefano Pellegrini, Andreas Ess, Konrad Schindler, and Luc Van Gool. You’ll never walk alone:

Modeling social behavior for multi-target tracking. In Proceedings of the IEEE International

Conference on Computer Vision, pages 261–268. IEEE, 2009.

[117] Ian R Petersen, Matthew R James, and Paul Dupuis. Minimax optimal control of stochastic

uncertain systems with relative entropy constraints. IEEE Transactions on Automatic Control,

45(3):398–412, 2000.

[118] Robert Platt. Convex receding horizon control in non-gaussian belief space. In Algorithmic

Foundations of Robotics X, pages 443–458. Springer, 2013.

https://journals.sagepub.com/doi/full/10.1177/02783649211037697

https://journals.sagepub.com/doi/full/10.1177/02783649211037697

BIBLIOGRAPHY 147

[119] Robert Platt, Russell Tedrake, Leslie Kaelbling, and Tomas Lozano-Perez. Belief space plan-

ning assuming maximum likelihood observations. In Robotics: Science and Systems (RSS),

2010.

[120] Ashwini Pokle, Roberto Martın-Martın, Patrick Goebel, Vincent Chow, Hans M Ewald, Jun-

wei Yang, Zhenkai Wang, Amir Sadeghian, Dorsa Sadigh, Silvio Savarese, et al. Deep local

trajectory replanning and control for robot navigation. In 2019 IEEE International Conference


[121] M. Popovic, G. Hitz, J. Nieto, I. Sa, R. Siegwart, and E. Galceran. Online informative

path planning for active classification using uavs. In 2017 IEEE International Conference on

Robotics and Automation (ICRA), pages 5753–5758, 2017.

[122] Morgan Quigley, Ken Conley, Brian P. Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, Rob

Wheeler, and Andrew Y. Ng. Ros: an open-source robot operating system. In ICRA Workshop

on Open Source Software, 2009.

[123] M. Rafieisakhaei, S. Chakravorty, and P. R. Kumar. T-lqg: Closed-loop belief space plan-

ning via trajectory-optimized lqg. In 2017 IEEE International Conference on Robotics and

Automation (ICRA), pages 649–656, 2017.

[124] Mohammadhussein Rafieisakhaei, Suman Chakravorty, and PR Kumar. Belief space planning

simplified: Trajectory-optimized lqg (t-lqg). arXiv preprint arXiv:1608.03013, 2016.

[125] Dhananjay Raghunathan and John Baillieul. Relative motion of robots as a means for signal-

ing. In Proceedings of the World Congress on Engineering and Computer Science 2008, San

Francisco, USA, 2008.

[126] Dhananjay Raghunathan and John Baillieul. Exploiting information content in relative motion.

In 2009 American Control Conference (ACC), pages 2166–2171. IEEE, 2009.

[127] Dhananjay Raghunathan and John Baillieul. Motion based communication channels between

mobile robots - a novel paradigm for low bandwidth information exchange. In 2009 IEEE/RSJ

International Conference on Intelligent Robots and Systems (IROS), pages 702–708. IEEE, oct

2009.

[128] Vanessa Bates Ramirez. A new and improved burger robot’s on the market – and every-

one wants one, November 2020. Available: https://singularityhub.com/2020/11/06/

flippy-the-fast-food-robot-just-went-commercial-and-its-selling-like-crazy/

(Accessed: August 17, 2021).

[129] J. Revels, M. Lubin, and T. Papamarkou. Forward-mode automatic differentiation in julia.

arXiv:1607.07892 [cs.MS], 2016.

https://singularityhub.com/2020/11/06/flippy-the-fast-food-robot-just-went-commercial-and-its-selling-like-crazy/

https://singularityhub.com/2020/11/06/flippy-the-fast-food-robot-just-went-commercial-and-its-selling-like-crazy/

BIBLIOGRAPHY 148

[130] Nicholas Rhinehart, Rowan McAllister, Kris Kitani, and Sergey Levine. Precog: Prediction

conditioned on goals in visual multi-agent settings. In Proceedings of the IEEE/CVF Interna-

tional Conference on Computer Vision, pages 2821–2830, 2019.

[131] Vincent Roulet, Maryam Fazel, Siddhartha Srinivasa, and Zaid Harchaoui. On the convergence

of the iterative linear exponential quadratic gaussian algorithm to stationary points. In 2020

American Control Conference (ACC), pages 132–137. IEEE, 2020.

[132] Reuven Y Rubinstein and Dirk P Kroese. The cross-entropy method: a unified approach to

combinatorial optimization, Monte-Carlo simulation and machine learning. Springer Science

& Business Media, 2013.

[133] Amir Sadeghian, Vineet Kosaraju, Ali Sadeghian, Noriaki Hirose, Hamid Rezatofighi, and

Silvio Savarese. Sophie: An attentive gan for predicting paths compliant to social and physical

constraints. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern

Recognition, pages 1349–1358, 2019.

[134] T. Salzmann, B. Ivanovic, P. Chakravarty, and M. Pavone. Trajectron++: Dynamically-

feasible trajectory forecasting with heterogeneous data. In European Conference on Computer

Vision, August 2020.

[135] S. Samuelson and I. Yang. Data-driven distributionally robust control of energy storage to

manage wind power fluctuations. In 2017 IEEE Conference on Control Technology and Ap-

plications (CCTA), pages 199–204. IEEE, 2017.

[136] Samantha Samuelson and Insoon Yang. Safety-aware optimal control of stochastic systems

using conditional value-at-risk. In 2018 American Control Conference (ACC), pages 6285–

6290. IEEE, 2018.

[137] Edward Schmerling, Karen Leung, Wolf Vollprecht, and Marco Pavone. Multimodal probabilis-

tic model-based planning for human-robot interaction. In 2018 IEEE International Conference


[138] Mac Schwager, Philip Dames, Daniela Rus, and Vijay Kumar. A multi-robot control policy

for information gathering in the presence of unknown hazards. In Robotics Research : The

15th International Symposium ISRR, pages 455–472. Springer International Publishing, 2017.

[139] Andreas Seekircher, Tim Laue, and Thomas Rofer. Entropy-based active vision for a humanoid

soccer robot. In RoboCup 2010: Robot Soccer World Cup XIV, volume 6556 LNCS, pages 1–12.

Springer Berlin Heidelberg, 2011.

[140] Christian Seider and Christina Radis. When the car takes over – a glimpse

into the future of autonomous driving. Technical report, NTT DATA Deutschland

BIBLIOGRAPHY 149

GmbH, Munich, Germany, February 2018. Available: https://de.nttdata.com/files/

2018-en-wp-when-the-car-takes-over.pdf (Accessed: August 17, 2021).

[141] Robin Senge, Stefan Bosner, Krzysztof Dembczynski, Jorg Haasenritter, Oliver Hirsch, Nor-

bert Donner-Banzhoff, and Eyke Hullermeier. Reliable classification: Learning classifiers that

distinguish aleatoric and epistemic uncertainty. Information Sciences, 255:16–29, 2014.

[142] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van

Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanc-

tot, et al. Mastering the game of go with deep neural networks and tree search. nature,

529(7587):484–489, 2016.

[143] Aman Sinha, Matthew O’Kelly, Hongrui Zheng, Rahul Mangharam, John Duchi, and Russ

Tedrake. Formulazero: Distributionally robust online adaptation via offline population syn-

thesis. In International Conference on Machine Learning, pages 8992–9004. PMLR, 2020.

[144] Michael A. Sipe and David P. Casasent. Best viewpoints for active vision classification and pose

estimation. In David P. Casasent, editor, Proc. SPIE 3208, Intelligent Robots and Computer

Vision XVI: Algorithms, Techniques, Active Vision, and Materials Handling, pages 382–393.

International Society for Optics and Photonics, sep 1997.

[145] Patrick Slade, Preston Culbertson, Zachary Sunberg, and Mykel Kochenderfer. Simultaneous

active parameter estimation and control using sampling-based bayesian reinforcement learning.

In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages

804–810. IEEE, 2017.

[146] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using

deep conditional generative models. In Advances in Neural Information Processing Systems,

volume 28, 2015.

[147] Adhiraj Somani, Nan Ye, David Hsu, and Wee Sun Lee. Despot: Online pomdp planning with

regularization. In Advances in Neural Information Processing Systems, volume 13, 2013.

[148] Davide Spinello and Daniel J Stilwell. Nonlinear estimation with state-dependent gaussian

observation noise. IEEE Transactions on Automatic Control, 55(6):1358–1366, 2010.

[149] Ke Sun and Vijay Kumar. Belief space planning for mobile robots with range sensors using

ilqg. IEEE Robotics and Automation Letters, 6(2):1902–1909, 2021.

[150] Zachary Sunberg and Mykel Kochenderfer. Online algorithms for pomdps with continuous

state, action, and observation spaces. In Proceedings of the International Conference on Au-

tomated Planning and Scheduling, volume 28, pages 259–263, 2018.

https://de.nttdata.com/files/2018-en-wp-when-the-car-takes-over.pdf

https://de.nttdata.com/files/2018-en-wp-when-the-car-takes-over.pdf

BIBLIOGRAPHY 150

[151] Yuval Tassa, Tom Erez, and Emanuel Todorov. Synthesis and stabilization of complex behav-

iors through online trajectory optimization. In 2012 IEEE/RSJ International Conference on

Intelligent Robots and Systems (IROS), pages 4906–4913. IEEE, 2012.

[152] Selim Temizer, Mykel Kochenderfer, Leslie Kaelbling, Tomas Lozano-Perez, and James

Kuchar. Collision avoidance for unmanned aircraft using markov decision processes. In AIAA

guidance, navigation, and control conference, page 8040, 2010.

[153] Sebastian Thrun. Probabilistic algorithms in robotics. Ai Magazine, 21(4):93–93, 2000.

[154] Sebastian Thrun, Wolfram Burgard, and Dieter Fox. Probabilistic Robotics. The MIT Press,

2005.

[155] The Seattle Times. Amazon deploys many more orange robots at ware-

houses, December 2016. Available: https://www.seattletimes.com/business/

amazon-deploys-many-more-orange-troops-at-warehouses/ (Accessed: August 17,

2021).

[156] Emanuel Todorov and Weiwei Li. A generalized iterative lqg method for locally-optimal feed-

back control of constrained nonlinear stochastic systems. In 2005 American Control Conference

(ACC), pages 300–306. IEEE, 2005.

[157] Claire Tomlin, George J Pappas, and Shankar Sastry. Conflict resolution for air traffic man-

agement: A study in multiagent hybrid systems. IEEE Transactions on Automatic Control,

43(4):509–521, 1998.

[158] Emmanouil Tzorakoleftherakis and Todd D Murphey. Iterative sequential action control for

stable, model-based control of nonlinear systems. IEEE Transactions on Automatic Control,

64(8):3170–3183, 2018.

[159] Nuria Armengol Urpı, Sebastian Curi, and Andreas Krause. Risk-averse offline reinforcement

learning. In International Conference on Learning Representations, 2021.

[160] Jur Van Den Berg, Stephen J Guy, Ming Lin, and Dinesh Manocha. Reciprocal n-body collision

avoidance. In Robotics research, pages 3–19. Springer, 2011.

[161] Jur van den Berg, Sachin Patil, and Ron Alterovitz. Motion planning under uncertainty using

iterative local optimization in belief space. The International Journal of Robotics Research,

31(11):1263–1278, 2012.

[162] B. P. G. Van Parys, D. Kuhn, P. J. Goulart, and M. Morari. Distributionally robust control

of constrained stochastic systems. IEEE Transactions on Automatic Control, 61(2):430–442,

2016.

https://www.seattletimes.com/business/amazon-deploys-many-more-orange-troops-at-warehouses/

https://www.seattletimes.com/business/amazon-deploys-many-more-orange-troops-at-warehouses/

BIBLIOGRAPHY 151

[163] Karl Von Frisch. The dance language and orientation of bees. Harvard University Press, 2013.

[164] Andreas Wachter and Lorenz T Biegler. On the implementation of an interior-point fil-

ter line-search algorithm for large-scale nonlinear programming. Mathematical programming,

106(1):25–57, 2006.

[165] Allen Wang, Xin Huang, Ashkan Jasour, and Brian Williams. Fast risk assessment for au-

tonomous vehicles using learned models of agent futures. In Robotics: Science and Systems

(RSS), 2020.

[166] Allen Wang, Ashkan Jasour, and Brian C Williams. Non-gaussian chance-constrained trajec-

tory planning for autonomous vehicles under agent uncertainty. IEEE Robotics and Automation

Letters, 5(4):6041–6048, 2020.

[167] Jack M Wang, David J Fleet, and Aaron Hertzmann. Gaussian process dynamical models for

human motion. IEEE transactions on pattern analysis and machine intelligence, 30(2):283–

298, 2007.

[168] L. Wang, A. D. Ames, and M. Egerstedt. Safety barrier certificates for collisions-free multirobot

systems. IEEE Transactions on Robotics, 33(3):661–674, June 2017.

[169] Mingyu Wang, Negar Mehr, Adrien Gaidon, and Mac Schwager. Game-theoretic planning

for risk-aware interactive agents. In 2020 IEEE/RSJ International Conference on Intelligent

Robots and Systems (IROS), pages 6998–7005. IEEE, 2020.

[170] Mingyu Wang, Zijian Wang, Shreyasha Paudel, and Mac Schwager. Safe distributed lane

change maneuvers for multiple autonomous vehicles using buffered input cells. In 2018 IEEE


[171] Zijian Wang, Guang Yang, Xuanshuo Su, and Mac Schwager. Ouijabots: Omnidirectional

robots for cooperative object transport with rotation control using no communication. In

Distributed Autonomous Robotic Systems, pages 117–131. Springer, 2018.

[172] Y. Wardi and M. Egerstedt. Algorithm for optimal mode scheduling in switched systems. In

2012 American Control Conference (ACC), pages 4546–4551. IEEE, 2012.

[173] Peter Whittle. Risk-sensitive linear/quadratic/gaussian control. Advances in Applied Proba-

bility, 13(4):764–777, 1981.

[174] Peter Whittle. Risk sensitivity, a strangely pervasive concept. Macroeconomic Dynamics,

6(1):5–18, 2002.

BIBLIOGRAPHY 152

[175] Grady Williams, Paul Drews, Brian Goldfain, James M Rehg, and Evangelos A Theodorou.

Information-theoretic model predictive control: Theory and applications to autonomous driv-

ing. IEEE Transactions on Robotics, 34(6):1603–1622, 2018.

[176] Grady Williams, Brian Goldfain, Paul Drews, James M Rehg, and Evangelos A Theodorou.

Best response model predictive control for agile interactions between autonomous ground ve-

hicles. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages

2403–2410. IEEE, 2018.

[177] Lihua Xie, Dan Popa, and Frank L Lewis. Optimal and robust estimation: with an introduction

to stochastic control theory. CRC press, 2007.

[178] Dingjiang Zhou, Zijian Wang, Saptarshi Bandyopadhyay, and Mac Schwager. Fast, on-line

collision avoidance for dynamic vehicles using buffered voronoi cells. IEEE Robotics and Au-

tomation Letters, 2(2):1047–1054, 2017.

[179] M. Zobel, J. Denzler, and H. Niemann. Entropy based camera control for visual object tracking.

In Proceedings. International Conference on Image Processing, volume 3, pages 901–904. IEEE,

jun 2002.

Documents

ONLINE TRAJECTORY PLANNING ALGORITHMS FOR ROBOTIC …