Learn to adapt to human walking: A Model-based ...users.ntua.gr/xpapag/Papers/19-0390_03_MS.pdf · A lot of research focuses on social robot navigation [15], i.e. robot motion planning

IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED MONTH, YEAR 1

Learn to adapt to human walking: A Model-basedReinforcement Learning Approach for a Robotic

Assistant RollatorGeorgia Chalvatzaki1, Xanthi S. Papageorgiou2, Petros Maragos1 and Costas S. Tzafestas1

Abstract—In this paper, we tackle the problem of adaptingthe motion of a robotic assistant rollator to patients withdifferent mobility status. The goal is to achieve a coupled human-robot motion in a front-following setting as if the patient waspushing the rollator him/herself. To this end, we propose a novelapproach using Model-based Reinforcement Learning (MBRL)for adapting the control policy of the robotic assistant. Thisapproach encapsulates our previous work on human tracking andgait analysis from RGB-D and laser streams into a human-in-the-loop decision making strategy. We use Long Short-Term Memory(LSTM) networks for designing a Human Motion IntentionModel (HuMIM) and a Coupling Parameters Forecast model,leveraging on the outcome of human gait analysis. An initialLSTM-based policy network was trained via Imitation Learning(IL) from human demonstrations in a Motion Capture setup.This policy is then fine-tuned with the MBRL framework usingtracking data from real patients. A thorough evaluation analysisproves the efficiency of the MBRL approach as a user-adaptivecontroller.

Index Terms—Human-Centered Robotics; Learning and Adap-tive Systems; Automation in Life Sciences: Biotechnology, Phar-maceutical and Health Care

I. INTRODUCTION

THE development of robotic mobility assistants is a majorresearch area with great impact on society. The constant

increase of aged population in recent years has created newchallenges in the healthcare sector, causing great difficultiesfor the existing care and nursing staff to keep up with theseevolving needs. The necessity for robotic assistants that willhelp with elderly mobility and rehabilitation is clear. It hasbeen now close to twenty years since the first robotic rollatorsemerged [1], [2]. An intelligent robotic mobility assistantshould serve many purposes; postural support, gait analy-sis, sit-to-stand transfer, navigation and cognitive assistance.Adaptation to user needs is important for seamless human-robot interaction in such applications.

Manuscript received: February, 24, 2019; Revised May, 31, 2019; AcceptedJune, 27, 2019.

This paper was recommended for publication by Editor Allison Okamuraupon evaluation of the Associate Editor and Reviewers’ comments. Thisresearch work has been cofinanced by the European Union and Greek nationalfunds through the Operational Program Competitiveness, Entrepreneurshipand Innovation, under the call RESEARCH CREATE INNOVATE (i-Walk,project code:T1EDK- 01248 / MIS: 5030856).

1G. Chalvatzaki, C.S. Tzafestas and P. Maragos are with the Schoolof Electrical and Computer Engineering, National Technical University ofAthens, Greece (email: {gchal}@mail.ntua.gr.

2X.S. Papageorgiou is with the Institute for Language and Speech Process-ing (ILSP / Athena R.C.), Greece (email: {xpapag}@ilsp.gr).

Digital Object Identifier (DOI): see top of this page.

Fig. 1: Robotic agent observes predicted human motion intentionand learns through Model-based Reinforcement Learning to adapt itscontrol actions, accordingly.

Fig. 2: Left and middle: Prototype robotic assistant rollatorequipped with a RGB-D sensor for capturing the upper body poseand a 2D laser sensor for detecting the legs motion. Right: Exampleof the MoCap markers on an elderly user and a passive rollator, fromwhich the data for imitation learning derived.

In this paper, we tackle the problem of adapting the motionof a robotic rollator that moves along with an elder user whilebeing in front of him. The applied control should comply withthe user’s needs in case the user wants to walk either supportedor unsupported by the rollator, whenever feeling confident, i.e.leaving the handles and walking along with the robot in frontof them (Fig. 1). However, the robot should follow and be in aclose distance in front of the user, not only to provide supportwhenever needed, but also to prevent possible falls.

Motivated by this need, taking into account the variability inhuman walking, and especially in pathological gait (e.g ataxicand freezing types of gait present different velocities and pat-terns), we propose a unified method for continuous monitoringof each user and adaptation of the robotic platform’s motionaccordingly. We propose a MBRL method for adapting therobot’s motion in front of the user. Fig. 1 encapsulates anoverview of the problem we aim to solve; the robotic assistantshould infer the human’s motion intention and learn a controlpolicy using MBRL to select control actions that will complyto the human’s way of walking.

We build upon our previous work, regarding human trackingand gait analysis fusing 2D laser data capturing the legs motion[3] and RGB-D streams of the upper body pose estimation

2 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED MONTH, YEAR

using the Open Pose Library [4], from sensors mounted on arobotic rollator (Fig. 2). Laser data are used to perform robustgait tracking and reliable on-line gait analysis by exploitingthe high scanning frequency and precision of the laser sensor,while RGB-D streams can provide additional information fromwhich we can infer human gait stability [5]. In this work,we integrate the aforementioned methods into a human-in-the-loop control framework using MBRL for adapting the robotmotion to each user.

In the human-robot coupled navigation context, our maincontribution resides on a novel approach considering humanmotion intentions within a MBRL framework for the onlinemotion adaptation of a robotic assistant in a challenging front-following scenario (Fig. 1). In this framework, we start bydeveloping LSTM based prediction models for estimatinghuman motion intention using a history of motion trackingdata. We then train models which associate the human motionorientation and the estimated stride length provided by gaitanalysis to the desired coupling parameters for the robot’sheading and position, i.e. the desired separation distance andbearing in the human-robot frame. Further on, we use thisinformation to train a policy for suggesting robot controlactions according to the human motion intentions and theexpected desired coupling. We developed an initial policymodel trained with IL from human demonstrations using datafrom motion markers (VICON system), which were placedon the human and a passive rollator frame in a series ofdata collection experiments (Fig. 2). Although such a modelbehaves well for the demonstrated cases and gives insighton how the user wants the platform to be placed in front ofhim/her while walking, this policy does not have experiencefor recovering from drift cases or unexpected detection loss ofthe user. To cope with such situations, the proposed MBRLframework performs fine-tuning of the initial control policy (asseen in Fig. 3), while using random sampling Model PredictiveControl (MPC) for planning [6]–[8]. Detailed experimentalresults are presented in the paper showing the efficiency ofthe proposed MBRL framework for the motion adaptation ofa robotic assistant rollator using data from real patients.

II. RELATED WORK

State-of-the-art research for robotic assistants mostly relieson admittance control schemes [9], [10]. A control strategyusing as inputs human velocity and orientation was proposedin [11]. A formation control for a robot-human followingscenario was presented in [12], for safely navigating blindpeople. In our previous work [13], we have considered afront-following problem with a kinematic controller adaptingto users according to their pathological mobility class. AReinforcement Learning (RL) shared-control for a walkingaid with human intention prediction from force sensors ispresented in [14].

A lot of research focuses on social robot navigation [15],i.e. robot motion planning among crowds [16], using RL.Most methods for robot navigation require pedestrians motionpredictions for the robot to learn how to navigate among themin a compliant way [17]. An interaction-aware motion pre-diction approach for pedestrians with an LSTM-based model

Fig. 3: Model-based Reinforcement Learning framework for policyadaptation using human motion intention predictions.

for learning human motion behavior was presented in [18].In [19], deep RL was used for navigating according to socialnorms across crowds, while in [20], RL is used for unfreezingthe robot in the crowd by taking into account the coordinationbetween robots and detected humans. In such cases the robotdoes not accompany humans, but it rather learns how to movethrough and avoid collisions with them.

Regarding robotic companions, a method for human-robotnavigation using the social force model and a Bayesian pre-dictor for human motion is described in [21]. A model basedon social force and human motion prediction is presented in[22], for making robots capable of approaching people with ahuman-like behavior, while they are walking in a side-by-sideformation with a person, avoiding several pedestrians in theenvironment. An MPC technique that accounts for safety andcomfort requirements for a robot accompanying a human in asearch and rescue scenario is presented in [23].

The use of deep RL is prevalent in modern researchaiming to plan robot motion [24] and control [25] for var-ious tasks. Robot navigation systems which have integratedsuch RL decision-making schemes can be found in [26]–[28]. Approaches combining IL with RL for learning controlpolicies are presented in [29], [30]. Although, model-free RLapproaches have many successful applications, they requirelarge amount of training data, which are often simulated, thustheir applicability is limited. On the other hand, model-basedRL firstly learns a model of the system and then trains acontrol policy using feedback [31]. MBRL has been used forrobot control both in simulated and real world experiments[32]–[34]. MBRL relies on MPC for planning control actions,thus using learned models along with MPC as a control policy,is a matter in hand for RL and IL research [8], [35], [36]. Wewere inspired by recent advances in adaptive control usingMBRL [7], [37]. In this work, we propose a novel MBRLframework for learning and adapting the control policy of arobotic assistant rollator to human walking. To the best of ourknowledge, this is the first approach aiming to solve a front-following problem using MBRL and human motion predictionmodels, either for a robotic assistant or a robotic companion.

III. PRELIMINARIES

In RL the goal is to learn a policy that will propose actionsfor an agent, which will maximize the sum of the expectedfuture rewards [38]. Given the current state xt ∈ X, the agent

CHALVATZAKI et al.: LEARN TO ADAPT TO HUMAN WALKING 3

executes an action ut ∈U and receives a reward rt = r(xt ,ut),while transitioning to the next state xt+1 = f (xt ,ut)+wt withinitial state x0 ∼ p(x0), where f is a nonlinear function for thesystem’s forward dynamics, wt a Gaussian noise process andp(x0) an initial state distribution. In most cases, especially formodel-free RL [39], the reward function is estimated fromsamples, which is a data expensive process. Model-basedRL attempts to address the problem of data inefficiency byusing observed data to learn the dynamics of the system. Themodel is used for running internal simulations of the agent’sdynamics, based on which the policy is learned. The goal ofMBRL is to learn an approximation of the true dynamics f .Let fθ be the learned discrete-time function parametrized by θ

that approximates f . The objective is to find the parametrizedpolicy fθ in a finite horizon that maximizes a long-term rewardover a time horizon T by optimizing the parameters θ .

Since MBRL aims to learn a global dynamics model,generalization is an issue, especially for robotics applicationsthat have to affront stochastic environments and adapt to newtasks. Thus, we resort to the option of planning through thesuggested policy actions to compensate for model errors. MPCis a suitable finite horizon optimal control solution whichoptimizes a cost function at each time step to produce asequence of control actions. Classic MPC relies on optimizingconstrained quadratic costs, requiring first or second orderapproximations of the dynamics for convexity, which is some-times difficult to meet when the dynamics are approximatedby neural networks. Thus, it is useful to employ a random-sampling shooting method for MPC [6], to perform rolloutsthrough time and simulate trajectories in a short time horizonT . In MBRL framework, MPC is used for finding the trajectorywith the minimum cumulative cost over time horizon T , forwhich only the first action ut of the optimal sequence is appliedto the system, and then re-plan at each time-step. Therefore,such an approach compensates for model inaccuracies bypreventing accumulating errors and drifting from the desiredtrajectory. In the context of MBRL, the reward maximizationcan be viewed as the equivalent cost minimization problemthrough MPC.

IV. PROBLEM STATEMENT

Our problem concerns finding the optimal control pol-icy for adapting the robotic assistant’s motion to the needsof users with different mobility status. Given an estimatedcurrent human state xH

t = [ xH yH υHx υH

y ]T , wherepH

t = [ xH yH ]T is the position and uHt = [ υH

x υHy ]T

the velocity along the axes, and the human-related robotcoupling parameters, i.e. the desired separation distance `tand relative human robot bearing φt , we must find an optimalcontrol action uR

t that will guarantee the compliance to thehuman motion intention. In other words, we aim to finda policy fθ (xR

t ,xHt ) that will propose robot control actions

uRt = [υt ωt ]

T , where υt and ωt are the linear and angularvelocities and xR

t = [ xR yR ]T the robot position along theaxes, following the objective of joint human-robot navigation.The problem includes the following optimization problem aim-ing to find the optimal control sequence UT

t = {ut , ...,ut+T−1},over a finite time horizon T , by minimizing the followingquadratic cost:

UTt = argmin

ut ,...,ut+T−1

12

t+T−1

∑τ=t

(xRτ )

T ·C ·xRτ + cτ

T ·xRτ

s.t. xRt+1 = g(xR

t ,uRt ) with uR

t ∼ fθ (xRt ,x

Ht )

and ulb ≤ uRt ≤ uub (1)

where C ∈ R2×2 is a diagonal positive definite weightmatrix, ct = −(pH

t + xdt ) is the goal position with xd

t =[ lt cos(φt) lt sin(φt) ]T being the desired coupling betweenhuman and robot position along the axes in the local human-robot frame. The optimization problem is subject to the robotmotion model g(xR

t ,uRt ) w.r.t. the unknown policy fθ (xR

t ,xHt )

and constrained by some upper uub and lower ulb bounds overthe linear and angular velocity commands. As transition modelg(xR

t ,uRt ) we consider the well-known unicycle model.

V. PROPOSED MBRL FRAMEWORK

The proposed method for control policy learning for arobotic assistant rollator that will adapt its motion to theuser’s gait, while keeping a desired relative coupling formation(distance and bearing), is depicted in Fig. 3. At each timeframe, we predict the human motion over a time horizon T andforecast the evolution of the desired coupling parameters. Thisinformation is used for sampling velocities from the controlpolicy network that approximates the dynamics of the human-robot coupled motion, where the MPC selects the optimalcontrol sequence according to (1). The observed human androbot states, along with the selected action, are aggregated ina dataset for adapting the control policy.

Specifically, the proposed framework addresses two coresub-problems. The first sub-problem is understanding thehuman motion intention. This encapsulates not only the pre-diction of the human future trajectory in a finite time horizongiven some past knowledge, but also a model for forecastingthe evolution of the desired separation distance and bearingfor the same time-horizon. Our second sub-problem concernslearning an optimal control policy. This policy is dependenton the human motion observation by the robot, since wehave to deal with a constant interactive human-robot couplingproblem. The robot should always be in front of the human,keeping a desired separation distance and orientation andadapting its control actions according to the human’s currentand predicted walking states. Incorporating a human motionforecast model, helps to better decide over the best long termcost of the control actions through MPC. We rely on IL fortraining an initial global approximator of the control policynetwork from human demonstrations and use this trainedmodel in the MBRL framework for online adaptation. Inthe following, we will describe the human motion intentionprediction models and the proposed control policy network andtheir implementation within the proposed MBRL framework .

A. Human Motion Intention Prediction Models

Human motion intention prediction includes two main goals,as shown in Fig. 4. The first one concerns the human motionprediction in a finite-time horizon given the past human states.The second one refers to the estimation of the coupling


Fig. 4: Recurrent networks used for predicting future human states(HuMIM network) and the desired coupling parameters.

parameters of the robot w.r.t. human. Before diving into detailsabout the predictive models, we will briefly describe what thehuman state includes and how it is extracted. In our previousworks, we have extensively studied human motion detectionand tracking from 2D laser data along with real time gaitanalysis [3]. Recently, in [5] we have also used the upper bodypose detection from an RGB-D sensor to perform a humanCenter-of-Mass (CoM) tracking by jointly using informationfrom the pose and the legs motion (i.e. the gait velocity) toestimate the CoM motion. In this work, we use this notion ofhuman state xH

t , i.e. the position and velocity of human’s CoMalong the axes. We exploit Motion Capture (MoCap) data toextract ground truth states and initially train our models withthose smooth data and then fine-tune them with data from ourtracking system.

Human Motion Intention Model (HuMIM): The HuMIMis a deep-learning network based on LSTM units [40]. LSTMconstitutes a special kind of recurrent neural networks that caneffectively learn long-term dependencies that exist in sequen-tial data like in motion trajectories. This is accomplished byincorporating memory cells that allow the network to learnwhen to forget previous hidden states and when to updatehidden states given new information. Our network architecturefor HuMIM is depicted in Fig. 4. The input feature vector xH

tis the current human state.1 The network comprises a FullyConnected (FC) (Fig. 4 - blue boxes) layer, followed by aRectified Linear Unit (ReLU) activation [41] and two LSTM(Fig. 4 - yellow boxes) layers and a FC layer that decodes theoutput, which is a prediction of the future human states overa time horizon T : xH

t+1:t+T .Coupling Parameters Forecast Model: Another problem

we need to solve, is to figure out the desired couplingparameters in the human-robot frame, i.e. the relative distanceand bearing that will ensure the coupled human-robot motion.This is especially important for following mode cases whenthe robot has to follow a human from front but keeping aclose distance in case assistance is needed. Those parametersare crucial for robot-control as we have already seen in Sec.IV. For computing them, we employed information fromdemonstrations of real patients walking with a passive rollator,while wearing motion markers (Fig. 4). We have found thatthe human-rollator distance, while walking, is correlated tothe human stride length. Since we can apply real-time gaitanalysis from [3], we can compute stride lengths and use them

1We have set the initial human position to be the global reference frame.

Fig. 5: Network architecture of the proposed control policy.

in a prediction network that will forecast their evolution overa time window. For relative bearing we predict the humanself orientation evolution in the local human frame. Therefore,given as input the current parameters `t ,φt , a simple networkwith two LSTM layers along with a FC layer (Fig. 4), canprovide the next time step predictions.

B. Control Policy training via Imitation Learning

We train an initial control policy for the robotic rollatorfollowing the concept of IL [42]. We benefit from the demon-strations of real patients for imitating the way they interactwith the rollator while walking. The goal is to learn controlactions for the robot as if the human was pushing the rollatorin front of him. To this end, we have implemented the controlpolicy network of Fig. 5. Following the narrative of Sec. IV,this network will serve as the approximator fθ of the truedynamics fθ , that will propose velocity commands for therobot given the information about the human motion intention.

The proposed policy network is an LSTM-based sequence-to-sequence model using as input features the predicted humanstates xH

t:t+T for a time-window T transformed w.r.t. thecurrent robot state xR

t (Fig. 5). We use two FC layers witha ReLU and a Dropout layer [43] (with probability p = 0.4)between them. The scope of the FC layers is to encode theinitial features using static transformations independently ofthe time dependencies modelled by the LSTM units. Themain encoding-decoding is implemented by the two LSTMlayers. The output is decoded by the final FC layer, that givesthe control action for time t, i.e. the robot velocity vector:uR

t = [υt ωt ]T .

C. Control Policy Adaptation via Model-based ReinforcementLearning

Although IL can provide good results on predicting thecontrol actions w.r.t. the ground truth ones, its capacity islimited to the demonstrated data. Therefore, we use thelearned policy via IL as as initial approximation which willbe adapted to unseen human motion patterns through RL.Fig. 3 and Algorithm 1 show an overview of the proposedMBRL scheme for policy adaptation. In this setting, we resortto the HuMIM network and the coupling parameters forecastmodel described in Sec. V-A to predict in a time horizon Tthe human states xH

t:t+T and the desired coupling parameterslt:t+T ,φt:t+T deriving from the current estimated stride lengthand human self orientation. Those parameters will be used toform the desired coupling state xd

t for the MPC controllerin (1) as described in Sec. IV. At each time step we usethe predicted human states provided by HuMIM, transformedin the respective robot frame, to sample Ns new velocities


Algorithm 1 Model-based RL for coupled HR motionRequire: Training data and empty dataset D for aggregationRequire: Aggregation frequency K ∈ Z, MPC horizon T ∈ ZRequire: Pre-trained control policy fθ

1: for i = 1, ... do2: if i mod K = 0 then3: for t:1,..,T do4: get future human states xH

t:T+T via HuMIM net5: get desired coupling parameters lt:t+T ,φt:t+T6: sample Ns velocities from policy uR

t ∼ fθ (xRt ,xH

t:T+T )and add exploration noise

7: perform MPC rollouts to find the optimal control se-quence UT

t using (1)8: execute first action uR

t from selected sequence UTt

9: add (xRt ,xH

t:T+T ,uRt ) in dataset D

10: else11: perform fine-tuning on policy of Fig. 5 using the aggregated

data D

from the policy. We use the dropout layer to apply MonteCarlo Markov Chain sampling on the network’s outputs to takeadvantage of the network’s uncertainty [43]. We also applyextra random exploration noise on the sampled velocities inorder to widen the sample distribution.

Moreover, we apply random white noise on the estimatedrobot state xR

t to simulate possible errors in real case scenarioslike localization errors, drifting, etc. Our aim is to learnpolicies for recovering the robot from false states by applyingthe best possible control action. The Ns velocity samples areused for the MPC rollouts by simulating trajectories over afinite time horizon T for each sample. The simulated trajectorywith the minimum cost computed by (1) (i.e. highest rewardin the RL narrative) is selected, while only the first actionfrom the selected control sequence is applied to the robot.Through re-planning at each time step, we compensate forpossible model errors. The robot state along with the appliedcontrol action and the current human state xR

t ,xHt:T+T ,uR

t (Fig.3, Alg. 1) are aggregated in a new, initially empty, dataset Dintending to be used for policy adaptation by fine-tuning thenetwork every K ∈ Z time steps. The on-policy data aggre-gation and retraining of the model adapts the policy to newstate-action tuples possibly previously unseen to the networkby augmenting the respective distributions and improving thecontroller’s performance.

VI. EXPERIMENTAL RESULTS

A. Experimental setup & data

The data used in this work were collected in AgaplesionBethanien Hospital - Geriatric Center in Heidelberg with theparticipation of fourteen patients. The participants presentedmoderate to mild mobility impairment, according to clinicalevaluation. The subjects had to perform several everyday lifescenarios using a passive robotic rollator, used for the purposeof data collection, while wearing motion markers from aMoCap setup (Fig. 2). The subjects had to perform severalexperimental scenarios in a special hospital room, walkingsupported (i.e. holding on the rollator) or unsupported (i.e.the rollator was “following” the human from a close distancewithout physical interaction). The data were collected by aKinect v.1 sensor and a Hokuyo UBG-04LX-F01 laser sensorthat were mounted on the rollator (Fig. 2). For the purpose

(a) (b)Fig. 6: (a) Example of a predicted human path with the HuMIMnetwork. (b) Example of predicted robot’s linear velocity from theIL policy network.

of this work, we have employed 20.000 frames of MoCapdata (synchronized to the laser frame rate, i.e. 0.028sec/scan),and a dataset of approximately 5.000 frames of human CoMtracking data from four patients from the supported modescenarios used for fine-tuning/testing our models and fortraining MBRL. An extra dataset of 2.000 tracking data werekept for the experimental testing of the MBRL framework. Inthe following, we provide detailed results that demonstrate theefficiency of the proposed models and the performance of theproposed MBRL method.

B. Evaluation of Human Motion Intention Prediction ModelsImplementation of HuMIM network: We have trained

HuMIM using the MoCap and tracking data with a 80%-20% partition for training and testing respectively, for 500epochs with learning rate 10−4 and weight decay 10−4. Fortraining we have used Stochastic Gradient Descent optimizerwith the Mean Squared Error (MSE) loss (L2-loss) computedbetween the predicted xH and the ground truth, xH of theoutput features.

Evaluation: To evaluate the HuMIM network we computethe MSE loss for the training and testing datasets, which givesan indication of the overall prediction performance of ourmodels. More specifically, the MSE training loss was 4 ·10−4

while the testing loss for a T = 10 prediction horizon was2 ·10−3, meaning that our model provides a very good fit onthe data and accurate future predictions for the human motionintention. Fig. 6a depicts an example of a predicted path w.r.t.to the ground truth human path, where the dashed line showsan example of the forecast path for ten time steps.

Implementation of the Coupling Parameters forecastmodels: From our analysis we have found that the stridelength and actual human-robot distance data are correlated,with correlation coefficient ρ = 0.972 and a mean differencebetween them δ`= 0.15 m. In following cases, δ` is used as aconstant bias added to the predictions of the desired separationas a safety distance. For implementation we have used thesame training parameters and loss function as for HuMIM.

Evaluation: For the desired coupling parameters, we ac-quired equally good model fittings and MSE losses. For therelative bearing parameter φt the training loss was 2.2 · 10−3

rad and testing loss 2.8 ·10−3 rad, while for the relative sep-aration (trained both on demonstrated data and the extractedstride lengths from our tracking framework), the train loss was6 ·10−3 m and test loss 10−2 m. Those results show that ourtrained models can effectively predict human motion intentionsin a robot-human joint walking framework.


Arch. 1LSTM 2 FC 1FC + 2LSTM 2FC+1LSTM 2FC+2LSTMhidden states 512 [1024, 512] [2048,1024,512] [2048,1024,512] [2048,1024,1024,512]Train Loss 0.186 0.184 0.045 0.052 0.029Test Loss 0.158 0.156 0.057 0.065 0.043

TABLE I: Evaluation results for various architecture designs for theControl Policy Network of Fig. 5 in the IL setting.

C. Evaluation of IL Control Policy

Implementation: For training the control policy networkof Fig. 5 with IL, we have used the MoCap data, from whichwe have computed the human states and the robot’s groundtruth velocities. For the IL training of the network we havenot used HuMIM for human predictions, but we rather packedthe training data into time-overlapping feature vectors xH

t:t+T .The network was trained using mini-batches of 512 clips, withinitial learning rate 10−3, momentum 0.9 and weight decay10−4. The learning rate is divided by 10 after half the epochs.We used Adam optimizer and for imitation loss we haveemployed the L1-loss, i.e. the mean absolute error betweenthe predicted uR

t and the actual uRt robot velocities.

Evaluation: Table I provides the training and testing L1losses for the predicted control velocities w.r.t. the groundtruth ones for the IL setup. It is evident that the LSTMlayers are a requisite for decoding the sequences of humanmotion. Moreover, the combination of FC and LSTM layersseems to provide the necessary encoding-decoding schemefor translating a predicted human trajectory (considering thathumans move in an holonomic way) into linear and angularvelocities for the robotic assistant. We choose the architecturewith the 2 FC and 2 LSTM layers since it is the one havingthe smallest prediction error. An example of the performanceof the proposed policy network is depicted in Fig. 6b, wherewe compare the predicted linear velocity w.r.t. ground truthfrom the testing dataset of the IL policy.

D. Evaluation of the MBRL approach:

Implementation: For the fine-tuning process of the controlpolicy network of Fig. 5 according to the MBRL framework(Fig. 3), we use the Adam optimizer and the Huber loss, whichis less sensitive to outliers in data:

Lε =

12 ∑

∥∥∥ ûRt −uR

t

∥∥∥2, for

∣∣∣ ûRt −uR

t

∣∣∣≤ ε

ε

∣∣∣ ûRt −uR

t

∣∣∣− 12 ε2, otherwise

(2)

where ε > 0 is a small value. Below we evaluate differentdesign decisions regarding the MBRL setup. For the MBRLtraining we have employed 5000 frames of tracking data fromfour patients, while for testing the controller’s performance wehave kept 2000 data from one patient unseen to the trainingset.

Evaluation results for human-following: For the taskof human front-following, we evaluate the MBRL approachwith tracking data from a new patient, i.e. unseen to all thetraining procedures described above. We aim to investigatehow the control policy can adapt to a new patient withmedium mobility impairment, as if the robot was followingthe user from front, in a way like the user was pushing therobotic rollator. Fig. 8 presents graphs comparing paths, linear

velocities and separation distances from the MBRL approachw.r.t. the actual data provided by the MoCap analysis.

Evaluation of MBRL training: For the MBRL trainingprocedure we have explored different design parameters foracquiring the best possible solution to our problem. Fig. 7presents the learning curves, which represent the cumulativecosts for the task of human-robot coupled motion, for dif-ferent design parameters. Since we are considering costs, thelower the cumulative cost, the better the performance by thecorresponding MBRL setting. The best outcome from thisevaluation will be considered for testing with a new patient.

Specifically, in Fig. 7a the impact of different number ofsamples Ns used for the MPC rollouts is presented, for arange of 5-100 samples. The outcome seems reasonable, sincefor less samples (i.e. 5-20) the limited exploration by thecontroller leads to accumulating larger errors and thus costs.Interestingly, the learning curve of 50 samples behaves thesame as the curve for 100 samples, while both parametersettings converge very quickly at a steady performance. Theslopes of the curves for 50 and 100 samples show thatafter 250 aggregation steps we have a stabilized performance.We choose to use 50 samples as a computationally cheapersolution.

For selecting the number of epochs used at each aggregationstep for adapting our policy we have experimented with 20, 50and 100 epochs. Fig. 7b shows the cumulative costs for 200aggregation steps and 50 samples for the MPC. It is obviousthat 100 epochs of training has the best performance, howeverthe 50 epochs setting follows closely, thus we will select thosefor computational reasons.

In Fig. 7c we present the evaluation results for different ag-gregation frequencies K (Algorithm 1). From experimentation,we have found that aggregating and adapting the control policyevery 10 time steps yields better performance to the proposedalgorithm. It is important to note that those timings havebeen chosen to resemble timings in human gait. Specifically,we know from previous work [3], that approximately every10 time frames a human performs a leg swing for steppingthrough and about every 50 time frames a gait cycle is com-pleted. Therefore, we notice that adaptation for each steppingyields lower cumulative costs.

In the same way, we have explored different time-horizonsettings for our MPC (we have changed the settings of our pre-diction and policy networks accordingly for this experiment).Longer time-horizons than T = 10 accumulate greater errors.Longer horizons mean larger prediction errors from the humanmotion intention models, hence leading to greater errors for thepolicy estimation. Evidently, adaptation in a frequency relativeto the gait’s swing phase is more appropriate.

According to the above evaluation, we have decided toemploy the implementation of MBRL using Ns = 50 samplesfor the MPC rollout, 50 epochs for policy fine-tuning, K = 10time-steps aggregation frequency and T = 10 prediction hori-zon. This implementation had an average Root MSE (RMSE)0.068 cm regarding the coupling error w.r.t. the actual one andaverage Huber loss for fine-tuning was 0.031.

In Fig. 8a, we are comparing the paths performed by thetracked human for 500 time frames, the one derived by the


(a) (b) (c) (d)Fig. 7: Learning curves for the MBRL method presenting the cumulative costs for various design parameters: (a) Number of sampledtrajectories Ns. (b) Number of epochs used for fine-tuning the policy. (c) Aggregation frequency K. (d) Prediction horizon T .

(a) (b) (c)

Fig. 8: Testing results of the MBRL policy adaptation for a new patient. (a) Human, robot and ground truth (GT) paths (RMSE 0.18mMBRL vs. rollator in data-collection) . (b) Comparison of human, robot and GT rollator linear velocities (RMSE 0.15 m/sec MBRL vs.GT). (c) Comparison of the MBRL separation distance w.r.t. the GT human-rollator one (RMSE 0.22m).

MBRL process for the robot and the ground truth (GT) onederived by the MoCap data, when the patient pushed therollator during the data collection experiments. It is evidentthat the MBRL method provides a trajectory very close to theGT one. This can also be demonstrated by the results in Fig.8b, where we compare the linear velocities for the patient, theones proposed by MBRL and the actual rollator. The MBRLvelocity decisions follow closely the human velocity patterns.It is interesting to mention that in contrast to Fig. 6b presentingthe IL results where the policy followed the actual rollatorvelocities, now we can see that the policy has adapted tothe actual motion pattern of the patient. There is howevera small lag, of approximately 20 time frames in detectingturning points, which might also be inherited by the HuMIMperformance. Finally, we compare the separation distance inthe human-robot coupling. Again, the MBRL policy followsthe actual pattern, meaning also that using the stride lengthas inference for the desired separation distance is a validassumption. Concluding this analysis we provide results overall 2000 tracking frames for the task of patient following. Theaverage RMSE between the MBRL proposed robot path andthe GT rollator’s is 0.18m, the RMSEs for velocities are 0.15m/sec for linear and 0.24 rad/sec for angular velocity. Theaverage RMSE for human-robot separation distance w.r.t. theGT (i.e. human-rollator) one is 0.22m.

For further evaluation of the proposed MBRL framework wepresent a comparative analysis for the above example w.r.t.baseline methods. Those baseline methods are: (i) pure ILas described in Sec. V-B, (ii) the MPC for the constrainedoptimization problem of Eq. 1, and (iii) a kinematic controllerproposed in [13]. It should be noted that we have experimentedalso with a baseline model-free method used for continuous

control [25], however the algorithm did not converge, giventhe limited amount of data. Notably, this is a significant remarkas the MBRL method overcomes the sample inefficiencyproblem, being able to converge with little amount of realdata from a real problem setting.

Table II presents the results for the comparison of theMBRL with the above described baseline methods showingthe RMSE for the position of the robot w.r.t. the actual rollatorin the data collection experiments, the applied linear andangular velocities and the separation distance between humanand robot. When using IL the policy fails to track the userclosely as it easily drifts away from the desired trajectorycausing high RMSEs. The IL network only proposes velocitieswithout taking into account the desired coupling in the human-robot formation. Moreover, since the IL network is limitedby the training examples, it cannot generalize well to datathat diverge from the training distribution. Thus, it is evidentthat adaptation of the model is important especially whentracking people with pathological gait that present differentmobility problems, an advantage that the MBRL methodpresents. The MPC presents slightly higher errors than MBRL(in average approximately 50% RMSE increase). MPC relieson linearization of the system’s dynamics which infers errors,therefore user-adaptation is not achieved. The proposed MBRLmethod that incorporates the advantages of velocity adaptationalong with the model-based controller provides a better controlpolicy. The kinematic controller could perform quite well forstraight paths but was unable to compensate to orientationchanges therefore not keeping the desired coupling, resultingin high errors. Despite the advantages of MBRL resultingfrom our previous analysis, a limitation of this method stemsfrom the difficulty of learning a dynamics model sufficient


methodRMSE posi- linear angular separation

tion (m) vel. (m/sec) vel. (rad/sec) distance (m)IL 0.96 0.59 0.38 0.78

MPC 0.27 0.23 0.31 0.38kinematic 1.33 0,61 0,49 0.89

MBRL 0.18 0.15 0.24 0.22

TABLE II: Comparative evaluation of the MBRL performance w.r.t.three baseline methods.

enough to represent real dynamics. Thus, for optimizing ourframework we plan to combine MBRL with meta-learning,which can go beyond the dependency for an accurate dynamicsmodel to achieve adaptation.

VII. CONCLUSION & FUTURE WORK

We proposed a novel approach using MBRL for adapting themotion of a robotic assistant rollator to the walking patterns ofelderly and patients with various mobility inabilities. The aimis to develop a control policy that will propose optimal controlactions for coupling the robot’s motion with each user, as if theuser was pushing the rollator. In this setting, we consider theproblem to be a front-following human-robot coupled motion.To this end, we have designed LSTM-based networks forpredicting future human motion intentions and forecasting thedesired coupling parameters in the robot-human setting. Aninitial control policy that suggests control velocities given thehuman kinematic state evolution in a short time-horizon wastrained through IL. In the MBRL framework we adapt thispolicy employing a MPC planner and using tracking data frompatients. Through extensive experimentation with real data, weprovide evidence that prove MBRL to be efficient as a decisionmaking approach for a user-adaptive controller in a roboticassistant rollator. In our future work, we plan to test differentplanning methods in the MBRL framework and a combinationwith model-free methods.

REFERENCES

[1] S. Dubowsky, F. Genot, S. Godding, H. Kozono, and A. Skwersky,“Pamm - a robotic aid to the elderly for mobility assistance andmonitoring: a ”helping-hand” for the elderly,” in ICRA 2000.

[2] A. Morris and R. Donamukkala et al., “A robotic walker that providesguidance,” in ICRA ’03.

[3] G. Chalvatzaki, X. S. Papageorgiou, C. S. Tzafestas, and P. Maragos,“Augmented human state estimation using interacting multiple modelparticle filters with probabilistic data association,” RA-L ’18.

[4] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2dpose estimation using part affinity fields,” in CVPR ’17.

[5] G. Chalvatzaki, P. Koutras, and J. Hadfield et al., “On-line humangait stability prediction using lstms for the fusion of deep-based poseestimation and lrf-based augmented gait state estimation in an intelligentrobotic rollator,” ICRA ’19.

[6] A. G. Richards, “Robust constrained model predictive control,” Ph.D.dissetration, 2004.

[7] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural networkdynamics for model-based deep reinforcement learning with model-freefine-tuning,” in ICRA ’18.

[8] G. Williams, N. Wagener, and B. Goldfain et al., “Information theoreticmpc for model-based reinforcement learning,” in ICRA ’17.

[9] M. Geravand et. al, “An integrated decision making approach foradaptive shared control of mobility assistance robots,” Int’l J. of SocialRobotics, ’16.

[10] Y. Hirata et. al, “Motion control of intelligent passive-type walker forfall-prevention function based on estimation of user state,” in ICRA ’06.

[11] C. A. Cifuentes and A. Frizera, Development of a Cognitive HRI Strategyfor Mobile Robot Control, 2016.

[12] S. Scheggi et al., “Cooperative human-robot haptic navigation,” in ICRA’14, 2014.

[13] G. Chalvatzaki, X. S. Papageorgiou, P. Maragos, and C. S. Tzafestas,“User-adaptive human-robot formation control for an intelligent roboticwalker using augmented human state estimation and pathological gaitcharacterization,” in IROS ’18.

[14] W. Xu, J. Huang, Y. Wang, C. Tao, and L. Cheng, “Reinforcementlearning-based shared control for walking-aid robot and its experimentalverification,” Advanced Robotics ’15.

[15] T. Kruse, A. K. Pandey, R. Alami, and A. Kirsch, “Human-aware robotnavigation: A survey,” Robotics and Autonomous Systems ’13.

[16] P. Ciou, Y. Hsiao, Z. Wu, S. Tseng, and L. Fu, “Composite reinforcementlearning for social robot navigation,” in IROS ’18.

[17] M. Fahad, Z. Chen, and Y. Guo, “Learning how pedestrians navigate:A deep inverse reinforcement learning approach,” in IROS ’18.

[18] M. Pfeiffer, G. Paolo, and H. Sommer et al., “A data-driven modelfor interaction-aware pedestrian motion prediction in object clutteredenvironments,” in ICRA ’18.

[19] Y. F. Chen, M. Everett, M. Liu, and J. P. How, “Socially aware motionplanning with deep reinforcement learning,” in IROS ’17.

[20] T. Fan, X. Cheng, J. Pan, and P. P. Long et al., “Getting robots unfrozenand unlost in dense pedestrian crowds,” RA-L ’19.

[21] G. Ferrer, A. Garrell, and A. Sanfeliu, “Robot companion: A social-force based approach with human awareness-navigation in crowdedenvironments,” in IROS ’13.

[22] E. Repiso, A. Garrell, and A. Sanfeliu, “Robot approaching and engagingpeople in a human-robot companion framework,” in IROS ’18.

[23] D. Lee, C. Liu, Y. Liao, and J. K. Hedrick, “Parallel interactingmultiple model-based human motion prediction for motion planning ofcompanion robots,” IEEE T-ASE ’17.

[24] C. Finn and S. Levine, “Deep visual foresight for planning robotmotion,” in ICRA ’17.

[25] T. P. Lillicrap, J. J. Hunt, and A. Pritzel et al., “Continuouscontrol with deep reinforcement learning.” [Online]. Available:http://arxiv.org/abs/1509.02971

[26] P. Mirowski, R. Pascanu, and F. Viola et al., “Learning to navigate incomplex environments.” [Online]. Available: http://arxiv.org/abs/1611.03673

[27] J. Zhang, J. T. Springenberg, J. Boedecker, and W. Burgard, “Deepreinforcement learning with successor features for navigation acrosssimilar environments,” in IROS ’17.

[28] Y. Zhu, R. Mottaghi, and E. Kolve et al., “Target-driven visual navigationin indoor scenes using deep reinforcement learning,” in ICRA ’17.

[29] M. Pfeiffer, S. Shukla, and M. Turchetta et al., “Reinforced imitation:Sample efficient deep reinforcement learning for mapless navigation byleveraging prior demonstrations,” RA-L ’18.

[30] Y. Zhu, Z. Wang, and J. Merel et al.p.[31] M. P. Deisenroth, G. Neumann, and J. Peters, “A survey on policy search

for robotics,” Foundations and Trends in Robotics, 2013.[32] F. Ebert, C. Finn, and S. Dasari et al., “Visual foresight: Model-based

deep reinforcement learning for vision-based robotic control.” [Online].Available: http://arxiv.org/abs/1812.00568

[33] D. Meger, J. C. G. Higuera, A. Xu, P. Gigure, and G. Dudek, “Learninglegged swimming gaits from experience,” in ICRA ’15.

[34] H. Durrant-Whyte, N. Roy, and P. Abbeel, Learning to Control a Low-Cost Manipulator Using Data-Efficient Reinforcement Learning.

[35] K. Lee, K. Saigol, and E. Theodorou, “Safe end-to-end imitationlearning for model predictive control.” [Online]. Available: http://arxiv.org/abs/1803.10231

[36] B. Amos, I. Jimenez, J. Sacks, B. Boots, and J. Z. Kolter, “DifferentiableMPC for End-to-end Planning and Control,” 2018.

[37] I. Clavera and A. Nagabandi et al., “Learning to adapt: Meta-learningfor model-based control.” [Online]. Available: http://arxiv.org/abs/1803.11347

[38] R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning.MIT Press, 1998.

[39] J. Peters and S. Schaal, “Reinforcement learning of motor skills withpolicy gradients,” Neural Networks ’08.

[40] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” NeuralComput. ’97.

[41] V. Nair and G. E. Hinton, “Rectified linear units improve restrictedboltzmann machines,” in ICML’10.

[42] S. Schaal, “Learning from demonstration,” in NIPS’96.[43] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation:

Representing model uncertainty in deep learning,” in ICML ’16.

Documents

Learn to adapt to human walking: A Model-based ...users.ntua.gr/xpapag/Papers/19-0390_03_MS.pdf · A lot of research focuses on social robot navigation [15], i.e. robot motion planning