Goal Inference Improves Objective and Perceived …anca/papers/AAMAS16...rescue collaboration scenario. (c) Illustration of a collabo-rative table setting scenario. 19,30], to helping

Goal Inference Improves Objective and PerceivedPerformance in Human-Robot Collaboration˚

Chang Liu :§

[email protected] B. Hamrick :;

[email protected] F. Fisac :˝

[email protected]

Anca D. Dragan ˝

[email protected]. Karl Hedrick §

[email protected]. Shankar Sastry ˝

[email protected] L. Griffiths ;

[email protected]§ Department of Mechanical Engineering

; Department of Psychology˝ Department of Electrical Engineering and Computer Sciences

University of California, BerkeleyBerkeley, California, 94720

ABSTRACTThe study of human-robot interaction is fundamental to thedesign and use of robotics in real-world applications. Robotswill need to predict and adapt to the actions of human col-laborators in order to achieve good performance and im-prove safety and end-user adoption. This paper evaluates ahuman-robot collaboration scheme that combines the taskallocation and motion levels of reasoning: the robotic agentuses Bayesian inference to predict the next goal of its hu-man partner from his or her ongoing motion, and re-plansits own actions in real time. This anticipative adaptation isdesirable in many practical scenarios, where humans are un-able or unwilling to take on the cognitive overhead requiredto explicitly communicate their intent to the robot. A be-havioral experiment indicates that the combination of goalinference and dynamic task planning significantly improvesboth objective and perceived performance of the human-robot team. Participants were highly sensitive to the dif-ferences between robot behaviors, preferring to work with arobot that adapted to their actions over one that did not.

KeywordsHuman-Agent Interaction, Teamwork, Intention Inference,Collaborative Task Allocation

1. INTRODUCTIONOnce confined to controlled environments devoid of hu-

man individuals, robots today are being pushed into ourhomes and workplaces by unfolding advances in autonomy.From sharing workcells [20], to assisting older adults [2,

†These authors contributed equally to this paper.˚This work is supported by ONR under the Embedded Hu-mans MURI (N00014-13-1-0341).

Appears in: Proceedings of the 15th International Conferenceon Autonomous Agents and Multiagent Systems (AAMAS 2016),J. Thangarajah, K. Tuyls, C. Jonker, S. Marsella (eds.),May 9–13, 2016, Singapore.Copyright c© 2016, International Foundation for Autonomous Agents andMultiagent Systems (www.ifaamas.org). All rights reserved.

Predictive

(a)

Fixed

Reactive

(b) (c)

Figure 1: (a) Human avatar and three types of robot avatarused in the experiment. Circles with a single dot representtasks that can be completed by any agent (either human orrobot) and circles with two dots require simultaneous col-laboration of both agents. (b) Illustration of a search-and-rescue collaboration scenario. (c) Illustration of a collabo-rative table setting scenario.

19, 30], to helping in surveillance [13] and search and res-cue [11,27], robots need to operate in human environments,both with and around people.

While autonomous systems are becoming less reliant onhuman oversight, it is often still desirable for robots to col-laborate with human partners in a peer-to-peer structure[5, 10, 26]. In many scenarios, robots and humans can fill

940

complementary roles in completing a shared task, possiblyby completing different sub-goals in a coordinated fashion(Fig. 1a). For example, in a search-and-rescue mission (Fig.1b), human first responders might load injured persons ontoan autonomous vehicle, which would then transport the vic-tims to a medical unit. In a number of manipulation scenar-ios, humans and robots can take advantage of their differentaptitudes and capabilities in perception, planning and actu-ation to boost performance and achieve better results.

Collaboration hinges on coordination and the ability toinfer each other’s intentions and adapt [28, 29]. There isan abundant history of failure cases in human-automationsystems due to coordination errors and miscommunicationbetween human and autonomous agents [22, 24, 25]. Suchfailures often result from an inability of autonomous sys-tems to properly anticipate the needs and actions of peers;ensuring that autonomous systems have such an ability iscritical to achieving good team performance [31]. Considerthe collaborative manipulation scenario in Fig. 1c, in whichboth the human and the robot work together to set a table.It is important for the robot to anticipate the object thatthe human will take so that it can avoid hindering or eveninjuring the human.

Across robotics [7, 32], plan recognition [4], and cognitivescience [1, 23], Bayesian goal inference has been applied tothe problem of inferring human goals from their ongoingactions. Recently, we proposed a task allocation methodwhich utilizes Bayesian goal inference at the motion level toenable the robot to adapt to the human at the task planninglevel [21]. However, the strategy detailed in this work hasnot been evaluated in collaboration scenarios involving realhumans. Building on our initial work, we make the followingcontributions in this paper:Evaluation: we evaluate the combination of two methodstraditionally studied in isolation, namely applying goal in-ference at the motion level to compute a collaborative planat the task level. This work differs from other user stud-ies evaluating the use of Bayesian goal inference in human-robot interaction, which have focused on assisting humans toachieve an immediate goal [8,9]. We developed and ran a be-havioral experiment (Fig. 1a) in which participants collab-orated with virtual robots that (i) performed goal inferenceand re-planned as the human was moving (the predictiverobot); (ii) only re-planned after the person reached theirgoal (the reactive robot); or (iii) never re-planned unless thecurrent plan became infeasible (the fixed robot). We alsohad two between-subjects conditions in which we varied thetype of inference used by the robot. In the “oracle” condi-tion, the predictive robot had direct access to the partici-pant’s intent, emulating perfect goal inference, while in the“Bayesian” condition, the predictive robot used probabilis-tic inference on noisy human trajectories (as would happenin a realistic setting). Building on literature that evalu-ates human-robot collaboration [6,18], we evaluated perfor-mance both objectively and subjectively. Objective metricsincluded time to complete the task and the proportion oftasks completed by the robot; subjective metrics were basedon elicited preferences for the robots, and a survey regard-ing perceptions of the collaboration. As expected, errorsin prediction do increase task completion time and lowersubjective ratings of the robot. Despite these errors, bothconditions produce the same results: we find that a robotthat anticipates human goals to adapt its overall task plan issignificantly superior to one that does not, in terms of bothobjective performance and subjective perception.

Implication. Despite the nontrivial assumption that thehuman will plan optimally after the current goal, the be-havioral study reveals that using this model for the roboticagent’s adaptation greatly improves objective team perfor-mance (reducing task completion time and increasing theproportion of robot-completed goals). This may be due tothe combination of swift adaptation to the human (which at-tenuates the impact of the optimality assumption) with anadditional closed-loop synergy arising as the adapted robotplan becomes more likely to match the human’s own expec-tation. Importantly, adaptation based on this model alsosignificantly improves subjective perceptions: users patentlynotice—and appreciate—the agent’s adaptive behavior. Theimplications of these results on human-robot interaction de-sign are explored in the discussion.

2. TASK ALLOCATION DURING HUMAN-ROBOT COLLABORATION

To test the importance of inferring human intent andadapting in collaborative tasks, we use a task allocation sce-nario with both individual and shared tasks. Following [21],we consider a setting in which human agent H and a robotagent R are in a compact planar domain D Ă R2 with None-agent tasks located at positions pi P D, i “ 1, ..., N ,and M joint tasks with locations qj P D, j “ 1, ...,M . One-agent tasks are executed instantly when visited by any agent,whereas joint tasks are only completed when visited by bothagents simultaneously. The agent that arrives first at a jointtask cannot leave until the other one comes to help completethe task. The human starts at position h0 P D and the robotstarts at r0 P D. Both agents are able to move in any di-rection at the same constant velocity V . The discrete-timedynamics of the human-robot team at each time step k ě 0are hk`1 “ hk ` u and rk`1 “ rk ` w, respectively, where}u} “ }w} “ V , with } ¨ } being the Euclidean norm. Bothagents have the common objective of completing all tasksin the minimum amount of time. Each agent can observeits own location, the location of the other agent, and thelocation and type of all remaining tasks. However, agentscan not explicitly communicate with each other. This canbe seen as a multi-agent version of the Traveling SalesmanProblem [16].

2.1 Robot Collaboration SchemeThe scheme for integrating goal inference into task allo-

cation presented in [21] achieves robustness to suboptimalhuman actions by adjusting its plan whenever the human’sactions deviate from the robot’s current internal allocation.At the initial time k “ 0, the robot computes the optimaltask allocation and capture sequence given the initial layoutof tasks and agent positions (Section 2.3). The robot ob-serves the human’s actions as both human and robot movetowards their respective goals. To infer whether the hu-man is deviating from the robot’s internal plan, the robotconducts a maximum a posteriori (MAP) estimate (Section2.2). If a deviation is detected, the robot computes a new al-location of tasks based on the current state and the inferredhuman intent.

When computing the optimal allocation, it is assumedthat the human will in fact execute the resulting joint plan.In the ideal scenario where both agents are perfectly ratio-nal, this would indeed be the case (without the need for

941

explicit coordination). Of course, this rationality assump-tion is far from trivial in the case of human individuals; wenonetheless make use of it as an approximate model, ad-mitting incorrect predictions and providing a mechanism toadapt in these cases. It is worth noting that this modelmaximally exploits collaboration in those cases in which theprediction is correct: that is, if the human does follow thecomputed plan, then the team achieves the best possibleperformance at the task. As will be seen, adaptation basedon this model leads to significant improvements in team per-formance.

2.2 Bayesian Goal InferenceLet g denote the location of any of the N`M tasks. Once

the human starts moving towards a new task, the robot in-fers the human’s intent as the task with the highest posteriorprobability, P pg|h0:k`1q, given the history of observationsthus far, h0:k`1. The posterior probability is updated itera-tively using Bayes’ rule:

P pg|h0:k`1q 9 P phk`1|g, h0:kqP pg|h0:kq,

where the prior P pg|h0:kq corresponds to the robot’s be-lief that the human is heading towards task g based onprevious observed positions h0:k. The likelihood functionP phk`1|g, h0:kq corresponds to the transition probability forthe position of the human given a particular goal. Underthe Markov assumption, the likelihood function can be sim-plified as:

P phk`1|g, h0:kq “ P phk`1|g, hkq.

We compute the term P phk`1|g, hkq assuming that thehuman is noisily optimizing some reward function. Following[1], we model this term as a Boltzmann (‘soft-max’) policy:

P phk`1|g, hkq 9 exppβVgphk`1qq,

where Vgphq is the value function for each task location g.Following [21], we model Vgphq as a function of the distancebetween the human h and the task, dgphq “ }g ´ h}; i.e.

Vgphq “ γdgphqU ´ crpγ´ γdgphqq{p1´ γqs. This correspondsto the value function of a γ-discounted optimal control prob-lem with terminal reward U and uniform running cost c.The value function causes the Boltzmann policy to assign ahigher probability to actions that take the human closer tothe goal location g.

The parameter β is a rationality index, capturing howdiligent we expect the human to be in optimizing their re-ward. When β “ 0, P phk`1|g, hkq gives a uniform distri-bution, with all actions equally likely independent of theirreward (modeling an indifferent human agent). Conversely,as β Ñ 8, the human agent is assumed to almost alwayspick the optimal action with respect to their current goal.

2.3 Task AllocationWe compute the optimal allocation and execution order of

tasks that minimizes completion time using a mixed-integerlinear program (MILP). Let A denote the set of human androbot agents and G represent the set of remaining tasks.Then, the task allocation problem can be formulated as:

minxalloc,talloc

maxvPV,gPG

tvg

subject to xalloc P Xfeasible talloc P Tfeasible,

where tvg denotes the time when vth agent completes taskg; xalloc encodes the allocation of tasks to each agent and

talloc represents the time for agents to arrive and leave tasks;and Xfeasible and Tfeasible denote the corresponding sets offeasible values that satisfy the requirements on completingone-agent and joint tasks. To generate an allocation thatrespects the intent of the human agent, the inferred goal isassigned as the first task for the human, which is enforcedin Xfeasible. Other constraints of Xfeasible include 1) eachtask is assigned to the human or robot, 2) an agent canonly visit and complete a task once 3) an agent can leave atask position at most once and can only do so after visitingit (continuity condition) and 4) no subtour exists for anyagent. The timing constraints of Tfeasible requires that ajoint task be completed the time when both the human andthe robot visit the task while a common task be executedimmediately when an agent visits it. Readers interested indetails of the MILP formulation of this particular travelingsalesman problem can refer to [21]. In this study, the MILPis solved by the commercial solver Gurobi [15].

2.4 EvaluationThe method outlined above was evaluated in [21] through

simulations in which human behavior was modeled as amixture of optimality and randomness. Specifically, hu-mans were modeled assuming that they would take the op-timal action with probability α, or take an action uniformlyat random with probability 1 ´ α. However, as true hu-man behavior is much more sophisticated and nuanced thanthis, it is unclear how well these results would generalize toreal humans. Moreover, the simulation results tell us littleabout what people’s perceived experiences might be. Evenif Bayesian goal inference does decrease the time needed tocomplete a set of tasks with real people, it is important thatthose people subjectively perceive the robot as being helpfuland the collaboration as being fluent, and that they feel con-fident that they can rely on the robot’s help and comfortablecollaborating with it.

To more rigorously test the robot’s task allocation strat-egy, we ran a behavioral experiment in which we asked hu-man participants to collaborate with three different robots,as shown in Fig. 1a, one at a time and in a counter-balancedorder: a robot that did not re-plan its strategy regardless ofthe human’s choices (the fixed robot), a robot that reactedto these choices only once the human reached the goal inquestion (the reactive robot), and a robot that anticipatedthe human’s next goal at each step and adapted its behaviorearly (the predictive robot). We conducted the experimentunder two different conditions: one with an oracle-predictiverobot that can make perfect inferences of the human’s intent,and a second with a Bayesian-predictive robot, in which theBayesian intent inference is at times erroneous (as it will bein real human-robot interactions).

3. EXPERIMENTAL DESIGN

3.1 TaskWe designed a web-based human-robot collaboration ex-

periment where the human participant was in control of anavatar on a screen. In this experiment, the human andvirtual robot had to complete a set of tasks as quickly aspossible. While some of the tasks could be completed byeither the human or the robot (one-agent tasks), othersrequired both agents to complete the task simultaneously(joint tasks). The participant chose to complete tasks byclicking on them, causing the human avatar to move to that

942

task while the robot moved towards a task of its choice. Af-ter the human completed the chosen task, they chose anothertask to complete, and so on until all tasks were completed.

3.2 Independent VariablesTo evaluate the collaboration scheme described in Sec-

tion 2, we manipulated two variables: adaptivity and in-ference type.

3.2.1 AdaptivityWe manipulated the amount of information that the robot

receives during the task, which directly affects how adaptivethe robot is to human actions. The following robots varyfrom least adaptive to most adaptive:Fixed: This robot receives no information about the hu-man’s goal. It runs an initial optimal task allocation forthe human-robot team and subsequently executes the robotpart of the plan irrespective of the human’s actions. It onlyre-plans its task allocation if an inconsistency or deadlock isreached. For example, if the human finishes a task that therobot had internally allocated to itself, the fixed robot re-moves the task from its list of pending tasks and chooses thenext pending task as the goal. Additionally, if both agentsreach different joint tasks, the robot moves to the human’slocation and re-plans its strategy to account for this devia-tion.Reactive: This robot receives information about the hu-man’s goal only when the human reaches their goal. At thispoint, the robot reruns task allocation for the team assum-ing that the human will behave optimally for the remainderof the trial.Predictive: This robot receives information about the hu-man’s goal before they reach the goal. After receiving thisinformation, the robot recomputes the optimal plan, againassuming optimal human behavior thereafter.

3.2.2 Inference TypeWe also manipulated the type of inference used by the

predictive robot:Oracle-predictive: This robot receives perfect knowledgeof the human’s next goal as soon as the human chooses it.Although there are real-world cases in which a human mightinform a robot about its goal immediately after deciding, itis rare for a robot to know exactly the human’s goal withouthaving the human explicitly communicate it.Bayesian-predictive: This robot continuously receives in-formation related to the human’s goal by observing the mo-tion of the human. Based on this motion, the robot attemptsto predict the human’s goal using the Bayesian inferencescheme described in Section 2.2. In most cases of human-robot collaboration, the robot will only have access to thehuman’s motions.

3.3 Experimental ConditionsWe manipulated the independent variables of adaptivity

and inference type in a 3ˆ2 mixed design in which inferencetype was varied between-subjects, while adaptivity was var-ied within-subjects. All participants collaborated with thefixed and reactive robots (corresponding to the low adap-tivity and medium adaptivity conditions, respectively). Forthe high adaptivity conditions, half the participants wereassigned to a oracle condition, in which they collaboratedwith the oracle-predictive robot; and half were assigned toa Bayesian condition, in which they collaborated with the

Table 1: Experimental Conditions

Inference Type Adaptivity Robot(between-subjects) (within-subjects)

- Low Adaptivity Fixed- Med. Adaptivity Reactive

Oracle High Adaptivity Oracle-predictive

- Low Adaptivity Fixed- Med. Adaptivity Reactive

Bayesian High Adaptivity Bayesian-predictive

Bayesian-predictive robot. This design is given in Table 1for reference.

The motivation for this design was to prevent participantsfrom having to keep track of too many robots (as would bethe case in a 4 ˆ 1 design), while still creating the experi-ence of comparing the predictive robot with non-predictivebaselines.

3.4 Procedure

3.4.1 StimuliThere were 15 unique task layouts which consisted of a col-

lection of five or six tasks, as well as initial positions of thehuman and robot avatars. As shown in Fig. 1a, one-agenttasks were depicted by an orange circle with a single dotinside it, and joint tasks were depicted by a cyan circle withtwo dots inside it. The human avatar was a gender-neutralcartoon of a person on a scooter, and the robot avatars wereimages of the same robot in different poses and colors (ei-ther red, blue, or yellow), which were counterbalanced acrossparticipants.

3.4.2 Trial DesignFirst, we presented participants with four example trials

to help them understand the task, which they completedwith the help of a gray robot. Each experimental layoutwas presented to participants three times over the courseof the experiment (once with each of the fixed, reactive,and predictive robots). The trials were grouped into fiveblocks of nine trials (45 trials total). Within each block,participants completed three trials with one of the robots,then three with a different robot, and then three with thefinal robot. The order of the robots and task layouts wererandomized according to a Latin-square design, such thatthe order of the robots varied and such that no task layoutwas repeated within a block.

3.4.3 Layout GenerationWe randomly generated task layouts according to the fol-

lowing procedure. First, we assumed a model of human goalselection that sets the probability of choosing the next goaltask as inversely proportional to the distance between theposition of the human agent and the tasks1. Under thismodel, we calculated the expected task completion time of104 randomly generated layouts with reactive and Bayesian-predictive robots. We then computed the rank ordering of

1Specifically, we used a Boltzmann policy to model humantask selections. We fit the tuning parameter from partici-pant’s choices in an earlier pilot experiment using maximum-likelihood estimation, resulting in a best fitting parametervalue of β “ 1.05.

943

these completion times relative to all possible completiontimes for each layout, and computed the ratio of the rankorder of the Bayesian-adaptive robot to the rank order ofthe reactive robot. This ratio gave us a measure of how dif-ferently we would expect the two robots to behave: layoutswith ratio greater than 1.5 (9 layouts) or less than 0.6 (6layouts) were selected, giving a total of 15 layouts for eachrobot. Subjects completed 5 blocks of 9 layouts, totaling 45layouts.

3.4.4 Avatar AnimationWhen moving to complete a task, the robot avatar always

moved in a straight line towards the task. For the humanavatar, however, the trajectory followed was not a straightpath, but a Bezier curve that initially deviated to a side andthen gradually corrected its course towards the correct tasklocation. The reason for using this trajectory was to allowthe Bayesian-predictive robot to make errors: if the humanalways moved in a straight line, then the Bayesian-predictiverobot would almost always infer the correct goal right away(and thus would give nearly identical results as the oracle-predictive robot). Furthermore, this choice of curved tra-jectory is a more realistic depiction of how actual humansmove, which is not usually in a straight line [3].

3.4.5 Attention ChecksAfter reading the instructions, participants were given an

attention check in the form of two questions asking them thecolor of the one-agent tasks and joint tasks. At the end ofthe experiment, we also asked them whether they had beenpaying attention to the difference in helpfulness between thethree robots.

3.4.6 Controlling for ConfoundsWe controlled for confounding factors by counterbalancing

the colors of the robots (Section 3.4.1); by using a differentcolor robot for the example trials (Section 3.4.2); by ran-domizing the trial order (Section 3.4.2); by generating stim-uli to give advantages to different robots (Section 3.4.3); byanimating the human avatar in a realistic way to differen-tiate between the Bayesian-predictive and oracle-predictiverobots (Section 3.4.4); and by including attention checks(Section 3.4.5).

3.5 Dependent Measures

3.5.1 Objective measuresDuring each trial, a timer on the right side of the screen

kept track of the time it took to complete the tasks. Thistimer was paused whenever the participant was promptedto choose the next task and only counted up when eitherthe human avatar or robot avatar were moving. The finaltime on the timer was used an object measure of how longit took participants to complete each trial. We also trackedhow many one-agent tasks were completed by the robot, thehuman, or both simultaneously.

3.5.2 Subjective measuresAfter every nine trials, we asked participants to choose the

robot that they most preferred working with based on theirexperience so far. At the end of the experiment, we alsoasked participants to fill out a survey regarding their sub-jective experience working with each of the robot partners(Table 2). These survey questions were based on Hoffman’s

Table 2: Subjective Measures

Fluency1. The human-robot team with the [color] robot workedfluently together.2. The [color] robot contributed to the fluency of the teaminteraction.

Contribution1. (Reversed) I had to carry the weight to make thehuman-robot team with the [color] robot better.2. The [color] robot contributed equally to the team per-formance.3. The performance of the [color] robot was an importantcontribution to the success of the team.

Trust1. I trusted the [color] robot to do the right thing at theright time.2. The [color] robot is trustworthy.

Capability1. I am confident in the ability of the [color] robot to helpme.2. The [color] robot is intelligent.

Perceptiveness (originally ‘Goals’)1. The [color] robot perceived accurately what my goalswere.2. (Reversed) The [color] robot did not understand whatI was trying to accomplish.

Forced-Choice Questions1. If you were to redo this experiment with only one ofthe robots, which would you most prefer to work with?2. If you were to redo this experiment with only one ofthe robots, which would you least prefer to work with?

metrics for fluency in human-robot collaborations [17], andwere answered on a 7-point Likert scale from 1 (“stronglydisagree”) to 7 (“strongly agree”). We also included twoforced-choice questions, which had choices corresponding tothe three robots.

3.6 HypothesesWe hypothesized that adaptivity and inference type would

affect both objective measures (Section 3.5.1) and subjectivemeasures (Section 3.5.2).H1 - Task Completion Time Robot adaptivity will neg-atively affect task completion times, with the least adap-tive robots resulting in the slowest completion times, andthe most adaptive robots resulting in the fastest completiontimes. The Bayesian-predictive will robot result in slowercompletion times than the oracle-predictive robot.H2 - Robot Task Completion Robot adaptivity will pos-itively affect the number of tasks that the robot completes,with more adaptive robots completing more tasks than theless adaptive robots. The Bayesian-predictive robot will com-plete fewer tasks than the oracle-predictive robot.H3 - Robot Preference Robot adaptivity will positivelyaffect participants’ preferences, with more adaptive robots be-ing preferable to less adaptive robots. The Bayesian-predictiverobot will be less preferable than the oracle-predictive robot.Additionally, preferences will become stronger as participantsgain more experience with each robot.H4 - Perceptions of the Collaboration Robot adaptiv-

944

ity will positively affect participants’ perceptions of the col-laboration as indicated by the subjective survey measures,with more adaptive robots being rated higher than less adap-tive robots. Inference type will also affect perceptions, withthe Bayesian-predictive robot scoring lower than the oracle-predictive robot.

3.7 ParticipantsWe recruited a total of 234 participants from Amazon’s

Mechanical Turk using the psiTurk experimental framework[14]. We excluded 32 participants from analysis for failingthe attention checks, as well as 6 due to an experimentalerror, leaving a total of N “ 196 participants whose data weused. All participants were treated in accordance with localIRB standards and were paid $1.20 for an average of 14.7minutes of work, plus an average bonus of $0.51. Bonusescould range from $0.00 to $1.35 depending on performance:for each trial, participants could get a $0.03 bonus if theycompleted all tasks in the shortest possible amount of time; a$0.02 bonus if they finished within the top 5% fastest times;or a $0.01 bonus if they finished within the top 10% fastesttimes2. They received no bonus if they were slower than thetop 10% fastest times.

4. RESULTS

4.1 Manipulation ChecksThe goal of changing the inference type was ultimately

to influence the number of errors made by the robot. Toensure that the Bayesian-predictive robot did make moreerrors than the oracle-predictive robot, we counted the pro-portion of timesteps on each trial during which the Bayesian-predictive robot had an incorrect belief about the human’sgoal. We then constructed a mixed-effects model for theseerror rates with only the intercept as fixed effects, and bothparticipants and stimuli as random effects. The interceptwas 0.27˘ 0.021 SE (tp14.043q“12.83, p ă 0.001), indicat-ing an average error rate of about 27%. We also looked atthe stimuli individually by reconstructing the same model,except with stimuli as fixed effects. Using this model, wefound that error rates ranged from 15.0% ˘ 1.18% SE to41.0%˘ 1.18% SE.

4.2 H1 - Task Completion TimeFirst, we looked at the objective measure of task com-

pletion time, which is also shown in Fig. 2. To analyzetask completion time, we first found the average completiontime for each participant across stimuli, then constructeda linear mixed-effects model for these times using the in-ference type and adaptivity as fixed effects and the partic-ipants as random effects. We found main effects of bothinference type (F p2, 194q “ 10.30, p ă 0.01) and adaptivity(F p2, 388q “ 709.27, p ă 0.001), as well as an interactionbetween the two (F p2, 388q“10.07, p ă 0.001).

We performed post-hoc comparisons using the multivari-ate t method for p-value adjustment. In support of H1, wefound that the reactive robot led to completion times thatwere 0.52s˘ 0.029 SE faster than the fixed robot (tp388q“17.777, p ă 0.001). Also in support of H1, the oracle-predictive robot led to completion times that were 0.69s ˘

2When comparing completion times to the range of possi-ble times, we always compared the participants’ completiontimes with the range of all times achievable with the robotthey were currently collaborating with.

Oracle Bayesian

Fixed

6.0

6.5

7.0

7.5

8.0

Mean

com

ple

tion

tim

e (

seco

nd

s)

Oracle Bayesian

ReactiveOracle Bayesian

Predictive

Task completion time, relative to optimal

Figure 2: Comparison of completion times for differentrobots. The y-axis begins at the average optimal comple-tion time, and thus represents a lower bound. Error barsare bootstrapped 95% confidence intervals of the mean us-ing 1,000 samples.

0.041 SE faster than the reactive robot (tp388q “ ´16.785,p ă 0.001), and the Bayesian-predictive robot led to com-pletion times that were 0.47s ˘ 0.041 SE faster than thereactive robot (tp388q “´11.339, pă 0.001). There was noevidence for a difference between the fixed robots betweenparticipants (tp329.53q “ ´1.528, p “ 0.61), nor betweenthe reactive robots (tp329.53q “ ´1.756, p “ 0.46). Therewas, however, a difference between the oracle-predictive andBayesian-predictive robots, (tp329.53q “ ´5.03, p ă 0.001),with the oracle-predictive robot resulting in completion timesthat were faster by 0.34s ˘ 0.067 SE. The combination ofthese results fully support H1: when adaptivity increases orthe number of errors in inference decreases, task completiontime decreases.

4.3 H2 - Robot Task CompletionNext, we looked at the additional object measure of how

many one-agent tasks were completed by the robot.3 Foreach participant, we counted the total number of one-agenttasks completed by the robot throughout the entire experi-ment. These numbers are shown in Fig. 3. To analyze theseresults, we constructed a linear mixed-effects model for thenumber of tasks with inference type and adaptivity as fixedeffects and the participants as random effects. There was noevidence for a main effect of inference type on the number oftasks completed by the robot (F p2, 194q “ 0.905, p“ 0.34),but there was an effect of the adaptivity on the number oftasks completed by the robot (F p2, 388q“171.15, pă0.001),as well as an interaction between adaptivity and inferencetype (F p2, 388q“9.237, pă0.001).

To investigate these effects further, we performed post-hoc comparisons with multivariate t adjustments. Consis-tent with H2, we found that the reactive robot completed1.2 ˘ 0.24 SE more tasks than the fixed robot (tp388q “4.969, pă0.001); the oracle-predictive robot completed 2.3˘0.34 SE more tasks than the reactive robot (tp388q“6.722,p ă 0.001); and the Bayesian-predictive robot completed4.0 ˘ 0.35 SE more tasks than the reactive robot (tp388q“11.566, pă 0.001). Surprisingly, however, in comparing be-

3Sometimes the robot and human could complete the sameone-agent task simultaneously. We excluded these casesfrom the number of one-agent tasks completed by the robot,because for these tasks the robot was not actually providingany additional help. We found that the robot and humancompeted for the same task 1.5%, 8.6%, 14.6%, and 18.1%of the time in the oracle-predictive, Bayesian-predictive, re-active, and fixed conditions, respectively.

945

Oracle Bayesian

Fixed

0

5

10

15

20

25

30

35

40N

um

ber

of

task

s

Oracle Bayesian

ReactiveOracle Bayesian

Predictive

Single-agent tasks completed by the robot

Figure 3: Number of one-agent tasks completed by therobot. The dotted lines represent the robot and the hu-man completing an equal proportion of tasks. Error barsare bootstrapped 95% confidence intervals of the mean us-ing 1,000 samples.

tween subjects, we found the opposite of H2: the Bayesian-predictive robot completed 1.7 ˘ 0.58 SE more tasks thanthe oracle-predictive robot (tp316.19q“2.903, pă0.05).

These results confirm part of H2—that adaptivity in-creases the number of tasks completed by the robot—butdisconfirm the other part of H2—that Bayesian inferencedecreases the number of tasks completed by the robot. How-ever, we note that the effects are quite small between thedifferent robots: the differences reported are in the order of1–4 tasks for the entire experiment, which had a total of 39one-agent tasks. Each trial only had two or three one-agenttasks, and thus these results may not generalize to scenariosthat include many more tasks. Furthermore, although wefound that the Bayesian inference type increased the num-ber of tasks completed by the robot, we speculate that thismay be due to the Bayesian-predictive robot inferring theincorrect human goal and therefore taking some other one-agent task. However, further investigation is required todetermine the actual source of this difference.

4.4 H3 - Robot PreferenceWe next analyzed the subjective measure of participants’

preferences for which robot they would rather work with.

4.4.1 Preferences over timeFig. 4 shows the proportion of participants choosing each

robot as a function of trial. We constructed a logistic mixed-effects model for binary preferences (where 1 meant therobot was chosen, and 0 meant it was not) with inferencetype, adaptivity, and trial as fixed effects and participantsas random effects. Using Wald’s tests, we found a signifi-cant main effect of adaptivity (χ2

p2q “ 16.7184, pă 0.001)and trial (χ2

p1q“5.1104, pă0.05), supporting H3. We alsofound an interaction between adaptivity and trial (χ2

p1q “9.2493, pă0.01). While the coefficient for trial was negative(β “ ´0.022 ˘ 0.0095 SE, z “´2.261, pă 0.05), the coef-ficient for the interaction of trial and the predictive robotwas positive (β “ 0.036 ˘ 0.0120, z“3.017, pă0.01). Thisalso supports H3: if preferences for the predictive robot in-crease over time, by definition the preferences for the otherrobots must decrease, thus resulting in an overall negativecoefficient.

Although we did not find a significant main effect of infer-ence type (χ2

p1q “ 1.457, p“ 0.23), we did find an interac-tion between inference type and adaptivity (χ2

p2q“7.7208,p ă 0.05). Post-hoc comparisons with the multivariate t

9 18 27 36 45

Trial

0.0

0.2

0.4

0.6

0.8

1.0

Pro

port

ion

pre

ferr

ing

rob

ot

Oracle

fixed

reactive

predictive

9 18 27 36 45

Trial

Bayesian

Figure 4: Preferred robot as a function of trial. Error barsare bootstrapped 95% confidence intervals of the mean using1,000 samples.

adjustment indicated that there was a significant overallimprovement in preference between the fixed and reactiverobots (z“´5.23, p ă 0.001) as well as an improvement inpreference between the reactive and oracle-predictive robots(z“11.86, pă0.001) and between the reactive and Bayesian-predictive robots (z“ 6.95, pă 0.001). While there was nosignificant difference between participants for the fixed robot(z“0.40, p“0.998), there was a significant difference for thereactive robot (z“´2.93, pă0.05) and a marginal differencebetween the predictive robots (z“2.52, p“0.098), with theoracle-predictive robot being slightly more preferable to theBayesian-predictive robot.

4.4.2 Final rankingsFor each participant, we assigned each robot a score based

on their final rankings. The best robot received a score of1; the worst robot received a score of 2; and the remain-ing robot received a score of 1.5. We constructed a logis-tic mixed-effects model for these scores, with inference typeand adaptivity as fixed effects and participants as randomeffects; we then used Wald’s tests to check for effects. Inpartial support of H3, there was a significant effect of adap-tivity on the rankings (χ2

p2q“52.352, pă0.001) but neithera significant effect of inference type (χ2

p1q“1.149, p“0.28)nor an interaction between adaptivity and inference type(χ2p2q “ 3.582, p “ 0.17). Post-hoc comparisons between

the robot types were adjusted with Tukey’s HSD and indi-cated a preference for the reactive robot over the fixed robot(z “ 5.092, p ă 0.001) and a preference for the predictiverobots over the reactive robot (z“5.634, pă0.001).

4.5 H4 - Perceptions of the CollaborationFinally, we looked at the subjective measures of partic-

ipants’ perceptions of the collaboration. For each robotand measure (capability, contribution, fluency, goals, andtrust), we averaged each participants’ responses to the in-dividual questions, resulting in a single score per partici-pant, per measure, per robot. The results are shown inFig. 5. We constructed a linear mixed-effects model for thesurvey responses with the adaptivity, inference type, andmeasure type as fixed effects, and with participants as ran-dom effects. We found significant main effects for both theadaptivity (F p2, 2730q “ 543.18, p ă 0.001) and measuretype (F p4, 2730q “ 5.28, p ă 0.001). While there was nosignificant main effect of inference type (F p2, 195q “ 0.08,p“ 0.78), there was an interaction between adaptivity andinference type (F p2, 2730q “ 17.17, pă 0.001). We did notfind interactions between inference type and measure type

946

1

2

3

4

5

6

7A

vera

ge L

ikert

sco

reCapability Contribution Fluency Perceptiveness Trust

Fixed (Oracle)

Fixed (Bayesian)

Reactive (Oracle)

Reactive (Bayesian)

Oracle-predictive

Bayesian-predictive

Figure 5: Findings for subjective measures. Error bars are bootstrapped 95% confidence intervals of the mean.

(F p4, 2730q “ 0.14, p“ 0.97), between adaptivity and mea-sure type (F p8, 2730q “ 0.38, p “ 0.93), or between adap-tivity, inference type, and measure type (F p8, 2730q“ 0.14,p“0.997).

We investigated these effects further through a post-hocanalysis with multivariate t corrections. In support of H4,we found that participants rated the reactive robot higherthan the fixed robot by 0.72 ˘ 0.058 SE points on the Lik-ert scale (tp2730q “ 12.415, p ă 0.001). They also ratedthe oracle-predictive robot higher than the reactive robotby 1.44 ˘ 0.082 SE points (tp2730q“17.568, pă0.001), theBayesian-predictive robot higher than the reactive robot by0.91˘ 0.083 SE points (tp2730q“11.064, pă0.001), and theoracle-predictive robot higher than the Bayesian-predictiverobot by 0.36˘ 0.112 points (tp470.31q“3.267, pă0.05).

5. DISCUSSIONIn this paper we set out to answer three questions. First,

does an increase in robot adaptivity lead to an objectiveimprovement in real human-robot collaboration scenarios?Second, even if performance improves objectively, do peopleactually notice a difference? Third, if the robot does nothave perfect knowledge of the human’s goals and must inferthem, does adaptivity still result in the same improvements?

Implications. Our results suggest that outfitting robotswith the ability to adaptively re-plan their actions from amotion-based inference of human intent can lead to a sys-tematic improvement in both the objective and perceivedperformance in human-robot collaboration. We found thatthis type of adaptivity increases performance objectively bylowering the amount of time it takes the team to completethe tasks and by improving the balance of work between thetwo agents. Subjectively, we found that participants pre-ferred motion-based predictive adaptation to purely reactiveadaptation at the task level (which, as expected, was in turnpreferred to no adaptation), and that they rated the moreadaptive robot higher on a number of measures includingcapability, collaboration, fluency, perceptiveness, and trust.Unsurprisingly, introducing prediction errors through ma-nipulation of the inference type generally decreased bothobjective performance and subjective ratings. What is sur-prising, however, is how small this decrease in performancewas, particularly given that the Bayesian-predictive robothad an error rate ranging from a low of 15% to a high of 40%for some stimuli. This suggests that an attempt to predictand adapt to human goals—even under limited accuracy—can have a significant effect on the team’s objective perfor-mance and the human’s perception of the robot.

Another non-obvious observation is that the adaptationmechanism is successful in spite of the strong assumptionthat the human agent will behave optimally from the cur-rent state. In practice, the impact of this assumption is lim-

ited by the fact that the robot quickly adapts whenever thehuman does not follow this optimal plan; this is comparableto how model-predictive control can use a nominal systemmodel and correct for model mismatch at each re-planningstep. Additional effects may also explain the observed im-provement. In particular, our model of human task planningexplicitly accounts for the joint actions of the two agents(as opposed to only their own); because human inferencesof other agent’s intentions have been shown to be consis-tent with models of approximately rational planning [1], itis therefore likely that the robot’s adapted actions closelymatch users’ actual expectations. In addition, the fact thatthe robot adapts by switching to the new joint optimal planmeans that the human is always “given a new chance” tobehave optimally; in other words, the robot’s adaptationis always placing the team in the situation with the bestachievable outcome. Developing and testing these hypothe-ses further will be the subject of future work.

Limitations. There are several limitations to this workwhich we plan to address in future research. First, in ourexperiment, participants could only select the goal task andthe trajectory for human avatar was predefined. It would bemore realistic if participants were allowed to directly controlthe motion of human avatar. To better understand the roleof realistic control, we will conduct a future study in whichparticipants can continuously move their avatar using the ar-row keys. Second, the robot’s working assumption that thehuman will follow the optimal joint plan is unrealistic. Bet-ter results could conceivably be obtained if this assumptionwere to be relaxed and replaced by a more refined model ofhuman inference and decision-making, enabling a more effec-tive adaptation of the robot’s plan. Lastly, while adaptivityis fundamental to successful collaboration in many types oftasks, it may not lead to improvements in all contexts. Forexample, in manufacturing scenarios where there is schedul-ing and allocation ahead of time, adaptation on either sideis not required: in these cases, humans actually prefer thatrobots take the lead and give them instructions [12], ratherthan adapting to their actions.

Conclusions. Overall, our results demonstrate the bene-fits of combining motion-level inference with task-level planadaptation in the context of human-robot collaboration. Al-though our results are based on virtual simulations, we spec-ulate that the improvement found here will be even more im-portant in real-world situations, during which humans maybe under considerable stress and cognitive load. In suchcases, a robot partner not behaving as expected can leadto frustration or confusion, impeding the human’s judgmentand performance. A robot partner that adapts to the humancan make all the difference, both in the successful completionof the task and in the perceived quality of the interaction.

947

REFERENCES[1] C. L. Baker, R. Saxe, and J. B. Tenenbaum. Action

understanding as inverse planning. Cognition,113(3):329–349, 2009.

[2] R. Bemelmans, G. J. Gelderblom, P. Jonker, andL. De Witte. Socially assistive robots in elderly care: asystematic review into effects and effectiveness.Journal of the American Medical DirectorsAssociation, 13(2):114–120, 2012.

[3] D. Brogan and N. Johnson. Realistic human walkingpaths. IEEE International Workshop on ProgramComprehension, 2003.

[4] E. Charniak and R. P. Goldman. A Bayesian model ofplan recognition. Artificial Intelligence, 64(1):53–79,1993.

[5] M. B. Dias, B. Kannan, B. Browning, E. G. Jones,B. Argall, M. F. Dias, M. Zinck, M. M. Veloso, andA. J. Stentz. Sliding autonomy for peer-to-peerhuman-robot teams. Intelligent Autonomous Systems(IAS), 2008.

[6] A. D. Dragan, S. Bauman, J. Forlizzi, and S. S.Srinivasa. Effects of robot motion on human-robotcollaboration. ACM/IEEE International Conferenceon Human-Robot Interaction (HRI), 2015.

[7] A. D. Dragan and S. S. Srinivasa. Formalizingassistive teleoperation. Robotics, Science and Systems(RSS), 2012.

[8] A. D. Dragan and S. S. Srinivasa. A policy-blendingformalism for shared control. The InternationalJournal of Robotics Research, 32(7):790–805, 2013.

[9] A. Fern, S. Natarajan, K. Judah, and P. Tadepalli. Adecision-theoretic model of assistance. Journal ofArtificial Intelligence Research, 50(1):71–104, 2014.

[10] T. Fong, I. Nourbakhsh, C. Kunz, L. Fluckiger,J. Schreiner, R. Ambrose, R. Burridge, R. Simmons,L. M. Hiatt, A. Schultz, et al. The peer-to-peerhuman-robot interaction project. AIAA Space, 2005.

[11] T. Furukawa, F. Bourgault, B. Lavis, and H. F.Durrant-Whyte. Recursive Bayesiansearch-and-tracking using coordinated UAVs for losttargets. IEEE International Conference on Roboticsand Automation (ICRA), 2006.

[12] M. C. Gombolay, R. A. Gutierrez, G. F. Sturla, andJ. A. Shah. Decision-making authority, team efficiencyand human worker satisfaction in mixed human-robotteams. Robotics, Science and Systems (RSS), 2014.

[13] B. Grocholsky, J. Keller, V. Kumar, and G. Pappas.Cooperative air and ground surveillance. IEEERobotics & Automation Magazine, 13(3):16–25, 2006.

[14] T. M. Gureckis, J. Martin, J. McDonnell, R. S.Alexander, D. B. Markant, A. Coenen, J. B. Hamrick,and P. Chan. psiTurk: An open-source framework forconducting replicable behavioral experiments online.Behavioral Research Methods, 2015.

[15] Gurobi Optimization. Gurobi reference manual, 2015.Available at http://www.gurobi.com.

[16] M. Held and R. M. Karp. The traveling-salesmanproblem and minimum spanning trees. OperationsResearch, 18(6):1138–1162, 1970.

[17] G. Hoffman. Evaluating fluency in human-robotcollaboration. ACM/IEEE International Conference

on Human-Robot Interaction (HRI), Workshop onHuman Robot Collaboration, 2013.

[18] G. Hoffman and C. Breazeal. Effects of anticipatoryaction on human-robot teamwork efficiency, fluency,and perception of team. ACM/IEEE InternationalConference on Human-Robot Interaction (HRI), 2007.

[19] C. D. Kidd, W. Taggart, and S. Turkle. A sociablerobot to encourage social interaction among theelderly. IEEE International Conference on Roboticsand Automation (ICRA), 2006.

[20] C. Lenz, S. Nair, M. Rickert, A. Knoll, W. Rosel,J. Gast, A. Bannat, and F. Wallhoff. Joint-action forhumans and industrial robots for assembly tasks.Robot and Human Interactive Communication(RO-MAN), 2008.

[21] C. Liu, S.-y. Liu, E. L. Carano, and J. K. Hedrick. Aframework for autonomous vehicles with goal inferenceand task allocation capabilities to support taskallocation with human agents. Dynamic Systems andControl Conference (DSCC), 2014.

[22] D. T. McRuer. Pilot-induced oscillations and humandynamic behavior, volume 4683. NASA, 1995.

[23] G. Pezzulo, F. Donnarumma, and H. Dindo. Humansensorimotor communication: a theory of signaling inonline social interactions. PLoS ONE, 8(2), 2013.

[24] M. Saffarian, J. C. F. de Winter, and R. Happee.Automated Driving: Human-Factors Issues andDesign Solutions. Proceedings of the Human Factorsand Ergonomics Society Annual Meeting,56(1):2296–2300, 2012.

[25] N. B. Sarter and D. D. Woods. Team play with apowerful and independent agent: operationalexperiences and automation surprises on the AirbusA-320. Human Factors, 39(4):553–569, 1997.

[26] J. Shah and C. Breazeal. An empirical analysis ofteam coordination behaviors and action planning withapplication to human–robot teaming. Human Factors,52(2):234–245, 2010.

[27] J. Tisdale, Z. Kim, and J. K. Hedrick. Autonomousuav path planning and estimation. IEEE Robotics &Automation Magazine, 16(2):35–42, 2009.

[28] M. Tomasello, M. Carpenter, J. Call, T. Behne, andH. Moll. Understanding and sharing intentions: theorigins of cultural cognition. Behavioral and BrainSciences, 28(05):675–691, 2005.

[29] C. Vesper, S. Butterfill, G. Knoblich, and N. Sebanz.A minimal architecture for joint action. NeuralNetworks, 23(8):998–1003, 2010.

[30] K. Wada and T. Shibata. Living with seal robots – itssociopsychological and physiological influences on theelderly at a care house. IEEE Transactions onRobotics, 23(5):972–980, 2007.

[31] J. Yen, X. Fan, S. Sun, T. Hanratty, and J. Dumer.Agents with shared mental models for enhancing teamdecision makings. Decision Support Systems,41(3):634–653, 2006.

[32] B. D. Ziebart, N. Ratliff, G. Gallagher, C. Mertz,K. Peterson, J. A. Bagnell, M. Hebert, A. K. Dey, andS. Srinivasa. Planning-based prediction forpedestrians. IEEE/RSJ International Conference onIntelligent Robots and Systems (IROS), 2009.

948

Documents

Goal Inference Improves Objective and Perceived …anca/papers/AAMAS16...rescue collaboration scenario. (c) Illustration of a collabo-rative table setting scenario. 19,30], to helping