Online inference of human belief for cooperative robotstuprints.ulb.tu-darmstadt.de/8295/2/IROS2018_AV.pdf · 2019-01-17 · Accepted Publication to the 2018 IEEE/RSJ International

Accepted Publication to the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)October, 1-5, 2018, Madrid, SpainDOI: 10.1109/IROS.2018.8594076

Online inference of human belief for cooperative robots

Moritz C. Buehler1,2 and Thomas H. Weisswange2

Abstract— For human-robot cooperation, inferring a hu-man’s cognitive state is very important for an efficient andnatural interaction. Similar to human-human cooperation,understanding what the partner plans and knowing, if he issituation aware, is necessary to prevent collisions, offer supportat the right time, correct mistakes before they happen or choosethe best actions for oneself as early as possible.

We propose a model-based belief filter to extract relevantaspects of a human’s mental state online during cooperation. Itperforms inference based on human actions and its own taskknowledge, modeling cognitive processes like perception andaction selection. In contrast to most prior work, we explicitlyestimate the human belief instead of inferring only a singlemode or intention. Since this is a double inference process, wefocus on representing the human estimates of environmentalstate and task as well as corresponding uncertainties.

We designed a human-robot cooperation experiment thatallowed for a variety of cognitive states of both agents andcollected data to test and evaluate the proposed belief filter.The results are promising, as our system can be used toprovide reasonable predictions of the human action and insightsinto his situation awareness. At the same time it is inferringinterpretable information about the underlying cognitive states– A belief about the human’s belief about the environment.

I. INTRODUCTION

Robots and other (partially) autonomous technical systemsare more and more present in our daily and working life,examples include vacuum cleaning robots, manufacturingrobots, intelligent application software or advanced driver as-sistance systems. Particularly classical manufacturing robotsoften have very narrow use-cases, because they are limitedto work in dedicated areas to protect people from harm(separation of space). This often demotes them to meretools, that is humans are delegating defined functions tothose technical systems (functional separation), e.g. a vac-uum cleaning robot is clearly supposed to clean a room.Those approaches become more and more inappropriatedue to the high complexity and interdependence of moderntechnical systems. For a next generation, it is importantto investigate interaction between humans and autonomoustechnical systems to create interfaces that are comprehensibleand efficient, enabling a natural cooperation. Applicationscenarios for human-robot cooperation include cooperativemanufacturing, shared autonomy teleoperation, path planningin crowded environments or semi-autonomous traffic. Forthe cooperation of humans and machines, complex aspectsknown from human-human interaction become important,including situation awareness, predictability, trust, intention,desires or affect. Using these aspects requires some basic

1with Control Methods & Robotics Lab, TU Darmstadt, Germany2with Honda Research Institute Europe GmbH

understanding of the human cognition and its representationof the environment.

Human situation awareness (SA) in particular covers manyimportant parts of the human cognitive state. SA is a topic ofinterest especially in aviation research [1] since many years.It proved to be an efficient concept to evaluate and enhancethe performance of a human operator in his interaction witha machine. Situation awareness goes beyond the perceptionof relevant pieces of information, like flight variables (e.g.altitude) or the environmental states (e.g. location of otheraircrafts). The human has to relate different informationfragments (low altitude is necessary while landing, butdangerous during flight) and anticipate the consequences.Along these lines, we propose that a robot also needs to besituation aware to be able to solve more complex tasks. For acollaborative setting, SA will also include awareness of theinteraction partner and his internal processes, for exampleif a human partner is situation aware himself. With thisknowledge a robot can warn or direct the human’s attention,provide additional information or adapt its own strategy toprevent failures or improve overall task performance. For thisreason, we want to model the relevant cognitive processesof a human partner involved in solving a collaborative task.But since we can not directly access them, we will inferthe hidden information indirectly through reasoning overthe human’s actions and active sensing strategies. Indeed,humans are really good in inferring the mental states of theirhuman partners [2].

One interesting requirement for SA is the understanding ofthe task goals and valuations. In machine learning researchone concept of inferring these is Inverse ReinforcementLearning (IRL) [3]. Reinforcement learning tackles the prob-lem of an agent selecting actions to optimize accumulatedfuture reward in a dynamic environment. IRL in contrastinfers the reward function from an observed optimal actionsequence. It was previously proposed to specify the goal for arobot in collaborative settings, implicitly taking into accountsome of the cognitive state variables of a human partner [4].In [5] IRL was used in an automotive context to predict ahuman’s most likely future action and to incorporate it intothe planning process of a robotic agent. IRL focuses on infer-ring the reward function based on human actions, however, toappropriately handle situation awareness, additional aspectshave to be considered.

Among human actions, gaze is an important informationsource to estimate which aspects of the environment thehuman might be aware of. Gaze as active observation isproposed to obtain the object of human attention awarenessfor a collaborative setting in [6]. Similarly, in an automotive

c© 2018 IEEE. This version is published under German Copyright Law.

https://doi.org/10.1109/IROS.2018.8594076

context, [7], [8], estimation of a human’s focus of attentionis extracted from gaze information to estimate his situationawareness. [9] investigate gaze as communication, that isused to coordinate actions by looking at the other. In theremainder of this paper, we will call active observationslike this gathering actions, because they can be used for ourinferences in a way similar to task-progressing actions. Thiscan, for example, also be seen in [10], where the authorsuse IRL on gaze behavior to infer the internal valuation ofawareness of certain information for a given task.

The combination of both task actions and gathering ac-tions, should provide a more complete view over the humancognitive states. As task structure and perception are notfree of noise and might only be partially observable, thecognitive human state will include a belief, i.e. a probabilisticrepresentation of his environment. Previous belief inferenceconcepts are proposed in [11] and [12] interpreting an agent’saction and information gain. [11] use belief recursion to anarbitrary depth (i.e. he believes that I believe...) for includingthe cognitive state of an interaction partner into the decisionmaking. They apply it to a game-like multi-agent scenariowith prespecified policies for the (artificial) agents and wereable to solve cooperative problems. In contrast, [12] observea human moving in a grid world while approaching a desiredfood truck. They want to explain the observed behaviorretrospectively by inferring the human’s desires and beliefsbased on his actions and a simple perception model. Thehuman belief is sampled uniformly and inferred after taskcompletion. The results are compared to simpler (heuristic)human models and to human observer assessments.

In this paper, we develop a general concept for inferringbelief and situation awareness of a human cooperation part-ner from information gathering and task actions. We use anapplication scenario, inspired by human-robot cooperativemanufacturing, with a shared task and goal and the potentialfor supportive warning actions and predictive adaptation ofthe robot strategy. This direct interaction target requires aformulation for online estimation of a human’s belief, incontrast to [12]. Our approach introduces a parametrizedapproximation of the human belief that concentrates on sig-nificant aspects of the cognitive state. Through this, inferringthe complete human belief over the environment reduces tothe inference of the parameters which form the basis forestimating his situation awareness.

The remaining paper is structured as follows: First, wewill describe our general concept for filtering the humanbelief based on his task and gathering actions. Therefor, weintroduce a parametrization of the human belief to supportonline inference. We describe our cooperative human-robotexperiment and specify the inference processes for this sce-nario. The collected data is finally used to test and evaluateour inference process.

II. MODEL-BASED ONLINE BELIEF FILTER

We present our approach for inferring the human belief.Our goal is the enhancement of human-robot cooperation,

where the robot should adapt its behavior based on the hu-man’s situation awareness. This is done through inference ofthe human belief regarding relevant parts of the common taskduring execution. Therefore, we use human actions as wellas his information gathering and respect its consequences forthe human state of mind to generate a complete view.

We formalize our problem in the Partially ObservableMarkov Decision Process (POMDP) framework [13]. Theenvironmental state s is changed to the next state s′ throughactions a by the agents, according to the dynamics, calledthe transition function, T (s′|s, a). A reward signal providespossibly delayed feedback on the quality of the behavior ofthe agents. It is assumed that they select their actions tryingto maximize the overall reward. The system state s is notdirectly accessible by the agents, but has to be inferred fromthe perceived observations o(a, s). Based on this information,an agent can construct a belief over the state, b = p(s),which means integrating action and observation histories intoa probability distribution of the current state.

We assume that the human creates a belief bH basedon which he selects his actions. The belief of the humanis only partially observable to the robot, it can not lookinto the human brain. Instead it only knows the human’saction and information gathering activities, which provideinformation of the human observations. Inferring the humanbelief results in a probability distribution over a probabilitydistribution, which quickly gets to complex for interestingreal-world scenarios. Therefore, we approximate the humanbelief through a sparse parametrized distribution and infer itsparameters. With an appropriate parameter selection, we willshow that we can still recover a representation containing thetask relevant aspects.

A. Representation

The belief bH = p(s, T ) describes the human’s internalstate as distribution over the state and transition function(the transition belief could be subsumed in the state beliefby defining the base MDP in a different way). Constructinga probability distribution p(bH) is unfeasible for real worldscenarios.

We parametrize an approximate belief distribution bH ≈q(s, T |θ) with parameters θ that focuses on regions, wherethe belief is high. The most probable belief µ = argmax(bH)is represented together with its probability α = bH(µ)as parameters θ = (µ, α). In a continuous space, onecould alternatively approximate the human belief througha normal distribution with parameters mean and precision.Representing only one state in the belief can be limiting,especially for multimodal cases. This can be easily overcomeby separately modeling the k most probable states.

In the following, we consider a general parametrization θfor the human belief and develop our filter equations to inferthe distribution p(θ) over the parameters.

B. Filter structure

We model the human as rational agent acting in a POMDP.The cognitive human processes that we respect are shown in

2

Fig. 1a. Based on his belief, the human will decide on thenext action, that can be information gathering (e.g. gaze shift)leaving the state unchanged or an action that leads to a statetransition of the environment. Information gathering willresult in an observation, providing information the humanuses to update his belief. For task actions, the human willrespect the expected transition of the environmental state,updating his state belief according to his transition belief.

In Fig. 1b the temporal and causal relations of the filtervariables are shown in the form of a Dynamic BayesianNetwork. It includes the relations from Fig. 1a to updateits representation in the same way, we suppose the humandoes, according to action and information gathering. Sincethe human action decision is supposed to be based on hisbelief, we can use the actual decision as noisy correctiveinformation to reduce deviation between belief estimate andthe true belief.

HumanbeliefbH

ActionaH

Environmentstates

Decision

Transitionmodel

T (bH , aH)

Transition/Gathering

Observationo

(a) Structure of modeled mentalprocesses

sk−1 sk

T

State

ok−1 okObservation

θk−1 θkHuman belief

parameters

ak−1 akAction

(b) Temporal filter structure

Fig. 1: Model structure

C. Expected human observations

Human observations are not directly accessible, but wecan attribute them based on human information gathering.For example, if someone looks at an object (directs his gazetowards it), one assumes that he observed that object andintegrated its presence in his internal environment represen-tation. Therefore, based on (information gathering) actionby the human, aH , we conclude on his observation, p(o|aH)(perception model). An observation will lead to an updateof human belief, p(s, T |o) ∼ p(o|s, T )p(s, T ). We thereforehave to update the parameters θ

p(θ|aH) ∼∑s,T,o

p(o|s, T )q(s, T |θ)p(θ)p(o|aH), (1)

where the observation probabilities p(o|s, T ) depend on theconcrete, application specific perception model.

D. Actions

For human actions, we have to do two further effects,feedback based on human action selection and the statetransition. The first is based on the assumption, that the

human decides approximately rationally and we calculate theaction probabilities according to a softmax model as

p(aH |θ) ∼ exp(τQ(a, s|θ)), (2)

where Q(a, s|θ) is the action value function depending onthe belief parameters θ. τ is a temperature hyperparameter,characterizing the human degree of rationality.

The corrective update can be computed as

p(θ|aH) ∼ p(aH |θ)p(θ). (3)

The action value function Q could be obtained from aPOMDP solver, e.g. [14]. Using an online method like this,the distribution over θ might be sampled to reduce therequired processing load.

For task progressing actions, the human expects a transi-tion of the system state s to a new state s′. He will updatehis belief and the new distribution of the parameters afterstate transition p(θ′) becomes:

p(θ′|aH) =∑

s,T,θ,s′

p(θ′|s′, θ) p(s′|s, T, aH)︸︷︷︸T (s′, s, aH)

q(s, T |θ)p(θ|aH),

(4)

where p(θ′|s′, θ) ∼ p(s′|θ′)p(θ′|θ) is the approximation ofbH(s′) through our parametrization.

E. Belief evaluation and situation awareness estimation

The inferred human belief can be used to enhance co-operation in various ways. In Eq. (2) we already computea probabilistic action prediction. But we want not only topredict but to understand the human behavior to improve thecooperation on another level, e.g. providing him necessaryinformation. Therefore we use the belief to evaluate thesituation awareness of the human. We call a human situationaware, if the optimal action cost based on his belief is notsignificantly worse than the best action cost given perfectknowledge. If this is the case, the human belief contains allrelevant information for optimal action selection. To evaluateit, the robot itself must be aware of the situation, since weneed this knowledge to estimate the human belief.

The approach for filtering the human belief was introducedon a general level. We will now move to a concrete situationto apply and test it. In the following section, we present ourhuman-robot cooperation experiment and the application ofthe human belief inference.

III. EXPERIMENTS

We designed a cooperative experiment with the objectiveof generating useful data to evaluate our approach. Thishas several requirements: The final target is an applicationto a real-world task, e.g. a cooperative manufacturing task,which we want to represent in an abstracted way. We needinteraction in the form of interdependence between robotand human. Further, we require a planning process by theparticipants to include temporal aspects and raise the com-plexity, which we need to achieve variability in the humanperformance. It should also be possible to evaluate the perfor-mance of the agents. Inspired by cooperative manufacturing,

3

we selected a sequential task, where a human and a robothave to press buttons in a specific order. Each button presscan represent a processing step of the manufacturing task.The abstraction of using buttons has the advantage that weavoid domain specific difficulties for the robot, like complexcontrollers to grasp tools. Further, the workspace providesthe opportunity to realize other application scenarios.

A. Setup

robot task state

Fig. 2: Experimental setup for human robot cooperationexperiment, gathering buttons on human board in white.

Our setup consists of a cooperative workspace for humanand robot, as shown in figure 2. The robot is a one armindustrial UR5 robot, designed for operating with humans ina common workspace. Both agents, human and robot have aboard on the table with 9 buttons in a 3 times 3 grid in frontof them. The buttons can be set to different colors whichare used to distinguish different actions. A screen on thewall is used to display task relevant information, e.g. tasksequence or state. Human pose and gaze can be measured,but are not used so far. Instead, gaze as active observationis abstracted by discrete gathering actions on a subset of thebuttons, as explained below. Currently, we do not use directphysical intersection, the agents stay in separated workingareas. Interdependence is achieved through the task and theneed for respecting the other’s actions in planning.

B. Task design

The task consists of pressing colored buttons in a givenorder. It is automatically generated following specified rulesto respect our demands. An example structure is shown inFig. 3. Distinct actions are represented by button colors. Theshape serves to distinguish the actor, squares for robot actionsand circles for human actions. The nodes between the actionsare used as system states and pressing the button marked onan outgoing edge leads to a transition to the linked node.The task starts at the most left node, ends at the goal nodeon the right, and is shared between both agents. Differentpaths are possible to achieve the goal, since there are stateswith multiple outgoing edges, leading to different branchesof the graph. Both actors can be in situations, where theyselect a branch by taking an action. Starting on the left-mostnode of Fig. 3, the robot can decide between upper path

(pressing yellow) and lower path (pressing red). The humanhas to track the robot’s decision to determine subsequentactions. If the robot presses the yellow button, the humanhas to press the purple one subsequently. We introduced thebranching to force planning over purely reactive behavior andto achieve the required interaction and complexity. Takingan action in a state where it is not a specified transition isa mistake and leads to a negative reward, while the stateremains unchanged. To achieve the desired task complexity,we generate tasks with a size of 8 states.

The explicit task is generated with randomly varyingbranching and merging points and button attributions. Themapping of colors to the buttons is changed between tasksto avoid habituation.

Fig. 3: Cooperative task plan example

In addition to the colored buttons, we take human informa-tion gathering into account. To focus on the actual inferenceprocess, we introduce discrete gathering actions as temporarysimplification. Therefore we use three buttons, illuminated inwhite (Fig. 2). By holding them, the human obtains a certainpiece of information on the screen. “Task gathering” displaysthe whole task as in Fig. 3. “State gathering” shows theactual state of the environment as node in the graph includinglast and next action(s). “Robot gathering” visualizes therobot button approach to distinguish between different robotactions. Due to this explicit robot gathering action, we wantto avoid direct visual information and the robot is simulatedduring the main part of the experiment.

We introduce a gathering cost through a time delay fordisplaying the information. The human should always trackthe environmental state, therefore we use the highest delayof one second for that. Gathering the task is delayed by ahalf second, the task is already shown at the start of each runbut difficult to remember. Since the robot moves and acts,the human needs to dynamically adapt to what it does, sowe assign no further delay for robot gathering.

In our current experiments, the robot randomly presses oneof the next buttons from the task. The intention behind this isthat the human has to actively track the robot’s actual action.When the robot chooses a path, the human has to be awareof the actual decision, which he can achieve through robotgathering. This robot selection strategy is communicated tothe participant before an experiment. The actual as well assimulated button press produces a sound, from which thehuman can infer the end of the robot action execution andthe related state transition. After the sound, he can continuewith the next action. The robot needs about 2 to 6 secondsfor pressing a button, depending on the button location. This

4

is normally sufficient for the human to gather informationtowards the upcoming parts. Varying the robot’s speed wouldbe a way to adapt the human stress level.

The reward function contains two parts, the number offalse button presses and the execution time. The humanhas to balance between performing fast with the risk ofmistakes and performing safely with the cost of time throughinformation gathering. He is told about both aspects withoutthe explicit weighting to avoid further confusion. At the endof the task, we provide feedback in form of a score. Fromthese it should be possible to get at least an intuition on theweights, however they may have their individual weighting,which we represent as static hyperparameter w.

At the beginning of each task run, called episode, thecorresponding graph is displayed on the screen for threeseconds. This time is designed to be insufficient for thehuman to memorize every aspect so the human has to usethe task gathering button during execution. The experimentalprocedure of several tasks starts with a habituation phase,where the human has the chance to learn the overall principle,how to progress, and to get used to the gathering buttons.After this habituation phase of 5 episodes, we recorded theexecution of 20 tasks.

We performed experiments with 9 participants. 21 times,that is 5% of the task actions, a false button was pressed bythe human. The execution time as second measure varieda lot (STD 1.8 s, Mean 2.2 s) between different episodesand participants. From the frequent usage of the gatheringactions, we conclude that memory is a significant limitationin our task. For future work it will be interesting to deal withother (e.g physical) limitations of the human.

In average, the participants took 2.4 task actions and 3gatherings per episode. We observed differing strategies bythe participants regarding task gathering versus state gather-ing. Indeed, there is information overlap, because gatheringthe state also provides the information over the immediatenext action(s). However, the strategy of relying mainly onstate gathering is not globally optimal, because it neitherallows to choose the best path nor to prepare a sequence ofconsecutive human actions. Instead, it seems to be a localoptimum in the human learning process. This strategy wasobserved mainly for participants 2, 3 and 4 and partly for 5.

C. Filter specification

We now apply the general belief inference concept fromsection II. As environmental state s we refer to the currentnode number of the task graph. The transition function T isspecified by the task.

1) Representation: We are interested in three aspects ofthe human belief bH , namely the state, the transition functionand the belief over the next robot action. We parametrize andupdate these independently (but with the same observations).

For the state belief we use one parameter describingthe most probable state µs = argmaxs(bH(s)) and itsprobability αs = bH(µs). As prior we use the start state,p(µs = 0) = 1.

For the transition dynamics we represent the structure asadjacency matrix TH as part of the true transition functionT , together with several probability parameters αT . Wefurther divide the human task belief in two parts, the overallstructure respectively connectivity and the actual buttonsconnecting the nodes. Since we designed the task such thatthe human is unable to memorize it completely, we modeltwo sequential steps by the human. The first is the selectionof one single path through the graph whereupon he tries tocapture the corresponding buttons. This structure consistingof one possible paths is represented as connectivity matrixCT together with a probability αT,C . Further, the humanneeds to know the required actions to proceed on the selectedpath. To respect the human memory limitation, we introducea hyperparameter nmem for the memory size, characterizingthe number of buttons, the human can remember. Basedon short term memory research, we select four as typicalnumber [15]. Within this assumption, we have nmem manyparameters µT,i of most probable buttons along the path andthe corresponding probabilities αT,i (i = 1 . . . nmem). Theconnectivity matrix CT and the colors µT,i are combinedin the adjacency matrix TH . Together with the probabilitiesαT , it forms the task parametrization, for which we use auniform prior.

In situations, where the robot has multiple action options,the human needs to know, what action the robot performs.Therefore, we represent the belief for the current robot actionbH(aR). Because we informed the human about the randomaction selection by the robot, we assume, that he has auniform prior over robot actions. When the human gathersinformation on the robots movement, his belief will update tothe true action, b(aR = aR,true) = 1. Since the robot actionbelief is clear and only important for the current node, we donot use a probabilistic representation for it. The used beliefrepresentation combines all three aspects, the parameters areθ =

(µs αs TH αT µR

).

2) Observations: Each gathering action is assumed to leadto an observation and a corresponding belief update by thehuman. Therefore, we have to specify the observation proba-bilities p(o|s, T ) from Eq. (1). With the discrete informationbuttons in our experiment, the observations are clearly relatedto these and assumed to provide reliable information.

Accordingly, task gathering sets the human task prob-abilities αT to one and the estimates to the true values.Gathering the state will display the actual state together withthe immediate next action(s). Thus state µs and and the nextbutton color µT,1 are updated to the true values and thecorresponding certainties αs and αT,0 becomes one. Lastly,information of the robot’s movement will lead to certainhuman knowledge, regarding the robot action.

For another observation type, the sound triggered by arobot action, we use Eq. (1), where we have to spec-ify the observation probability p(oaR|s, T ). It results fromthe probability for any robot action in the current state,p(oaR|s, T ) =

∑aRp(aR|s, T ). A robot action leads to

a (possibly uncertain) state transition, changing the humanbelief over state. The parametrization updates to p(θ′) =

5

∑aRp(θ′|aR)p(aR|µR), where p(θ′|aR) results from Eq.

(4), replacing human with robot action.3) Human action: For the update of p(θ) according to

human actions, we need to compute the action value functionQ(a, θ). For our relatively simple task, we can declare an ac-tion value function by hand and do not need to use a POMDPsolver. The expected value of the action µT,1 depends onthe probability for state and task, Q(µT,1) = αT,1αs,describing the risk of failure. Gathering actions augmentthe human corresponding probabilities α and increase thefollowing success probability. However, if the task progressdepends on the human action, gathering will delay the taskprogress and therefore result in a gathering cost c. Theresulting action values are Q(state) = 1−cstate, Q(task) =αs(1+γαT,2:)−ctask, Q(robot) = αT,1αs+cstate with thecorresponding gathering costs for states µs, where the humancan progress.

Therefore we used −1 as reward for a false action and thegathering costs from time delay, multiplied by the false/timeratio, w = 3/s. Choosing a longer path augments the timeneeded and results in an additional cost for those actions.Gathering the task reduces the human’s uncertainty not onlyfor the next action but also for future actions. The benefitof future uncertainty reduction may be discounted by somefactor γ, modeled as hyper parameter. We set it to γ = 0.5and use a rationality parameter τ = 3.

D. Results

The principle of belief inference is demonstrated in Fig.4 for a part of a recorded run. The relevant part of thetask structure is shown in a). The human presses the purplebutton in b), transitioning to the next task state. In c), theresult of the inference process after this action is visualized.The inferred state belief p(µs) is represented through thewidth of the pentagon on the node. The probability for thehuman state estimate µs is concentrated on the first state.One part of the task belief is shown by the empty symbols,representing buttons probably unknown for the human. Inthis state, the robot can press dark blue or rose, leadingto different branches. Since the robot chooses the actionrandomly, the human should gather the robot’s movementto track the state transition. Our filter predicts this actionas most likely by the human. In this recording however, thehuman does not follow our prediction, while the robot pressesdark blue in d). In the updated belief estimate e), multiplestates regarding human belief µs are possible. Additionallya low human state certainty αs (not illustrated) is estimated.Due to that, the filter predicts state gathering by the human,which he actually does in the next step f). Consequently,the new state belief of the human is concentrated on thetrue state and the next action options should be known bythe human g). Latest after the next human action (purpleor light blue) we expect him to gather for task, since weestimate the human to be uncertain about the subsequentcolors (represented as white circles).

When introducing interaction, e.g. providing specific miss-ing information to the human, an online calculation is neces-

sary. Each of our belief update steps is computed sufficientlyfast, it takes less than 5ms on a standard desktop PC with3.2 GHz single core computation.

For a quantitative evaluation, we compute the hit rate ofthe action prediction. According to Eq. (2), we calculateaction probabilities and compare the most probable with theactual human action. Since there are cases with multipleappropriate action (e.g. branching over comparable paths),we use as second measure the likelihood, the probability ourfilter attributes to the actual human action.The hit rate of the prediction is shown in Fig. 5a. Theaveraged rate is 56%, but varies between the participants.We compute a statistical baseline over all participants, thatalways predicts the most frequent action (task gathering,26%). For comparison, we let an expert (first author) predictthe human actions using the same information (action historyand task knowledge) as our system (59%). The low successrate for participants 2 and 3 might result from a suboptimalstrategy that they use (which violates the rationality assump-tion). Focusing only on predicting human task actions, thesuccess rate is significantly higher at 82%. It is most difficultto predict the gathering actions for task and state correctly.Looking at the experts performance, our task seems to havesome general problems regarding predictability.

The average likelihoods, assigned to the performed humanactions, are shown in Fig. 5b, the mean inference result is51%. Again, we calculate a statistical baseline (relative actionfrequencies). The expert was restricted to spread his predic-tion uniformly over (multiple) actions. We observe severalcases, where two actions or three gathering actions of thehuman are reasonable and the action prediction distributed.

Finally, we analyze the influence of the hyperparameters.Once selected, they are held constant for the evaluation ofall participants, to achieve better comparisons. The memoryparameter is varied between 2 and 8 with little effectsregarding the hit rate, shifting towards prediction of taskprogressing actions. In fact, our memory model is simple ande.g. the existing dependence on the task complexity (numberof branches) is not respected. The rationality parameter hasalmost no influence on the hit rate. The action distributionbecomes sharper which leads to more peaked probability pre-dictions. Most influence results from the weighting parameterof the human reward function, the ratio error to time (inseconds). An increased time cost leads to better predictionresults. Over the ratios 0.1 to 100 the prediction rate risesfrom 52 to 59 %.

The belief inference produces plausible qualitative behav-ior and the prediction results are promising. Further analysisis necessary to ascertain the performance of our approachand the specific parametrization.

IV. CONCLUSIONSWe presented an approach for inferring human belief to es-

timate his situation awareness, relying on human informationgathering and human actions. We proposed to approximatethe human belief distribution with a parametrized distribu-tion, allowing the inference of these parameters without needfor double inference.

6

a) b) d) f)

c) e) g)

Fig. 4: Example case, a) describes the task, b), d) and f) show three consecutive button presses, c), e) and g) represent theestimated human belief after the corresponding action.

0 1 2 3 4 5 6 7 8Participant

0.0

0.2

0.4

0.6

0.8

1.0

Hit

rate

InferenceExpertOnly task actionsBaseline

(a) Hit rate

0 1 2 3 4 5 6 7 8Participant

0.0

0.2

0.4

0.6

0.8

1.0

Like

lihoo

d

(b) Mean predictive likelihood

Fig. 5: Evaluation of action prediction

Inspired by the needs and structure of a cooperativemanufacturing task, we designed a human-robot experiment,where we applied the belief inference method. We lookedat different cases of the experiment and could verify thequalitative consistence with human observers. The actionprediction based on the inferred belief produced promisingresults.

We will analyze the capabilities and limits of the proposedinference method, regarding parametrization as well as otherapplication scenarios. One extension would be the use ofInverse Reinforcement Learning to infer the actual humanreward function instead of static modeling.

Future research will further be directed towards enhancingcooperation of human and robot with the knowledge ofinternal human beliefs. Therefore we want to evaluate thehuman’s situation awareness and adapt the robot behavior towarn or inform the human and to optimize the joint behavior.We think that we need the knowledge of human cognitivestates to achieve the goal of making human-robot cooperationmore efficient and intuitive.

REFERENCES

[1] M. R. Endsley, “Toward a Theory of Situation Awareness in DynamicSystems,” Human Factors: The Journal of the Human Factors andErgonomics Society, vol. 37, no. 1, pp. 32–64, 1995.

[2] B. Liddle and D. Nettle, “Higher-order theory of mind and social com-petence in school-age children,” Journal of Cultural and EvolutionaryPsychology, vol. 4, no. 3, pp. 231–244, 2006.

[3] A. Y. Ng and S. J. Russell, “Algorithms for Inverse ReinforcementLearning,” in Proceedings of the Seventeenth International Conferenceon Machine Learning (ICML), 2000.

[4] D. Hadfield-Menell, A. Dragan, P. Abbeel, and S. Russell, “Co-operative inverse reinforcement learning,” in Advances in NeuralInformation Processing Systems, 2016, pp. 3916–3924.

[5] D. Sadigh, S. Sastry, S. A. Seshia, and A. D. Dragan, “Planningfor Autonomous Cars that Leverage Effects on Human Actions,” inRobotics: Science and Systems XII, 2016.

[6] L. Paletta, A. Dini, C. Murko, S. Yahyanejad, M. Schwarz, G. Lodron,S. Ladstatter, G. Paar, and R. Velik, “Towards Real-time ProbabilisticEvaluation of Situation Awareness from Human Gaze in Human-RobotInteraction,” in Proceedings of the Companion of the 2017 ACM/IEEEInternational Conference on Human-Robot Interaction - HRI, 2017.

[7] T. Baer, D. Linke, D. Nienhuser, and J. M. Zollner, “Seen and missedtraffic objects: A traffic object-specific awareness estimation,” in IEEEIntelligent Vehicles Symposium Workshops (IV Workshops), 2013, pp.31–36.

[8] J. Schwehr and V. Willert, “Driver’s gaze prediction in dynamic auto-motive scenes,” in IEEE 20th International Conference on IntelligentTransportation Systems (ITSC), 2017, pp. 1–8.

[9] O. Palinko, F. Rea, G. Sandini, and A. Sciutti, “Robot reading humangaze: Why eye tracking is better than head tracking for human-robotcollaboration,” in IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS), 2016, pp. 5048–5054.

[10] C. A. Rothkopf and D. H. Ballard, “Modular inverse reinforcementlearning for visuomotor behavior.” Biological cybernetics, vol. 107,no. 4, pp. 477–90, 2013.

[11] L. Zettlemoyer, B. Milch, and L. Kaelbling, “Multi-agent filtering withinfinitely nested beliefs,” Advances in Neural Information ProcessingSystems, pp. 1–8, 2009.

[12] C. L. Baker, J. Jara-Ettinger, R. Saxe, and J. B. Tenenbaum, “Rationalquantitative attribution of beliefs, desires and percepts in humanmentalizing,” Nature Human Behaviour, vol. 1, no. 4, p. 0064, 2017.

[13] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning andacting in partially observable stochastic domains,” Artificial Intelli-gence, vol. 101, no. 1-2, pp. 99–134, 1998.

[14] A. Somani, N. Ye, D. Hsu, and W. S. Lee, “DESPOT: Online POMDPPlanning with Regularization,” in Advances in Neural InformationProcessing Systems, 2013, pp. 1772–1780.

[15] N. Cowan, “The magical number 4 in short-term memory: a reconsid-eration of mental storage capacity.” The Behavioral and brain sciences,vol. 24, no. 1, pp. 87–114, 2001.

7

Documents

Online inference of human belief for cooperative robotstuprints.ulb.tu-darmstadt.de/8295/2/IROS2018_AV.pdf · 2019-01-17 · Accepted Publication to the 2018 IEEE/RSJ International