7
Social Interaction: Multimodal Conversation with Social Agents Katashi Nagao and Akikazu Takeuchi Sony Computer Science Laboratory Inc. 3-14-13 Higashi-gotanda, Shinagawa-ku, Tokyo 141, Japan E-mail: {nagao,takeuchi}@csl.sony.co.jp Abstract including robotics, artificial life, and artificial ecosys- tems. We present a new approach to human-computer interaction, called so&Z interaction. Its main characteristics are summarized by the follow- ing three points. First, interactions are real- ized as multimodal (verbal and nonverbal) con- versation using spoken language, facial expres- sions, and so on. Second, the conversants are a group of humans and social agents that are au- tonomous and social. Autonomy is an impor- tant property that allows agents to decide how to act in an ever-changing environment. So- cialness is also an important property that al- lows agents to behave both cooperatively and col- laboratively. Generally, conversation is a joint work and ill-structured. Its participants are required to be social as well as autonomous. Third, conversants often encounter communica- tion mismatches (misunderstanding others’ in- tentions and beliefs) and fail to achieve their joint goals. The social agents, therefore, are al- ways concerned with detecting communication mismatches. We realize a social agent that hears human-to-human conversation and informs what is causing the misunderstanding. It can also in- teract with humans by voice with facial displays and head (and eye) movement. However, is autonomy itself sufficient for social ser- vices? Although autonomy is vital to survive in the real world, it is only concerned with “self.” It is selfish by nature. It seems that it does not work well in human society, since it includes socially constructed artifacts such as laws, customs, culture. Social services pro- vided by computer systems have to incorporate with these artifacts. Socialness is a higher-level concept defined above the concept of an individual, and is the style of interaction between the individuals in a group. Socialness can be applied to the interaction between humans and com- puters, and possibly to that between multiple comput- ers. In this paper, we study socialness of conversational interaction between humans and computers. Conver- sation is no doubt, a social activity, especially when more than two participants are involved in it. How- ever, conversation research to date has been biased to problem-solving. Question-answering systems are typ- ical examples. All conversation research based on this view has the following features. Introduction Many artificial intelligence researchers have been seek- ing to create intelligent autonomous creatures that act like partners rather than tools. They will take the re- sponsibility of doing some social services through inter- acting with humans. We will depend on the computer that assists us to achieve some tasks and delegate it to the responsibility for working out the details, rather than invoke a series of commands which cause the sys- tem to carry out well-defined and predictable opera- tions. Dialogical: Only two participants, a human (asker) and a computer (answerer), are assumed. Turn- taking is trivial (alternate turns). Transformational: Computers are regarded as a function that receives an incluiry and produces its answer. Passive: Computers will not voluntarily speak. The dialogical and transformational views are well fitted to applications such as natural language inter- faces of databases, consulting and guidance systems. However, our daily conversation is not always func- tional. One example that is not functional is the co- constructive conversation studied by Chovil (Chovil 1991). Autonomy is to have or make one’s own laws. An au- Co-constructive conversation is that a group of indi- tonomous system has the ability to control itself and viduals, in which, say, people talk about the food they to make its own decisions. Autonomy is essential to ate in a restraurant a month ago. There are no spe- survive in a dynamically changing world such as one we live in. It is the subject of research in many areas cial roles (like the chair) for the participants to play. They all have the same role. All participants try to 22 The Arts From: AAAI-94 Proceedings. Copyright © 1994, AAAI (www.aaai.org). All rights reserved.

1994-Social Interaction: Multimodal Conversation … › Papers › AAAI › 1994 › AAAI94-004.pdfmobile robot control system, based on task-achieving behaviors, instead of decomposition

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1994-Social Interaction: Multimodal Conversation … › Papers › AAAI › 1994 › AAAI94-004.pdfmobile robot control system, based on task-achieving behaviors, instead of decomposition

Social Interaction: Multimodal Conversation with Social Agents

Katashi Nagao and Akikazu Takeuchi Sony Computer Science Laboratory Inc.

3-14-13 Higashi-gotanda, Shinagawa-ku, Tokyo 141, Japan E-mail: {nagao,takeuchi}@csl.sony.co.jp

Abstract including robotics, artificial life, and artificial ecosys- tems.

We present a new approach to human-computer interaction, called so&Z interaction. Its main characteristics are summarized by the follow- ing three points. First, interactions are real- ized as multimodal (verbal and nonverbal) con- versation using spoken language, facial expres- sions, and so on. Second, the conversants are a group of humans and social agents that are au- tonomous and social. Autonomy is an impor- tant property that allows agents to decide how to act in an ever-changing environment. So- cialness is also an important property that al- lows agents to behave both cooperatively and col- laboratively. Generally, conversation is a joint work and ill-structured. Its participants are required to be social as well as autonomous. Third, conversants often encounter communica- tion mismatches (misunderstanding others’ in- tentions and beliefs) and fail to achieve their joint goals. The social agents, therefore, are al- ways concerned with detecting communication mismatches. We realize a social agent that hears human-to-human conversation and informs what is causing the misunderstanding. It can also in- teract with humans by voice with facial displays and head (and eye) movement.

However, is autonomy itself sufficient for social ser- vices? Although autonomy is vital to survive in the real world, it is only concerned with “self.” It is selfish by nature. It seems that it does not work well in human society, since it includes socially constructed artifacts such as laws, customs, culture. Social services pro- vided by computer systems have to incorporate with these artifacts.

Socialness is a higher-level concept defined above the concept of an individual, and is the style of interaction between the individuals in a group. Socialness can be applied to the interaction between humans and com-

puters, and possibly to that between multiple comput- ers. In this paper, we study socialness of conversational interaction between humans and computers. Conver- sation is no doubt, a social activity, especially when more than two participants are involved in it. How- ever, conversation research to date has been biased to problem-solving. Question-answering systems are typ- ical examples. All conversation research based on this view has the following features.

Introduction Many artificial intelligence researchers have been seek- ing to create intelligent autonomous creatures that act like partners rather than tools. They will take the re- sponsibility of doing some social services through inter- acting with humans. We will depend on the computer that assists us to achieve some tasks and delegate it to the responsibility for working out the details, rather than invoke a series of commands which cause the sys- tem to carry out well-defined and predictable opera- tions.

Dialogical: Only two participants, a human (asker) and a computer (answerer), are assumed. Turn- taking is trivial (alternate turns).

Transformational: Computers are regarded as a function that receives an incluiry and produces its answer.

Passive: Computers will not voluntarily speak.

The dialogical and transformational views are well fitted to applications such as natural language inter- faces of databases, consulting and guidance systems.

However, our daily conversation is not always func- tional. One example that is not functional is the co- constructive conversation studied by Chovil (Chovil 1991).

Autonomy is to have or make one’s own laws. An au- Co-constructive conversation is that a group of indi- tonomous system has the ability to control itself and viduals, in which, say, people talk about the food they to make its own decisions. Autonomy is essential to ate in a restraurant a month ago. There are no spe- survive in a dynamically changing world such as one we live in. It is the subject of research in many areas

cial roles (like the chair) for the participants to play. They all have the same role. All participants try to

22 The Arts

From: AAAI-94 Proceedings. Copyright © 1994, AAAI (www.aaai.org). All rights reserved.

Page 2: 1994-Social Interaction: Multimodal Conversation … › Papers › AAAI › 1994 › AAAI94-004.pdfmobile robot control system, based on task-achieving behaviors, instead of decomposition

recall the food by relating his or her memory about the food, adding comments, and correcting the other’s impression. Turn-taking is controlled by eye contact, facial expression, body gestures, voice tones, and so on. Conversation includes many subconversations, some of them existing in parallel and dividing the group into subgroups. The conversation terminates only when all the participants are satisfied with the conclusion.

Co-constructive conversation closely approximates to our day-to-day conversation. Conversation is a so- cial action. Suchman said that communication is not a symbolic process that happens to go on in real-world settings, but a real-world activity in which we make use of language to delineate the collective relevance of our shared environment (Suchman 1987). To realize a com- puter that can participate in social conversation such as the co-constructive conversation, described above, is our research goal. To this end, we propose the notion of “Social Interaction” as a new conversation paradigm between humans and computers. In contrast to the problem-solving view, social interaction has the follow- ing features.

N-participant conversation: Conversation involves more than two agents that are humans or computers. A computer has to recognize every participant with his/her/its character. There is no fixed roles. Turn-taking is flexible, and highly dependent on the conversational situation.

Social: Every participant is more or less social and fol- lows social rules such as “avoid misunderstandings,” “do not speak while other people are speaking,” “si- lence is no good, ” “contribute whenever possible,” etc.

Situated actions: Conversational actions are con- trolled not only by intelligence and social rules, but also by situations perceived multimodally. Here, we assume that a participant’s actions, such as body gestures, eye contact, facial expressions, and cough- ing, are all included in a situation.

Active: A computer actively joins the that is, grabs every chance to speak.

conversation,

We call an autonomous system that can do social interaction with humans a social agent. In the fol- lowing, we study an architecture of a social agent and its behavioral model. This paper is organized as fol- lows. In Section “An Architecture for Social Agents,” we present an architecture for a social agent. In Sec- tion “Conversation as Situated Action,” a situated conversational action based on multimodal cognition is presented. In Section “Conversation as Coopera- tive Action,” we present a model for understanding ill- structured conversation and detecting communication mismatches.

An Architecture for Social Agents

Model Several agent architectures featuring interaction with a society have been proposed (Cohen SC Levesque 1990, Bates, Loyall, St Reilly 1992). Social agents are fully exposed to a real human society, and have to perceive the verbal and nonverbal messages and take actions based on them.

Traditional conversation programs process voice in- put sequentially, from low-level recognition to high- level semant,ic analysis. This works well in the do- main of transformational question-answering applica- tions. However, conversation such as co-construction requires faster response to other participants’ utter- ances. These reactions are not necessarily deliberate ones executed at a semantic level. Moreover, some re- actions may be triggered by nonverbal actions such as eye contact and body gestures. In fact, conversation is supported by multiple coordinated activities at vari- ous cognitive levels. This makes communication highly flexible and robust.

Brooks proposed the horizontal decomposition of a mobile robot control system, based on task-achieving behaviors, instead of decomposition based on func- tional modules (Brooks 1986). His architecure is pow- erful enough to survive in the real world. as proven by a series of robots he designed. The same argument holds when we design a social agent, since social agents have to be involved in conversations that are real-world activities going on in real-*world settings. Figure 1 il- lustrates the horizontal decomposition of a social agent based on task-achieving behaviors. It is important to note that the layers act on sensory data in parallel. There is downward control and upward dataflows.

There has been much debate between those groups that support situated actions and those that support physical symbol systems (Vera & Simon 1993). In our achitecture, these views are placed at opposite ends. Namely, the lower layers rule reactions to multiple sensory input, data, while the upper layers adminis- ter deliberate actions. The next section explains the lower levels. The section following the next explains the higher levels.

reason about utterances of conversants

plan changes to the world

identify topics

monitor changes b b

perception recognize action

eyes explore face ears wander

voice

detect

Figure 1: Horizontal decomposition of a social agent

Believable Agents 23

Page 3: 1994-Social Interaction: Multimodal Conversation … › Papers › AAAI › 1994 › AAAI94-004.pdfmobile robot control system, based on task-achieving behaviors, instead of decomposition

Current Implementation In the current implementation, a social agent has a face, a voice, eyes, and ears. They are realized by two subsystems, a facial animation subsystem that gener- ates a three-dimensional face capable of various facial displays, and a spoken language subsystem that recog- nizes and interprets speech, and generates voice out- puts. Currently, the animation subsystem is running on SGI 320VGX and the spoken language subsystem on a Sony NEWS workstation. These two subsystems communicate with each other via an Ethernet network.

The face is modeled three-dimensionally. The cur- rent face is composed of approximately 500 polygons. The face is rendered using a texture taken from a pho- tograph or a video frame. A facial display is realized by local deformation of the polygons representing the face. We use the numerical equations simulating muscle ac- tions defined by Waters (Waters 1987). Currently, 16 muscles and 10 parameters, controlling mouth opening, jaw rotation, eye movement, eyelid opening,

and head orientation are incorporated. These 16 muscles were determined by Waters, considering the correspondence with action units in the Facial Action Coding System (FACS) (Ek man & Friesen 1978). The facial model- ing and animation system are based on the work of Takeuchi and Franks (Takeuchi & Franks 1992).

Speaker-independent continuous speech inputs are accepted without special hardware. To obtain a high level of accuracy, context-dependent phonetic hidden Markov models are used to construct phoneme-level hypotheses (Itou, Hayamizu, & Tanaka 1992). The speech recognizer outputs N-best word-level hypothe- ses. The semantic analyzer deals with ambiguities in syntactic structures and generates a semantic repre- sentation of the utterance. We applied a preferential constraint satisfaction technique for disambiguation and semantic analysis (Nagao 1992). The plan recog- nition module determines the speaker

s

intention by constructing his belief model and dynamically adjust- ing and expanding the model as the conversation pro- gresses (Nagao 1993). The response generation module generates a response by using domain knowledge and text templates (typical utterance patterns).

The spoken language subsystem recognizes a num- ber of typical conversational situations that are impor- tant in communication. We associate these situations with specific communicative facial displays. The corre- spondence between conversational situations and facial displays is based on the work of Takeuchi and Nagao (Takeuchi & Nagao 1993). For example, in situations where speech input is not recognized or where it is syntactically invalid, the facial display of “Not confi- dent” is displayed. If the speaker

s

request is out of the system

s

knowledge, then the system displays a facial shrug and replies “I cannot manage it without knowing it .”

Gaze control is also implemented in the facial ani- mation subsystem using a video camera fixed on top

of a computer display. Comparing coming images with the image of the vacant room and segmenting differ- entiated regions, moving objects are extracted in real- time. Assuming that moving objects are only humans in the room, we can find the 2D position of human participants in the image. Using camera position and direction, the position is translated to 3D orientation, which is applied to eyeball rotation and face rotation when drawing a 3D face.

Figure 2 shows a snapshot of conversation between humans and a social agent.

Figure 2: Conversation with a social agent

Conversation as Situated Action In conversation, people show various complicated bc- havior. Some are expressed by face, others by body motions. Situated action views this complexity as a re- sult of interaction with an environment including con- versants, not as a result of internal complexity. This complexity can be regarded as fruitfulness of commu- nication.

This fruitfulness is related to fruitfulness of sensory data, and is realized by the lower layers in Figure 1. However, when implementing it, there is a severe trade- off between the processing speed and the content of data-processing. The more information, the slower the processing. Since slow reactions are essentially useless in a real-world setting, we have to force ourselves to reduce time-consuming processing in these layers.

In the current implementation, the following processings are considered for image and audi- tory data in those layers: image scene analy- sis; face position detection; face identification; fa- cial display recognition; auditory scene understand- ing; voice position detection; voice identification; speech recognition. These underlined items wnre al- ready implemented.

The lower four layers of the decomposition in Fig- ure 1, detect, wander, explore, and recognize, are major

24 The Arts

Page 4: 1994-Social Interaction: Multimodal Conversation … › Papers › AAAI › 1994 › AAAI94-004.pdfmobile robot control system, based on task-achieving behaviors, instead of decomposition

players in a situated conversational action game. De- tect is to detect an input in any of the sensory channels. An agent may perform quick reactions such as looking in the direction from which the input appeared. Gen- erally, such quick reactions are short-lived. Wander is a tendency of distraction. It may distract agent’s attention from detected input. Explore is an opposite action to wander, and it looks for something attractive. It may suppress wander for a while. Recognition is to recognize sensory data that seem to be worth paying attention. It tries to extract its meaning, although its full understanding is left to the higher layers.

From interaction between these layers, various con- versational actions emerge. For instance, a composite action of “detect then wander” appears the sign of ig- norance or no interest; “Explore then wander” appears the sign of refusal; “Explore then recognize” appears the sign of attendance and interest. Sophisticated so- cial actions like turn-taking are highly depending upon mutual perception of these kinds of signals, so they would be constructed naturally on these situated con- versational acitons.

Conversation as Cooperative Action Language use is an action that influences the human mind. Speech act theory formalized this idea (Searle 1969). Actions are assumed to be performed after planning their effects on the world. So, language use is also assumed to be performed by planning its ef- fect on the mind. The difference between planning for physical action and language use is that planning for language use includes the hearers’ beliefs and plans (i.e., plan recognition). Conventional plan recognition models deal only with one hearer’s beliefs and plans. Research into two-participant conversation (i.e., dia- logue) concentrates on task-oriented, well-structured dialogues that make use of a consistent plan library and a well-ordered turn-taking constraint (Carberry 1990). Group conversation is generally ill-structured. This means that some interruptions by other speakers and communication mismatches between conversants occur frequently.

We extended the conventional plan recognition mod- els to deal with communication mismatches by main- taining multiagent’s conversational states. Multiagent conversational states consist of agents’ beliefs, utter- ances, illocutionary act types, other communicative signals, turn-taking sequences, and activated plans. Il- locutionary act types such as INFORM, REQUEST, etc. are an abstraction of the speaker’s intentions in terms of the actions intended by the speaker. Other communicative signals contain facial displays, head and eye movement, and gestures, as described in the previous section. Activated plans are represented as a network that connects preconditions and the effects of plans in an agent’s belief space.

Figure 3 illustrates the concept of the multiagent conversational state.

I Agent A’s belief space I

Agent B’s belief space

Agent C’s belief space

Figure 3: Conceptual model of multiagent conversa- tional state

We will explain our mechanism by using the follow- ing discourse fragment. A and B are humans, while C is an agent that overhears their conversation.

A: “Do you know what happened today?” B: “I don’t know exactly, but . ..” C (to B): “I think that he wants to tell you.”

C’s plan recognition process upon hearing A’s utter- ance “Do you know what happened today?” is traced in Figure 4.

utter(a,“Do you know what happend today?“,to)

Introduction of \ intention $

:g knowif(a,knotvref(b,e,t&tl)

..;::..

. . :::

5

Possitive knowledge5i’ :::..

‘%.:Negative knowledge :.,

iinformref(b,a,e,tz) informref(a,b,e,t$

I Effect

knowref(b,e,t$

Figure 4: Inferred plan recognition on utterance “Do you know what happened today?”

In this figure, the inference proceeds from the top to the bottom. The directions of the arrows (except the thick ones) indicate logical implication. Thus, the downward arrows correspond to deduction, while the upward ones correspond to abduction. knowif(X,P,T) means that agent X knows that proposition P holds at time T. knowref(X,P,T) means that agent X knows the

Believable Agents 25

Page 5: 1994-Social Interaction: Multimodal Conversation … › Papers › AAAI › 1994 › AAAI94-004.pdfmobile robot control system, based on task-achieving behaviors, instead of decomposition

contents of proposition P at time T. informref(X,Y,P,T) means that X informs other agent Y of the contents of P at time T. a, b, and e denote agents A, B, and the event that ‘happened today’, respectively.

These plan schemes are assumed to be common to all the conversants, because they are domain-independent and common sense. Domain-dependent plans may dif- fer between conversants, since they may have different experiences on domain-dependent behavior.

Figure 4 shows that, from A’s utterance “Do you know what happened today. T”, the conversants can in- fer that A’s intention is knowref(a,e,t3) when A believes knowref(b,e,t2) or that A’s intention is knowref(b,e,t3) when A believes knowref(a,e,t2). After A’s utterance, B believes that A does not know e. So, he infers that A’s intention is knowref(a,e,t3). However, C infers that A knows e from another information source.

After the utterance of B, agent C utters utterance “I think that he wants to tell you” because there is a mismatch between beliefs of A and B (in C’s belief space), and C infers that it could be an obstacle to the progress of the conversation.

To detect communication mismatches in group con- versation, social agents consider assumptions based on other agents’ utterance planning. A communication mismatch is judged to have occurred when an agent recognizes the following situations.

1.

2.

3.

Illocutionary act mismatch An example would be the situation where an agent utters an utterance of illocutionary act type QUES- TIONREF (i.e., the agent wants to know about something) and, after that, another agent utters an utterance of type INFORMIF (i.e., yes/no answer). e.g., An agent asked “Do you know what happened today ?” with the intention of knowing about the event ‘happened today’ and the other agent’s answer was “Yes, I do.” Belief inconsistency An agent misunderstands another agent’s beliefs. e.g., An agent asked “Do you know what happened today?” with the intention of knowing about the event ‘happened today’ but the agent misunderstood that the other agent already knew about the event. Plan inconsistency An agent misunderstands another agent’s intended plans. e.g., An agent asked “Do you know what hap- pened today?” with the intention of knowing about the event ‘happened today’ and the other agent mis- understood that the agent had a plan to inform about the event. In general, it is not necessary to exactly determine

the conversants’ intentions, because it is sufficient to know whether there is a communication mismatch. De- cisions on unnecessary occasions should be delayed un- til required (van Beek & Cohen 1991).

When social agents detect a communication mis- After C hears A’s first utterance, C infers A’s acti- match, they inform the other conversants by saying a vated plans at that conversational state, as shown in

few phrases from which they can easily determine their misunderstanding. These utterances are called mini- mal utterances. Minimal utterances of social agents are caused by constraints imposed on their resource- boundedness and socialuess. These utterances func- tion to limit the processing for generation, required by resource-bounded agents. They also contribute to avoiding further progress of the conversation without first resolving the misunderstanding. In conversation, timely action is crucial, since delays have some mean- ing in themselves. We consider the minimal utterance as a situated action. As mentioned before, situated actions in conversation involve multimodality. So, in this case, an agent’s response includes facial actions and prosodic actions in voice tones;

Example Conversation Now, let’s look at an example of a conversation be- tween humans and a social agent. The humans (A and B) are talking about cooking, and the social agent (C) overhears their talk. This example is based on Kautz’s cooking plan library (Kautz 1990).

A: “I made marinara sauce. What brand of wine do you like?” B: “Marinara . . . Ok. Italian ‘Soave’ is good.” A (with a perplexed look): “....” C (to A): “I think that he is thinking of a pasta dish.” A (to B): “Oh. I am making chicken marinara.” Figure 5 shows part of the cooking plan library used

to understand the example conversation. In this fig- ure, the upward-pointing thick arrows correspond to is-a (a-kind-of) relationships, while downward-pointing thin arrows indicate has-a (part-of) relationships. We assume that C uses this plan library from the initial state.

red wine white wine 27 d\ meat\ake

Make Make spaghetti

Make fettucini spaghetti chicken

pest0 marinara marinara marinara

fettucini spaghetti pest0 marinara

Figure 5: Cooking plan library (Kautz 1990)

26 The Arts

Page 6: 1994-Social Interaction: Multimodal Conversation … › Papers › AAAI › 1994 › AAAI94-004.pdfmobile robot control system, based on task-achieving behaviors, instead of decomposition

Figure 6l.

End event

Make uake Mike fettucini chicken marinara

spaghetti mannara marinara

Figure 6: A’s activated plan

After hearing B’s utterance, C infers B’s activated plans (B’s recognized plans about A) at that conver- sational state, as shown in Figure 7. This inference

Chohse white wine

make

Make Make fettucini marinara

spaghetti mannara

Figure 7: B’s recognized plan about A

involves knowledge about the relationships between wines and dishes (i.e., white wines are well-suited to pasta dishes). When C sees A’s perplexed facial dis- play, C makes an assumption that A’s intention is not to make a pasta dish (but is to make a meat dish) and

‘For the sake of simplicity, we omit the speech act pred- icates such as knowif, knowref, informref, etc. described in the previous section.

C detects that there is a communication mismatch (i.e., plan inconsistency) between A and B2.

Then, C is motivated to inform A (or B) that a mis- understanding has occurred. In this case, C infers that it is better to inform A than to B, since A is now per- plexed, and may need help. As a result, C tells A about (C’s inferred) B’s recognized plan about A.

This example shows that social agents can detect communication mismatches by maintaining multiagent conversational states and can voluntarily take part in conversation to smooth out any misunderstandings when agents detect obstacles to communication. A de- tailed mechanism of social agents’ cooperative (mini- mal) utterance generation will be presented in a sepa- rate publication (Nagao 1994).

Another example of cooperative conversation is that a social agent says some additional information related to the topic in conversation. For example, a guest talks with a desk clerk about a French restaurant, and the clerk tells the best one he knows, then the overhearing agent accesses the agent of that restaurant, and tells the guest that it is fully-booked today.

Concluding Remarks and Further Work We presented an approach to social interaction, a mul- timodal conversation with social agents. Socialness is an essential property of intelligent agents as well as autonomy. Social agents consider other agents’ (in- cluding humans’) beliefs and intentions, and behave cooperatively. One example of cooperation is the re- moval of obstacles to communication caused by mis- understanding between agents. Our model can deal with N-conversant plan recognition and detect commu- nication mismatches between conversants. These mis- matches consist of illocutionary act mismatches, belief inconsistency that is a misunderstanding of the beliefs held by other conversants, and plan inconsistency that is a misunderstanding of the intended domain plans of others. An ideal multimodal interaction is modeled by human face-to-face conversation in which speech, fa- cial displays, head and eye movement, etc. are utilized. Our system integrates these modalities for interacting with social agents.

In the future, we plan to simulate human-to-human communication in more complex social environments. We need to design several social relationships between agents and implement social stereotypes (e.g., social standing, reputation, etc.) and personal properties (e.g., disposition, values, etc.). These social/personal properties can dynamically change according to con- versational contexts. From these studies, we can pro- pose some design principles for a society of agents.

Of course, future work needs to be done on design and implementation of coordination of multiple com-

21n the current implementation, the agent cannot rec- ognize/understand human facial displays. So in this case, C detects a communication mismatch because of a break in conversation.

Believable Agents 27

Page 7: 1994-Social Interaction: Multimodal Conversation … › Papers › AAAI › 1994 › AAAI94-004.pdfmobile robot control system, based on task-achieving behaviors, instead of decomposition

munication modalities. We think that such coordina- tion is an emergent phenomenon from tight interac- tions with environments (including humans and other agents) by means of situated actions and (more de- liberate) cooperative actions. Precise control for mul- tiple coordinated activities, therefore, is not directly implementable. Only constraints or relations among perception, conversational states, and action will be implementable. At the beginning, we are developing a constraint-based computational architecture and ap- plying it to tightly-coupled spoken language compre- hension (Nagao, Hasida, & Miyata 1993).

Co-constructive conversation that is less constrained by domains or tasks is one of our future targets to be carried out. We are also interested in developing inter- active characters and stories as an application for inter- active entertainment. Bates and his colleagues called such interactive systems “believable agents” (Bates et al. 1994). We are trying to build a conversational, an- thropomorphic computer character that will entertain us with some pleasant stories.

Acknowledgments The authors would like to thank Mario Tokoro and col- leagues at Sony CSL for their encouragement and dis- cussion, anonymous reviewers for their valuable com- ments on a draft of this paper, and Toru Ohira for his helpful advice on the wording. We also extend our thanks to Satoru Hayamizu, Katunobu Itou, Taketo Naito, and Steve Franks for their contributions to the implementation of the prototype system. Special thanks go to Keith Waters for granting permission to access his original animation system.

References Bates, J.; Hayes-Roth, B.; Nilsson, N.; and Laurel, B., eds. 1994. AAAI 1994 Spring Symposium on Believable Agents. American Association for Artificial Intelligence. Bates, J.; Loyall, A. R.; and Reilly, W. S. 1992. An architecture for action, emotion, and social behavior. In Proceedings of the Fourth European Workshop on Modeling Autonomous Agents in a Multi-Agent World (MAAMAW’92). Institute of Psychology of the Ital- ian National Research Council< Brooks, R. A. 1986. A robust layered control system for a mobile robot. IEEE Journal of Robotics and Automation 2( 1):14-23. Carberry, S. 1990. Plan Recognition in Natural Lan- guage Dialogue. The MIT Press. Chovil, N. 1991. Discourse-oriented facial displays in conversation. Research on Language and Social Interaction 25:163-194. Cohen, P. R., and Levesque, H. J. 1990. Rational interaction as the basis for communication. In Cohen, P. R.; Morgan, J.; and Pollack, M. E., eds., Intentions in Communication. The MIT Press. 221-255.

Ekman, P., and Friesen, W. V. 1978. Facial Ac- tion Coding System. Palo Alto, California: Consult- ing Psychologists Press. Itou, K.; Hayamizu, S.; and Tanaka, H. 1992. Con- tinuous speech recognition by context-dependent pho- netic HMM and an efficient algorithm for finding N- best sentence hypotheses. In Proceedings of ICASSP- 92, 1.21-1.24. IEEE. Kautz, H. 1990. A circumscriptive theory of plan recognition. In Cohen, P. R.; Morgan, J.; and Pollack, M. E., eds., Intentions in Communication. The MIT Press. 105-133. Nagao, K.; Hasida, K.; and Miyata, T. 1993. Understanding spoken natural language with omni- directional information flow. In Proceedings of th,e Thirteenth, Internation.al Joinvt Conference on. Arti- ficial Intelligence (IJCAI-93), 1268-1274. Morgan Kaufmann Publishers, Inc. Nagao, K. 1992. A preferential constraint satisfaction technique for natural language analysis. In Proceed- ings of the Tenth European Conference on Artificial Intelligence (ECAI-92), 523-527. John Wiley Sz Sons. Nagao, K. 1993. Abduction and dynamic preference in plan-based dialogue understanding. In Proceedings of the Thirteenth Internation,al Joint Confererxce on Artificial Intelligence (IJCAI-93), 1186-l 192. Mor- gan Kaufmann Publishers, Inc. Nagao, K. 1994. Minimal utterances of resource- bounded social agents. Technical Report Forthcom- ing, Sony Computer Science Laboratory Inc., Tokyo, Japan. Searle, J. R. 1969. Speech Acts: An Essay in the Philosophy of Language. Cambridge University Press. Suchman, L. 1987. Plans and Situated Actions. Cam- bridge University Press. Takeuchi, A., and Franks, S. 1992. A rapid face construction lab. Technical Report SCSL-TR-92- 010, Sony Computer Science Laboratory Inc., Tokyo, Japan. Takeuchi, A., and Nagao, K. 1993. Communicative facial displays as a new conversational modality. In Proceedings of ACM/IFIP INTERCHI’93: Confer- ence on Hum,an Factors in Computing Systems, 187- 193. ACM Press. van Beek, P., and Cohen, R. 1991. Resolving plan am- biguity for cooperative response generation. In Pro- ceedings of the Twelfth International Joint Confer- ence on Artificial Intelligence (IJCAI-91), 938-944. Morgan Kaufmann Publishers, Inc. Vera, A. H., and Simon, H. A. 1993. Situated action: A symbolic interpretation. Cognitive Science 17( 1):7- 48. Waters, K. 1987. A muscle model for animating three-dimensional facial expression. Computer Gruph- its 21(4):17-24.

28 The Arts