How to Talk to a Hologram - Institute for Creative ...leuski/publications/papers/leuski-06-iui.pdfguage understanding module. It analyses the text strings coming out from the speech

How to Talk to a Hologram

Anton LeuskiInstitute for CreativeTechnologies USC

Marina del Rey, CA, 90292

[email protected]

Jarrell PairInstitute for CreativeTechnologies USC


[email protected]

David TraumInstitute for CreativeTechnologies USC


[email protected]

Peter J. McNerneyInstitute for CreativeTechnologies USC


[email protected]

Panayiotis GeorgiouUniversity of Southern

CaliforniaLos Angeles, CA, 90089

[email protected]

Ronakkumar PatelInstitute for CreativeTechnologies USC


[email protected]

ABSTRACTThere is a growing need for creating life-like virtual humansimulations that can conduct a natural spoken dialog witha human student on a predefined subject. We present anoverview of a spoken-dialog system that supports a per-son interacting with a full-size hologram-like virtual humancharacter in an exhibition kiosk settings. We also give a briefsummary of the natural language classification componentof the system and describe the experiments we conductedwith the system.

Categories and Subject DescriptorsH.3.3 [Information Search and Retrieval]: SelectionProcess; H.4.m [Information Systems Applications]: Mis-cellaneous; H.5.2 [Information Interfaces and Presen-tation]: User Interfaces—Natural language,Voice I/O

General TermsDesign, Experimentation, Human Factors

KeywordsSpoken dialog, text classification

1. INTRODUCTIONIn the recent Hollywood movie “iRobot” set in 2035 the

main character played by Will Smith is running an investiga-tion into a death of an old friend. The detective finds a smalldevice that projects a holographic image of the deceased,delivers a recorded message and responds to the detective’squestions by playing back prerecorded answers. The deviceresponses are limited to the events preceding the accident.We are building virtual characters with similar capabilitiesin our lab.

These virtual characters are focused on providing edu-cational and training aids of different specializations. For

Copyright is held by the author/owner.IUI’06, January 29–February 1, 2006, Sydney, Australia.ACM 1-59593-287-9/06/0001.

example, such a character can be designed to deliver a spo-ken message to a student and answer questions regardingthe subject of the message.

In this paper we describe one of these characters called“Sgt. Blackwell” that serves as interface for an exhibitionkiosk. It supports natural language dialog allowing the userto explore the system design and get an overview of the ex-hibition. There are two main contributions of this project.The first is the system integration aspect: the kiosk com-bines high-quality graphics, elements of physical staging,state-of-the-art automatic speech recognition system (ASR),and a text classification approach to provide the user witha complete experience of interaction with a virtual human-like character. The second aspect is the statistical naturallanguage classifier at the core of the system that analyzesthe user’s speech and selects the appropriate responses.

We describe the system setup, the technology used to an-alyze the human speech and select the appropriate characterresponse. We also describe a set of experiments we use toanalyze the system’s performance.

2. SGT. BLACKWELLFigure 1 shows a photograph of the system setup. We

see a soldier in a full combat gear standing in a doorway ofa building. Everything we see here including the buildingwalls, the stones in the front, the table inside of the room,the map, the radio, and the rest of the military gear arephysical props. The soldier is a life-size pseudo-hologram.It is a high-resolution real-time graphics projection onto spe-cialized transparent optical film that covers the doorway.

The graphics of Sgt. Blackwell is a high-quality computerrendering of a 3D model consisting of more than 60,000 poly-gons. The life-presence effect is further facilitated by a rangeof idle movement animations. The character periodicallysteps from one foot to another, flexes his hands, and moveshis head.

A user talks to Sgt. Blackwell using a head-mounted close-capture usb microphone. The user’s speech is converted intotext using an automatic speech recognition system. We usedthe Sonic speech recognition engine from the University ofColorado [6] with our own acoustic and language models [7].

The character can deliver 83 spoken lines ranging from

360

Figure 1: A photograph of the Sgt. Blackwell sys-tem setup.

one word to a couple paragraphs long monologues. The spo-ken lines were recorded by a voice actor and automaticallytranscribed to generate the character’s lip movement anima-tions. The delivery of the lines is accompanied by a varietyof life-like gestures and body movements. These animationswere recorded at a motion-capture studio and edited in ourgraphics group.

There are three kinds of lines Sgt. Blackwell can deliver:content, off-topic, and prompts. The 57 content-focusedlines cover the identity of the character, its origin, its lan-guage and animation technology, its design goals, our uni-versity, the exhibition setup, and some miscellaneous topics,such as “what time is it?” and “where can I get my coffee?”

When Sgt. Blackwell detects a question that cannot beanswered with one of the content-focused lines, it selects oneout of 13 off-topic responses, – e.g., “I am not authorized tocomment on that,” – indicating that the user has venturedout of the allowed conversation domain.

In the event of the user persisting on asking the questionsfor which the character has no informative response, thesystem tries to nudge the user back into the conversationdomain by suggesting a question for the user to ask: “Youshould ask me instead about my technology.” There are 7different prompts in the system.

One topic can be covered by multiple answers, so askingthe same question again often results in a different responseintroducing variety into the conversation. The user canspecifically request the alternative answer by asking some-thing along the lines of “do you have anything to add?” or

“anything else?” This is the first of two types command-likeexpressions Sgt. Blackwell understands. The second type isa direct request to repeat the previous response, e.g., “comeagain?” or “what was that?”

If the user persists on asking the same question over andover, the character might be forced to repeat its answer.It indicates that by preceding the answer with one of thefour “pre-repeat” lines indicating that incoming responsehas been heard recently, e.g., “Let me say this again...”

3. TEXT CLASSIFICATIONA crucial part of the Sgt. Blackwell system is the lan-

guage understanding module. It analyses the text stringscoming out from the speech recognition system and selectsthe appropriate system responses. The main problem withlanguage understanding is the uncertainty. There are twosources of uncertainty in a spoken dialog system: the first isthe natural language ambiguity, making it difficult to com-pactly characterize the mapping from the text surface formto the meaning; and the second is the error-prone text out-put from the ASR module. When creating a language un-derstanding system one possible approach is to design a setof rules that for an input text string select a response [10].Because of the uncertainty this approach can quickly be-come intractable for anything more than the most trivialtasks. An alternative is to create an automatic system thatuses a set of training question-answer pairs to learn the ap-propriate question-answer matching algorithm [1].

On the surface our answer selection problem is similar tothe question answering scenario that has been studied inthe context of Q&A track at TREC [9]. Our setup differsin that we have a fixed number of responses to choose from,while a typical Q&A system has to find the response froma large collection of textual material. Additionally such asystem assumes that the questions are based on facts and theresponse has to be relevant to the question. The main focusin our settings is on the answer’s appropriateness as opposedto its relevance. For example, an evasive or a misleadinganswer would be appropriate but not relevant.

Our question-answer matching task is similar to text clas-sification tasks that have been studied in Information Re-trieval for several decades [5]. Indeed, we have a fixed setof answers or classes, we have a training set of questionsthat correspond to individual answers and we are buildinga system that for a previously unseen question selects theappropriate class or the answer. The distinct properties ofour study are the small size of the text we need to classify,– the questions are a few words long and the answers al-most never exceed 2-3 sentences, – and the large number ofclasses (Sgt. Blackwell has 60 responses to choose from1).We should also note that the answers have a more informa-tive structure than traditional class labels – the answer texthas a particular meaning and simply replacing each answerwith a class label would discard any information containedin the text representation.

For this project we have created three text classificationapproaches. The first approach is a state-of-the-art text clas-sification system based on Support Vector Machines (SVM) [8,4]. We represent each question as a feature vector of tf · idfweighted word unigrams, bigrams, and trigrams. The sys-

157 content-focused responses, 2 command-related re-sponses, and one off-topic class

361

tem is trained to classify question vectors into classes la-beled with individual answers. The other two systems arebased on the statistical language modeling techniques usedrecently in mono-lingual (LM) and cross-lingual informationretrieval (CLM) [2, 3]. The former approach uses the Rel-evance Model technique to compute the language model ofthe training questions assigned to each answer and comparesit to the language model of the test question. The latter usesthe training data to “translate” the test question – calcu-lates the language model of the most likely answer for thetest question – and then compares this model to languagemodels of the individual answers.

We have compared all three approaches and observed astatistically significant advantage in performance of the lan-guage modeling techniques over the SVM approach. Bothlanguage modeling approaches create better text represen-tations than the feature vectors used in the SVM system.In addition, the CLM system takes advantage of the answertext information.

4. EXPERIMENTSTo train our system we have developed a data set of

matching question-answer pairs. First, a scriptwriter de-fined a set of questions Sgt. Blackwell should be able toanswer and prepared answers for those questions. Then weexpanded the set of questions by a) explicitly paraphrasingthe questions and b) collecting questions from users by simu-lating the final system in a Wizard of Oz study (WOZ). Theaudio recorded during the WOZ studies were transcribedand the questions added to the data set. That way we col-lected 1261 questions total. We then annotated each ques-tion with the appropriate answers.

We conducted two sets of experiments. First we run anoff-line evaluation of the classification algorithms compar-ing the language model classification techniques with theSVM method. These experiments followed the 10-fold cross-validation schema. We compared the systems using the clas-sification accuracy or the proportion of times the system re-turned the appropriate answer. The SVM, LM, and CLMshowed 53.1%, 57.8%, and 62.0% accuracy correspondingly.It is 16.67% improvement for the CLM approach over theSVM system. The numbers are statistical significant usingt-test with the cutoff set to 5% (p < 0.05).

We have incorporated the best performing classifier (CLM)into the Sgt. Blackwell system and for the second set of ex-periments we conducted a user study where we asked eachparticipant to talk to Sgt. Blackwell. Each person had toask at least 20 questions. We specifically defined 10 of thosequestions by selecting from the training data set to test theeffect of the speech recognition rate on the classification.For the rest of the study the users were free to ask anyquestions. 18 people participated in the study and we col-lected 378 question-answer pairs. We have recorded andtranscribed the sessions. The preliminary analysis shows anaverage 77.8% classifier accuracy on the ASR output, 83.3%accuracy on the hand-transcribed data and 36.1% averageword error rate (WER) for the ASR. Plotting the answerselection accuracy as a function of the ASR WER showsthat the classification accuracy on the ASR output starts todegrade significantly relative to the accuracy on the hand-transcribed data only after approximately the 70% WERmark. It indicates that the classifier is very robust to thespeech recognition errors.

5. CONCLUSIONS AND FUTURE WORKIn this paper we presented an overview of a spoken-dialog

system that supports a person interacting with a full-sizehologram-like virtual human character in an exhibition kiosksettings. We also gave a brief summary of the natural lan-guage classification component of the system and describedthe evaluation experiments we conducted with the system.

Preliminary failure analysis indicates a few directions forimproving the system’s quality. First, we could continue ourefforts on collecting more training data and extending thequestion sets.

Second, we could have the system to generate a confi-dence score for its classification decisions. Then the answerswith a low confidence score can be replaced with an answerthat prompts the user to rephrase her question. The sys-tem would then use the original and the rephrased versionto repeat the answer selection process.

Finally, we observed that a notable percent of misclassi-fications results from the user asking a question that has astrong context dependency on the previous answer or ques-tion. We are presently looking into incorporating this con-text information into the answer selection process.

6. REFERENCES[1] J. Chu-Carroll and B. Carpenter. Vector-based

natural language call routing. Journal ofComputational Linguistics, 25(30):361–388, 1999.

[2] V. Lavrenko, M. Choquette, and W. B. Croft.Cross-lingual relevance models. In Proceedings of the25th annual international ACM SIGIR, pages175–182, Tampere, Finland, 2002.

[3] A. Leuski. Using statistical language models for adialog response selection. Forthcoming.

[4] A. Leuski. Email is a stage: discovering people rolesfrom email archives. In Proceedings of the 27th annualinternational ACM SIGIR, pages 502–503, Sheffield,United Kingdom, 2004. ACM Press.

[5] D. D. Lewis, R. E. Schapire, J. P. Callan, andR. Papka. Training algorithms for linear textclassifiers. In Proceedings of the 19th InternationalACM SIGIR, pages 298–306, Zurich, Switzerland,1996.

[6] B. Pellom. Sonic: The university of coloradocontinuous speech recognizer. Technical ReportTR-CSLR-2001-01, University of Colorado, Boulder,CO, 2001.

[7] A. Sethy, P. Georgiou, and S. Narayanan. Buildingtopic specific language models from webdata usingcompetitive models. In Proceedings ofEUROSPEECH, Lisbon, Portugal, 2005.

[8] I. Tsochantaridis, T. Hofmann, T. Joachims, andY. Altun. Support vector machine learning forinterdependent and structured output spaces. InProceedings of the 21st international conference onMachine learning, Banff, Alberta, Canada, 2004.

[9] E. M. Voorhees. Overview of the trec 2003 questionanswering track. In Proceedings of The 12th TextRetrieval Conference, pages 54–69, 2003.

[10] J. Weizenbaum. Eliza–a computer program for thestudy of natural language communication betweenman and machine. Communications of the ACM,9(1):36–45, 1966.

362

Documents

How to Talk to a Hologram - Institute for Creative ...leuski/publications/papers/leuski-06-iui.pdfguage understanding module. It analyses the text strings coming out from the speech