6
Boise State University ScholarWorks Computer Science Faculty Publications and Presentations Department of Computer Science 1-1-2017 A Graphical Digital Personal Assistant at Grounds and Learns Autonomously Casey Kennington Boise State University Aprajita Shukla Boise State University is is an author-produced, peer-reviewed version of this article. e final, definitive version of this document can be found online at HAI '17 Proceedings of the 5th International Conference on Human Agent Interaction, published by ACM - Association for Computing Machinery. Copyright restrictions may apply. doi: 10.1145/3125739.3132592

A Graphical Digital Personal Assistant That Grounds and

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Boise State UniversityScholarWorksComputer Science Faculty Publications andPresentations Department of Computer Science

1-1-2017

A Graphical Digital Personal Assistant ThatGrounds and Learns AutonomouslyCasey KenningtonBoise State University

Aprajita ShuklaBoise State University

This is an author-produced, peer-reviewed version of this article. The final, definitive version of this document can be found online at HAI '17Proceedings of the 5th International Conference on Human Agent Interaction, published by ACM - Association for Computing Machinery. Copyrightrestrictions may apply. doi: 10.1145/3125739.3132592

A Graphical Digital Personal Assistantthat Grounds and Learns Autonomously

Casey KenningtonBoise State University

1910 University [email protected]

Aprajita ShuklaBoise State University

1910 University [email protected]

Author KeywordsGrounding; Interactive Dialogue; Personal Assistant

ACM Classification KeywordsH.5.2 User Interfaces: Information Systems

ABSTRACTWe present a speech-driven digital personal assistantthat is robust despite little or no training data and au-tonomously improves as it interacts with users. The sys-tem is able to establish and build common ground be-tween itself and users by signaling understanding and bylearning a mapping via interaction between the wordsthat users actually speak and the system actions. Weevaluated our system with real users and found an over-all positive response. We further show through objec-tive measures that autonomous learning improves per-formance in a simple itinerary �lling task.

INTRODUCTIONIn spoken interaction, participants signal understanding(e.g., by uttering backchannels) which shapes interactionby allowing conversation participants to know that theirinterlocutor is understanding what is being said. How-ever, signaling understanding is a challenge for speech-driven agents [6]: some systems display the recognizedtranscript or utter okay after a request has completed,but there is no guarantee that the request was actu-ally understood and could lead to the wrong systemaction. Moreover, though many systems are based ondata-driven robust statistical models, they are generallystatic in that the models have a prede�ned ontology anddo not continue to improve as they interact with users.Taken together, these system shortcomings are due toa lack of conversational grounding which is de�ned asbuilding mutual understanding between dialogue partic-ipants [5]. Our goal is to improve system grounding bysignaling backchannels to users in an intuitive way andby autonomously improving the mapping between whatusers say and system actions.

Our personal assistant (pa) system, which we explainfurther in Section 3, works incrementally (i.e., it updatesits internal state word by word) as it comprehends andgives visual cues of understanding through the gui, and,crucially, if the system displays incorrect cues, a user cancorrect the misunderstanding immediately instead of atthe end of the request. Because incremental processinglends itself to a system that can signal backchannels,

an incremental system can also be more autonomous�requests that have been repaired and con�rmed locallycan be used as examples on how the system can improveunderstanding through conversational grounding.

In Section 4 we explain how we evaluate our system withreal users under two di�erent settings: a baseline systemand a system that learns autonomously. Our user evalu-ations show that our system is perceived as intuitive anduseful, and we show through objective measures that itcan autonomously improve through the interactive pro-cess. In the following section, we explain how we buildo� of previous work.

BACKGROUND AND RELATED WORKThough grounding between systems and users is a chal-lenge [11], we build directly o� of recent work that wasperceived by users as natural and allowed users to accom-plish many tasks in a short amount of time [9] and [13,4] which addressed misalignments in understanding in arobot-human interaction scenarios. Also directly relatedis [8] which used a robot that could signal incrementalunderstanding by performing actions (e.g., moving to-wards a referred object). Backchannels play a role inthe grounding process; [14], for example, used prosodicand contextual features in order to produce a backchan-nel without overlapping with users' speech. We use a guito display backchannels (i.e., we need not worry aboutoverlap with user speech).

SYSTEM DESCRIPTIONIncremental ProcessingOur system is built as a network of interconnected mod-ules as proposed by the incremental unit (iu) framework[15], a theoretical framework for incremental dialogueprocessing where bits of information are encoded as thepayload of ius; each module performs some kind of op-eration on those ius and produces ius of its own (e.g., aspeech recognizer takes audio input and produces tran-scribed words as incremental units). The iu frameworkallows us to design and build a personal assistant thatcan perform actions without delay which is crucial inbuilding systems that can ground with the user by sig-naling ongoing understanding�an important prerequisiteto autonomous learning (explained further below).

It has been shown that human users perceive incremen-tal systems as being more natural than traditional, turn-based systems [1, 17, 16, 3, 9], o�er a more human-like

michaelbradley
Text Box
This is an author-produced, peer-reviewed version of this article. The final, definitive version of this document can be found online at HAI '17 Proceedings of the 5th International Conference on Human Agent Interaction, published by Association for Computing Machinery. Copyright restrictions may apply. doi: 10.1145/3125739.3132592

experience [7] and are more satisfying to interact withthan non-incremental systems [2]. Moreover, psycholin-guistic research has also shown that humans comprehendutterances as they unfold [19, 18].

System OverviewOur system builds directly o� of [9], which introduceda system composed of four main components: au-tomatic speech recognition (asr) which incrementallytranscribes spoken utterances, natural language under-standing (nlu) explained below, a dialogue manager(dm) using OpenDial [12] which determined when thesystem should select (i.e., �ll a slot in a semanticframe), wait for more information, request con�rma-tion, or confirm a con�rmation request, and the �nalcomponent was a gui, also explained below. Figure 1conceptually shows these components and how the in-formation (i.e., ius) �ows between them. As our workfocuses on improvements made to the nlu and gui toimprove conversational grounding, we explain these twocomponents in greater detail.

ASR NLU

DMGUI

Figure 1. System overview.

Natural Language UnderstandingFor nlu, we applied the simple incremental update model(sium) [10] which can produce a distribution over possi-ble slots in a semantic frame. This nlu works incremen-tally: it updates the distribution over the slot values asit receives words from the asr. The model is formalizedbelow:

P (I|U) =1

P (U)P (I)

∑r∈R

P (U |R = r)P (R = r|I) (1)

Where P (I|U) is the probability of the intent I (i.e., asemantic frame slot) of the utterance U . R is a mediat-ing variable of properties, which maps between aspects ofU and I. For example, italian is an intent I, which hasproperties pasta, mediterranean, vegetarian, etc. Fortraining, the system learns a mapping between words inU and the properties in R. For application, for each wordin U , a probability distribution is produced over R whichis summed over for each I. In our experiments, mostproperties are pre-de�ned (which is common), but some-times properties need to be discovered, e.g., for streetnames which are unique to an area or city. Our systemcan discover properties and make use of them without

prior training data using a Levenshtein distance calcu-lation between the property and word strings (similarto [9]). As the system interacts with the user, it learnsmappings between words and properties autonomously.

Grounded Conversation with an Informative GUIOur gui has a map (using the Google Search and MapsAPIs), a list of suggestions, and an itinerary created bythe user derived from the suggestions. Figure 2 portraysthis: the top half of the gui is a map annotated withthe location of suggested items (in this example, restau-rants). If a user selects any item in the Suggestions list(e.g., by tapping or clicking on it), it is added to theItinerary list for later reference.

Grounding through the GUI: Figure 2 shows the state ofthe gui for an example utterance I'm hungry for somemedium-priced Japanese food. The gui updated (i.e., byexpanding branches showing the options and �lling thebranch with one of those options, as in price:medium),thereby signaling to the user continual understanding(i.e., a backchannel beyond just showing the asr tran-scription). Nodes colored in red denote where the usershould focus her attention. The system is able to sig-nal a clari�cation request by displaying japanese? asa branch of cuisine. This informs the user not onlythat there was misunderstanding, but exactly what partof the utterance was misunderstood (in this case, it tech-nically wasn't misunderstanding; rather, the system ver-i�ed the intent of the user). To continue, a simple yeswould �ll cuisine with japanese thereby completingthe expected information needed for that particular in-tent type (i.e., restaurant). At that point, the system isas informed as it will be so the user can select from thelist of suggestions, ranked by relevance to the requestutterance. In the event that a clari�cation request is an-swered with no (or some other negative response), thequestion mark goes away and the node is �lled again inblue; i.e., japanese. In addition to the tree, the mapincrementally updates as the user's utterance unfolds bydisplaying a list of suggestions ranked by relevance; thelocation of those suggested items is further annotatedin the map with a relevant icon. As the request un-folds, the number of points of interest shrinks, resultingin a zooming-in e�ect of the map. Taken together, thesevisual cues provide several signals of ongoing system un-derstanding to the user.

At any point the user can restart by saying a reservedkeyword (e.g., reset) and at any point the user can �back-track� by saying no which un�lls each slot one by one.For example, in Figure 2, if the user had uttered I'mhungry for some medium-priced Mexican food and thesystem �lled price:medium and cuisine:japanese, theuser could say no which would result in an expandedcuisine slot. This allows users to repair potential mis-understandings locally before the system performs a (po-tentially wrong) task or produces an incorrect response.

Autonomous Learning: Our system further improvesupon previous work by leveraging the gui to learn as

michaelbradley
Text Box
This is an author-produced, peer-reviewed version of this article. The final, definitive version of this document can be found online at HAI '17 Proceedings of the 5th International Conference on Human Agent Interaction, published by Association for Computing Machinery. Copyright restrictions may apply. doi: 10.1145/3125739.3132592

Figure 2. Our system gui shows the right-branching tree, a corresponding map, suggestions, and a list of items that theuser opted to add to the itinerary.

it interacts. We accomplish this by collecting the wordsof a completed utterance and corresponding �lled slotsthen informing the nlu that the utterance led to the�lled slots�e�ectively providing an additional positivetraining example for the nlu. The nlu can then im-prove its probabilistic mapping between words and slotvalues; i.e., through updating the sub-model P (U |R) byretraining the classi�er with the new information. This isa useful addition because the system designer could notpossibly know beforehand all the possible utterances andcorresponding intents for all users; this e�ectively allowsthe system to begin from scratch with little or no trainingdata. It also allows the system to adapt (i.e., establishcommon ground) to user preferences as certain wordsdenote certain items (e.g., noodles could mean Japaneseramen for one user, or Italian pasta for another). Oursystem has provisions for providing autonomous learningby updating the nlu using the �lled slot values and theutterance when the user selects an item in the Suggestedlist to add it to the Itinerary. This allows the system tolearn without interrupting the user's productivity withan explicit feedback request.

EXPERIMENTThis section explains a user evaluation performed on ourpa. Our pa has provisions for �nding information in thefollowing domains: art galleries, bars, bus stations, mu-seums, parking lots, and restaurants. These a�ordancesare clearly visible to the users when they �rst see the pagui. Users interacted with one of two versions of our pa:baseline or autonomous learning. Both versions were thesame in that they discovered possible intents (in this ex-periment, only bus station), applied the same gui (i.e.,the annotated map and right-branching tree) displayedselectable options which are added to an itinerary whenselected. The autonomous learning version improved asexplained above. To allow for greater participation di-versity, we made our system available through a web in-terface and posted the link on various social media sites.

Task & ProcedureParticipants were asked to use Chrome Browser on anon-mobile device (e.g., a laptop) with a working micro-

phone. Participants took part in the study completelyonline by directing their browsers to an informed dis-closure about the nature of the study, then instructionswere given which are simpli�ed as follows: you have beenliving in a city for a few months and a (�ctitious) friendnamed Samantha wishes to visit you for a weekend. Useour pa to plan out an itinerary for your friend's visit.We chose Boise, Idaho (U.S.A.) as the location for par-ticipants to explore (in future work, we will allow par-ticipants to set their language and location).

After reading and agreeing to the instructions, the par-ticipants were directed to our pa system with whichthey interacted in their web browser via speech (we usedGoogle asr, which works incrementally). At any point,they could add candidate items suggested by our pa

into the Itinerary list. This constituted phase one. Af-ter three minutes, a message popped up, showing theiritinerary and a request that they re-create the sameitinerary again for another friend. The purpose of thisis to see if the system had learned anything about theirindividual preferences or way of expressing their intentas they recreated their original itinerary.

After they acknowledged the pop-up by clicking OK,their itinerary was cleared and they were again ableto interact with the system, thereby beginning phasetwo. They were given another three minutes to com-plete phase two, for a total of six minutes of interactingwith our pa. Afterwards, they were taken to a question-naire about their experience with our pa, followed by aform for them to enter for a gift card drawing, and �-nally a debrie�ng. This task favors a system that cansuggest possibilities optimized for breadth; i.e., �lling adiverse itinerary. Even though we wish to show how oursystem can ground autonomously, we opted for this taskbecause it represents a realistic scenario beyond previ-ous work. In total, 15 participants took place in ourstudy and �lled out the questionnaire, 8 for the baselinesettings and 7 for the autonomous learning setting.

michaelbradley
Text Box
This is an author-produced, peer-reviewed version of this article. The final, definitive version of this document can be found online at HAI '17 Proceedings of the 5th International Conference on Human Agent Interaction, published by Association for Computing Machinery. Copyright restrictions may apply. doi: 10.1145/3125739.3132592

MetricsWe report subjective and objective scores. We reportobjective measures for the following derived from systemlogs:

• average length of utterances

• number of items added to the itinerary for the �rst phase

• fscore between itineraries in the two phases (where the itineraryfrom phase one is the expected itinerary for phase two)

• number of times the user had to reset the gui

• number of times the user had to backtrack

• number of times the system applied improvements

The subjective scores come from the questionnaireswhere participants responded using a 5-point Likert scale(the italicized portion is a shortened version of the ques-tion that we use in the results table below):

• like the map - I liked how the screen showed the map and theassistant at the same time.

• tree representation - The assistant �tree" could have been betterrepresented in some other way.

• worked as expected - I almost always got the results I wanted.

• intuitive - the pa was easy and intuitive to use.

• noticed it improved - I had the impression that it was improving.

• speak/pause - I didn't know when to speak or pause.

• system predict better - It could better predict what I wanted.

• appeared to understand - It appeared understand what I said.

• natural interaction - I felt that the interaction was more naturalthan the other personal assistants I have used.

• �x misunderstandings - I liked that I could �x misunderstand-ings quickly.

We hypothesize that the autonomous version of the sys-tem will result in better results than the baseline systemwhich makes no attempt at learning or improvement.For subjective measures, we hypothesize that the overallexperience of both versions will be positive (in fact, asa sanity check, for some questions and measures we ex-pect the scores for both versions to be very similar), butoverall the impression that the system improved shouldbe higher for the autonomous learning settings.

ResultsObjectiveTable 1 shows the objective results as averaged over all.The results show that, in general, the two systems pro-duce similar results, as expected. The important di�er-ence is in the fscore, which shows how well the itinerariesof the two phases match: the itinerary fscore betweenthe �rst and second phases for the autonomous systemis much higher than it is for the baseline system. Weconjecture that this is due to the system learning dur-ing the �rst phase what kinds of items the user addedto the itinerary (as illustrated by the average number ofimprovements done by the autonomous version). Duringphase two when the users were required to make the sameitinerary, the autonomous system had a stronger map-ping of utterances and previously selected items, therebypredicting to a small degree what their preferences were(which, as explained above, is a form of grounding).

item baseline autonomousavg utt len* 1.63 (0.69) 2.86 (1.51)

avg # itinerary items 2.5 (3.6) 1.75 (1.78)avg itinerary fscore 0.04 (0.07) 0.5 (0.5)

avg # reset* 11.6 (26.2) 6.12 (10.2)avg # no* 2.6 (6.9) 2.18 (3.23)

avg # improvements 0 (0) 6.25 (3.9)

Table 1. Objective results: avg. (std). Asterisks denoteitems where lower scores are better.

SubjectiveTable 2 shows the subjective scores for the questionnaireaveraged over all participants with standard deviation inparentheses (questions with an asterisk denote questionswhere lower scores are better). Overall, the subjectivescores do not show a strong preference for either sys-tem (a t-test revealed no statistical signi�cance usinga Bonrefoni correction of 12); though both systems arerated positively. The users did like that the map directlyshowed points of interest and they liked the ability to re-set at anytime. Though they did not have the impressionthat the autonomous version was improving while theyinteracted, they did notice that the autonomous versionpredicted what they wanted more than the baseline sys-tem.

question baseline autonomousI like the map 4.5 (0.7) 4.3 (0.9)

tree representation* 3.1 (1.3) 3.0 (1.6)worked as expected 3.0 (1.3) 3.0 (1.3)

intuitive 3.1 (0.9) 3.3 (1.2)I noticed it improved 3.0 (1.0) 2.9 (1.2)

speak/pause* 3.4 (1.4) 3.3 (1.6)predict better* 3.1 (1.2) 2.5 (1.0)

appeared to understand 3.0 (1.2) 3.3 (1.4)natural interaction 2.5 (1.2) 2.9 (1.0)

�x misunderstandings 3.4 (1.0) 3.0 (1.4)

Table 2. Subjective results from questionnaires: avg.(std). Asterisks denote questions where lower scores arebetter.

CONCLUSIONS & FUTURE WORKThe results are positive overall: the system is useful andallows users to �ll an itinerary using speech. Users wereable to recreate their itineraries with the autonomoussystem much more accurately than with the baseline sys-tem Minimal grounding indeed took place through thegui by the tree and map, both of which updated in-crementally as the users' utterances unfolded, by prop-erties (i.e., ontology) discovery by the system, and byimproving the mapping between utterances and proper-ties. For future work, we will leverage our system toautonomously improve the dialogue manager.

REFERENCES1. Gregory Aist, James Allen, Ellen Campana, Lucian

Galescu, Carlos Gallo, Scott Stoness, Mary Swift,and Michael Tanenhaus. Software architectures forincremental understanding of human speech. InProceedings of CSLP, pages 1922�-1925, 2006.

2. Gregory Aist, James Allen, Ellen Campana,Carlos Gomez Gallo, Scott Stoness, and Mary

michaelbradley
Text Box
This is an author-produced, peer-reviewed version of this article. The final, definitive version of this document can be found online at HAI '17 Proceedings of the 5th International Conference on Human Agent Interaction, published by Association for Computing Machinery. Copyright restrictions may apply. doi: 10.1145/3125739.3132592

Swift. Incremental understanding inhuman-computer dialogue and experimentalevidence for advantages over nonincrementalmethods. In Pragmatics, volume 1, pages 149�154,Trento, Italy, 2007.

3. Layla El Asri, Romain Laroche, Olivier Pietquin,and Hatim Khouzaimi. NASTIA: NegotiatingAppointment Setting Interface. In Proceedings ofLREC, pages 266�271, 2014.

4. Joyce Y Chai, Lanbo She, Rui Fang, SpencerOttarson, Cody Littley, Changsong Liu, andKenneth Hanson. Collaborative e�ort towardscommon ground in situated human-robot dialogue.In Proceedings of the 2014 ACM/IEEEinternational conference on Human-robotinteraction, pages 33�40, Bielefeld, Germany, 2014.

5. Herbert H. Clark and Edward F. Schaefer.Contributing to discourse. Cognitive Science,13(2):259�294, 1989.

6. Nina Dethlefs, Helen Hastie, Heriberto Cuayáhuitl,Yanchao Yu, Verena Rieser, and Oliver Lemon.Information density and overlap in spoken dialogue.Computer Speech and Language, 37:82�97, 2016.

7. Jens Edlund, Joakim Gustafson, Mattias Heldner,and Anna Hjalmarsson. Towards human-like spokendialogue systems. Speech Communication,50(8-9):630�645, 2008.

8. Julian Hough and David Schlangen. A Model ofContinuous Intention Grounding for HRI. InProceedings of The Role of Intentions inHuman-Robot Interaction Workshop, number 1,2017.

9. Casey Kennington and David Schlangen.Supporting Spoken Assistant Systems with aGraphical User Interface that Signals IncrementalUnderstanding and Prediction State. In Proceedingsof the 17th Annual Meeting of the Special InterestGroup on Discourse and Dialogue, pages 242�251,Los Angeles, sep 2016. Association forComputational Linguistics.

10. Casey Kennington and David Schlangen. A SimpleGenerative Model of Incremental ReferenceResolution in Situated Dialogue. Computer Speech& Language, 2017.

11. Geert-Jan M Kruij�. There is no common groundin human-robot interaction. In Proceedings ofSemDial, 2012.

12. Pierre Lison. A hybrid approach to dialoguemanagement based on probabilistic rules. ComputerSpeech and Language, 34(1):232�255, 2015.

13. Chansong Lui, Rui Fang, and Joyce Yue Chai.Towards Mediating Shared Perceptual Basis inSituated Dialogue. In Proceedings of the 13thAnnual Meeting of the Special Interest Group onDiscourse and Dialogue, number July, pages140�149, Seoul, South Korea, jul 2012. Associationfor Computational Linguistics.

14. Raveesh Meena, Gabriel Skantze, and JoakimGustafson. Data-driven models for timing feedbackresponses in a Map Task dialogue system. InComputer Speech and Language, volume 28, pages903�922, Metz, France, aug 2014. Association forComputational Linguistics.

15. David Schlangen and Gabriel Skantze. A General,Abstract Model of Incremental DialogueProcessing. In Dialogue & Discourse, volume 2,pages 83�111, 2011.

16. Gabriel Skantze and Anna Hjalmarsson. TowardsIncremental Speech Production in DialogueSystems. In Word Journal Of The InternationalLinguistic Association, pages 1�8, Tokyo, Japan,sep 1991.

17. Gabriel Skantze and David Schlangen. Incrementaldialogue processing in a micro-domain. Proceedingsof the 12th Conference of the European Chapter ofthe Association for Computational Linguistics onEACL 09, (April):745�753, 2009.

18. Michael J. Spivey, Michael K. Tanenhaus,Kathleen M. Eberhard, and Julie C. Sedivy. Eyemovements and spoken language comprehension:E�ects of visual context on syntactic ambiguityresolution. Cognitive Psychology, 45(4):447�481,2002.

19. Michael Tanenhaus, Michael Spivey-Knowlton,Kathleen Eberhard, and Julie Sedivy. Integration ofvisual and linguistic information in spokenlanguage comprehension. Science (New York,N.Y.), 268(5217):1632�1634, 1995.

michaelbradley
Text Box
This is an author-produced, peer-reviewed version of this article. The final, definitive version of this document can be found online at HAI '17 Proceedings of the 5th International Conference on Human Agent Interaction, published by Association for Computing Machinery. Copyright restrictions may apply. doi: 10.1145/3125739.3132592