InproTKs: A Toolkit for Incremental Situated Processing€¦ · XML-RPC is a remote procedure call protocol which uses XML to encode its calls, and HTTP as a transport mechanism

Proceedings of the SIGDIAL 2014 Conference, pages 84–88,Philadelphia, U.S.A., 18-20 June 2014. c©2014 Association for Computational Linguistics

InproTKS: A Toolkit for Incremental Situated Processing

Casey KenningtonCITEC, Dialogue Systems

Group, Bielefeld Universityckennington1

Spyros KousidisDialogue Systems Group

Bielefeld Universityspyros.kousidis2

[email protected]@uni-bielefeld.de

David SchlangenDialogue Systems Group

Bielefeld Universitydavid.schlangen2

Abstract

In order to process incremental situateddialogue, it is necessary to accept infor-mation from various sensors, each track-ing, in real-time, different aspects of thephysical situation. We present extensionsof the incremental processing toolkit IN-PROTK which make it possible to plug insuch multimodal sensors and to achievesituated, real-time dialogue. We also de-scribe a new module which enables the usein INPROTK of the Google Web SpeechAPI, which offers speech recognition witha very large vocabulary and a wide choiceof languages. We illustrate the use of theseextensions with a description of two sys-tems handling different situated settings.

1 Introduction

Realising incremental processing of speech in-and output – a prerequisite to interpretation andpossibly production of speech concurrently withthe other dialogue participant – requires some fun-damental changes in the way that componentsof dialogue systems operate and communicatewith each other (Schlangen and Skantze, 2011;Schlangen and Skantze, 2009). Processing situ-ated communication, that is, communication thatrequires reference to the physical setting in whichit occurs, makes it necessary to accept (and fuse)information from various different sensors, eachtracking different aspects of the physical situation,making the system multimodal (Atrey et al., 2010;Dumas et al., 2009; Waibel et al., 1996).

Incremental situated processing brings togetherthese requirements. In this paper, we present a col-lection of extensions to the incremental process-ing toolkit INPROTK (Baumann and Schlangen,2012) that make it capable of processing situ-ated communication in an incremental fashion:

we have developed a general architecture forplugging in multimodal sensors whith we denoteINPROTKS, which includes instantiations for mo-tion capture (via e.g. via Microsoft Kinect andLeap Motion) and eye tracking (SeeingmachinesFaceLAB). We also describe a new module webuilt that makes it possible to perform (large vo-cabulary, open domain) speech recognition via theGoogle Web Speech API. We describe these com-ponents individually and give as use-cases in adriving simulation setup, as well as real-time gazeand gesture recognition.

In the next section, we will give some back-ground on incremental processing, then describethe new methods of plugging in multimodal sen-sors, specifically using XML-RPC, the RoboticsService Bus, and the InstantReality framework.We then explain how we incorporated the GoogleWeb Speech API into InproTK, offer some usecases for these new modules, and conclude.

2 Background: The IU model, INPROTK

As described in (Baumann and Schlangen, 2012),INPROTK realizes the IU-model of incremen-tal processing (Schlangen and Skantze, 2011;Schlangen and Skantze, 2009), where incrementalsystems consist of a network of processing mod-ules. A typical module takes input from its leftbuffer, performs some kind of processing on thatdata, and places the processed result onto its rightbuffer. The data are packaged as the payload ofincremental units (IUs) which are passed betweenmodules.

The IUs themselves are also interconnected viaso-called same level links (SLL) and grounded-inlinks (GRIN), the former allowing the linking ofIUs as a growing sequence, the latter allowing thatsequence to convey what IUs directly affect it (seeFigure 1 for an example). A complication partic-ular to incremental processing is that modules can“change their mind” about what the best hypothe-

84

Figure 1: Example of IU network; part-of-speechtags are grounded into words, tags and words havesame level links with left IU; four is revoked andreplaced with forty.

sis is, in light of later information, thus IUs can beadded, revoked, or committed to a network of IUs.

INPROTK determines how a module network is“connected” via an XML-formatted configurationfile, which states module instantiations, includ-ing the connections between left buffers and rightbuffers of the various modules. Also part of thetoolkit is a selection of “incremental processing-ready” modules, and so makes it possible to realiseresponsive speech-based systems.

3 InproTK and new I/O: InproTKS

The new additions introduced here are realised asINPROTKS modules. The new modules that inputinformation to an INPROTKS module network arecalled listeners in that they “listen” to their respec-tive message passing systems, and modules thatoutput information from the network are calledinformers. Listeners are specific to their methodof receiving information, explained in each sec-tion below. Data received from listeners are pack-aged into an IU and put onto the module’s rightbuffer. Listener module left buffers are not usedin the standard way; left buffers receive data fromtheir respective message passing protocols. An in-former takes all IUs from its left buffer, and sendstheir payload via that module’s specific outputmethod, serving as a kind of right buffer. Figure2 gives an example of how such listeners and in-formers can be used. At the moment, only stringscan be read by listeners and sent by informers; fu-ture extensions could allow for more complicateddata types.

Listener modules add new IUs to the network;correspondingly, further modules have to be de-signed in instatiated systems then can make useof these information types. These IUs created by

the listeners are linked to each other via SLLs.As with audio inputs in previous version of IN-PROTK, these IUs are considered basedata and notexplictly linked via GRINs in the sensor data. Themodules defined so far also simply add IUs and donot revoke.

We will now explain the three new methods ofgetting data into and out of INPROTKS.

3.1 XML-RPCXML-RPC is a remote procedure call protocolwhich uses XML to encode its calls, and HTTP as atransport mechanism. This requires a server/clientrelationship where the listener is implemented asthe server on a specified port.1 Remote sensors(e.g., an eye tracker) are realised as clients and cansend data (encoded as a string) to the server usinga specific procedural call. The informer is also re-alised as an XML-RPC client, which sends data to adefined server. XML-RPC was introduced in 1998and is widely implemented in many programminglanguages.

Mic

Motion !Sensor

ASR

Listener

NLU

Speaker DMNLG

Informer

InproTKs

Logger

Gesture Classifier

Figure 2: Example architecture using new mod-ules: motion is captured and processed externallyand class labels are sent to a listener, which addsthem to the IU network. Arrows denote connec-tions from right buffers to left buffers. Informationfrom the DM is sent via an Informer to an externallogger. External gray modules denote input, whitemodules denote output.

3.2 Robotics Service BusThe Robotics Service Bus (RSB) is a middlewareenvironment originally designed for message-passing in robotics systems (Wienke and Wrede,2011).2 As opposed to XML-RPC which requires

1The specification can be found at http://xmlrpc.scripting.com/spec.html

2https://code.cor-lab.de/projects/rsb

85

point-to-point connections, RSB serves as a busacross specified transport mechanisms. Simply,a network of communication nodes can either in-form by sending events (with a payload), or lis-ten, i.e., receive events. Informers can send in-formation on a specific scope which establishesa visibility for listeners (e.g., a listener that re-ceives events on scope /one/ will receive all eventsthat fall under the /one/ scope, whereas a listenerwith added constants on the scope, e.g., /one/two/will not receive events from different added con-stants /one/three/, but the scope /one/ can listenon all three of these scopes). A listener mod-ule is realised in INPROTKS by setting the de-sired scope in the configuration file, allowing IN-PROTKS seamless interconnectivity with commu-nication on RSB.

There is no theoretical limit to the number of in-formers or listeners; events from a single informercan be received by multiple listeners. Events aretyped and any new types can be added to the avail-able set. RSB is under active development and isbecoming more widely used. Java, Python, andC++ programming languages are currently sup-ported. In our experience, RSB makes it particu-larly convenient for setting distributed sensor pro-cessing networks.

3.3 InstantReality

In (Kousidis et al., 2013), the InstantRealityframework, a virtual reality environment, wasused for monitoring and recording data in a real-time multimodal interaction.3 Each informationsource (sensor) runs on its own dedicated work-station and transmits the sensor data across a net-work using the InstantIO interface. The data canbe received by different components such as In-stantPlayer (3D visualization engine; invaluablefor monitoring of data integrity when recordingexperimental sessions) or a logger that saves alldata to disk. Network communication is achievedvia multicast, which makes it possible to have anynumber of listeners for a server and vice-versa.

The InstantIO API is currently available in C++and Java. It comes with a non-extensible set oftypes (primitives, 2D and 3D vectors, rotations,images, sounds) which is however adequate formost tracking applications. InstantIO listeners andinformers are easily configured in INPROTKS con-figuration file.

3http://www.instantreality.org/

3.4 Venice: Bridging the Interfaces

To make these different components/interfacescompatible with each other, we have developed acollection of bridging tools named Venice. Veniceserves two distinct functions. First, Venice.HUB,which pushes data to/from any of the followinginterfaces: disk (logger/replayer), InstantIO, andRSB. This allows seamless setup of networks forlogging, playback, real-time processing (or com-binations; e.g, for simulations), minimizing theneed for adaptations to handle different situations.Second, Venice.IPC allows interprocess communi-cation and mainly serves as a quick and efficientway to create network components for new typesof sensors, regardless of the platform or language.Venice.IPC acts as a server to which TCP clients(a common interface for sensors) can connect. Itis highly configurable, readily accepting varioussensor data outputs, and sends data in real-time tothe InstantIO network.

Both Venice components operate on all threemajor platforms (Linux, Windows, Mac OS X),allowing great flexibility in software and sensorsthat can be plugged in the architecture, regardlessof the vendor’s native API programming languageor supported platform. We discuss some use casesin section 5.

4 Google Web Speech

One barrier to dialogue system development ishandling ASR. Open source toolkits are available,each supporting a handful of languages, with eachlanguage having a varying vocabulary size. A stepin overcoming this barrier is “outsourcing” theproblem by making use of the Google Web SpeechAPI.4 This interface supports many languages, inmost cases with a large, open domain of vocabu-lary. We have been able to access the API directlyusing INPROTKS, similar to (Henderson, 2014).5

INPROTKS already supports an incremental vari-ant of Sphinx4; a system designer can now choosefrom these two alternatives.

At the moment, only the Google Chromebrowser implements the Web Speech API. Whenthe INPROTKS Web Speech module is invoked,it creates a service which can be reached from

4The Web Speech API Specificiation: https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html

5Indeed, we used Matthew Henderson’s webdial projectas a basis: https://bitbucket.org/matthen/webdialog

86

the Chrome browser via an URL (and hence, mi-crophone client, dialogue processor and speechrecogniser can run on different machines). Navi-gating to that URL shows a simple web page whereone can control the microphone. Figure 3 showshow the components fit together.

While this setup improves recognition as com-pared to the Sphinx4-based recognition previouslyonly available in INPROTK, there are some ar-eas of concern. First, there is a delay caused bythe remote processing (on Google’s servers), re-quiring alignment with data from other sensors.Second, the returned transcription results are only‘semi-incremental’; sometimes chunks of wordsare treated as single increments. Third, n-best listscan only be obtained when the API detects the endof the utterance (incrementally, only the top hy-pothesis is returned). Fourth, the results have acrude timestamp which signifies the end of the au-dio segment. We use this timestamp in our con-struction of word IUs, which in informal tests havebeen found to be acceptable for our needs; we de-fer more systematic testing to future work.

Figure 3: Data flow of Google Web Speech API:Chrome browser controls the microphone, sendsaudio to API and receives incremental hypotheses,which are directly sent to InproTKS.

5 INPROTKS in Use

We exemplify the utility of INPROTKS in two ex-periments recently performed in our lab.

In-car situated communication We have testeda “pause and resume” strategy for adaptive in-formation presentation in a driving simulationscenario (see Figure 4), using INPROTKS andOpenDS (Math et al., 2013). Our dialogue man-ager – implemented using OpenDial (Lison, 2012)– receives trigger events from OpenDS in order toupdate its state, while it verbalises calendar eventsand presents them via speech. This is achievedby means of InstantIO servers we integrated intoOpenDS and respective listeners in INPROTKS. Inturn, InstantIO informers send data that is logged

Figure 4: Participant performing driving test whilelistening to iNLG speech delivered by InProTKS.

by Venice.HUB. The results of this study are pub-lished in (Kousidis et al., 2014). Having availablethe modules described here made it surprisinglystraightforward to implement the interaction withthe driving simulator (treated as a kind of sensor).

Real-time gaze fixation and pointing gesturedetection Using the tools described here, wehave recently tested a real-time situated commu-nication environment that uses speech, gaze, andgesture simultaneously. Data from a MicrosoftKinect and a Seeingmachines Facelab eye trackerare logged in realtime to the InstantIO network.A Venice.HUB component receives this data andsends it over RSB to external components thatperform detection of gaze fixation and pointinggestures, as described in (Kousidis et al., 2013).These class labels are sent in turn over RSB toINPROTKS listeners, aggregating these modalitieswith the ASR in a language understanding module.Again, this was only enabled by the framework de-scribed here.

6 Conclusion

We have developed methods of providing mul-timodal information to the incremental dialoguemiddleware INPROTK. We have tested thesemethods in real-time interaction and have foundthem to work well, simplifying the process ofconnecting external sensors necessary for multi-modal, situated dialogue. We have further ex-tended its options for ASR, connecting the GoogleWeb Speech API. We have also discussed Venice,a tool for bridging RSB and InstantIO interfaces,which can log real-time data in a time-alignedmanner, and replay that data. We also offeredsome use-cases for our extensions.

INPROTKS is freely available and accessible.6

6https://bitbucket.org/inpro/inprotk

87

Acknowledgements Thank you to the anony-mous reviewers for their useful comments and toOliver Eickmeyer for helping with InstantReality.

ReferencesPradeep K. Atrey, M. Anwar Hossain, Abdulmotaleb

El Saddik, and Mohan S. Kankanhalli. 2010. Multi-modal fusion for multimedia analysis: a survey, vol-ume 16. April.

Timo Baumann and David Schlangen. 2012. The In-proTK 2012 Release. In NAACL.

Bruno Dumas, Denis Lalanne, and Sharon Oviatt.2009. Multimodal Interfaces : A Survey of Princi-ples , Models and Frameworks. In Human MachineInteraction, pages 1–25.

Matthew Henderson. 2014. The webdialog Frame-work for Spoken Dialog in the Browser. Technicalreport, Cambridge Engineering Department.

Spyros Kousidis, Casey Kennington, and DavidSchlangen. 2013. Investigating speaker gaze andpointing behaviour in human-computer interactionwith the mint.tools collection. In SIGdial 2013.

Spyros Kousidis, Casey Kennington, Timo Baumann,Hendrik Buschmeier, Stefan Kopp, and DavidSchlangen. 2014. Situationally Aware In-Car Infor-mation Presentation Using Incremental Speech Gen-eration: Safer, and More Effective. In Workshop onDialog in Motion, EACL 2014.

Pierre Lison. 2012. Probabilistic Dialogue Mod-els with Prior Domain Knowledge. In Proceedingsof the 13th Annual Meeting of the Special InterestGroup on Discourse and Dialogue, pages 179–188,Seoul, South Korea, July. Association for Computa-tional Linguistics.

Rafael Math, Angela Mahr, Mohammad M Moniri,and Christian Muller. 2013. OpenDS: A newopen-source driving simulator for research. GMM-Fachbericht-AmE 2013.

David Schlangen and Gabriel Skantze. 2009. A Gen-eral, Abstract Model of Incremental Dialogue Pro-cessing. In Proceedings of the 10th EACL, numberApril, pages 710–718, Athens, Greece. Associationfor Computational Linguistics.

David Schlangen and Gabriel Skantze. 2011. A Gen-eral, Abstract Model of Incremental Dialogue Pro-cessing. Dialoge & Discourse, 2(1):83–111.

Alex Waibel, Minh Tue Vo, Paul Duchnowski, and Ste-fan Manke. 1996. Multimodal interfaces. ArtificialIntelligence Review, 10(3-4):299–319.

Johannes Wienke and Sebastian Wrede. 2011. Amiddleware for collaborative research in experimen-tal robotics. In System Integration (SII), 2011IEEE/SICE International Symposium on, pages1183–1190.

88

Documents

InproTKs: A Toolkit for Incremental Situated Processing€¦ · XML-RPC is a remote procedure call protocol which uses XML to encode its calls, and HTTP as a transport mechanism