A Multimodal Dialogue System for Medical Decision …...Proceedings of the SIGDIAL 2017 Conference, pages 23–26, Saarbrucken, Germany, 15-17 August 2017.¨ c 2017 Association for

Proceedings of the SIGDIAL 2017 Conference, pages 23–26,Saarbrucken, Germany, 15-17 August 2017. c©2017 Association for Computational Linguistics

A Multimodal Dialogue System for Medical Decision Supportin Virtual Reality

Alexander Prange, Margarita Chikobava, Peter Poller, Michael Barz, Daniel SonntagGerman Research Center for Artificial Intelligence (DFKI)

66123 Saarbrucken, Germany{Firstname}.{Lastname}@dfki.de

Abstract

We present a multimodal dialogue systemthat allows doctors to interact with a medi-cal decision support system in virtual real-ity (VR). We integrate an interactive visu-alization of patient records and radiologyimage data, as well as therapy predictions.Therapy predictions are computed in real-time using a deep learning model.

1 Introduction

Modern hospitals and clinics rely on digital patientdata. Simply storing and retrieving patient recordsis not enough; in order for computer systems toprovide interactive decision support, one must rep-resent the semantics in a machine readable formusing medical ontologies (Sonntag et al., 2009b).In this paper, we present a novel real-time decisionsupport dialogue for the medical domain, wherethe physician can visualize and interact with pa-tient data in an virtual reality environment by us-ing natural speech and hand gestures.

Our multimodal dialogue system is an extensionof previous work by Luxenburger et al. (Luxen-burger et al., 2016) where we used an Oculus Riftwith an integrated eye-tracker in a medical remotecollaboration setting. First, the radiologist fills outa findings form using a mobile tablet with a stylus.The data is then transcribed in real-time using au-tomated handwriting recognition, parsed, and rep-resented based on medical ontologies. Then, thedoctor, or any other health professional, enters vir-tual reality and interacts with patient records us-ing the multimodal dialogue system. Through thetemporal synchronization of visual and auditoryevents in VR, we support multisensory integration(Morein-Zamir et al., 2003). This way we profitfrom superadditivity (Oviatt, 2013) to further en-hance multisensory perception.

2 Architecture

Modern hospitals and clinics are highly digital-ized; in order to integrate our system seamlesslyinto everyday processes, we designed a highlyflexible architecture, which can be connected toexisting hospital systems (e.g., PACS, a picturearchiving and communication system) and con-nects novel interaction devices such as VR glassesand head-mounted displays (HMDs). As depictedin Figure 1, all devices in this scenario are ei-ther connected directly or through adapters to theProxy Server using XML-RPC, a remote proce-dure call protocol which uses XML to encode in-formation that is sent via HTTP between clientsand server. The Proxy Server manages and re-lays the cross-platform communication betweenthe different devices. The mobile device for in-stance retrieves patient data and medical imagesthrough the Proxy Server from the hospital’s PACSand RIS (radiology information system). The doc-tor then fills out the report, and the results are sendback. Some components, like the PACS and RIS,are not connected directly through XML-RPC tothe rest of the system, but through the Patient DataProvider, which provides an abstraction layer tothe other devices. This way we can ensure a flex-ible integration of different proprietary softwaresolutions that are already being used in hospitals.

2.1 Mobile Device

Even though modern hospitals and clinics arehighly digitalized, there are many everyday pro-cesses that are still performed using pen and paper.Our approach in this scenario is based on the workof Sonntag et al. (Sonntag et al., 2014) wherethey use digital pens to improve reporting prac-tices in the radiology domain. Instead of usinga digital pen on normal paper, we create a fullydigital version of the radiology findings form (in

23

Figure 1: Architecture diagram

this case mammography), to be used on a mo-bile device with integrated stylus. The radiolo-gist writes the report directly onto the tablet us-ing the stylus and through real-time handwriting,gesture, and sketch recognition, the entire contentis transcribed, exported and written into the hos-pital’s database. Our approach has several advan-tages over the traditional form filling process: (1)the contents are instantly transcribed and parsedinto concepts of medical ontologies, (2) real-timefeedback about the handwriting recognition pro-cess allows for a direct validation of input data,and (3) medical images are taken directly from thehospital’s PACS, are then displayed on the screenand can be annotated by the radiologist. We usethe Samsung Note series as mobile devices, be-cause they feature a special Wacom digitiser tech-nology for the stylus input; we built our softwareon top of the MyScript1 handwriting recognitionengine.

2.2 Virtual Reality

We created a Unity3D application2 that resemblesa real world doctor’s office. The user can movefreely inside the room using positional trackingand may also look around using head tracking.To enable immersive and remote interaction withmedical multimedia data, we use a projection onthe wall, where the patient files, the previouslyannotated digital form, and the therapy predic-tions are shown (see Figure 2). Navigation in-side the documents, like zooming or scrollingthrough pages, can be achieved either through nat-ural speech interaction or by using the OculusTouch controllers, which we render as hands in-side VR.

1http://myscript.com/2http://unity3d.com/

Figure 2: Screenshot of therapy prediction resultsinside virtual reality

2.3 Decision Support

Our medical dialogue system facilitates supportfor deciding which therapy is most suitable for agiven patient. We integrated a prediction modelfor clinical decision support based on deep learn-ing (Esteban et al., 2016) as backend service run-ning on a dedicated GPU server. They presenteda recurrent neural network (RNN) to include dy-namic sequences of examinations which was mod-ified to take dynamic patient data as additionalinput. This model was trained on a set of struc-tured data from 475 patients, containing a total of19438 diagnoses, 15352 procedures, 59202 lab-oratory results and 13190 medications. All per-sonal data, such as names, date of birth, patient-IDs were anonymized accordingly and all date andtime references were shifted. For our dialoguesystem, fast response times are of particular inter-est. We use TensorFlow (Abadi et al., 2016) toenable GPU-accelerated predictions on a scalableplatform. Our service runs on a dedicated high-performance computer and is accessible to the di-alogue system through our Proxy Server.

3 Dialogue

In order to facilitate coordinated interactions onthe patient data within the virtual reality environ-ment, we developed a multimodal dialogue in-terface that allows us to operate and interact byspeech and gestures. The multimodal dialoguesystem supports three different types of interac-tions: (1) interactions with the patient data shownon the virtual display (e.g., ”Open the patient filefor Gerda Meier.”, ”Show the next page.”); (2) in-teractive question answering (QA) about the con-tents of a patient record (e.g., ”When was thelast examination?”); and (3) control of the ther-apy prediction component (e.g., ”Which therapy

24

is recommended?”). Within the dialogue the fol-lowing speech interactions and phenomena are re-alized:

• Navigation inside patient records (e.g.,open/close file, scroll, zoom, turn page)• Anaphoric reference resolution (e.g., ”What

is her current medication?”)• Elliptic speech input (e.g., ”... and the

age?”)• Multimodal (deictic) dialog interactions (e.g.,

”Zoom in here” + [user points on a region onthe display])• Cross-modal reference resolution (e.g.,

”What is the second best therapy recommen-dation?”)

3.1 Dialogue Implementation

The implementation of the dialogue followsthe rapid engineering principles (Sonntag et al.,2009a) and is implemented with SiAM-dp (Neßel-rath, 2015), an open development platform formultimodal dialogue systems. All knowledgerepresentations and dialogue structures follow adeclarative specification with ontology structures.First, the already existing patient data model ofthe patient database was mapped onto the corre-sponding domain ontology for SiAM-dp’s knowl-edge manager, which is initialized with the spe-cific patient instances at the beginning of each di-alogue session. The speech recognition grammaris loaded into Nuance’s speech recognizer3.

The dialogue model is based on finite-state ma-chines; the mapping of user intentions to match-ing multimodal system reactions is defined declar-atively. The determination of the user intentionin SiAM-dp follows a fusion process: SiAM-dp’smodality specific user input analysis components(speech recognition, gesture analysis) and theirfusion in conjunction with reference resolutionwithin the discourse manager. The realization ofmultimodal output (speech output, virtual displaycontent modifications, therapy prediction invoca-tion) is coordinated by SiAM-dp’s presentationplanning component. The software itself runs onthe same machine that has the Oculus Rift and theTouch controllers attached. Technically speaking,SiAM-dp is operated with standard speech recog-nition and synthesis (Nuance, SVOX), connectedto Oculus Rift’s microphone and speakers as audio

3http://www.nuance.com

input and output devices. An example dialogue isas follows:

U.1 ”Show the patient file for Gerda Meier.”S.1 ”Here is the patient file for Gerda Meier.”

[patient data is displayed on the display in-side the VR room]

U.2 ”What was the last examination?”S.2 ”Mrs. Meier recently received a mammogra-

phy.”U.3 ”When was it?”S.3 ”The mammography was made on the 10th of

March.”U.4 ”Now show me the patient file for Paula Fis-

cher.”S.4 ”Here is the patient file for Paula Fischer.”

[new patient data is displayed]U.5 ”Zoom in here.” [user points on a region

on the display using the Oculus Touch con-troller]

S.5 [virtual display is zoomed accordingly]U.6 ”Which therapy is recommended?”S.6 ”For Paula Fischer chemotherapy is recom-

mended.” [bar chart with therapy predictionis displayed]

In (U.1) the user requests a patient file to be pre-sented on the display inside the VR room. Thecorresponding system output (S.1) is multimodal:speech output is synchronized with the presenta-tion of the patient file. The user then requests in-formation about the patient data currently shownon the display (U.2), e.g. anamneses and previoustherapies. This user input contains an ellipsis: thename of the patient is not mentioned. SiAM-dp’sdiscourse manager resolves it from the dialoguecontext that was filled in (U.1). Further questionsabout specifics may be asked (U.3). The contextinfers that ”it” refers to the mammography justmentioned (rule-based anaphora resolution).

The next utterance (U.4) shows that users mayshift the topic at any point, for instance by request-ing other patient data. (U.5) is an example of amultimodal input consisting of a speech input anda corresponding pointing gesture. Processing thisuser input is only possible if both modalities are ina certain time frame and correctly fused.

The main dialogue move is (U.6), as it trig-gers the real-time therapy prediction process onthe GPU Server. The system’s response in (S.6) isagain multimodal as the requested therapy is pre-sented on the virtual display, together with synthe-sized speech output.

25

Anaphora resolution is also handled in our sys-tem. Since the patient file represented on the dis-play is always synchronized with the current dis-course model and within SiAM-dp depending onthe context modelled as discourse memory (Son-ntag, 2010) the system can resolve utterances like”when was her last examination?”

4 Conclusions and Future Work

In this paper, we presented our multimodal dia-logue system implementation in virtual reality. Itprovides a first example of an automated decisionsupport system that computes therapy predictionsin real-time using deep learning techniques. Ourmultimodal dialogue system, in combination withinteractive data visualization in virtual reality, ismeant to provide an intuitive dialogue componentfor helping the doctor in his or her therapy deci-sion. Preliminary evaluations in the clinical dataintelligence project (Sonntag et al., 2016) are en-couraging and we believe that such multimodal-multisensor interfaces in VR can already be de-signed and implemented to effectively advance hu-man performance in medical decision support.

Currently we are investigating how displayingcomplex 3D medical images (e.g., DICOM) in VRcan improve the diagnostic process. We are alsolooking into possibilities to include additional in-put modalities such as gaze information from eye-tracking to further improve the multimodal inter-action.

As an extension to the dialogue it is planned toinclude ambiguity resolution by asking clarifica-tion questions. If a patient name is ambiguous, thesystem could ask for clarification (U: ”Open thepatient file of Mrs. Meier.” S: ”Gerda Mayer orAnna Maier?”). In addition users should be able tochange or add patient data through natural speech.

Acknowledgments

This research is part of the project “clinical dataintelligence” (KDI) which is founded by the Fed-eral Ministry for Economic Affairs and Energy(BMWi).

ReferencesMartın Abadi, Paul Barham, Jianmin Chen, Zhifeng

Chen, Andy Davis, Jeffrey Dean, Matthieu Devin,Sanjay Ghemawat, Geoffrey Irving, Michael Isard,Manjunath Kudlur, Josh Levenberg, Rajat Monga,Sherry Moore, Derek G. Murray, Benoit Steiner,

Paul Tucker, Vijay Vasudevan, Pete Warden, MartinWicke, Yuan Yu, and Xiaoqiang Zheng. 2016. Ten-sorflow: A system for large-scale machine learning.In 12th USENIX Symposium on Operating SystemsDesign and Implementation (OSDI 16). USENIXAssociation, GA, pages 265–283.

C. Esteban, O. Staeck, S. Baier, Y. Yang, and V. Tresp.2016. Predicting clinical events by combining staticand dynamic information using recurrent neural net-works. In 2016 IEEE International Conference onHealthcare Informatics (ICHI). pages 93–101.

Andreas Luxenburger, Alexander Prange, Moham-mad Mehdi Moniri, and Daniel Sonntag. 2016.Medicalvr: Towards medical remote collaborationusing virtual reality. In Proceedings of the 2016ACM International Joint Conference on Pervasiveand Ubiquitous Computing: Adjunct. ACM, NewYork, NY, USA, UbiComp ’16, pages 321–324.

S. Morein-Zamir, S. Soto-Faraco, and A. Kingstone.2003. Auditory capture of vision: examining tem-poral ventriloquism. Brain Res Cogn Brain Res17(1):154–163.

Robert Neßelrath. 2015. SiAM-dp : An open devel-opment platform for massively multimodal dialoguesystems in cyber-physical environments. Ph.D. the-sis, Universitat des Saarlandes, Postfach 151141,66041 Saarbrucken.

S. Oviatt. 2013. The Design of Future Educational In-terfaces. Taylor & Francis.

Daniel Sonntag. 2010. Ontologies and Adaptivity inDialogue for Question Answering, volume 4 of Stud-ies on the Semantic Web. IOS Press.

Daniel Sonntag, Gerhard Sonnenberg, Robert Nessel-rath, and Gerd Herzog. 2009a. Supporting a rapiddialogue engineering process. In Proceedings ofthe First International Workshop On Spoken Dia-logue Systems Technology. International WorkshopOn Spoken Dialogue Systems Technology (IWSDS-2009), December 9-11, Kloster Irsee, Germany.o.A.

Daniel Sonntag, Volker Tresp, Sonja Zillner, AlexanderCavallaro, Matthias Hammon, Andre Reis, Peter A.Fasching, Martin Sedlmayr, Thomas Ganslandt,Hans-Ulrich Prokosch, Klemens Budde, DaniloSchmidt, Carl Hinrichs, Thomas Wittenberg, PhilippDaumke, and Patricia G. Oppelt. 2016. The clin-ical data intelligence project. Informatik-Spektrum39(4):290–300.

Daniel Sonntag, Markus Weber, Alexander Cavallaro,and Matthias Hammon. 2014. Integrating digitalpens in breast imaging for instant knowledge acqui-sition. AI Magazine 35(1):26–37.

Daniel Sonntag, Pinar Wennerberg, Paul Buitelaar, andSonja Zillner. 2009b. Pillars of ontology treatmentin the medical domain. J. Cases on Inf. Techn.11(4):47–73.

26

Documents

A Multimodal Dialogue System for Medical Decision …...Proceedings of the SIGDIAL 2017 Conference, pages 23–26, Saarbrucken, Germany, 15-17 August 2017.¨ c 2017 Association for