35
Proposal of a Hierarchical Proposal of a Hierarchical Architecture for Multimodal Architecture for Multimodal Interactive S ystems Masahiro Araki* 1 Tsuneo Nitta* 2 Kouichi Katsurada* 2 Takuya Nishimoto* 3 Tetsuo Amakasu* 4 Shinnichi Kawamoto* 5 * 1 Kyoto Institute of Technology * 1 Kyoto Institute of Technology * 2 Toyohashi University of Technology * 3 The University of Tokyo * 4 NTT Cyber Space Labs. * 5 ATR 2007/11/16 1 W3C MMI ws

Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

Proposal of a HierarchicalProposal of a Hierarchical Architecture for MultimodalArchitecture for Multimodal 

Interactive SystemsyMasahiro Araki*1 Tsuneo Nitta*2 Kouichi Katsurada*2

Takuya Nishimoto*3 Tetsuo Amakasu*4

Shinnichi Kawamoto*5

*1Kyoto Institute of Technology*1Kyoto Institute of Technology   

*2Toyohashi University of Technology

*3The University of Tokyo  *4NTT Cyber Space Labs. *5ATR2007/11/16 1W3C MMI ws

Page 2: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

OutlineOutline• Background– Introduction of speech IF committee under ITSCJ– Introduction to Galatea toolkit

• Problems of W3C MMI Architecture– Modality Component is too largey p g– Fragile Modality fusion and fission functionality– How to deal with user model?

• Our Proposal– Hierarchical MMI architectureHierarchical MMI architecture– “Convention over Configuration” in various layers

2007/11/16 W3C MMI ws 2

Page 3: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

Background(1)Background(1)• What is ITSCJ?– Information Technology Standards Commission of Japanp• under IPSJ (Information Processing Society of Japan)

• Speech Interface Committee under ITSCJ• Speech Interface Committee under ITSCJ –Mission• Publish TS (Trial Standard) document concerning multimodal dialogue systems

2007/11/16 W3C MMI ws 3

Page 4: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

Background(2)Background(2)• Theme of the committee

h f– Architecture of MMI system

– Requirements of each component

• Future directionsGuideline for implementing practical MMI system– Guideline for implementing practical MMI system

– specify markup language

2007/11/16 W3C MMI ws 4

Page 5: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

Our Aim1. Propose an MMI architecture which can be 

used for advanced MMI researchused for advanced MMI researchW3C: From the practical point of view (mobile, accessibility)

2. Examine the validity of the architecture through y gsystem implementation

Galatea Toolkit

3. Develop a framework and release it as a open source

towards de facto standard

2007/11/16 W3C MMI ws 5

Page 6: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

Galatea Toolkit(1)Galatea Toolkit(1)

• Platform for• Platform for developing MMI systemssystems

• Speechrecognition

• SpeechSynthesis

• Face Image gSynthesis

2007/11/16 W3C MMI ws 6

Page 7: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

Galatea Toolkit(2)Galatea Toolkit(2)

ASRJulian

DialogueManager

Galatea DMTTSGalatea talk

FaceFSM

2007/11/16 W3C MMI ws 7

Page 8: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

Galatea Toolkit(3)Galatea Toolkit(3)

Di lPhoenix Dialogue Manager

Macro Control Layer(AM‐MCL) Agent( )

Direct Control Layer (AM‐DCL)

AgentManager

ASR TTS FaceASRJulian

TTSGalatea talk

FaceFSM

2007/11/16 W3C MMI ws 8

Page 9: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

Problems of W3C MMI(1)Problems of W3C MMI(1)• The “size” of Modality Component does not 

it f lif lik t t lsuit for life‐like agent controlDeliveryContext

Interaction Data

Runtime Framework

ContextComponent

manager Component

Modality Component API Modality Component API

Speech Modality Face Image Modality

ASR TTS FSM

2007/11/16 W3C MMI ws 9

Page 10: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

Problems of W3C MMI(1)Problems of W3C MMI(1)• Lip synchronization with speech output

lDeliveryContext

Component

Interactionmanager

DataComponent

Runtime Framework

o [65] h[60] [ ]

set lip 1

3

h d l d l

set Text=“ohayou”

a[65] ... moving sequence

12

Speech Modality Face Image Modality

ASR TTS FSM

4

ASR TTS

start

2007/11/16 W3C MMI ws 10

Page 11: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

Problems of W3C MMI(1)Problems of W3C MMI(1)• Back channeling mechanism

lDeliveryContext

Component

Interactionmanager

DataComponent

Runtime Framework

12

h d l d l

nodshort pause set Text=“hai”start

2

Speech Modality Face Image Modality

ASR TTS FSMASR TTS

2007/11/16 W3C MMI ws 11

Page 12: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

Problems of W3C MMI(2)Problems of W3C MMI(2)• Fragile Modality fusion and fission functionality

DeliveryContext

C t

Interactionmanager

DataComponent

Runtime Framework

Componentg p

How to define“from here to there” point (120,139)

point (200,300)

How to define multimodal grammar?

Speech Modality Tactile Modality

h

g

Is simple 

ASRtouchsensor

unification enough?

2007/11/16 W3C MMI ws 12

Page 13: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

Problems of W3C MMI(2)Problems of W3C MMI(2)• Fragile Modality fusion and fission functionality

DeliveryContext

C t

Interactionmanager

DataComponent

Runtime Framework

Componentg p

“this is route map” SVG Contentsplanning is 

Speech Modality Graphic Modality

p gsuitable for 

adapting various 

TTSSVG

Viewerdevices.

2007/11/16 W3C MMI ws 13

Page 14: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

Problems of W3C MMI(3)Problems of W3C MMI(3)• How to deal with user model?

DeliveryContext

Interactionmanager

DataComponent

Runtime Framework

Componentmanager Component

Where is the user model information 

t d?Speech Modality Face Image Modality

stored?

ASR TTS FSMfails many times

2007/11/16 W3C MMI ws 14

times

Page 15: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

SolutionSolution• Back to multimodal framework– more smaller modality component

• Separate state transition description• Separate state transition description– task flow– interaction flow– modality fusion/fission

hierar hi al ar hite t rehierarchical architecture

2007/11/16 W3C MMI ws 15

Page 16: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

Investigation procedureg pPhase 1

use case analysis

requirement for overall systems

Working draft for MMI architecture

2007/11/16 W3C MMI ws 16

Page 17: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

Use case analysisyName input modality output modality

li di l ha

on‐line shopping

mouse, speechdisplay, speech animated agent

b voice search mouse speech display speechb voice search mouse, speech display, speech

c site search mouse, speech, key display, speech

i t ti ithd

interaction withrobot

speech, image, sensor speech, display

negotiatione

negotiationwith interactive agent

speech speech, face imageagent

f kiosk terminal touch, speech speech, display

2007/11/16 W3C MMI ws 17

Page 18: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

Example of use caseInteraction with robot

Nishijin Kasuri is

Interaction with robot

Nishijin Kasuri is a traditional texture in Kyoto.

What isKasuri?

2007/11/16 18W3C MMI ws

Page 19: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

Requirementsq1. general2 input modality in common2. input modality3. output modality4 hit t i t ti d h i ti

in common with W3C

4. architecture, integration and synchronization point

5 runtimes and deployments5. runtimes and deployments6. dialogue management7 h dli f f d fi ld extension7. handling of forms and fields8. connection with outside application

extension

9. user model and environment information10.from the viewpoint of developer

2007/11/16 W3C MMI ws 19

Page 20: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

data model application logiclayer 6:application

user model /device model

controllayer 5:task control

set/get event/  control set/get

task control

layer 4

event / result command

controllayer 4interactioncontrol

integrated result / event event commandcommand

control / understanding controllayer 3:modality integration

interpreted result/event

command

control/interpret

control/interpret

control controllayer 2:modalitycomponent

interpreted result/ event

event commandcommand

ASR / t hTTS /  graphical

p pcomponent

layer 1:

results・event event commandcommand

ASR pen / touch/

audio outputg poutput

yI/O device

Page 21: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

Investigation procedureInvestigation procedurePhase 2

Detailed analysis of use case

Requirements for each layer

Publish trial standard

release reference implementation

2007/11/16 W3C MMI ws 21

Page 22: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

Detailed use case analysisy

2007/11/16 W3C MMI ws 22

Page 23: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

Requirements of each layerq y• Clarify Input/Output with adjacent layers

• Define events

• Clarify inner layer processing• Clarify inner layer processing

• Investigate markup language

2007/11/16 W3C MMI ws 23

Page 24: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

1st layer:Input/Output module1 layer:Input/Output module• Function

/– Uni‐modal recognition/synthesis module

• Input module– Input:(from outside) signal

(from 2nd layer) information used for recognition – Output:(to 2nd ) recognition result– Example:ASR, touch input, face detection, ...

• Output module– Input:(from 2nd ) output contents – Output:(to outside) signal– Example:TTS, Face image synthesizer, Web browser, ...

2007/11/16 24W3C MMI ws

Page 25: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

2nd : Modality componenty p• Function

lapper that absorbs the difference of 1st la er– lapper that absorbs the difference of 1st layerex)Speech Recognition componentgrammar:SRGS   semantic analysis : SISRresult: EMMA

– provide multimodal synchronization

ex) TTS with lip synchronization

LS‐TTS2nd:Modalitycomponent

TTS1st:Input/Output FSM

2007/11/16 25W3C MMI ws

p pmodule

Page 26: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

3rd :Modality Fusion• Integration of input information – Interpretation of sequential / simultaneous input– Interpretation of sequential / simultaneous input

– Output the integrated result as EMMA format

Modality Fusion3rd:Modality fusion

EMMA

Modality Fusion<emma:interpretation id="int1“

emma:medium="acoustic“<emma:sequence id="seq1">

<emma:interpretation id="int2“emma:medium="tactile“

emma:mode="speech"><action> move </action><object> this </object>

d ti ti h /d ti ti

emma:medium tactileemma:mode="ink"><x>0.253</x> <y>0.124</y>

</emma:interpretation><emma:interpretation id="int3“

Speech 2nd: Touch IMC

<destination> here </destination></emma:interpretation>

<emma:interpretation id int3emma:medium="tactile“emma:mode="ink"><x>0.866</x> <y>0.724</y>

</emma:interpretation>

2007/11/16 26W3C MMI ws

IMCModalitycomponent

Touch IMC </emma:interpretation></emma:sequence>

Page 27: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

3rd :Modality Fission

• Rendering output information– Synchronization of sequential/simultaneous output

– Coordination of output modality based on theCoordination of output modality based on the access device

Modality Fission

Modality Fission

I recommend“ hi d i”

Name Price Feature

Speech OMC

Graphical OMC

“sushi dai”. Sushidai

3800 good taste

okame 3650 good service

2007/11/16 27W3C MMI ws

iwasa 3500 shelfish

Page 28: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

4th :Inner task control• Image

a piece of dialog e at client side

4 :Inner task control

– a piece of dialogue at client side

S: Please input member IDU: 2024

S:Please select food.U: Meat

2007/11/16 28W3C MMI ws

S: Is it OK?U: Yes.

Page 29: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

4th :Inner task control• Required functions

4 :Inner task control

– Error handlingex) check  departure time < arrival time) p

– Default subdialogueex) confirmation retryex) confirmation, retry, ...

– Form filling algorithm) F I t t ti Al ithex) Form  Interpretation Algorithm

– Slot update informationex)  process of negative response to confirmation request (“NO, from Kyoto.”)

2007/11/16 29W3C MMI ws

Page 30: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

4th :Inner task control

control5th control

Initialize eventstart dialogue(uri or code)

dataend event(status)

• FIA

• Input analysis (with error check)

g ( ) ( )

p y ( )

• Update data module

• Update user model

4th

device informationend event(status)

Initialize eventoutput contents

Initialize eventStart Input(with interruption) 

device informationEMMA

Modality Fusion Modality Fission3rd

EMMA

2007/11/16 30W3C MMI ws

Page 31: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

5th :Task control5 :Task control• Image– describe overall task flowserver side controller– server side controller

• Possible markup languae– SCXML

– Controller definition in MVC modelController definition in MVC model• entry points and their processing

– Script language on Rails application framework– Script language on Rails application framework• contains application logic (6th layer)• easy to prototype and customize• easy to prototype and customize

2007/11/16 31W3C MMI ws

Page 32: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

5th :Task control5 :Task control

data module application logic

user model/device model6th

• state transition

set/get call set/get

• state transition

• conditional branch

• event handling

• subdialogue management

5th

• subdialogue management

Initialize eventstart dialogue (uri or code)

dataend event(status)

control4th

start dialogue (uri or code) ( )

2007/11/16 32W3C MMI ws

Page 33: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

6th :Application

• Image

pp

– Processing module outside of dialogue system• accessed from various layersaccessed from various layers

• modules– application logic

ex)DB access, Web API access• Persist, update, delete, search of data

– user model / device model• persist user’s information through sessions

• manage device information defined in ontologyg gy

2007/11/16 33W3C MMI ws

Page 34: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

Too many markup language?y p g g• Does each level require different markup l ?language?– No.– simple functionality of 5th and 4th layer can provide data model approach (ex) Ruby on Rails)pp ( ) y )

– default function of 3rd layer can be realized simple principle (ex) unification in modality fusion)principle (ex) unification in modality fusion)

– 2nd layer functions are task/domain independent

“Convention over Configuration” 

2007/11/16 34W3C MMI ws

Page 35: Architecture for Multimodal“sushi dai . Sushi dai 3800 good taste okame 3650 good service 2007/11/16 W3C MMI ws 27 iwasa 3500 shelfish. 4th :Innertaskcontrol • Image apiece of

Summaryy• Problems of W3C MMI Architecture

d l–Modality Component

–Modality fusion and fission functionality

– User model

• Our Proposal• Our Proposal– Hierarchical MMI architecture

– “Convention over Configuration” in various layers

2007/11/16 35W3C MMI ws