106
Athens October 17th, 1 1 Athen Renato De Mori Mc Gill University University of Avignon Spoken Language Recognition and Understanding LUNA IST contract no 33549

Athens October 17th, 2009 1 1 IEEE DL 2009 Athens, october 17th 2009 Renato De Mori Mc Gill University University of Avignon Spoken Language Recognition

Embed Size (px)

Citation preview

Athens October 17th, 2009 11

IEEE DL 2009 Athens, october 17th 2009

Renato De MoriMc Gill University

University of Avignon

Spoken Language Recognition and Understanding

LUNA IST contract no 33549

Athens October 17th, 2009 2

Summary

• Short discussion on automatic language recognition problems• - Automatic speech recognition in a working system

• Interpretation process– Full / shallow parsing

– Generative vs discriminative models models

– Semantic composition

– Confidence and learning

– Examples of working systems

Athens October 17th, 2009 3

Dialog system architecture

inputinputRECOGNITION

UNDERSTANDING

and DIALOG

ANSWER GENERATION

answeranswer

acoustic

language

semantic

dialog

synthesis

models

Athens October 17th, 2009 4

The channel model

INFORMATION SOURCE

INFORMATION CHANNEL

W A

the objective of recognition is to reconstruct W based on the observation of A. The source and the channel contain KSs. The source generates a variety of sequences W with a given probability distribution.The channel, for a given W, generates a variety of A with a given probability distribution.

Considering the following coding scheme :

Athens October 17th, 2009 5

Decoding as search

)A(P

)W(P)W/A(P)A/W(Pmaxargw

W

computed by the language model LM

computed by the acoustic model AM

W : sequence of hypothesized words A: acoustic evidence

Search for a sequence of words W such that

Athens October 17th, 2009 6

word hypothesis generation

signal corpus

acoustic models

sequence of acoustic descriptors

text corpus

language modellexical models

INTEGRATED NETWORK

WORDS

HYPOTHESIS GENERATION

training training

Athens October 17th, 2009 7

W1

W2

W3

P(W1)

P(W2)

P(W3)

P(W1/W1)

P(W3/W1)P(W2/W1)

P(W2/W2)

P(W1/W2)

P(W3/W2) P(W1/W3)

P(W3/W3)

P(W2/W3)

Athens October 17th, 2009 8

left to right models with skips and emplty transitions

SiSj

Pij

Pii

ii(a)

ij(a)

Pii + Pij = 1

ij(a) is the probability of observing a during transition

i j

Athens October 17th, 2009 9

Mixture densities

Pii

Pij

If the components of a are statistically independent, then

)a(gii )a(gij

aa

∑ )a,,(Nw=)a(G

1=gijgijgijgij

Athens October 17th, 2009 10

Major problems

Impressive performance but not robust enough.

Some reasons:

acoustic variability

noise

disfluency

pronunciation variability

language variety

reduced feature size

modeling limits

non-admissible search

Athens October 17th, 2009 11

Demo

LUNAVIZ

OK 340.1

HS 267 3 267 4

DISF 298 1

COM 238 6

COM 238 4

Athens October 17th, 2009 12

Possible applications

Constrained environments (broadcast transcriptions)

Partial automation (telephone services)

Redundant messages (spoken surveys)

Specific content extraction (understanding and dialog)

Athens October 17th, 2009 13

Complexity, Functionality, Technology

CommandAnd

Control

(e.g., Simple callRouting; VRCP;Voice dialing)

Sim

ple

AS

R; is

ola

ted

wo

rds

, co

nn

ec

ted

dig

its

Interaction Complexity(human-machine)

PromptConstrained

NaturalLanguage

(e.g., TravelReservations,

Finance,Directory asst)

La

rge

r vo

ca

bu

lary

, Ha

nd

-bu

ilt gra

mm

ars

Free-form Natural

LanguageDialogue

(Customer Care,Help Desks, E-Commerce)

Ve

ry la

rge

vo

ca

bu

lary

, NL

, DM

, TT

S

1990 2009

Athens October 17th, 2009 14

COMPUTER INTERPRETATION PROCESS

Athens October 17th, 2009 15

Interpretation problem decomposition

Speech signs meaning

1-best, n-best, lattices

Acoustic features words constituents structures

features for interpretation

Problem reduction representation is context-sensitive

Interpretation is a composite decision process. Many decompositions are possible involving a variety of methods and KSs, suggesting to consider a modular approach to process design.

Robustness is obtained by evaluation and possible integration of different KSs and methods used for the same sub-task.

Athens October 17th, 2009 16

Understanding process overview

speech to conceptual structures and MRL

speech

Short Term Memory

Long Term Memory : AM LM interpretation KSs

dialogue

learning

signs

words concept tags

concept structures

MRL description

Athens October 17th, 2009 17

Interpretation as a translation process

Interpretation of written text can be seen as a process that uses procedures for translating a sequence of words in natural language into a set of semantic hypotheses (just constituents or structures) described by a semantic language.

W:[S[VP [V give, PR me] NP [ART a, N restaurant] PP[PREP near, NP [N Montparnasse, N station]]]]

:[Action REQUEST ([Thing RESTAURANT], [Path NEAR ([Place IN ([Thing MONTPARNASSE])])]]

Interesting discussion in (Jackendoff, 1990) Each major syntactic constituent of a sentence maps into a conceptual constituent, but the inverse is not true.

Athens October 17th, 2009 18

Parsing with ATIS stochastic semantic gramamrs

show

flight

Showindicator

DateDest.

FlightIndicator

Dest.Indicator

City Name

DateIndicator

Day

Please show me the flights Boston Mondayto on

Non-terminal nodes

Terminal nodes

show

flight

Showindicator

DateDest.

FlightIndicator

Dest.Indicator

City Name

DateIndicator

Day

Please show me the flights Boston Mondayto on

Non-terminal nodes

Terminal nodes

Athens October 17th, 2009 19

Semantic building actions in full parsing parsing

S NP [agent] VP[action] NP[theme] det N V det N the customer accepts the contract

Use tree kernel methods for learning argument matching (Moschitti, Raymond, Riccardi, ASRU 2007)

Not robust enough if WER is high

Athens October 17th, 2009 20

Predicate/argument structures and parsers

Recently, classifiers were proposed for detecting concepts and roles.

Such detection process was integrated with a stochastic parser (e.g. Charniak 2001).

A solution using this parser and tree-kernel based classifiers for predicate argument detection in SLU is proposed in (Moschitti et al. ASRU 2007).

Other relevant contributions on stochastic semantic parsing can be found in (Goddeau and Zue. 1992, . Goodman. 1996,. Chelba and Jelinek, 2000,. Roark, 2002, Collins, 2003)

Lattice-based parsers are reviewed in (Hall, 2005)

Athens October 17th, 2009 21

Shallow parsing

Athens October 17th, 2009 22

Probabilistic interpretation in the Chronous system

Org.

else Date

Dest.

The probability P(CW) is computed using Markov models Markov models as

P(CW)=P(W|C)P(C)

(Pieraccini et al., 1991, Pieraccini, E. Levin, E. Vidal, 1993).

Athens October 17th, 2009 23

Semantic Classification trees

City?

from City?

to City?Origin

Dest.

yes no

no

noyes

yes

(Kuhn and De Mori, 1995)

Athens October 17th, 2009 24

Conceptual Language Models FSTs

CLM0

CLM1

CLMj

………………………………..

CLMJ

This architecture is used also for separating in domain from out domain message segments (Damnati, 2007) and for spoken opinion analysis (Camelin et al., 2006). The whole ASR knowledge models in this way a relation between signal features and meaning.

Athens October 17th, 2009 25

Hypothesis generation from lattices

An initial ASR activity generates a word graph (WG) of scored word hypotheses with a generic LM.

The network is composed with WG resulting in the assignment of semantic tags to paths in WG

C

1cc0

C

0cc

CLMGENLMCLM

CLMWGSEMG

Athens October 17th, 2009 26

Word Graph

autour de vingt euros

du

TrocadéroWord graph WG

un

Athens October 17th, 2009 27

Composition

autour/NEAR /RANGE

de/ vingt/ euros/<PRICE>

du/

Trocadéro/<PLACE>

un/

/

Trocadéro/

<b(PRICE)><b(PLACE)>

))Trocadéro:value,square:type(LOC(IN ThingPlace

Athens October 17th, 2009 28

Projection

In order to obtain the conceptual hypotheses that are more likely to be expressed by the analyzed utterance, SEMG is projected on its outputs (the second elements of the pairs associated to transitions) leading to a weighted Finite State Machine (FSM) with only indicators of beginning and end words of semantic tags. The resulting FSM is then made deterministic and minimized leading to an FSM SWG give by

SWG=OUTPROJ(SEMG)

Operations on SFSTs are performed using the ATT library (Mohri et al.)

The objective is to have the correct interpretation as the most likely among the hypotheses coherent with the dialogue predictions.

Athens October 17th, 2009 29

Demo

LUNAVIZ

OK 340.1

Best path avoids insertions

Athens October 17th, 2009 30

CRF

Cc kiikk ixyyf

xZxyp ),,,(exp

)(

1)|( 1

y Cc kiikk ixyyfxZ ),,,(exp)( 1

otherwise 0

to}|{arrivecontain ... and

. if 1

),,,( 11 ii

i

iik xx

CITYARRIVEy

ixyyf

Possibility of having features from long-term dependences

Results for LUNA from Riccardi, Raymond, Ney, Hann

Athens October 17th, 2009 31

Results

French Corpus attribute name attribute name&value

model text speech text speech

CRF 11.5 24.3 14.6 28.8

FST 14.1 27.5 16.6 31.3

DBN 15.6 29.0 18.3 33.0

LLPos 14.7 27.5 17.7 32.1

SVM 16.5 29.2 19.2 33.1

SMT 15.2 28.7 23.4 34.9

weighted ROVER

10.7 23.8 13.0 27.9

FRENCH MEDIA corpus

3000 test dialog turns

Telephone speech

Vocabulary 2K

WER 27.4%

Athens October 17th, 2009 32

Computational semantics

Epistemology, the science of knowledge, considers a datum as basic unit.

Semantics deals with the organization of meanings and the relations between sensory signs or symbols and what they denote or mean.

Computer epistemology deals with observable facts and their representation in a computer.

Natural language interpretation by computers performs a conceptualization of the world using computational processes for composing a meaning representation structure from available signs and their features.

Athens October 17th, 2009 33

Frames as computational structures (intension)

{addressloc TOWN (facet)

……attached proceduresarea DEPARTMENT OR PROVINCE OR STATE

……attached procedures country NATION

……attached proceduresstreet NUMBER AND NAME

……attached procedureszip ORDINAL NUMBER

……attached procedures }

A frame scheme with defining properties represents types of conceptual structures (intension) as well as instances of them (extension). Relations with signs can be established by attached procedures (S. Young et al., 1989).

Athens October 17th, 2009 34

Frame instances (extension)

A convenient way for asserting properties, and reasoning about semantic knowledge is to represent it as a set of logic formulas.

A frame instance (extension) can be obtained from predicates that are related and composed into a computational structure.

Frame schemata can be derived from knowledge obtained by applying semantic theories.

Interesting theories can be found, for example in (Jackendoff, 1990, 2002) or in (Brackman 1978, reviewed by Woods 1985)

)84000,x(zip)Pascalavenue1,x(street)France,x(country

)Vaucluse,x(area)Avignon,x(loc)address,x(of_cetanins)x(

Athens October 17th, 2009 35

Frame instance

{a0001instance_of addressloc Avignonarea Vauclusecountry Francestreet 1, avenue Pascalzip 84000}

Schemata contain collections of properties and values expressing relations. A property or a role are represented by a slot filled by a value

Athens October 17th, 2009 36

Composition process 1

Step 1 – concept tags -> frame instance fragments

Concept tag : hotel-parking,

Fragment : HOTEL.[hotel_facilities FACILITY.[facility_type parking]]

Athens October 17th, 2009 37

Composition process 2

Step 2 – composition by fusion

HOTEL.[luxury four_stars]

HOTEL.[h_facility FACILITY.[hotel_service tennis]]

Fusion rule

resultHOTEL. [luxury four_stars, h_facility FACILITY.[hotel_service tennis]]

e,qqd,qqj,e,qj,d,q

e,qj,d,ie,qqd,qq

sl.F,sl.Ffusionvvj

headvheadsl.F,sl.Ffusionpsup

e,qqd,qqj,e,qj,d,q

e,qj,d,ie,qqd,qq

sl.F,sl.Ffusionvvj

headvheadsl.F,sl.Ffusionpsup

Athens October 17th, 2009 38

Composition process 3

Step 3– Composition by attachment

b,hz,qb,hz,qhz,q ,Blink,Blinkpsuphead,contains

new fragment ADDRESS.[adr_city Lyon]

Composition result:

HOTEL. [luxury four_stars,

h_facility FACILITY.[hotel_service tennis],

at_loc ADDRESS.[adr_city Lyon]]

Athens October 17th, 2009 39

Semantic composition results

French Corpus F-measure Frame heads

F-measure Frame composition

model text speech text speech

rules 98.3 77.0 85.3 66.9

CRF + rules 84.0 77.8 74.2 68.2

All DBN 76.11 70.0 56.2 54.0

Athens October 17th, 2009 40

Semantic composition CRF + rules

Metrics Precision Recall F1

1. Frame Recognition 78.37 77.28 77.82

2. Argument Boundary Detection 87.21 76.99 81.78

3. Argument Labeling 76.78 72.05 74.34

4. Argument Recognition 70.55 66.20 68.31

5. Frame Realization 59.87 59.03 59.45

6. Frame Composition 73.56 63.73 68.29

7. Turn Analysis 46.77 45.98 46.37

Athens October 17th, 2009 41

Demo

LUNAVIZ

OK 340 1

Athens October 17th, 2009 42

CONFIDENCE AND LEARNING

Athens October 17th, 2009 43

Confidence

Evaluate confidence of components and compositions

represents the confidence indicators or a function of them.

Notice that it is difficult to compare competing interpretation hypotheses based on the probability where Y is a time sequence of acoustic features, because different semantic constituents may have been hypothesized on different time segments of stream Y.

)(P

)Y(P

conf

conf

)Y(P

Athens October 17th, 2009 44

Probabilistic frame based systems

In probabilistic frame-based systems, (Koller 1998 ) a frame slot S of a frame F is associated a facet Q with value Y: Q(F,S,Y).

A probability model is part of a facet as it represents a restriction on the values Y.

It is possible to have a probability model for a slot value which depends on a slot chain.

It is also possible to inherit probability models from classes to subclasses, to use probability models in multiple instances and to have probability distributions representing structural uncertainty about a set of entities.

Athens October 17th, 2009 45

Dependency graph with cycles

Acoustic_evidence support concept filled-slot

If the dependence graph has cycles, then possible worlds can be considered. The computation of probabilities of possible worlds is discussed in (Nilsson, 1986). A general method for computing probabilities of possible worlds based on Markov logic networks (MLN) is proposed in (Richardson, 2006).

kY

kW

kY kW kC k,j,i

Y W C ,j,i

mY mW mC m,j,i

Athens October 17th, 2009 46

Probabilistic models of relational data

Relational Markov Networks (RMN) are a generalization of CRFs that allow for collective classification of a set of related entities by integrating information from features of individual entities as well as relations between them (Taskar et al., 2002).

Methods for probabilistic logic learning are reviewed in (De Raedt, 2003).

Athens October 17th, 2009 47

Define confidence-related situations

Consensus among classifiers and SFST is used to produce confidence indicators in a sequential interpretation strategy (Raymond et al. 2005, 2007). Classifiers used are SCT, SVM, adaboost. Committee-Based Active Learning uses multiple classifiers to select samples (Seung et al. 1992)

FSM SVM adaboostSCT

Fusion strategy

Athens October 17th, 2009 48

Confidence measures

Two basic steps:

1) generate as many features as possible based on the speech recognition and/or natural language understanding process and

2) Estimate correctness probabilities with these features, using a combination model.

Athens October 17th, 2009 49

Features for confidence

Many features are based on empirical considerations:

semantic weights assigned to words,

uncovered word percentage,

gap number,

slot number,

word, word-pair and word-triplet

occurrence counts,

Athens October 17th, 2009 50

Adaptive Learning in Practice

(Riccardi Tur

Hakkani-Tur, 2005)

Athens October 17th, 2009 51

Application with existing technology

OMS (FT Voice XML Server)

HTTP (ASRoutput)

HTML

ASRwith

domainspecific

LM

Application Server

threshold α

Multi-classifier

label, classifier : Probability (label)

Cla

ssif

icat

ion

Pro

cess

Ala

rmD

etec

tio

n

Adaboost Classifier

Alarm Probability

threshold β

Web serverOMS (FT Voice XML Server)

HTTP (ASRoutput)

HTML

ASRwith

domainspecific

LM

Application Server

threshold α

Multi-classifier

label, classifier : Probability (label)

Cla

ssif

icat

ion

Pro

cess

Ala

rmD

etec

tio

n

Adaboost Classifier

Alarm Probability

threshold β

Application Server

threshold α

Multi-classifier

label, classifier : Probability (label)

Multi-classifier

label, classifier : Probability (label)

Cla

ssif

icat

ion

Pro

cess

Ala

rmD

etec

tio

n

Adaboost Classifier

Alarm Probability

Ala

rmD

etec

tio

n

Adaboost Classifier

Ala

rmD

etec

tio

n

Adaboost Classifier

Alarm Probability

threshold β

Web server

Athens October 17th, 2009 52

Use Inf Th for system tuning

Usable results have been obtained in French with the opinion analysis system

Selection of the operating point (IEEE TSAP)

Athens October 17th, 2009 53

Bias estimation and correction

Bias estimation and correction Proportion estimation results before and after correction (INTERSPEECH 2009)

Observed proportion

True proportion

Athens October 17th, 2009 54

Architecture of the FT automatic survey platform

IVRSurvey Framework

Survey platform

Surve

y definitioninterface

Definition Survey Definition

Data

OMS (FT VXML server)

ASR/TTS

Application server

Disserto (FT DM)

log

Trafic supervision

Customer Profile

Provisionning

RE

SU

LTS

CA

LLS

Resultsview

Samplingmodels

Samplingmodels

ProvisionningInterface

Record models

Record models

IVRSurvey Framework

Survey platform

Surve

y definitioninterface

DefinitionDefinition Survey Definition

Data

OMS (FT VXML server)

ASR/TTS

Application server

Disserto (FT DM)

log

Trafic supervision

Customer Profile

ProvisionningProvisionning

RE

SU

LTS

CA

LLS

Resultsview

Samplingmodels

Samplingmodels

ProvisionningInterface

Record models

Record models

Athens October 17th, 2009 55

Synopsis of the two applications related to opinion analysis

Opin

ion D

etection

Segmentation

Interpretation

Validation

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

courtesy efficiency rapidity

-+

daily statistics

Monitoring

Opin

ion D

etection

Segmentation

Interpretation

Emergency estimation

sorted list

Alarm detection

Opin

ion D

etection

Segmentation

Interpretation

Validation

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

courtesy efficiency rapidity

-+

daily statistics

Monitoring

Opin

ion D

etection

Segmentation

Interpretation

Emergency estimation

sorted list

Alarm detection

Athens October 17th, 2009 56

The LUNA MEDIA SYSTEM

LUNA Talkserver

ASR TTS

Client SLU

STMMDMShortTerm

Memory

LUNA Talkserver

ASR TTS

Client SLU

STMMDMShortTerm

Memory

Athens October 17th, 2009 57

Dialogue systems

tS~ tS

~b

user

SLU system

Dialogue next state estimator

Dialogue Manager

mem

St-1

at

ut

Yt

belief estimator

Dialogue Manager

mem

at

action depends on dialogue state action depends on belief

Athens October 17th, 2009 58

Conclusions

A modular SLU architecture can exploit the benefits of combined use of CRFs, classifiers and stochastic FSMs, which are approximations of more complex grammars.

Grammars should perhaps be used in conjunction with processes having inference capabilities.

Recent results and applications of probabilistic logic appear interesting, but its effective use for SLU still has to be demonstrated.

Annotating corpora for these tasks is time consuming suggesting that it is suitable to use a combination of knowledge acquired by a machine learning procedures and human knowledge.

Athens October 17th, 2009 59

Conclusions

Robustness, incremental learning, portability are important and open issues.

SLU is not only used in human-machine dialogs. Other applications are for opinion analysis, indexing, summarization, retrieval.

Some applications can tolerate a certain amount of ASR imprecision.

Athens October 17th, 2009 60

THANK YOU

Athens October 17th, 2009 61

Dialogue system requirements

Negotiation ability

user satisfaction , new requests

Context interpretation

anaphora, ellipses, disfluencies, synchronization

Interaction flexibility

focus switch

Cooperative reactions

completion, corrective, conditional, intentional answers

Adequacy of response style

Athens October 17th, 2009 62

Communicative acts

cooperation

inspires

communication

which should be based on

rationality

Communication is performed by agents which are in a mental state

An act depends on mental states and has the effect of modifying them

Athens October 17th, 2009 63

Reasoning and decision

beliefs

goals

Dialog history concepts

inference

actions

decision

Athens October 17th, 2009 64

Types of dialogues

collaborative dialogues conflictive dialogues contrary beliefs contrary beliefs shared goals contrary goals

cooperative dialogues (Gren, 1975) shared beliefs shared goals

Athens October 17th, 2009 65

models for decision making

precondition-action

finite-state diagrams Stochastic finite state automata

formal grammars stochastic cfgs

logic inference

predicate logics Markov Logic Networks

modal logics

overall reward maximization

(partially observable) Markov Decision Processes

Athens October 17th, 2009 66

State dependent or belief dependent decision

tS~ tS

~b

user

SLU system

Dialogue next state estimator

Dialogue Manager

mem

St-1

at

ut

Yt

belief estimator

Dialogue Manager

mem

at

action depends on dialogue state action depends on belief

Athens October 17th, 2009 67

Sub-optimal sequential solution

A P A Y B P Y A B P A B

P Y A B P Y WA B P Y W A P W A B

A P Y W P W B P A W B

uA u

u sA u

u s u s

u sW

u sW

u u s

uA u

s u s

*

*

arg max ( | , ) arg max ( | , ) ( | )

( | , ) ( , | , ) max ( | , ) ( | , )

arg max ( | ) ( | ) ( | , )

W P Y W P W B

A P A W B

Ws

uA u

u s

* arg max ( | ) ( | )

arg max ( | *, )*

Athens October 17th, 2009 68

Finite-state diagrams

Structural approaches, based on finite-state models, assume that dialog coherence resides in its structure based on descriptions of observed sequencing of utterances, not on explanations of what is observed.

Although effective for some practical systems, the approach is rigid and may severely limit user satisfaction.

While conviviality is a major requirement for system design, an efficient implementation is an essential condition for achieving user satisfaction.

There are specific tasks in dialog that can be achieved with simple and effective models based on finite-state diagrams.

Athens October 17th, 2009 69

Finite-state diagrams

SM 1INT 1=

ASU(LM1,UM1)

SM 21INT 21=

ASU(LM21,UM21)

SM 2nINT 2n=

ASU(LM2n,UM2n)

CONDITION 121CONDITION 12n

. . . . .

. . . . .

Athens October 17th, 2009 70

Finite-state diagrams

Next state depends only on actual interpretation of the input. Results of inference are pre-computed and represented by

(state,input) (next_state)

(actual_state) (action) rules

Finite-state models can be used in more general frameworks in which states represent tasks to be accomplished.

Each task can then be accomplished by models based on different paradigms (Heeman et al., 1998).

This simplifies system design imposing a certain degree of system-initiative without preventing mixed-initiative.

The grounding process, ensures that what has been said is mutually understood, and is explicitly achieved by asking the user to verify recognition results before leaving a dialog state.

Athens October 17th, 2009 71

Example : The FT 3000 system

• Service– obtain information about FT services

• purchase almost 30 different services– access account

• check consumption, pay bills• call forwarding, voice messaging

• Deployed since October 2005• Corpus collected daily• Dealing with different kind of speech

– Speech/non speech– Speech out-of-domain/speech in domain

• Comments from the users– Speech with a valid content/invalid content

• Experienced vs. New users

Athens October 17th, 2009 72

FT 3000 Voice Agency service

Thousands of dialogues are automatically processed every day and the system logs are stored.

Speakers are not professional ones and are not biased by the acquisition protocols.

The complexity of the services widely deployed is necessarily limited in order to guarantee robustness with a high automation rate.

The semantic model of such deployed system is task-oriented.

Athens October 17th, 2009 73

Dialogue types

transit dialogues Routing users to specific vocal services to activate already purchased services.

These dialogues represent 80% of the calls to the FT3000 service.

Other dialogues with which users ask about a service they may wish to subscribe.

For these dialogues, the average utterance length is higher, as well as the average number of dialogue turns.

Athens October 17th, 2009 74

From concept lattice to multiple active states

Sequence of dialog states

Sequence of utterances

Sequence of interpretations

Basic concept tag string

Word string

Objective: obtain the best sequence of states S over a sequence of spoken utterances represented by Y

recursive calculation with:

SLU

Athens October 17th, 2009 75

Strategy 1: sequential

• Best sequence of words

• Best sequence of concepts

• Best interpretation

Probability of a dialog state S at turn k:

1best word 1best concept 1best interpretation

Athens October 17th, 2009 76

Strategy 2: integrated search

• Best sequence of concepts+words+interpretation

For estimating the probability of a dialog state and an interpretation: joint search for the best sequence of words and concepts

word lattice concept lattice interpretation lattice

Athens October 17th, 2009 77

Strategy 3: integrated search + dialog context

• Bigram Language Model in the dialog states:

word lattice concept lattice interpretation lattice dialog state lattice

Athens October 17th, 2009 78

Implementation with a Finite State Machine approach

• Language Models + Local grammars + rules

– AT&T FSM + GRM Libraries

word

concept

inter.

dial. state

word lattice

3gram Language Model

HMM tagger

interpretation rules

dialog manager rules

2gram State Model

Athens October 17th, 2009 79

Corpus used fo the experiments

• Corpus (manually labelled data)– Training

• 44K utterances for LM (word and concept)• 7.4K dialogues (dialog state LM)

– Test• 816 dialogues / 1950 utterances

• Two types of dialogues– "Transit"

• Request for a specific task and rerouting towards the corresponding service

– "Other"• Request for information or new service purchase• Longer dialogs, more comments, new users

Athens October 17th, 2009 80

Evaluation criteria

The results are evaluated according to 3 criteria:

• the Word Error Rate (WER), the Concept Error Rate (CER) and the Interpretation Error Rate (IER).

• the CER is related to the correct translation of an utterance into a string of basic concept tags.

• the IER is related to the global interpretation of an utterance in the context of the dialogue service considered.Therefore this last measure is the most significant one as it is directly linked to the performance of the dialogue system.

Athens October 17th, 2009 81

Results with the OOD Language Model

• OOD LM is very useful on the other dialogues, which contain most of the comments from the users

• No negative impact on the transit dialogues

• IER=Interpretation Error Rate (at the turn level)• correct=all the attribute/value components in a frame must be correct• Strategy 1

Example: {Gest(Resilier,Ambi(AtoutsPlus,HeureLocale,ForfaitLocal))}

Athens October 17th, 2009 82

Results according to the SLU strategy

• Strat1=sequential• Strat2=integrated• Strat3=integrated + dialog context

• WER = Word Error Rate• CER = Concept Error Rate• IER = Interpretation Error Rate

• Some gains obtained by using an integrated search (strat2)- Higher correct values, but too many insertions (False Acceptance)

• No improvement observed by using dialog context- Very short dialogues (1.5 turns for Transit, 3.7 for Other)

Athens October 17th, 2009 83

Oracle measures

• Oracles in– Word lattice (WER)– Concept lattice (CER)– Interpretation lattice (IER)

– utterances detected as OOD were discarded

• IER obtained on these Oracle hypotheses– IER on the Oracle 1-best word string = 9.8%– IER on the Oracle 1-best concept string = 7.5%– IER on the Oracle 1-best interpretation = 4.4%

Joint search and optimization on the word/concept/interpretation levels

Athens October 17th, 2009 84

SLU decision process

• Decision process based on multiple hypotheses output• Example

– Agreement at the different levels for detecting “expected” dialogs – Can be used to detect problematic dialogues

Athens October 17th, 2009 85

Plan-based approaches

communicating is acting (Austin, 1962)

utterance is also as a realization of observable communicative actions: speech acts or dialog acts like

inform, request, confirm, commit,

Communicative actions can be planned and recognized on the basis of mental states.

Dialog analysis is considered in the framework of explaining actions based on mental states, relying on general models of actions and mental attitudes.

Athens October 17th, 2009 86

Activities

• identify the user goals or intensions (e.g. task goal, communicative goals (add information or corrections) goal reasoner infers goals or intensions from interpretations or inferred dialog acts. It should respect an interaction model

• identify and update beliefs

Cooperation, rationality and other model properties are represented by formulae which are used by an inference engine

Result : Dialogue state

• make decisions Result : actions

Optimal (minimum risk, maximum reward) decision strategies

Athens October 17th, 2009 87

Rational interaction to recognize goals and beliefs

Rational interaction is based on the assumption that an intelligent system has a communication ability, grounded on a general competence, that characterizes rational behavior (Sadek)

A proposition is a belief of an agent if the agent considers that the proposition is true.

Belief is the mental attitude whereby an agent has a model of the world.

Reasoning involves belief reconstruction and revision

Uncertainty is a way of representing an approximate perception of the world and expresses the fact that the agent believes that a proposition is not true, but it is more likely than its negation.

Athens October 17th, 2009 88

Rational interaction

Formulae of the type B(i,p), U(i,p), et C(i,p) can be respectively read as : “i believes that p is true”, “i is uncertain about the truth of p”, and “i desires that p is currently true”.

The logical model for operator B accounts for interesting properties of a rational agent, such as consistency of beliefs, and introspection capacity, formally characterized by the validity of logical schema such as:

)),i(B,i(B),i(B

))(B,i(B),i(B

),i(B),i(B

Athens October 17th, 2009 89

Rational interaction

•an agent cannot be uncertain of its mental attitudes

•an agent chooses the logical consequences of her choices:

•an agent must choose the event courses he believes he is already in

•an agent cannot bring about a situation if he believes that it already holds

•agents are in perfect agreement with themselves about their own mental attitudes

•Agents may also decide actions based on reasoning

Athens October 17th, 2009 90

Probability distribution of user goals

Recent systems accommodate for ASR and interpretation errors by maintaining beliefs about what the user has said rather than accepting the output from the recogniser.

Solutions:

•models that update a compressed representation of belief using the most likely last user utterance (Rudnicky CMU)

•use the full set of possible user utterances to update a distribution over user goals. (Young, Williams, Thompson, Cambridge Univ)

Athens October 17th, 2009 91

dialog policy

The dialog policy chooses a system action, based on the actual dialog state:

 

 

State transitions are characterized by the following probability distribution:

 

)S(A tt

P S At t t(S | ) 1 1

Athens October 17th, 2009 92

dialog policy

The combination of action and state at time t determines the expected immediate reward:

 

 

The goal is to find a policy which maximizes the total reward:

 

r r At t t 1 (S , )

R rtt

T

1

1

1

Athens October 17th, 2009 93

dialog policy

A small negative reward is given at each dialog turn. A large positive reward is given on completing successfully and a corresponding negative reward is given for completing unsuccessfully. All solution methods of reinforcement learning depend on finding the value function for any state :

 

 

Given a state, there may be different paths to completion, also depending on the unpredictability of the user, each one of these will correspond to a reward. The expected value of these rewards could be computed.

V E R S St t

(S) ( | )

Athens October 17th, 2009 94

dialog policy

A more useful function is:

 

 

For which, the optimal policy at state S can be determined as follows:

 

Q A E R S S A At t t

(S, ) ( | , )

' (S) arg max (S, )A

Q A

Q A r A P S A Q AAS

(S, ) (S, ) (S'| , ) max (S' , ' )''

Athens October 17th, 2009 95

Partially Observable MDP

In most practical spoken dialog systems, the state space and the set of possible user dialogue acts are very large. As a result, belief monitoring is computationally very expensive and direct policy optimisation is intractable. The user belief cannot be observed and must be inferred.

Thus a dialog system must base its strategy on incomplete data and has to use Partially Observable MDP.

Dialog states are summarized in system beliefs (a manageable finite set)

System beliefs are obtained by using an observation vector O which encapsulates the evidence needed to update and maintain them.

Athens October 17th, 2009 96

POMDP

Modelling a dialog system as a POMDP involves maintaining a probability distribution over all possible machine states.

When performing policy optimisation, however, the subsequent machine action is likely to focus on just the most likely states. This suggests maintaining two coupled state spaces:

the full space called the master state space and

a much simpler space called the summary state space The summary state space consists of the top N user goal states from master space

Athens October 17th, 2009 97

Hidden Information State (HIS)

The HIS model groups together user goals into partitions which are indistinguishable given the past user utterances.

In order to reduce complexity, a Bayesian Network is used to represent the state of the POMDP, where the network is factored according to the relations of the system relational data base (Thompson and Young, ICASSP 2008).

State S : (user_goal g, user act u, history summary h)

States are hidden and have to be inferred

Athens October 17th, 2009 98

Belief

)1t(S

i

i

i

)1t(a)1t(S)t(SP)t(S)t(YPk)t(Sbbelief

hh

uu

gg

)t(S

System ,actions depend on the distribution of beliefs among states

Belief is computed for each state at each turn t be assuming only certain dependences among slots

Probabilities are computed by representing sequences of states by Dynamic Bayesian Networks (DBN)

Athens October 17th, 2009 99

DBN state goal (t) goal-attribute1 goal-attribute2

user-attribute1 user-attribute2

user (t)

history (t) h-attribute1 h-attribute2

state S(t)

Athens October 17th, 2009 100

DBN dependencies goal (t) goal-attribute1 goal-attribute2

user-attribute1 user-attribute2

user (t)

history (t) h-attribute1 h-attribute2

state S(t)

Athens October 17th, 2009 101

DBN system action goal (t) goal-attribute1 goal-attribute2

user-attribute1 user-attribute2

user (t)

history (t) h-attribute1 h-attribute2

state S(t)

A

Athens October 17th, 2009 102

DBN observation goal (t) goal-attribute1 goal-attribute2

user-attribute1 user-attribute2

user (t)

history (t) h-attribute1 h-attribute2

state S(t)

A Y(t)

Athens October 17th, 2009 103

DBN

goal (t) goal-attribute1 goal-attribute2

user-attribute1 user-attribute2

user (t)

history (t) h-attribute1 h-attribute2

state S(t)

A Y(t)

goal (t-1) goal-attribute1 goal-attribute2

user-attribute1 user-attribute2

user (t-1)

history (t-1) h-attribute1 h-attribute2

state S(t-1)

A Y(t-1)

Athens October 17th, 2009 104

Policy

Grid based approximations to a summary containing the probability of each slot require too many grid points and are ill-suited to the task. Instead, some form of tiling or more general function approximation must be used.

An algorithm that performs well within the proposed framework is episodic Natural Actor Critic.

For each action, a, a summary function φa(b) is defined. This maps belief distributions, b, to a summary vector that the system designer believes are important in making choices between actions.

Athens October 17th, 2009 105

Muntimodam dialog system

Dialog Engine

Sub-system manager Event handler and synchronizer

I/O system #1

I/O system #2

service system #1

service system #2

Athens October 17th, 2009 106

THANK YOU