Athens October 17th, 2009 1 1 IEEE DL 2009 Athens, october 17th 2009 Renato De Mori Mc Gill University University of Avignon Spoken Language Recognition

Athens October 17th, 2009 11

IEEE DL 2009 Athens, october 17th 2009

Renato De MoriMc Gill University

University of Avignon

Spoken Language Recognition and Understanding

LUNA IST contract no 33549


Summary

• Short discussion on automatic language recognition problems• - Automatic speech recognition in a working system

• Interpretation process– Full / shallow parsing

– Generative vs discriminative models models

– Semantic composition

– Confidence and learning

– Examples of working systems


Dialog system architecture

inputinputRECOGNITION

UNDERSTANDING

and DIALOG

ANSWER GENERATION

answeranswer

acoustic

language

semantic

dialog

synthesis

models


The channel model

INFORMATION SOURCE

INFORMATION CHANNEL

W A

the objective of recognition is to reconstruct W based on the observation of A. The source and the channel contain KSs. The source generates a variety of sequences W with a given probability distribution.The channel, for a given W, generates a variety of A with a given probability distribution.

Considering the following coding scheme :


Decoding as search

)A(P

)W(P)W/A(P)A/W(Pmaxargw

W

computed by the language model LM

computed by the acoustic model AM

W : sequence of hypothesized words A: acoustic evidence

Search for a sequence of words W such that


word hypothesis generation

signal corpus

acoustic models

sequence of acoustic descriptors

text corpus

language modellexical models

INTEGRATED NETWORK

WORDS

HYPOTHESIS GENERATION

training training


W1

W2

W3

P(W1)

P(W2)

P(W3)

P(W1/W1)

P(W3/W1)P(W2/W1)

P(W2/W2)

P(W1/W2)

P(W3/W2) P(W1/W3)

P(W3/W3)

P(W2/W3)


left to right models with skips and emplty transitions

SiSj

Pij

Pii

ii(a)

ij(a)

Pii + Pij = 1

ij(a) is the probability of observing a during transition

i j


Mixture densities

Pii

Pij

If the components of a are statistically independent, then

)a(gii )a(gij

aa

∑ )a,,(Nw=)a(G

1=gijgijgijgij


Major problems

Impressive performance but not robust enough.

Some reasons:

acoustic variability

noise

disfluency

pronunciation variability

language variety

reduced feature size

modeling limits

non-admissible search


Demo

LUNAVIZ

OK 340.1

HS 267 3 267 4

DISF 298 1

COM 238 6

COM 238 4


Possible applications

Constrained environments (broadcast transcriptions)

Partial automation (telephone services)

Redundant messages (spoken surveys)

Specific content extraction (understanding and dialog)


Complexity, Functionality, Technology

CommandAnd

Control

(e.g., Simple callRouting; VRCP;Voice dialing)

Sim

ple

AS

R; is

ola

ted

wo

rds

, co

nn

ec

ted

dig

its

Interaction Complexity(human-machine)

PromptConstrained

NaturalLanguage

(e.g., TravelReservations,

Finance,Directory asst)

La

rge

r vo

ca

bu

lary

, Ha

nd

-bu

ilt gra

mm

ars

Free-form Natural

LanguageDialogue

(Customer Care,Help Desks, E-Commerce)

Ve

ry la

rge

vo

ca

bu

lary

, NL

, DM

, TT

S

1990 2009


COMPUTER INTERPRETATION PROCESS


Interpretation problem decomposition

Speech signs meaning

1-best, n-best, lattices

Acoustic features words constituents structures

features for interpretation

Problem reduction representation is context-sensitive

Interpretation is a composite decision process. Many decompositions are possible involving a variety of methods and KSs, suggesting to consider a modular approach to process design.

Robustness is obtained by evaluation and possible integration of different KSs and methods used for the same sub-task.


Understanding process overview

speech to conceptual structures and MRL

speech

Short Term Memory

Long Term Memory : AM LM interpretation KSs

dialogue

learning

signs

words concept tags

concept structures

MRL description


Interpretation as a translation process

Interpretation of written text can be seen as a process that uses procedures for translating a sequence of words in natural language into a set of semantic hypotheses (just constituents or structures) described by a semantic language.

W:[S[VP [V give, PR me] NP [ART a, N restaurant] PP[PREP near, NP [N Montparnasse, N station]]]]

:[Action REQUEST ([Thing RESTAURANT], [Path NEAR ([Place IN ([Thing MONTPARNASSE])])]]

Interesting discussion in (Jackendoff, 1990) Each major syntactic constituent of a sentence maps into a conceptual constituent, but the inverse is not true.


Parsing with ATIS stochastic semantic gramamrs

show

flight

Showindicator

DateDest.

FlightIndicator

Dest.Indicator

City Name

DateIndicator

Day

Please show me the flights Boston Mondayto on

Non-terminal nodes

Terminal nodes

show

flight

Showindicator

DateDest.

FlightIndicator

Dest.Indicator

City Name

DateIndicator

Day

Please show me the flights Boston Mondayto on

Non-terminal nodes

Terminal nodes


Semantic building actions in full parsing parsing

S NP [agent] VP[action] NP[theme] det N V det N the customer accepts the contract

Use tree kernel methods for learning argument matching (Moschitti, Raymond, Riccardi, ASRU 2007)

Not robust enough if WER is high


Predicate/argument structures and parsers

Recently, classifiers were proposed for detecting concepts and roles.

Such detection process was integrated with a stochastic parser (e.g. Charniak 2001).

A solution using this parser and tree-kernel based classifiers for predicate argument detection in SLU is proposed in (Moschitti et al. ASRU 2007).

Other relevant contributions on stochastic semantic parsing can be found in (Goddeau and Zue. 1992, . Goodman. 1996,. Chelba and Jelinek, 2000,. Roark, 2002, Collins, 2003)

Lattice-based parsers are reviewed in (Hall, 2005)


Shallow parsing


Probabilistic interpretation in the Chronous system

Org.

else Date

Dest.

The probability P(CW) is computed using Markov models Markov models as

P(CW)=P(W|C)P(C)

(Pieraccini et al., 1991, Pieraccini, E. Levin, E. Vidal, 1993).


Semantic Classification trees

City?

from City?

to City?Origin

Dest.

yes no

no

noyes

yes

(Kuhn and De Mori, 1995)


Conceptual Language Models FSTs

CLM0

CLM1

CLMj

………………………………..

CLMJ

This architecture is used also for separating in domain from out domain message segments (Damnati, 2007) and for spoken opinion analysis (Camelin et al., 2006). The whole ASR knowledge models in this way a relation between signal features and meaning.


Hypothesis generation from lattices

An initial ASR activity generates a word graph (WG) of scored word hypotheses with a generic LM.

The network is composed with WG resulting in the assignment of semantic tags to paths in WG

C

1cc0

C

0cc

CLMGENLMCLM

CLMWGSEMG


Word Graph

autour de vingt euros

du

TrocadéroWord graph WG

un


Composition

autour/NEAR /RANGE

de/ vingt/ euros/<PRICE>

du/

Trocadéro/<PLACE>

un/

/

Trocadéro/

<b(PRICE)><b(PLACE)>

))Trocadéro:value,square:type(LOC(IN ThingPlace


Projection

In order to obtain the conceptual hypotheses that are more likely to be expressed by the analyzed utterance, SEMG is projected on its outputs (the second elements of the pairs associated to transitions) leading to a weighted Finite State Machine (FSM) with only indicators of beginning and end words of semantic tags. The resulting FSM is then made deterministic and minimized leading to an FSM SWG give by

SWG=OUTPROJ(SEMG)

Operations on SFSTs are performed using the ATT library (Mohri et al.)

The objective is to have the correct interpretation as the most likely among the hypotheses coherent with the dialogue predictions.


Demo

LUNAVIZ

OK 340.1

Best path avoids insertions


CRF

Cc kiikk ixyyf

xZxyp ),,,(exp

)(

1)|( 1

y Cc kiikk ixyyfxZ ),,,(exp)( 1

otherwise 0

to}|{arrivecontain ... and

. if 1

),,,( 11 ii

i

iik xx

CITYARRIVEy

ixyyf

Possibility of having features from long-term dependences

Results for LUNA from Riccardi, Raymond, Ney, Hann


Results

French Corpus attribute name attribute name&value

model text speech text speech

CRF 11.5 24.3 14.6 28.8

FST 14.1 27.5 16.6 31.3

DBN 15.6 29.0 18.3 33.0

LLPos 14.7 27.5 17.7 32.1

SVM 16.5 29.2 19.2 33.1

SMT 15.2 28.7 23.4 34.9

weighted ROVER

10.7 23.8 13.0 27.9

FRENCH MEDIA corpus

3000 test dialog turns

Telephone speech

Vocabulary 2K

WER 27.4%


Computational semantics

Epistemology, the science of knowledge, considers a datum as basic unit.

Semantics deals with the organization of meanings and the relations between sensory signs or symbols and what they denote or mean.

Computer epistemology deals with observable facts and their representation in a computer.

Natural language interpretation by computers performs a conceptualization of the world using computational processes for composing a meaning representation structure from available signs and their features.


Frames as computational structures (intension)

{addressloc TOWN (facet)

……attached proceduresarea DEPARTMENT OR PROVINCE OR STATE

……attached procedures country NATION

……attached proceduresstreet NUMBER AND NAME

……attached procedureszip ORDINAL NUMBER

……attached procedures }

A frame scheme with defining properties represents types of conceptual structures (intension) as well as instances of them (extension). Relations with signs can be established by attached procedures (S. Young et al., 1989).


Frame instances (extension)

A convenient way for asserting properties, and reasoning about semantic knowledge is to represent it as a set of logic formulas.

A frame instance (extension) can be obtained from predicates that are related and composed into a computational structure.

Frame schemata can be derived from knowledge obtained by applying semantic theories.

Interesting theories can be found, for example in (Jackendoff, 1990, 2002) or in (Brackman 1978, reviewed by Woods 1985)

)84000,x(zip)Pascalavenue1,x(street)France,x(country

)Vaucluse,x(area)Avignon,x(loc)address,x(of_cetanins)x(


Frame instance

{a0001instance_of addressloc Avignonarea Vauclusecountry Francestreet 1, avenue Pascalzip 84000}

Schemata contain collections of properties and values expressing relations. A property or a role are represented by a slot filled by a value


Composition process 1

Step 1 – concept tags -> frame instance fragments

Concept tag : hotel-parking,

Fragment : HOTEL.[hotel_facilities FACILITY.[facility_type parking]]



Step 2 – composition by fusion

HOTEL.[luxury four_stars]

HOTEL.[h_facility FACILITY.[hotel_service tennis]]

Fusion rule

resultHOTEL. [luxury four_stars, h_facility FACILITY.[hotel_service tennis]]

e,qqd,qqj,e,qj,d,q

e,qj,d,ie,qqd,qq

sl.F,sl.Ffusionvvj

headvheadsl.F,sl.Ffusionpsup

e,qqd,qqj,e,qj,d,q

e,qj,d,ie,qqd,qq

sl.F,sl.Ffusionvvj

headvheadsl.F,sl.Ffusionpsup



Step 3– Composition by attachment

b,hz,qb,hz,qhz,q ,Blink,Blinkpsuphead,contains

new fragment ADDRESS.[adr_city Lyon]

Composition result:

HOTEL. [luxury four_stars,

h_facility FACILITY.[hotel_service tennis],

at_loc ADDRESS.[adr_city Lyon]]


Semantic composition results

French Corpus F-measure Frame heads

F-measure Frame composition

model text speech text speech

rules 98.3 77.0 85.3 66.9

CRF + rules 84.0 77.8 74.2 68.2

All DBN 76.11 70.0 56.2 54.0


Semantic composition CRF + rules

Metrics Precision Recall F1

1. Frame Recognition 78.37 77.28 77.82

2. Argument Boundary Detection 87.21 76.99 81.78

3. Argument Labeling 76.78 72.05 74.34

4. Argument Recognition 70.55 66.20 68.31

5. Frame Realization 59.87 59.03 59.45

6. Frame Composition 73.56 63.73 68.29

7. Turn Analysis 46.77 45.98 46.37


Demo

LUNAVIZ

OK 340 1


CONFIDENCE AND LEARNING


Confidence

Evaluate confidence of components and compositions

represents the confidence indicators or a function of them.

Notice that it is difficult to compare competing interpretation hypotheses based on the probability where Y is a time sequence of acoustic features, because different semantic constituents may have been hypothesized on different time segments of stream Y.

)(P

)Y(P

conf

conf

)Y(P


Probabilistic frame based systems

In probabilistic frame-based systems, (Koller 1998 ) a frame slot S of a frame F is associated a facet Q with value Y: Q(F,S,Y).

A probability model is part of a facet as it represents a restriction on the values Y.

It is possible to have a probability model for a slot value which depends on a slot chain.

It is also possible to inherit probability models from classes to subclasses, to use probability models in multiple instances and to have probability distributions representing structural uncertainty about a set of entities.


Dependency graph with cycles

Acoustic_evidence support concept filled-slot

If the dependence graph has cycles, then possible worlds can be considered. The computation of probabilities of possible worlds is discussed in (Nilsson, 1986). A general method for computing probabilities of possible worlds based on Markov logic networks (MLN) is proposed in (Richardson, 2006).

kY

kW

kY kW kC k,j,i

Y W C ,j,i

mY mW mC m,j,i


Probabilistic models of relational data

Relational Markov Networks (RMN) are a generalization of CRFs that allow for collective classification of a set of related entities by integrating information from features of individual entities as well as relations between them (Taskar et al., 2002).

Methods for probabilistic logic learning are reviewed in (De Raedt, 2003).


Define confidence-related situations

Consensus among classifiers and SFST is used to produce confidence indicators in a sequential interpretation strategy (Raymond et al. 2005, 2007). Classifiers used are SCT, SVM, adaboost. Committee-Based Active Learning uses multiple classifiers to select samples (Seung et al. 1992)

FSM SVM adaboostSCT

Fusion strategy


Confidence measures

Two basic steps:

1) generate as many features as possible based on the speech recognition and/or natural language understanding process and

2) Estimate correctness probabilities with these features, using a combination model.


Features for confidence

Many features are based on empirical considerations:

semantic weights assigned to words,

uncovered word percentage,

gap number,

slot number,

word, word-pair and word-triplet

occurrence counts,


Adaptive Learning in Practice

(Riccardi Tur

Hakkani-Tur, 2005)


Application with existing technology

OMS (FT Voice XML Server)

HTTP (ASRoutput)

HTML

ASRwith

domainspecific

LM

Application Server

threshold α

Multi-classifier

label, classifier : Probability (label)

Cla

ssif

icat

ion

Pro

cess

Ala

rmD

etec

tio

n

Adaboost Classifier

Alarm Probability

threshold β

Web serverOMS (FT Voice XML Server)

HTTP (ASRoutput)

HTML

ASRwith

domainspecific

LM

Application Server

threshold α

Multi-classifier


Cla

ssif

icat

ion

Pro

cess

Ala

rmD

etec

tio

n

Adaboost Classifier

Alarm Probability

threshold β

Application Server

threshold α

Multi-classifier


Multi-classifier


Cla

ssif

icat

ion

Pro

cess

Ala

rmD

etec

tio

n

Adaboost Classifier

Alarm Probability

Ala

rmD

etec

tio

n

Adaboost Classifier

Ala

rmD

etec

tio

n

Adaboost Classifier

Alarm Probability

threshold β

Web server


Use Inf Th for system tuning

Usable results have been obtained in French with the opinion analysis system

Selection of the operating point (IEEE TSAP)


Bias estimation and correction

Bias estimation and correction Proportion estimation results before and after correction (INTERSPEECH 2009)

Observed proportion

True proportion


Architecture of the FT automatic survey platform

IVRSurvey Framework

Survey platform

Surve

y definitioninterface

Definition Survey Definition

Data

OMS (FT VXML server)

ASR/TTS

Application server

Disserto (FT DM)

log

Trafic supervision

Customer Profile

Provisionning

RE

SU

LTS

CA

LLS

Resultsview

Samplingmodels

Samplingmodels

ProvisionningInterface

Record models

Record models

IVRSurvey Framework

Survey platform

Surve

y definitioninterface

DefinitionDefinition Survey Definition

Data

OMS (FT VXML server)

ASR/TTS

Application server

Disserto (FT DM)

log

Trafic supervision

Customer Profile

ProvisionningProvisionning

RE

SU

LTS

CA

LLS

Resultsview

Samplingmodels

Samplingmodels

ProvisionningInterface

Record models

Record models


Synopsis of the two applications related to opinion analysis

Opin

ion D

etection

Segmentation

Interpretation

Validation

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

courtesy efficiency rapidity

-+

daily statistics

Monitoring

Opin

ion D

etection

Segmentation

Interpretation

Emergency estimation

sorted list

Alarm detection

Opin

ion D

etection

Segmentation

Interpretation

Validation

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

courtesy efficiency rapidity

-+

daily statistics

Monitoring

Opin

ion D

etection

Segmentation

Interpretation

Emergency estimation

sorted list

Alarm detection


The LUNA MEDIA SYSTEM

LUNA Talkserver

ASR TTS

Client SLU

STMMDMShortTerm

Memory

LUNA Talkserver

ASR TTS

Client SLU

STMMDMShortTerm

Memory


Dialogue systems

tS~ tS

~b

user

SLU system

Dialogue next state estimator

Dialogue Manager

mem

St-1

at

ut

Yt

belief estimator

Dialogue Manager

mem

at

action depends on dialogue state action depends on belief


Conclusions

A modular SLU architecture can exploit the benefits of combined use of CRFs, classifiers and stochastic FSMs, which are approximations of more complex grammars.

Grammars should perhaps be used in conjunction with processes having inference capabilities.

Recent results and applications of probabilistic logic appear interesting, but its effective use for SLU still has to be demonstrated.

Annotating corpora for these tasks is time consuming suggesting that it is suitable to use a combination of knowledge acquired by a machine learning procedures and human knowledge.


Conclusions

Robustness, incremental learning, portability are important and open issues.

SLU is not only used in human-machine dialogs. Other applications are for opinion analysis, indexing, summarization, retrieval.

Some applications can tolerate a certain amount of ASR imprecision.


THANK YOU


Dialogue system requirements

Negotiation ability

user satisfaction , new requests

Context interpretation

anaphora, ellipses, disfluencies, synchronization

Interaction flexibility

focus switch

Cooperative reactions

completion, corrective, conditional, intentional answers

Adequacy of response style


Communicative acts

cooperation

inspires

communication

which should be based on

rationality

Communication is performed by agents which are in a mental state

An act depends on mental states and has the effect of modifying them


Reasoning and decision

beliefs

goals

Dialog history concepts

inference

actions

decision


Types of dialogues

collaborative dialogues conflictive dialogues contrary beliefs contrary beliefs shared goals contrary goals

cooperative dialogues (Gren, 1975) shared beliefs shared goals


models for decision making

precondition-action

finite-state diagrams Stochastic finite state automata

formal grammars stochastic cfgs

logic inference

predicate logics Markov Logic Networks

modal logics

overall reward maximization

(partially observable) Markov Decision Processes


State dependent or belief dependent decision

tS~ tS

~b

user

SLU system

Dialogue next state estimator

Dialogue Manager

mem

St-1

at

ut

Yt

belief estimator

Dialogue Manager

mem

at

action depends on dialogue state action depends on belief


Sub-optimal sequential solution

A P A Y B P Y A B P A B

P Y A B P Y WA B P Y W A P W A B

A P Y W P W B P A W B

uA u

u sA u

u s u s

u sW

u sW

u u s

uA u

s u s

*

*

arg max ( | , ) arg max ( | , ) ( | )

( | , ) ( , | , ) max ( | , ) ( | , )

arg max ( | ) ( | ) ( | , )

W P Y W P W B

A P A W B

Ws

uA u

u s

* arg max ( | ) ( | )

arg max ( | *, )*


Finite-state diagrams

Structural approaches, based on finite-state models, assume that dialog coherence resides in its structure based on descriptions of observed sequencing of utterances, not on explanations of what is observed.

Although effective for some practical systems, the approach is rigid and may severely limit user satisfaction.

While conviviality is a major requirement for system design, an efficient implementation is an essential condition for achieving user satisfaction.

There are specific tasks in dialog that can be achieved with simple and effective models based on finite-state diagrams.



SM 1INT 1=

ASU(LM1,UM1)

SM 21INT 21=

ASU(LM21,UM21)

SM 2nINT 2n=

ASU(LM2n,UM2n)

CONDITION 121CONDITION 12n

. . . . .

. . . . .



Next state depends only on actual interpretation of the input. Results of inference are pre-computed and represented by

(state,input) (next_state)

(actual_state) (action) rules

Finite-state models can be used in more general frameworks in which states represent tasks to be accomplished.

Each task can then be accomplished by models based on different paradigms (Heeman et al., 1998).

This simplifies system design imposing a certain degree of system-initiative without preventing mixed-initiative.

The grounding process, ensures that what has been said is mutually understood, and is explicitly achieved by asking the user to verify recognition results before leaving a dialog state.


Example : The FT 3000 system

• Service– obtain information about FT services

• purchase almost 30 different services– access account

• check consumption, pay bills• call forwarding, voice messaging

• Deployed since October 2005• Corpus collected daily• Dealing with different kind of speech

– Speech/non speech– Speech out-of-domain/speech in domain

• Comments from the users– Speech with a valid content/invalid content

• Experienced vs. New users


FT 3000 Voice Agency service

Thousands of dialogues are automatically processed every day and the system logs are stored.

Speakers are not professional ones and are not biased by the acquisition protocols.

The complexity of the services widely deployed is necessarily limited in order to guarantee robustness with a high automation rate.

The semantic model of such deployed system is task-oriented.


Dialogue types

transit dialogues Routing users to specific vocal services to activate already purchased services.

These dialogues represent 80% of the calls to the FT3000 service.

Other dialogues with which users ask about a service they may wish to subscribe.

For these dialogues, the average utterance length is higher, as well as the average number of dialogue turns.


From concept lattice to multiple active states

Sequence of dialog states

Sequence of utterances

Sequence of interpretations

Basic concept tag string

Word string

Objective: obtain the best sequence of states S over a sequence of spoken utterances represented by Y

recursive calculation with:

SLU


Strategy 1: sequential

• Best sequence of words

• Best sequence of concepts

• Best interpretation

Probability of a dialog state S at turn k:

1best word 1best concept 1best interpretation


Strategy 2: integrated search

• Best sequence of concepts+words+interpretation

For estimating the probability of a dialog state and an interpretation: joint search for the best sequence of words and concepts

word lattice concept lattice interpretation lattice


Strategy 3: integrated search + dialog context

• Bigram Language Model in the dialog states:

word lattice concept lattice interpretation lattice dialog state lattice


Implementation with a Finite State Machine approach

• Language Models + Local grammars + rules

– AT&T FSM + GRM Libraries

word

concept

inter.

dial. state

word lattice

3gram Language Model

HMM tagger

interpretation rules

dialog manager rules

2gram State Model


Corpus used fo the experiments

• Corpus (manually labelled data)– Training

• 44K utterances for LM (word and concept)• 7.4K dialogues (dialog state LM)

– Test• 816 dialogues / 1950 utterances

• Two types of dialogues– "Transit"

• Request for a specific task and rerouting towards the corresponding service

– "Other"• Request for information or new service purchase• Longer dialogs, more comments, new users


Evaluation criteria

The results are evaluated according to 3 criteria:

• the Word Error Rate (WER), the Concept Error Rate (CER) and the Interpretation Error Rate (IER).

• the CER is related to the correct translation of an utterance into a string of basic concept tags.

• the IER is related to the global interpretation of an utterance in the context of the dialogue service considered.Therefore this last measure is the most significant one as it is directly linked to the performance of the dialogue system.


Results with the OOD Language Model

• OOD LM is very useful on the other dialogues, which contain most of the comments from the users

• No negative impact on the transit dialogues

• IER=Interpretation Error Rate (at the turn level)• correct=all the attribute/value components in a frame must be correct• Strategy 1

Example: {Gest(Resilier,Ambi(AtoutsPlus,HeureLocale,ForfaitLocal))}


Results according to the SLU strategy

• Strat1=sequential• Strat2=integrated• Strat3=integrated + dialog context

• WER = Word Error Rate• CER = Concept Error Rate• IER = Interpretation Error Rate

• Some gains obtained by using an integrated search (strat2)- Higher correct values, but too many insertions (False Acceptance)

• No improvement observed by using dialog context- Very short dialogues (1.5 turns for Transit, 3.7 for Other)


Oracle measures

• Oracles in– Word lattice (WER)– Concept lattice (CER)– Interpretation lattice (IER)

– utterances detected as OOD were discarded

• IER obtained on these Oracle hypotheses– IER on the Oracle 1-best word string = 9.8%– IER on the Oracle 1-best concept string = 7.5%– IER on the Oracle 1-best interpretation = 4.4%

Joint search and optimization on the word/concept/interpretation levels


SLU decision process

• Decision process based on multiple hypotheses output• Example

– Agreement at the different levels for detecting “expected” dialogs – Can be used to detect problematic dialogues


Plan-based approaches

communicating is acting (Austin, 1962)

utterance is also as a realization of observable communicative actions: speech acts or dialog acts like

inform, request, confirm, commit,

Communicative actions can be planned and recognized on the basis of mental states.

Dialog analysis is considered in the framework of explaining actions based on mental states, relying on general models of actions and mental attitudes.


Activities

• identify the user goals or intensions (e.g. task goal, communicative goals (add information or corrections) goal reasoner infers goals or intensions from interpretations or inferred dialog acts. It should respect an interaction model

• identify and update beliefs

Cooperation, rationality and other model properties are represented by formulae which are used by an inference engine

Result : Dialogue state

• make decisions Result : actions

Optimal (minimum risk, maximum reward) decision strategies


Rational interaction to recognize goals and beliefs

Rational interaction is based on the assumption that an intelligent system has a communication ability, grounded on a general competence, that characterizes rational behavior (Sadek)

A proposition is a belief of an agent if the agent considers that the proposition is true.

Belief is the mental attitude whereby an agent has a model of the world.

Reasoning involves belief reconstruction and revision

Uncertainty is a way of representing an approximate perception of the world and expresses the fact that the agent believes that a proposition is not true, but it is more likely than its negation.


Rational interaction

Formulae of the type B(i,p), U(i,p), et C(i,p) can be respectively read as : “i believes that p is true”, “i is uncertain about the truth of p”, and “i desires that p is currently true”.

The logical model for operator B accounts for interesting properties of a rational agent, such as consistency of beliefs, and introspection capacity, formally characterized by the validity of logical schema such as:

)),i(B,i(B),i(B

))(B,i(B),i(B

),i(B),i(B


Rational interaction

•an agent cannot be uncertain of its mental attitudes

•an agent chooses the logical consequences of her choices:

•an agent must choose the event courses he believes he is already in

•an agent cannot bring about a situation if he believes that it already holds

•agents are in perfect agreement with themselves about their own mental attitudes

•Agents may also decide actions based on reasoning


Probability distribution of user goals

Recent systems accommodate for ASR and interpretation errors by maintaining beliefs about what the user has said rather than accepting the output from the recogniser.

Solutions:

•models that update a compressed representation of belief using the most likely last user utterance (Rudnicky CMU)

•use the full set of possible user utterances to update a distribution over user goals. (Young, Williams, Thompson, Cambridge Univ)


dialog policy

The dialog policy chooses a system action, based on the actual dialog state:

State transitions are characterized by the following probability distribution:

)S(A tt

P S At t t(S | ) 1 1


dialog policy

The combination of action and state at time t determines the expected immediate reward:

The goal is to find a policy which maximizes the total reward:

r r At t t 1 (S , )

R rtt

T

1

1

1


dialog policy

A small negative reward is given at each dialog turn. A large positive reward is given on completing successfully and a corresponding negative reward is given for completing unsuccessfully. All solution methods of reinforcement learning depend on finding the value function for any state :

Given a state, there may be different paths to completion, also depending on the unpredictability of the user, each one of these will correspond to a reward. The expected value of these rewards could be computed.

V E R S St t

(S) ( | )


dialog policy

A more useful function is:

For which, the optimal policy at state S can be determined as follows:

Q A E R S S A At t t

(S, ) ( | , )

' (S) arg max (S, )A

Q A

Q A r A P S A Q AAS

(S, ) (S, ) (S'| , ) max (S' , ' )''


Partially Observable MDP

In most practical spoken dialog systems, the state space and the set of possible user dialogue acts are very large. As a result, belief monitoring is computationally very expensive and direct policy optimisation is intractable. The user belief cannot be observed and must be inferred.

Thus a dialog system must base its strategy on incomplete data and has to use Partially Observable MDP.

Dialog states are summarized in system beliefs (a manageable finite set)

System beliefs are obtained by using an observation vector O which encapsulates the evidence needed to update and maintain them.


POMDP

Modelling a dialog system as a POMDP involves maintaining a probability distribution over all possible machine states.

When performing policy optimisation, however, the subsequent machine action is likely to focus on just the most likely states. This suggests maintaining two coupled state spaces:

the full space called the master state space and

a much simpler space called the summary state space The summary state space consists of the top N user goal states from master space


Hidden Information State (HIS)

The HIS model groups together user goals into partitions which are indistinguishable given the past user utterances.

In order to reduce complexity, a Bayesian Network is used to represent the state of the POMDP, where the network is factored according to the relations of the system relational data base (Thompson and Young, ICASSP 2008).

State S : (user_goal g, user act u, history summary h)

States are hidden and have to be inferred


Belief

)1t(S

i

i

i

)1t(a)1t(S)t(SP)t(S)t(YPk)t(Sbbelief

hh

uu

gg

)t(S

System ,actions depend on the distribution of beliefs among states

Belief is computed for each state at each turn t be assuming only certain dependences among slots

Probabilities are computed by representing sequences of states by Dynamic Bayesian Networks (DBN)


DBN state goal (t) goal-attribute1 goal-attribute2

user-attribute1 user-attribute2

user (t)

history (t) h-attribute1 h-attribute2

state S(t)


DBN dependencies goal (t) goal-attribute1 goal-attribute2


user (t)


state S(t)


DBN system action goal (t) goal-attribute1 goal-attribute2


user (t)


state S(t)

A


DBN observation goal (t) goal-attribute1 goal-attribute2


user (t)


state S(t)

A Y(t)


DBN

goal (t) goal-attribute1 goal-attribute2


user (t)


state S(t)

A Y(t)

goal (t-1) goal-attribute1 goal-attribute2


user (t-1)

history (t-1) h-attribute1 h-attribute2

state S(t-1)

A Y(t-1)


Policy

Grid based approximations to a summary containing the probability of each slot require too many grid points and are ill-suited to the task. Instead, some form of tiling or more general function approximation must be used.

An algorithm that performs well within the proposed framework is episodic Natural Actor Critic.

For each action, a, a summary function φa(b) is defined. This maps belief distributions, b, to a summary vector that the system designer believes are important in making choices between actions.


Muntimodam dialog system

Dialog Engine

Sub-system manager Event handler and synchronizer

I/O system #1

I/O system #2

service system #1

service system #2


THANK YOU

Documents

Athens October 17th, 2009 1 1 IEEE DL 2009 Athens, october 17th 2009 Renato De Mori Mc Gill University University of Avignon Spoken Language Recognition