RESEARCH REPORTSgroups.csail.mit.edu/sls/publications/1998/research98.pdf · 10 SUMMARY OF RESEARCH ing to collect. In contrast, the read data collected via the toll-free number tends

7SPOKEN LANGUAGE SYSTEMS

R E S E A R C H

R E P O R T S

8 SUMMARY OF RESEARCH


JUPITER Data Collection and AnalysisJoseph Polifroni, James Glass and Sally Lee

We have been actively collecting data within

the JUPITER domain since the beginning of

1997. As was done in previous domains, we

first developed a prototype JUPITER system

and used it to collect spontaneous speech

using a Wizard paradigm, with a human

typist in the loop and subjects brought into

the lab and given scenarios to solve. At the

same time, we solicited read speech using

both our Web-based data collection facility

and a phone number that subjects could call

to read from pre-distributed lists.

Once these data had been collected, we

were able to train a recognizer and move on

to system-based data collection. We cur-

rently have a toll-free number that is

available 24-hours/day for subjects to call to

find out weather information1. The

utterances collected from this facility are

also used as training data. The toll-free

number has been a particularly powerful

method for collecting data from a variety of

subjects in a short period of time. We feel

that these calls accurately reflect the way

users want to interact with such systems.

Data Collectionand TranscriptionIn recent months, we have been receiving

approximately 17 calls per day. In order to

ensure that these data are ready to use as

quickly as possible for both speech recogni-

tion and natural language development, we

have been processing incoming data on a

daily basis. Every morning a script automati-

cally sends email containing a list of the

previous days calls. These calls are then

manually transcribed, usually over the

course of the following day. The transcribed

calls are then bundled into sets containing

approximately 500 utterances and are added

to the training corpus as they become

available.

Over the past year, we have continued

to refine our transcription tool which was

originally developed for orthographic

transcription of read speech. We used a Tcl/

Tk interface to provide an editable window

where the transcriber listen to utterances

and correct existing transcriptions and add

specialized markings for noise, partial words

etc. The initial transcription for each

utterance was the orthography hypothesized

by the recognizer during the call. The

transcriber could also identify the talker as

male, female, or child using the transcrip-

tion tool.

The transcription tool uses a lexicon to

check the quality of orthographic transcrip-

tions. This feature is useful for finding

typographical errors before they are saved. If

a word does not appear in the lexicon, the

transcriber is warned and given the option

of either adding the word to the lexicon or

changing it in the orthography file. The use

of the lexicon also allowed us to monitor

the growth of new words in the domain.

Corpus AnalysisA breakdown of the current utterances in

our JUPITER corpus is shown in Table 1. The

live data were collected between May 1997

and July 1998. The read speech data

collection contained a list of 50 utterances

which users were asked to read. The wizard

data collection effort involved subjects

attempting to solve several different

scenarios and thus produced a considerable

number of utterances per user. Note that

the wizard data was the most time consum-

1 Dissemination of the toll-free number has been done exclusively by word-of-mouth and the SLS Group’shomepage.


ing to collect. In contrast, the read data

collected via the toll-free number tends to

have many fewer utterances per call, as users

are typically interested in weather informa-

tion for one particular location. In addition,

the queries by real users tend to be some-

what shorter, although this statistic is

probably skewed by utterances which do not

contain an actual query due to a false

triggering of the endpoint detector.

A breakdown of the live data shows that

just over 80% of users are male speakers,

with females comprising approximately 18%

of the data, and children the remainder.

A portion of the data was from non-native

speakers, although the system performed

adequately on speakers whose dialect or

accent did not differ too much from general

American English. Callers with strong

accents, however, have been thus far

excluded from training or test sets for both

the speech recognition and natural language

components. These data constituted

approximately 7% of the calls and 14% of

the data, and will be useful for future study.

A very small fraction (0.1%) of the utter-

ances included talkers speaking in a foreign

language (e.g., Spanish, French, German,

and Chinese).

The signal quality of the data varied

substantially depending on the handset, line

conditions, and background noise. Al-

though clearly an underestimate, speaker

phones were clearly used in approximately

5% of the calls due to the presence of

multiple talkers in an utterance. Only a

small fraction of the data (0.5%) was

estimated to be from cellular or phones.

Over 11% of the data contained

significant noises. About half of this noise

was due to cross-talk from other speakers,

while the other half was due to non-speech

noises. The most common identifiable non-

speech noise was due to the user hanging up

the phone at the end of a recording (e.g.,

after saying good bye). Other distinguish-

able sources of noise included (in descend-

ing order of occurence) TV, music, phone

rings, touch tones etc. These data were also

excluded from training and testing of the

speech recognition component, although

cleaned up orthographies were used for

natural language training.

There were a number of spontaneous

speech effects present in the recorded data.

Over 6% of the data included filled pauses

(e.g., uh, um, etc.) which were explicitly

modeled as words in the recognizer, since

they had consistent pronunciations, and

occured in predictable places in utterances.

Filled pauses were removed from the

sentence hypotheses sent to the natural

language component. Utterances containing

partial words contitute another 6% of the

data, although approximately two thirds of

these were due to clipping at the beginning

or end of an utterance. The remaining

artifacts were contained in less than 2% of

the data and included phenomenon such as

(in descending order of occurence) laughter,

Table 1.Calls Utts Type Av Words/Utt Utt/Call

13168 74697 live 5.6 5.7

25 956 wizard 7.5 38.2

71 3519 read 9 49.6

3772 25175 total 6.1 6.7


throat clearing, mumbling, shouting,

coughing, breathing, sighing, sneezing, etc.

The latter data and all clipped data were

also excluded from training or testing of the

speech recognition component.

The data from foreign speakers,

containing cross-talk or other noises, and

containing clipped speech and other

spontaneous phenomenon (excluding filled

pauses), combined to account for approxi-

mately one quarter of all recorded data. To

date, these data have not been used for

training or testing the speech recognizer,

despite the fact that the system often

performs quite well on these data during an

actual call. The reason we have avoided

these data to date is due to the concern that

the recognizer will produce poor alignments

during training which will ultimately

contaminate the acoustic-phonetic models.

We hope to test this hypothesis and

examine these data more closely in the near

future.


Natural Language Processingin the JUPITER DomainStephanie Seneff and Joseph Polifroni

Parsing Linguistic ContentWithin the JUPITER system, we have begun to

explore the possibility of using natural

language systems to parse textual informa-

tion recovered from various Web sites, in

order to facilitate selection of the appropri-

ate message to speak in response to specific

user requests. Weather forecasts are

obtained from multiple sites on the World

Wide Web, whose information content

complement one another. If the system fully

understands the information it is presenting

to the user, it will be able to select subsets

that are relevant to the question, avoiding

boring the user with long monologues on

irrelevant distractions. It will also be able to

recognize when two sources are providing

overlapping information, and can select one

or the other based on quality consider-

ations. Finally, by representing all linguistic

data in a language-transparent semantic

frame format, the notion of translating the

weather reports into other languages

becomes far more feasible. An example

interaction with JUPITER displaying some of

its dialog capabilities and understanding of

specific weather information is shown in

Figure 2.

Figure 2. Example spoken dialoguebetween JUPITER and a user. J: Hi! Welcome to JUPITER, the MIT Lab for Computer Science weather

information system. You may hang up or ask for help at any time. How can Ihelp you?

U: CAN YOU TELL ME WHAT’S THE WEATHER LIKE IN BOSTON TODAY

J: In Boston this afternoon. Partly sunny early. Then becoming cloudy with 30percent chance of showers toward evening. Tonight: rain. Possibly heavy attimes. Near 100 percent chance of rain. Can I help you with something else?

U: YES WHAT ABOUT SAN FRANCISCO

J: In San Francisco today: patchy fog near the ocean morning. Otherwisemostly sunny. Tonight: areas of coastal fog and low clouds. Otherwise fair. Isthere something else I can do for you?

U: YES HOW COLD IS IT IN DALLAS TEXAS TODAY

J: In Dallas today: high upper 40s. Tonight: lows near 32. What else would youlike to know?

U: I’D LIKE TO KNOW WHETHER IT’S SNOWING IN DETROIT

J: I’m sorry, I didn’t understand you. Please rephrase your query.

U: WHAT IS THE WEATHER IN DETROIT

J: In Detroit this afternoon: occasional light rain and drizzle. Tonight: rainshowers likely. Becoming snow showers by late tonight. 70 percent chance ofprecipitation.


While most of our sources contain

scripted phrases that are predictable in

format, the National Weather Service

(NWS) provides manually generated

forecasts that are quite rich in linguistic

variation. Some example sentences are

shown in Table 2. Although these data are

challenging from the standpoint of parse

coverage, they give a rich description of the

weather, including very descriptive accounts

of conditions over time and locale, predic-

tions of amounts of precipitation, advisories

for hurricanes, floods, etc. Thus, we believe

it to be worth the effort to process them.

After the first three months of develop-

ing a grammar for the NWS data, the

system had achieved a high coverage (over

99%) on incoming weather reports. Any

sentences that fail to parse can be rephrased

by the system developer to retain an

equivalent meaning. The grammar can later

be expanded to encompass a broader base.

This is an ongoing process. By requiring a

full parse and hand-editing sentences that

fail to parse, we can guarantee a high

reliability in the understanding and

regeneration. We feel it is important to

assure, as much as possible, that the system

rarely provides incorrect or garbled informa-

tion, which would be much more likely with

any robust parsing strategies. The current

grammar contains over 1100 unique words,

with about 800 categories, about half of

which are pre-terminal.

JUPITER is updated several times a day, by

polling the Web for any changes in predic-

tions. An automatic procedure parses the

data into semantic frames [1], and a second

process sorts them into categories based on

the meaning. In general, each sentence is

indexed under generic terms such as

“weather,” “temperature,” and “advisory,”

and under more specific terms such as

“rain,” “fog,” “snow,” and “hurricane,” as

well as by date (“today,” “tomorrow,” etc.),

and by location. To retrieve the answer to a

particular user request, the system first

retrieves the indices of the relevant sen-

tences in the weather report via an SQL

query filtering on the particulars of the user

request, and then orders them sequentially.

Finally, each of the semantic frames is

paraphrased in turn, to compose a verbal

response. Delays are minimal, since the

system has preprocessed all current informa-

tion into semantic frames in a local cache.

We have begun an effort to port the

system to three other languages besides

English: German, Mandarin Chinese, and

Spanish. For delivery of the information in

a foreign language, generation tables are

being prepared that mirror the tables for

“Increasing clouds and becoming breezy with showers likely and a chance ofthunderstorms by the afternoon”

“Near record low temperatures with a low from upper 20s north portions tonear 40 south sections near Del Rio”

“The National Weather Service has continued the excessive heat warning for highly urbanized areas for Friday”

“Some storms may be severe with damaging winds and isolated tornadoes”

Table 2. Some examples ofsentences from the NationalWeather Service web site.


English, but specify alternative phrase

ordering rules and vocabulary entries. For

each of these languages, a native speaker

who is fluent in English is preparing the

corresponding GENESIS generation rules [2].

German is a particularly difficult

language due to its extensive use of inflec-

tional endings. In particular, it was neces-

sary to augment GENESIS with a more

sophisticated ability to deal with case, which

can be assigned in the vocabulary file by

prepositions and verbs. In addition, we

needed to be able to specify the correct

inflectional endings for nouns and adjec-

tives as a function of case, gender, and

number.

We “closed the loop” on a complete

Spanish system this year, using acoustic

models from the GALAXIA system, developed

in 1996. Although much work remains in

all aspects of the system, we were pleased to

have a system that would accept Spanish

input and speak a Spanish reply (using the

Spanish version of TruVoice, a software

speech synthesizer from Centigram Corpo-

ration). Parsing of Spanish input was

achieved using TINA, which needed no

modification to system code to produce

semantic frames from Spanish input.

Although Spanish has more complicated

number and gender agreement than

English, GENESIS had no trouble producing

grammatically correct utterances in Spanish.

The additional work that we had done to

accomodate German, with an even more

complex agreement system incorporating

case meant that the mechanisms in GENESIS

for making such agreement had been

thoroughly exercised.

Figure 3 gives an example of a semantic

frame for the sentence, “2 to 4 inches

snowfall accumulation by morning,” along

with the corresponding paraphrases in three

languages. Note that the preposition “by”

has been interpreted in the semantic frame

as denoting a time expression, allowing the

appropriate translation of this diversely

realized preposition.

NATURAL LANGUAGE PROCESSING IN THE JUPITER DOMAIN

Figure 3. Example sentence andsemantic frame along withparaphrases in English, Germanand Spanish.

clause: weather_event topic: accumulation name: snowfall pred: amount topic value, name: 2 pred: to_value topic: value, name: 4, units: inches pred: by_time pred: time_interval topic: time_of_day, name: morning

Input: 2 to 4 inches snowfall accumulation by morningEnglish: snowfall 2 to 4 inches by morningGerman: Schneefall 2 bis 4 Inch bis am MorgenSpanish: nevada 2 a 4 pulgadas antes de la manana


N-best Selection Basedon Natural LanguageUnderstandingThe recognizer for JUPITER produces an N-

best list of candidate utterances, which are

passed to the natural language system for

final selection. We are exploring new

methods for selection, which include an

efficient strategy that parses a word graph

regenerated from the N-best list, and

alternative robust-parsing strategies,

including an agressive one that, upon total

parse failure, simply detects dominant

content words in the N-best list and

operates analogously to a word-spotting

algorithm. We are investigating ways to

combine the linguistic and the recognizer

scores, and have also incorporated a

mechanism to influence selection based on

the response to the preceding query. For

instance, when the system reads off a list of

cities in a particular region, it gives priority

to any candidate hypotheses that contain

one of those cities.

Dialogue ModellingWe have discovered several interesting issues

with regard to dialogue modelling to

accommodate users’ requests, and we are

becoming increasingly aware of the benefits

in letting real users influence the design of

the interaction. One of the critical aspects

of any conversational interface is the need

to inform the user of the scope of the

system’s knowledge. JUPITER only knows

approximately 500 cities, and users need to

be directed to select relevant available data

when their explicit request yields no

information. Even for the cities it knows,

JUPITER does not necessarily have the same

knowledge for all cities.

JUPITER has a separate geography table

organized hierarchically, enabling users to

ask questions such as “What cities do you

know about in the Caribbean?” This table is

also used to provide a means of summariz-

ing a result that is too lengthy to present

fully. For example, if the user asks where it

will be snowing in the United States, there

may be a long list of cities expecting snow.

The system then climbs a geographical

hierarchy until the list is reduced to a

readable size. For example, JUPITER might list

the states where it is snowing, or it might be

required to reduce the set even further to

broad regions such as “New England.”

We have implemented a BIANCA-style

control strategy for JUPITER (c.f. page 29),

which has made it relatively easy to visualize

the control flow, and to increase the

complexity of the dialogue capabilities. We

are in the process of regularizing the top-

level shell program for JUPITER and PEGASUS

towards a goal of providing a generic

Domain Server shell that can accelerate the

development of any future knowledge

domains.

EvaluationJUPITER is both an on-line system that

receives an average of 500 calls per month

and a research system that is used as a

testbed in areas such as displayless interac-

tion, virtual browsing, information on

demand, and translingual content manage-

ment. As we have used JUPITER to further

our research agenda, we have become

concerned that we have reliable evaluation

metrics for all JUPITER system components, to

ensure that new releases are adequately

evaluated before becoming part of the on-

line JUPITER system. Each separate compo-

STEPHANIE SENEFF AND JOSEPH POLIFRONI


nent of JUPITER is in a state of active develop-

ment; furthermore, the components

interact with each other (e.g., items in focus

are communicated from the JUPITER back-

end to the natural language component for

use in filtering incoming queries). Before

incorporating any changed version of an

individual component into the system, it is

necessary that we measure the effects of the

changes, both in the particular component

being updated and in the entire system. In

addition, to verify that we are making

progress with each update, we need a

consistent metric that can be applied across

time and across system components, in

addition to metrics for each individual

component. Over the past year, we have

refined a methodology for comparing

semantic frames, a meaning representation

produced by the natural language compo-

nent and used by many other system

components.

In order to compare semantic frames,

we must first establish a baseline set of

semantic frames that can be considered

“correct.” To do this, the evaluation

program is run in a mode that creates a

reference file of utterances, with their

associated semantic frame representations

and their system-generated paraphrases.

This reference file can be created with or

without incorporating discourse context,

enabling the evaluation to be conducted

under either condition.

Once a reference has been created,

system modifications can be tested against

this reference by running the evaluation

program in comparison mode. Since we

have adopted a hierarchical structure for the

semantic frame, any comparison must

recurse through both the reference and new

semantic frames and compare the individual

sub-frames that make up each. In addition,

not all of the differences between semantic

frames are critical for proper understanding

(e.g., the difference between an indefinite

and a definite article does not affect queries

to the JUPITER back-end), so the comparison

must know what is critical for understand-

ing and what can be ignored. Because

changes in the rules that govern parsing or

discourse inheritance can have unforeseen

consequences, this program can be used to

flag any changes that were either intentional

or inadvertently introduced as a conse-

quence of system modifications. When a

new version of the system is ready for

release, a new reference set is created for the

same set of test utterances.

To detect any newly introduced genera-

tion errors, we maintain a system generated

paraphrase for each frame in the reference

set. The outputs of the newest version of

the generation component can then be

compared against the stored templates, and

a human evaluator is invoked to judge

whether all observed changes are intended.

We have also used the semantic frame

measure to evaluate our N-best selection

procedure, which incorporates both

rejection and robust parsing. We have

explored several different strategies for N-

best selection, and found it productive to

compare the effectiveness of different

strategies by matching semantic frames,

both between two different hypothesized

solutions and between each solution and

the reference transcription. For example, we

have used automatic procedures based on

matching semantic frames as knowledge

sources for refining the parameter selection

process for rejection.

To determine if this automatic meaning

evaluation can be used to test something

NATURAL LANGUAGE PROCESSING IN THE JUPITER DOMAIN


that was previously done by human

evaluators, we conducted an experiment in

which we compared the automatic semantic

frame comparison algorithm for under-

standing against performance results

obtained from a subjective evaluation of log

files recorded from the original dialogue

with the user. These two methods appeared

to give very similar ratings, suggesting that

our automatic techniques are valid.

References[1] S. Seneff, “TINA: A Natural Language System for

Spoken Language Applications,” ComputationalLinguistics, vol. 18, no. 1, pp. 61-86, 1992.

[2] J. Glass, J. Polifroni and S. Seneff, “MultilingualLanguage Generation Across Multiple Domains,”Proc. International Conference on Spoken LanguageProcessing, pp. 983-986, Yokohama, Japan,September 1994.

STEPHANIE SENEFF AND JOSEPH POLIFRONI


Spontaneous Speech Recognitionin the JUPITER DomainJames Glass

The main focus of speech recognition

activity over the past eighteen months has

been on the JUPITER telephone-based

weather information domain. Beginning in

February and March 1997, we created an

initial corpus of approximately 3,600 read

utterances, augmented with over 1,000

utterances collected in a wizard environ-

ment. These data were used to create an

initial version of a recognizer which could

be used to automatically collect data from

users from a toll-free number. This system

was deployed in late April 1997, and has

provided a continuous source of data since

that time. These data have been used to

periodically improve the acoustic and

language models used by the recognizer.

Over the course of telephone data collec-

tion for this year, we have been able to

reduce the word error rate by a factor of

three. The following sections describe the

various components of the recognizer in

more detail.

VocabularyThe vocabulary used by the JUPITER system

has evolved over the course of the year as

periodic analyses were made of the growing

corpus. By the summer of 1997 the vocabu-

lary had stabilized to 1345 words including

over 500 cities, and 150 countries. Table 3

shows a breakdown of the various types of

words in the current vocabulary. Note that

over half of the vocabulary contains

geography related words.

The design of the geography vocabulary

was based on the cities for which we were

able to provide weather information, as well

as commonly asked cities. Other words were

incorporated based on frequency of usage

and whether or not the word could be used

in a query which the natural language

component could understand. The 1345

words had an out-of-vocabulary (OOV) rate

of 3.6% on a 1445 utterance test set. Table 4

shows some of the most frequently occur-

ring OOV words as of the end of 1997,

along with some example usages of the

word. The current vocabulary will be

expanded to incorporate many of these

words in the coming year.

Since the recognizer made use of a

bigram grammar in the forward Viterbi

pass, several multi-word units were incorpo-

rated into the vocabulary to allow for

greater long-distance constraint. An added

benefit of the multi-word units was to allow

for specific pronunciation modelling, such

as the sequence “going to” or “give me”

being produced as “gonna” or “gimme”

respectively. Common contractions such as

“what’s” were represented as multi-word

units (e.g., “what_is”) to reduce language

model complexity, and since these words

were often a source of transcription error

anyway. Additional multi-word candidates

were identified using a mutual information

criterion which looked for word sequences

which were likely to occur together. Table 5

shows examples of multi-word units found

in the JUPITER vocabulary.

Table 3. Categorical breakdown ofJUPITER vocabulary. Type Size Examples

geography 770 Boston, Alberta, France, Africa

basic 408 I, what, January, tomorrow

weather 167 temperature, snow, sunny


Language ModellingA class bigram language model was used in

the forward Viterbi search, while a class

trigram model was used in the backwards

A* search to produce the 10-best outputs for

the natural language component. A set of

nearly 200 classes were used to improve the

robustness of the bigram. The majority of

the classes involved grouping cities by state

or country (foreign), in order to encourage

agreement between city and state. In cases

where a city occurred in multiple states or

countries, separate entries were added to the

lexicon (e.g., Springfield Illinois vs. Spring-

field Massachusetts). Artificial sentences

were created in order to provide complete

coverage of all of the cities in the vocabu-

lary. Other classes were created semi-

automatically using a relative entropy metric

to find words which shared similar condi-

tional probability profiles. Table 6 shows

examples of word classes.

Since filled pauses (e.g., uh, um)

occurred both frequently and predictably

(e.g., start of sentence), they were incorpo-

Table 4. Examples of out-of-vocabulary words.

Table 5. Examples of multi-wordunits.

rated explicitly into the vocabulary, and

modelled by the bigram and trigram. These

words were stripped out of word hypotheses

before being passed to the natural language

component. When tested on a 1445

utterance test set the word-class bigram and

trigram had perplexities of 16.6 and 15.4,

respectively. These are slightly lower than

the respective word bigram and trigram

perplexities of 17.2 and 16.4. Note that the

class bigram also improved the speed of the

recognizer as it had 25% fewer connections

to consider during the search.

Acoustic ModellingWords in the lexicon were modelled with a

set of 69 labels, consisting of the standard

TIMIT labels augmented with a few larger

sonorant sequences such as vowel+/r/

which have been previously found to

improve the performance of our segment-

based recognizer. A small set of pronuncia-

tion rules were applied to the lexicon to

create a pronunciation graph. The majority

of the rules applied to cross-word effects

Word Usage Examplesup 29 hang up, come up, screwed up

pasadena 28

who 23 who are you, who created you

sure 19 sure, I am not sure

hey 17 hey this is neat

why 17 why did you tell me Sunday

point 16 dew point

el nino 14 do you know anything about el nino

can_you when_is never_mind

do_you what_about clear_up

excuse_me what_are heat_wave

give_me what_will pollen_count

going_to how_about warm_up

you_are I_would wind_chill


SPONTANEOUS SPEECH RECOGNITION IN THE JUPITER DOMAIN

Table 6. Example word classesused in the JUPITER domain.

such as multiple possible realizations of stop

consonant sequences, and geminant

consonants.

Acoustic models consisted of both

segment-based models and boundary-based

diphone models. The 69 context-indepen-

dent segment-models used a set of 40

features consisting of duration, MFCC

averages computed over segment thirds, and

derivatives computed at segment end-points.

The 742 diphone models used a set of 50

features consisting of 8 MFCC averages

centered around a hypothesized boundary.

A set of pooled principal components were

used to whiten the feature space for both

sets of models.

ExperimentsOver the course of this reporting period

several different recognizers have been

trained and evaluated. Figure 4 shows the

results of a mid-December performance

evaluation, which plots the performance of

all recognizers on the same set of data

recorded in the fall of 1997. All testing has

been performed on the subset of collected

data considered to be within domain, and

excludes utterances with out-of-vocabulary

words, clipped speech, cross-talk, and other

kinds of noise. The within domain utter-

ances typically correspond to 70 to 75% of

the recorded data. Note that in practice, the

recognizer is often able to correctly process

raining snowing

cold hot warm

humid windy

extended general

advisories warnings

conditions forecast report

humidity temperature

these excluded utterances.

As shown in Figure 4, the performance

has consistently improved over time as the

recognizers have been able to incorporate

more sophisticated language and acoustic

models due to increased amounts of

training data. At the end of April the

recognizer was trained primarily on read

speech, and wizard-based spontaneous

speech. This laboratory trained system had

word-error rates of 7% on read-speech and

10% on spontaneous speech collected from

group members. However, the error rates

initially tripled on incoming data from real

users. Over the course of the year both word

and sentence error rates have been reduced

by a factor of three, and are now in the 10%

and 26% range, respectively. Note that the

spontaneous error rates are considerably

lower for experienced users (3% word error

and 10% sentence error rates, respectively).

Future WorkAlthough the performance of the system has

improved considerably over the course of

the year, there remains a considerable

number of interesting research topics to be

pursued in the future. The system to date

has used a pooled speaker model for all

acoustic modelling. It should be possible to

achieve gains through speaker normalization

and short-term speaker adaptation, and also

through better adaptation to the channel


April May June July Aug Nov0

10

20

30

40

50

60

70

Word Error

Sentence Error

Figure 4. Word and sentence errorrates throughout 1997.

conditions of individual phone calls.

Adaptation may also be useful to help

improve performance on non-native

speakers. Speakers with strong foreign

accents have thus far been excluded from

training and testing, and therefore often

experience increased error rates.

The data collection efforts have

produced a gold-mine of spontaneous

speech effects which are often a source of

both recognition and understanding errors.

For example, partial words typically cause

problems for the speech recognizer. Another

source of recognition errors are out-of-

vocabulary words, which are often cities not

covered in the vocabulary. These phenom-

enon can potentially be modelled better by

both acoustic and language models. The use

of dynamic vocabulary and language models

may also help alleviate some of the un-

known city problems. Finally, we plan to

explore a more fully integrated recognizer,

to try and impose additional constraint

from the language understanding compo-

nent.

JAMES GLASS


Confidence Scoring for Speech UnderstandingChristine Pao, Philipp Schmid, and James Glass

Most speech recognition systems determine

the most likely sequence of words for a

given utterance. However, it is not always

desireable to make use of these word

hypotheses, especially when they are

incorrect! In a conversational system for

example, it is useful to also generate a

confidence score to indicate whether a word

sequence is actually correct. In this work, we

report on our initial efforts to produce

confidence scores for utterance rejection in

the JUPITER telephone-based weather

information domain.

An examination of data for the JUPITER

domain shows that 25-30% of the data

contain out-of-vocabulary words, clipped

speech, partial words, non-speech (e.g.,

silence, laughter), or other background

noise. Since it is not clear whether these

data can be correctly recognized or under-

stood by the system, they have been

excluded from all system evaluations,

although they clearly must be handled by

the deployed system. Our approach has

been to sort the JUPITER data into two

classes: data that the system can understand

and those which it cannot. We first devel-

oped methods which could reliably annotate

the data automatically, then developed

features and classifiers which could distin-

guish between the two classes.

Automatic AnnotationWe began the annotation process by first

classifying subsets of the data by hand. We

then used this manually annotated data to

develop and evaluate an automatic annota-

tion process. The automatic annotation

process we developed compares the seman-

tic frame generated from the correct

transcription with the frames corresponding

to the N-best word hypotheses produced by

the recognizer. The utterance is marked as

correctly understood if the semantic

representation of the first hypothesis which

parses matches the semantic frame gener-

ated from the correct transcription. If there

is no match or the transcription cannot be

parsed (has no semantic representation), the

utterance is marked as not understood.

The automatic annotation agreed with

the manual annotation in 1858 out of 2051

cases (90.6% agreement). When the 193

disagreements were examined, 154 were

resolved in favor of the automatic annota-

tion and 39 in favor of the manual annota-

tion. Assuming the cases where manual and

automatic annotation are in agreement were

correctly marked, the automatic annotation

is 98.1% accurate. According to the

automatic annotation, the train, develop-

ment, and testsets used for later experiments

had correct understanding rates of 63.6%,

56.6%, and 64.9% respectively. Note that

these results are considerably lower than the

near 80% understanding rates we achieve

on the within-vocabulary data subset, and

reflect the presence of the various phenom-

ena we mentioned earlier.

Feature SelectionThe next step was to develop a set of

features which would be relevant to

recognition/understanding confidence.

There were three classes of features which

we investigated. The first included scores

generated by the recognizer such as the

acoustic and language model scores. The

second was based the N-best list, and

included the number of sentence hypoth-

eses, word frequency statistics, and semantic

distances between individual sentence

hypotheses. The third class was based on

how many of the N-best hypotheses were


completely or partially understood by the

system. A total of 47 features were consid-

ered for classification.

Our first application for confidence

scoring was a classification task: separating

utterances likely to be understood (AC-

CEPT) by the system from utterances not

likely to be understood (REJECT). Fisher

linear discriminant analysis (LDA) was used

to select the best feature set for this task.

The feature sets were created iteratively. On

each iteration, N feature sets from the

previous iteration were each augmented

with one additional feature from M unused

features. The N*M new feature sets were

scored using LDA classification on a

development set, and the top N feature sets

were retained for the next iteration. The

LDA threshold for each classifier was set to

maintain a false rejection rate of 2% on the

development set. The procedure terminated

when no additional improvement was

found.

The best feature set achieved 60%

correct rejection on the development and

Figure 5. Distribution of Fisherlinear discriminant fordevelopment set.

test sets using a set of 14 features automati-

cally selected using the Fisher criterion. The

false rejection rate on the test set increased

slightly to 3%. Figure 5 shows the probabil-

ity distributions for the accept and reject

classes of the development set.

Decision Tree ClassificationOne problem with the Fisher classifier is

that the different features are combined

into a single measure, making it more

difficult to understand the importance of

and interaction between the individual

features. We decided it would be informa-

tive to see how individual features could be

used to partition the feature space. The

basic idea was to split off sets of tokens,

where the split set had a high percentage of

accept or reject tokens (i.e., high purity).

Regression trees were created by

searching for the feature vector which could

split a node (i.e., data subset) to meet purity

and size requirements. After a node had

been split, all remaining nodes which did

not meet the purity requirement were

0

100

200

300

400

500

600

-0.005 -0.004 -0.003 -0.002 -0.001 0 0.001 0.002 0.003

accept

reject

threshhold


merged together for subsequent splitting. If

there was no split which met both the

purity and minimum size requirements, the

purity constraints were relaxed and the

search was repeated. A development set was

used for cross-validation purposes.

While the resulting regression tree

structure could be used directly as a

CONFIDENCE SCORING FOR SPEECH UNDERSTANDING

classifier for utterance rejection, there was a

greater degradation in performance when

moving from development to test sets, when

compared to our initial Fisher classifier

experiments. However, we did find the

regression trees useful for identifying

features which were useful for confidence

scoring.


PEGASUS: Flight Departure/Arrival/GateInformation SystemStephanie Seneff, Joseph Polifroni and Philipp Schmid

Given the success of our JUPITER telephone

only system for weather reports, we have

decided to develop a similar system for

information about the dynamic status of

flights. This is particularly timely because

several sites have recently been introduced

on the Web that maintain such informa-

tion. Basically, each major U.S. carrier

(United, American, Delta) offers their own

site, and there are also generic sites such as

thetrip.com and Microsoft's Expedia. We

believe that the flight domain is signifi-

cantly more challenging than weather in

terms of inheritance needs and dialogue

management.

We currently have an end-to-end system

in place, although its performance is not yet

adequate for general utilization. The

recognizer vocabulary currently contains

about 600 words, and the class bigram has

been trained on a small set of sentences that

were created manually. The system is able

to answer questions about the expected

departure or arrival time of specific flights,

and can also answer on-time requests and

provide gate information for some of the

carriers. We have built into it a dialogue

management system based on the BIANCA

dialogue manager that supports mixed-

initiative dialogue control, with PEGASUS

asking the user directly for information

necessary to access the various Web sites. As

is usually the case for our systems, the user

is not required to answer the system's

questions. An example dialogue (text input)

in PEGASUS is given in Table 7.

The PEGASUS system has some difficulty

in accurately recognizing flight numbers.

We have implemented a mechanism to filter

the N-best list based on any flight numbers

that were explicitly listed in a prior re-

sponse. This is implemented using the

same mechanism that was used to filter on

lists of cities provided by JUPITER, and is

another example of dialogue context

influencing recognition output. We believe

this mechanism will be important for the

overall success of the system.

Context tracking is a fairly challenging

aspect of the PEGASUS domain. Users often

assume that attributes from the past should

carry over to the current query, but it is

often ambiguous as to how much of the

context to retain. For example, if the user

asks, "Is there a United flight?", should any

prior restrictions on departure time be

remembered? When the user provides an

airline and flight number in a new query,

source and destination should be retained

only if they are consistent, but are necessary

when the flight exists as multiple legs. Users

can also refer pronominally to linguistically

embedded entities, as in the sequence:

"What is the arrival gate of United flight

nine?", followed by "Is it on time?" Alterna-

tively, obligatory roles can be elided, as in

"What is the arrival gate?" where an implied

flight-in-focus is not mentioned at all.

Aspects of inheritance have appeared in

PEGASUS that have led us to expand the

capabilities of the generic discourse

component.

We believe PEGASUS will soon be

sufficiently robust to enter a phase of

wizard-based data collection. In parallel, we

will collect read speech using our Web-based

telephone data collection facility. Once

there is a corpus of a few thousand sen-

tences, we can train the recognizer on these

data, and hopefully the system will then be

adequate for preliminary release to the

public, via a toll-free number.


USER: Are there any flights from Chicago to Dallas this morning

PEGASUS: Can you provide an approximate departure time or arrival time?

USER: Leaving around nine

PEGASUS: I have found 5 flights from Chicago to Dallas that leave around 9o'clock a m. American Airlines flight 2209, American Airlines flight2311, United Airlines flight 1781, American Airlines flight 2213, andUnited Airlines flight 359. Please select one.

USER: Will the third one be on time

PEGASUS: United Airlines flight 1781 from Chicago is estimated to arrive inDallas Fort Worth at 11:09 am local time 8 minutes ahead ofschedule. It's currently 150 miles north east of Dallas Fort Worth,flying at 24000 feet with a speed of 502 miles per hour. Is theresomething else I can do for you?

USER: How about the first one

PEGASUS: American Airlines flight 2209 from Chicago landed in Dallas FortWorth at 10:12 am local time. Is there something else?

USER: What gate did it arrive at

PEGASUS: American Airlines flight 2209 is arriving in Dallas Fort Worth at gate11. What else?

USER: Nothing more thank you

PEGASUS: Thanks for using Pegasus! Have a nice day!

Table 7. Sample dialogue (typedinput) between a user and the newPEGASUS system.

PEGASUS: FLIGHT DEPARTURE/ARRIVAL/GATE INFORMATION SYSTEM


Using Aggregation to Improve the Performanceof Mixture Gaussian Acoustic ModelsT.J. Hazen and Andrew Halberstadt

This work [1] investigates the use of

aggregation as a means of improving the

performance and robustness of mixture

Gaussian models. This technique produces

models that are more accurate and more

robust to different test sets than traditional

cross-validation using a development set. In

[1], a theoretical justification for this

technique is presented, along with experi-

mental results in phonetic classification,

phonetic recognition, and word recognition

tasks on the TIMIT and Resource Manage-

ment databases. In speech classification and

recognition tasks error rate reductions of up

to 12+AFw-% were observed using this

technique. A method for pruning the

aggregated models is also presented. This

pruning method reduces computational

requirements while retaining some of the

benefits of aggregation.

Aggregation of probabilistic classifiers is

performed by averaging the outputs of a set

of independently trained classifiers. The

proof in [1] demonstrates that an aggregate

classifier is guaranteed to exhibit an error

metric which is equal to or better than the

average error metric of the individual

classifiers used during aggregation. This

technique is not restricted to mixture

Gaussian classifiers, but rather is valid for

any type of probabilistic classifier. The

proof is also completely independent of the

test data being presented to the classifier.

Thus, the method is robust because it

improves performance regardless of the test

set being used.

To demonstrate the effectiveness of

aggregation, Table 8 presents results

obtained on three different tasks, phonetic

classification, phonetic recognition, and

word recognition. All of the experiments

utilize the SUMMIT [2] speech recognition

system. This system utilizes mixture

Gaussian acoustic models to score segment-

based phonetic units. For each experiment,

24 individual sets of acoustic models were

independently trained. These 24 individual

trials were then utilized to create sets of

independent models for each aggregation

trial. Thus, the models from 24 individual

training trials are used to create 12 2—fold

aggregation trials, 8 3—fold aggregation

trials, etc., until only one trial of 24—fold

aggregation is performed. In each case,

aggregation was found to produce signifi-

cant improvements over standard training

techniques, with observed error rate

reductions of up to 12+AFw-%. Aggrega-

Table 8. Average accuracy of Mtrials of N-fold model aggregationfor the tasks of context-independentphonetic classification on TIMIT(CI-PC1 and CI-PC2), context-dependent phonetic recognition onTIMIT (CD-PR), and context-dependent word recognition onResource Management(CD-WR).

M=24 M=6 M=1

Task Test Set N=1 N=4 N=24 % Error Reduction

CI-PC1 dev 21.2 20.0 19.6 7.1

CI-PC1 core 22.1 20.7 20.2 8.3

CI-PC2 core 23.2 21.3 20.4 12.0

CD-PR dev 28.1 27.3 27.0 3.6

CD-PR core 29.3 28.4 28.1 4.0

CD-WR test 4.5 4.2 4.0 12.0

Average Error Rate (%)


tion performs particularly well when the

standard training techniques produce

models that are overfit to the training data.

There are a large number of possible

generalizations and extensions of this work.

In a sense, model aggregation is equivalent

to linear interpolation of models, where the

interpolation is performed using equal

weights. Thus, any situation where models

are interpolated can be viewed as a form of

aggregation. This encompasses common

techniques such as the interpolation of

speaker-independent and gender-specific

models, or the interpolation of context-

independent and context-dependent

models. In recent experiments using SUMMIT,

aggregation was fruitfully applied to models

trained using different segmentations of the

speech signal. Aggregation is attractive

because of its simplicity, and because

empirical evidence indicates its effectiveness

in reducing the error rate in typical speech

classification and recognition tasks.

References[1] T. J. Hazen and A. K. Halberstadt, “Using

Aggregation to Improve the Performance ofMixture Gaussian Acoustic Models”, Proc. IEEEInternational Conference on Acoustics, Speech andSignal processing, Seattle, WA, May 1998.

[2] J. Glass, J. Chang, and M. McCandless, “AProbabilistic Framework for Feature-based SpeechRecognition,’’ Proc. International Conference onSpoken Language Processing, pp. 2277-2280,Philadelphia, PA, September 1996.

USING AGGREGATION TO IMPROVE THE PERFORMANCE OF MIXTURE GAUSSIAN ACOUSTIC MODELS


BIANCA: A Dialogue Management Enginefor PEGASUSPhilipp Schmid, Stephanie Seneff and Joseph Polifroni

The purpose of a dialogue manager is to

provide for each dialogue turn in a human-

computer interaction the system's half of the

conversation, including verbal, tabular, and

pictorial responses. In addition to answering

the users request, the dialogue manager can

also initiate clarification sub-dialogues to

resolve ambiguities stemming either from

recognition errors (e.g., indicated by low

confidence scores for individual words),

from insufficient user-provided information

to answer the request, or from semantic

considerations (London, England or

London, Kentucky). The dialogue manager

needs to implement policies to deal with a

variety of dialogue situations: how to

present the answer to the user, especially in

the case of display-less system interaction

where only a limited amout of information

can be provided to the user per turn, or

what to do if the requested information is

not available. Additionally, it should

provide dialogue-context dependent

assistance upon request or detection of

difficulties in the interaction with the user.

In previous experience, the develop-

ment of dialogue management components

for conversational systems has been slow

and cumbersome, partly because of the

complex nature of the dialogue activities

and difficulties in visualizing the control

flow of the processing steps which are often

embedded in nested if-then-else statements

and/or nested function calls. Each change

to the control flow required a

recompilation, and visualization was done

typically using source-code debuggers. In

addition, a separate executable was needed

for each domain (e.g., JUPITER, PEGASUS and

CityGuide).

We have recently begun the develop-

ment of a new dialogue management engine

called BIANCA. The goal was to be able to

describe the dialogue plan in terms of

elementary functions acting on state

variables. Hence BIANCA is driven by external

Debugger

Rules

VariablesFunctions

src dest airline flight# sql nfound atime agate

"Don't care"

"Exists/Has value"

"Not set"

Figure 6. Graphical User Interfacefor BIANCA.


tables containing the domain-dependent

specifications expressed in a high-level

specification language. As a side-effect of

this new engine, we are able to use a visual

debugger that clearly illustrates the control

flow (see Figure 6).

There are two important high-level

concepts in BIANCA: variables and functions.

In the case of the flight arrival and depar-

ture information domain (PEGASUS) the set

of variables (key/value pairs) captures three

essential types of state: (a) the user's

intentions (e.g., "What is the arrival gate?",

"Will the flight be on time?"), (b) the values

of variables essential to the domain (flight

number or airline name), and (c) variables

describing the state of the interaction with

external data sources (e.g., the number of

items retrieved from a database). The values

of these variables implicitly define the state

of the dialogue. Initial values are derived

from the input frame-in-context (using

GENESIS). The functions perform important

elementary tasks such as database retrieval

or clarification requests. In the process of

execution they change the state of one or

more of the variables. The order in which

these functions are executed is determined

by a set of rules (pre-conditions) that are

expressed as boolean expressions using the

keys and values of the variables (see table 9

for an example of variables and rules for a

small PEGASUS domain.). Each function can

be triggered by one or more rules and one

or more post-condition can hold true (e.g.,

either zero, one, less than three, or at least

three flight were retrieved from the data-

base). If in addition to the mandatory pre-

conditions we can also specify all the post-

conditions and their respective probabili-

ties, we can statically determine the optimal

dialogue strategy (e.g., which question to ask

the user in order to narrow down the

number of flights under consideration).

In parallel to the above effort, we are

reorganizing the generic domain-server code

to support a shared data structure that

captures all the important information

necessary to complete each dialogue turn.

This structure is intended to provide a

universal framework for dialogue manage-

ment, at least in the context of database-

access activities. The dialogue management

operates in essentially an object-oriented

formalism, in that all functions operate only

on the contents of the dialogue manage-

ment structure. Any high-level functionality

that should be essential for all domains is

provided in a server-shell, which should

expedite the process of porting to a new

domain. We envision ultimately a single

domain-server with separate libraries of

functions specializing in specific domains

such as JUPITER and PEGASUS.

BIANCA: A DIALOGUE MANAGEMENT ENGINE FOR PEGASUS


PHILIPP SCHMID, STEPHANIE SENEFF AND JOSEPH POLIFRONI

Variables:

airline, flight_number, from_airport, to_airport, query, num_found,arrival_gate

Rules:

airline & flight_number --> compose_query --> query airline & !flight_number --> need_flight_number --> reply query & !num_found --> evaluate_query --> num_found num_found > 1 --> speak_multiple --> reply num_found = 0 --> speak_no_info --> reply num_found = 1 & arrival_gate "which" --> speak_gate --> reply

User: "Which gate will United flight nine arrive at?"

Initial configuration:

airline: "UAL" flight_number: "9" arrival_gate: "which"

execute compose_query:

airline: "UAL" flight_number: "9" arrival_gate: "which" query: "select arrival_gate from * where airline=UAL and flight_number=9"

execute evaluate_query

airline: "UAL" flight_number: "9" arrival_gate: "which" query: "select arrival_gate from * where airline=UAL and flight_number=9" num_found: 1

execute speak_gate

airline: "UAL" flight_number: "9" arrival_gate: "which" query: "select arrival_gate from * where airline=UAL and flight_number=9" num_found: 1 reply: "United flight nine is expected to arrive in San Francisco at gate 78"

Table 9. PEGASUS example.


ANGIE-Based Pronunciation ServerAarati Parmar and Stephanie Seneff

Research in extending the ANGIE system has

been progressing along four fronts: 1) using

ANGIE for sophisticated prosodic modelling,

2) using ANGIE in a variety of ways in speech

recognition tasks, 3) acquiring a large

lexicon labelled according to ANGIE’s

conventions through semi-automatic means,

and 4) developing a pronunciation server

based on the ANGIE formalism. The first

three efforts are being conducted as student

theses [1,2,3] (c.f. pages 40, 52 and 62). The

fourth project was advanced mainly through

the efforts of Aarati Parmar in a summer

staff position, with assistance from Helen

Meng and Stephanie Seneff.

The goal of the summer project was to

develop an ANGIE server to provide pronun-

ciations and/or syllabifications for English

words. The pronunciations could be

provided either in a convention appropriate

for SUMMIT or the standard ANGIE phonemic

convention, and as a side-effect ANGIE’s

“morph” analysis [1] could be returned as

well. The training data for the pronuncia-

tion server were acquired from available

resources, using JUITER’s vocabulary as a

model for SUMMIT conventions, and using

the on-line COMLEX (more than 63,000

words excluding proper nouns and abbrevia-

tions) lexicon for providing letter-to-sound

rules. The conventions for COMLEX are

different from those of either SUMMIT or

ANGIE, so a set of rules defining a mapping

from COMLEX to ANGIE aided the acquisi-

tion process. We also attempt to utilize our

predefined “morph” lexicon to enhance

accuracy.

Table 10 shows some examples of

different transcription conventions for

words containing the morph unit “-tion” in

SUMMIT, ANGIE, and COMLEX. Notice that “-

tion” is realized as “sh ax n” in SUMMIT, as

“sh! en” in ANGIE, and as “S.In” in

COMLEX.

The operational flow is a series of

backoff steps as follows:

i) Given a word spelling, first look to see

if it already exists in our dictionary. If so,

return the pronunciation.

ii) Otherwise, look up the word-morph

inventory (we currently have 48K words

decomposed into morphs) to get a morph

decomposition. If lookup fails, parse with

letter terminals into a hypothesized morph

sequence.

iii) Having acquired the morph se-

quence, look up the morph pronunciations

file (each morph is labelled with a phonetic

transcription in this file, using SUMMIT

conventions). Then concatenate the morph

pronunciations together to get a full word

pronunciation.

iv) Apply rewrite rules to handle

phonological variations at morph bound-

aries, e.g., deletable stops in SUMMIT or

pluralization rules.

The pronunciation provided by ANGIE

for each morph in the morph lexicon has

been transformed from ANGIE format to

appropriate canonical SUMMIT phonemes

using a set of automatically generated ANGIE-

phoneme-to-SUMMIT-phoneme mapping

rules. The basis for such generation, as

mentioned above, is the 600-word vocabu-

lary in the JUPITER domain. By utilizing the

intermediate morph representation, we

guarantee consistency in the pronunciations

for alternate words that share the same

morph label.


This project has not yet been com-

pleted, as we expect the server to be

especially useful for generating pronuncia-

tions for novel proper nouns such as place

names. We believe it should be possible to

train it on a large set of place names

obtained semiautomatically from the Web.

Table 10. Examples of lexicalentries from SUMMIT, COMLEX,and ANGIE showing differingconventions. Note: “!” stands for“onset position, and “+” is a stressmarker.

SUMMIT: connections k ax n eh kd sh ax n zrestriction r ax s t r ih kd sh ax n

COMLEX: additions .xd’IS.Inzinterrogations .Int+Er.xg’eS.In

ANGIE: corruption k! er r! ah+ p sh! enreflection r! iy f! l eh+ k sh! en

References[1] A. Parmar. A Semi-Automatic System for the

Syllabification and Stress Assignment of Large Lexicons.M. Eng. thesis, MIT Department of ElectricalEngineering and Computer Science, May 1997.

[2] R. Lau. Subword Lexical Modelling for SpeechRecognition. Ph.D. thesis, MIT Department ofElectrical Engineering and Computer Science,March 1998.

[3] G. Chung. Hierarchical Duration Modelling for aSpeech Recognition System. MIT Department ofElectrical Engineering and Computer Science,May 1997.


Documents

RESEARCH REPORTSgroups.csail.mit.edu/sls/publications/1998/research98.pdf · 10 SUMMARY OF RESEARCH ing to collect. In contrast, the read data collected via the toll-free number tends