THESIS RESEARCH - MIT CSAILgroups.csail.mit.edu/sls/publications/1998/thesis98.pdf40 SUMMARY OF RESEARCH Hierarchical Duration Modelling for a Speech Recognition System Grace Chung

35SPOKEN LANGUAGE SYSTEMS

T H E S I S

R E S E A R C H

36 SUMMARY OF RESEARCH


A Model for Segment-Based Speech RecognitionJane Chang

Currently, most approaches to speech

recognition are frame-based in that they

represent the speech signal using a temporal

sequence of frame-based features, such as

Mel-cepstral vectors. Frame-based ap-

proaches take advantage of efficient search

algorithms that largely contribute to their

success. However, they cannot easily

incorporate segment-based modeling

strategies that can further improve recogni-

tion performance. For example, duration is

a segment-based feature that is useful but

difficult to model in a frame-based ap-

proach.

In contrast, segment-based approaches

represent the speech signal using a graph of

segment-based features, such as average Mel-

cepstral vectors over hypothesized phone

segments. Segment-based approaches enable

the use of segment-based modeling strate-

gies. However, they introduce multiple

difficulties in recognition that have limited

their success.

In this work, we have developed a

framework for speech recognition that

overcomes many of the difficulties of a

segment-based approach. We have published

experiments in phone recognition on the

core test set of the TIMIT corpus over 39

classes [1]. We have also run preliminary

experiments in word recognition on the

December ’94 test set of the ATIS corpus.

In our segment-based approach, we

hypothesize segments prior to recognition.

Previously, our segmentation algorithm was

based on local acoustic change. However,

segmentation depends on contextual factors

that are difficult to capture in a simple

measure. We have developed a probabilistic

segmentation algorithm called “segmenta-

tion by recognition” that hypothesizes

segments in the process of recognition.

Segmentation by recognition applies all of

the constraints used in recognition towards

segmentation. As a result, it hypothesizes

more accurate segments. In addition, it

adapts to all types of variability, focuses

modeling on confusable segments, hypoth-

esizes all types of units, and uses scores that

can be re-used in recognition. We have

implemented this segmentation algorithm

using a backwards A* search and a diphone

context-dependent frame-based phone

recognizer. In published TIMIT experi-

ments, we have reported an 11.3% reduc-

tion in phone recognition error rate from

38.7% with our previous acoustic segmenta-

tion to 34.3% with segmentation by

recognition [1].

In segment-based recognition, the

speech signal is represented using a graph of

features. Probabilistically, it is necessary to

account for all of the features in the graph.

However, each path through the graph

directly accounts for only a subset of all

features. Previously, we modeled the features

that are not in a path using a single “anti-

phone” model [2]. However, the features

that are not in a path depend on contextual

factors that are difficult to capture in one

model. We have developed a search

algorithm called “near-miss modeling” that

uses multiple models for all features in a

graph. Near-miss modeling associates each

feature with a near-miss subset of features

such that any path through a graph is

associated with all features. As a result, it

probabilistically accounts for and efficiently

enforces constraints across all features. In

addition, it focuses modeling on discrimi-

nating between a feature and its near-misses.

We have implemented near-miss modeling

using a Viterbi search and a set of near-miss

phone models that correspond to our


context-independent phone models. In

published experiments, we have reported a

9.3% reduction in phone recognition error

rate from 34.3% with anti-phone modeling

to 31.1% with near-miss modeling [1]. In

addition, in preliminary ATIS experiments,

we have shown a 21.4% reduction in word

recognition error rate from 12.6% with anti-

phone modeling to 9.9% with near-miss

modeling.

In word recognition, deletion and

insertion errors in segmentation can cause

multiple recognition errors. Previously, we

have been using phone units. However,

phone realizations depend on contextual

factors and are difficult to segment. We

have developed larger units called “multi-

phone units” that span multiple phones.

Multi-phone units cover phone sequences

that demonstrate systematic acoustic and

lexical variations. As a result, they recover

from systematic segmentation errors. In

addition, they focus modeling on systematic

context-dependencies. We select multi-

phones using a Viterbi search to minimize

match, deletion and insertion criteria. In

preliminary experiments, we have shown a

4.2% reduction in word recognition error

rate from 9.9% with phone units to 9.1%

with multi-phone units.

Figure 7 shows an example of our

framework. The input speech is displayed as

a spectrogram. Segmentation by recognition

hypothesizes the graph of segments under

the spectrogram. Near-miss modeling

associates near-misses such that the black

segment is associated with the three gray

segments. The total score for a unit is the

sum of segment and near-miss scores. The

seven best scoring units for the black

segment are listed on the right. The best

scoring unit is the multi-phone unit that

spans the phone sequence of /r/ followed

by /l/. The recognized phone and word

outputs are displayed under the segment

graph.

With segmentation by recognition,

near-miss modeling and multi-phone units,

our framework overcomes many of the

difficulties in segment-based recognition

and enables the exploration of a wide range

of segment-based modeling strategies.

Although our work does not focus on

developing such strategies, we have already

Figure 7. Example of framework,showing spectrogram, segmentgraph, phone and word recognition,and scores for the highlightedsegment.

A MODEL FOR SEGMENT-BASED SPEECH RECOGNITION


shown improvements in recognition

performance. For example, a segment-based

approach can use both frame- and segment-

based features. In published experiments,

we have reported a 4.0% reduction in

phone recognition error rate from 27.7%

with just frame-based features to 26.6%

with both types of features. In addition, a

segment-based approach facilitates the use

of duration to model segment probability.

In preliminary experiments, we have shown

a 5.3% reduction in word recognition error

rate from 9.1% with no model to 8.5% with

a duration model.

References[1] J. Chang and J. Glass, “Segmentation and

Modeling in Segment-based Recognition,” Proc.European Conference on Speech Communication andTechnology, pp. 1199-1202, Rhodes, Greece,September 1997.

[2] J. Glass, J. Chang and M. McCandless, “AProbabilistic Framework for Feature-based SpeechRecognition,” Proc. International Conference onSpoken Language Processing, pp, 2277-2280,Philadelphia, PA, October1996.

[3] J. Chang. Near-Miss Modeling: A Segment-basedApproach to Speech Recognition. Ph.D. thesis, MITDepartment of Electrical Engineering andComputer Science, June 1998.

JANE CHANG


Hierarchical Duration Modellingfor a Speech Recognition SystemGrace Chung

Durational patterns of phonetic segments

and pauses convey information about the

linguistic content of an utterance. Most

speech recognition systems grossly

underutilize the knowledge provided by

durational cues due to the vast array of

factors that influence speech timing and the

complexity with which they interact. In this

thesis, we introduce a duration model based

on the ANGIE framework. ANGIE is a para-

digm which captures morpho-phonemic and

phonological phenomena under a unified

hierarchical structure. Sublexical parse trees

provided by ANGIE are well-suited for

constructing complex statistical models to

account for durational patterns that are

functions of effects at various linguistic

levels. By constructing models for all the

sublexical nodes of a parse tree, we implic-

itly model duration phenomena at these

linguistic levels simultaneously, and

subsequently account for a vast array of

contextual variables affecting duration from

the phone level up to the word level.

Experiments in our work have been

conducted in the ATIS domain which

consists of continuous, spontaneous

utterances concerning enquiries for travel

information.

In this duration model, a strategy has

been formulated in which node durations

in upper layers are successively normalized

by their respective realizations in the layers

below; that is, given a nonterminal node,

individual probability distributions,

corresponding with each different realiza-

tion in the layer immediately below, are all

scaled to have the same mean. This reduces

the variance at each node, and enables the

sharing of statistical distributions. Upon

normalization, a set of relative duration

models is constructed by measuring the

percentage duration of nodes occupied with

respect to their parent nodes.

Under this normalization scheme, the

normalized duration of each word node is

independent of the inherent durations of its

descendents and hence is an indicator of

speaking rate. A speaking rate parameter

can be defined as a ratio of the normalized

word duration over the global average

normalized word duration. This speaking

rate parameter is then used to construct

absolute duration models that are normal-

ized by the rate of speech. This is done by

scaling absolute phoneme durations by the

above parameter. By combining hierarchical

normalization and speaking rate normaliza-

tion, the average standard deviation for

phoneme duration was reduced from 50ms

to 33ms.

Using the hierarchical structure, we

have conducted a series of experiments

investigating speech timing phenomena. We

are specifically interested in (1) examining

secondary effects of speaking rate, (2)

characterizing the effects of prepausal

lengthening, and (3) detecting other word

boundary effects associated with duration

such as gemination. For example, we have

found, with statistical significance, that a

suffix within a word is affected far more by

speaking rate than is a prefix. We have also

studied closely the types of words which

tend to be realized particularly slowly in our

training corpus and it is discovered that

these are predominantly function words and

single syllable words.

Prepausal lengthening is the phenom-

enon where words preceding pauses tend to

be somewhat lengthened. Our goal is to

examine the characteristics associated with

prepausal effects and in the future further

incorporate these into our model. In our

studies, we consider the relationship

between this phenomenon and the rate of


speech. We found that lengthening occurs

when pauses tend to be greater than 100ms

in duration. It is also observed that

prepausal lengthening affects the various

sublexical units non-uniformly. For ex-

ample, the stressed syllable nucleus tends to

be lengthened more than the onset

position.The final duration model has been

implemented into the ANGIE phonetic

recognizer. In addition to contextual effects

captured by the model at various sublexical

levels, the scoring mechanism also accounts

explicitly for two inter-word level phenom-

ena, namely, prepausal lengthening and

gemination. Our experiments have been

conducted under increasing levels of

linguistic constraint with correspondingly

different baseline performances. The

improved performance is obtained by

providing successively greater amounts of

implicit lexical knowledge during recogni-

tion by way of an intermediate morph or

syllable lexicon.

When maximal linguistic contraint is

imposed, the incorporation of the relative

and speaking-rate normalized absolute

phoneme duration scores reduced the

phonetic error rate from 29.7% to 27.4%, a

relative reduction of 7.7%. These gains are

over and above any gains realized from

standard phone duration models present in

the baseline system, and encourage us to

further apply our model in future recogni-

tion tasks.

As a first step towards demonstrating

the benefit of duration modelling for full

word recognition, we have conducted a

preliminary study using duration as a post-

processor in a word-spotting task. We have

simplified the task of spotting city names in

the ATIS domain by choosing a pair of highly

confusable keywords, “New York” and

“Newark.” All tokens initially spotted as

“New York” are passed to a post-processor,

which reconsiders those words and makes a

final decision, with the duration component

incorporated. For this task, the duration post-

processor reduced the number of confusions

from 60 to 19 tokens out of a total of 323

tokens, a 68% reduction of error. We believe

that the dramatic performance improvement

demonstrates the power of durational knowl-

edge in specific instances where acoustic-

phonetic features are less effective.

In another experiment, the duration

model was fully integrated into an ANGIE-based

wordspotting system. As in our phonetic

recognition experiments, results were obtained

by adding varying degrees of linguistic

contraint. When maximum constraint is

imposed, the duration model improved

performance from 89.3 to 91.6 (FOM), a

relative improvement of 21.5%. The duration

model has shown to be most effective when the

maximum amount of lexical knowledge is

provided, wherein the model is able to best

take advantage of the various durational

relationships among the components of the

sublexical parse structure. We also believe that

the more complex parse structures available in

the keywords for this task contribute to the

performance of our duration model.

This research has demonstrated success in

employing a complex statistical duration model

in order to improve speech recognition

performance. In particular, we see that

duration is more valuable during word

recognition. We would like to incorporate our

duration modeling into a continuous speech

recognition system, in which significant gains

should also be possible there.

Reference[1] G. Chung. Hierarchical Duration Modelling

for a Speech Recognition System. S.M. thesis, MITDepartment of Electrical Engineering and ComputerScience, Cambridge, MA, May 1997.


Discourse Segmentation of Spoken Dialogue:An Empirical ApproachGiovanni Flammia

Empirical research in discourse and

dialogue is instrumental in quantifying

which conventions of human-to-human

language may be applicable for human-to-

machine language [1,2] This thesis is an

empirical exploration of one aspect of

human-to-human dialogue that can be

applicable to human-to-machine language.

Some linguistic and computational models

assume that human-to-human dialogue can

be modeled as a sequence of segments [3].

Detecting segment boundaries has potential

practical benefits in building spoken

language applications (e.g., designing

effective system dialogue strategies for each

discourse segment and dynamically chang-

ing the system lexicon at segment bound-

aries).

Unfortunately, drawing conclusions

from studying human-to-human conversa-

tion is difficult because spontaneous

dialogue can be quite variable, containing

frequent interruptions, incomplete sen-

tences and unstructured segments. Some of

these variabilities may not contribute

directly to effective communication of

information. The goal of this thesis is to

determine empirically the extent to which

discourse segment boundaries can be

extracted from annotated transcriptions of

spontaneous, natural dialogues in specific

application domains. We seek answers to

three questions. First, is it possible to obtain

consistent annotations from many subjects?

Second, what are the regular vs. irregular

discourse patterns found by the analysis of

the annotated corpus? Third, is it possible

to build discourse segment models auto-

matically from an annotated corpus?

The contributions of this thesis are

twofold. Firstly, we developed and evaluated

the performance of a novel annotation tool

and associated discourse segmentation

instructions. The tool and the instructions

have proven to be instrumental in obtaining

reliable annotations from many subjects.

Our findings indicate that it is possible to

obtain reliable and efficient discourse

segmentation when the task instructions are

specific and the annotators have few degrees

of freedom, i.e., when the annotation task is

limited to choosing among few independent

alternatives. The reliability results are very

competitive with other published work [4].

Secondly, the analysis of the annotated

corpus provides substantial quantitative

evidence about the differences between

human-to-human conversation and current

human-to-machine telephone applications.

Since dialogue annotation can be

extremely time consuming, it is essential

that we develop the necessary tools to

maximize efficiency and consistency. To this

end, we have developed a visual annotation

tool called Nb which has been used for

discourse segmentation in our group and

other institutions.

With the help of Nb, we determined

how reliably human annotators can tag

segments in the dialogue transcriptions of

our corpus. We conducted two experiments

in which the transcriptions have each been

annotated by several people.

To carry out our research, we are

making use of a corpus of orthographically

transcribed and annotated telephone

conversations. The text data are faithful

transcriptions of actual telephone conversa-

tions between customers and telephone

operators collected by BellSouth

Intelliventures and American Airlines in

1994. The first pilot study consisted of 18

dialogues from all the domains of our

corpus, each one annotated by 6 different


0 5 10 15 20 25 30 35 400

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of words in preceeding agent turn

Pro

babi

lity

of c

usto

mer

bac

k−ch

anne

l res

pons

e

coders [5]. The goal of this experiment was

rather exploratory in nature, without

particular constrains on where to place

discourse segment boundaries.We measured

reliability by recall, precision and the kappa

coefficient. When comparing two different

segmentations of the same text, we alterna-

tively select one as the reference and the

other one as the test. Reliability is best

measured by the kappa coefficient, a

statistical measure which is gaining popular-

ity in computational linguistics because it

measures how much better than chance is

the observed agreement [6]. A coefficient of

0.7 or better indicates reliable results. Table

11 summarizes our findings. We found that

without detailed instructions, annotators

agree at the 0.45 reliability level in placing

segment boundaries. In our data, we found

that the kappa coefficient is always less than

the average of precision and recall.

The analysis of the disagreements of the

first experiment led to a second, more

focused experiment. This other experiment

consisted of 22 dialogues from only one

application, the movies listing domain. Each

dialogue was annotated by 7 - 9 coders [7].

The instructions defined a segment to be a

section of the dialogue in which the agent

delivers a new piece of information that is

relevant to the task. In addition, the

annotators had to choose among five

different segment purpose labels when

tagging a discourse segment. In that case,

we found that the kappa reliability measure

Table 11. Summary percentagestatistics of the two annotationexperiments. Average precision andrecall are measured across allpossible combinations of pairs ofcoders. The groupwise kappacoefficient is computed from theclassification matrix of all thecoders. Statistics are computedusing as unit of analysis thesentence or the dialogue turn.Typically, a dialogue turn iscomposed of one to three shortsentences.

Experiment First SecondDialogues 18 22Coders per dialogue 6 7-9

Precision 57.7 85Recall 61.5 83.9Kappa 45.1 82.2

Precision 67 85.1Recall 71 84.7Kappa 53.7 82.4

Units: sentences

Units: turns

Figure 8. Observed frequency ofcustomer’s acknowledgments as afunction of the preceeding agent’sdialogue turn duration


in placing segment boundaries is 0.824, and

the accuracy in assigning segment purpose

labels is 80.1%.

To evaluate the feasibility of segmenting

dialogues automatically, we implemented a

simple discourse segment boundary

classifier based on learning classification

rules from lexical features [8]. On average,

the automatic algorithm agrees with the

manually annotated boundaries with 69.4%

recall and 74.5% precision.

Analysis of the movies listing conversa-

tions indicates that the customer follows

with an explicit acknowledgement the

information reported by the agent 84% of

the time. We found that the agent delivers

information using shorter rather than

longer sentences. Figure 8 is a cumulative

frequency plot of the length of the agent's

dialogue turn before a customer's

acknowledgement. Most of the times, the

agent does not speak more than 15 words

before the customer responds with an

acknowledgment. After the

acknowledgement, 40% of the time the

information is explicitly confirmed by both

parties with at least two additional dialogue

turns.

Analysis of the annotated segments

indicate that the customer is mainly

responsible for switching to new topics, and

that in average the agent's response is not

immediate but instead is preceeded by a few

clarification turns. Table 12 lists the fraction

of agent vs. customer initiated segments by

topics and the average turn of the agent's

response from the beginning of the seg-

ment.

References[1] N.O. Bernsen, L. Dybkjaer, and H. Dybkjaer,

“Cooperativity in Human-machine and Human-human Spoken Dialogue.” Discourse Processes Vol.21, No. 2, pp. 213-236, 1996.

[2] N. Yankelovich, “Using Natural Dialogs as theBasis for Speech Interface Design” Chapter forthe upcoming book, Automated Spoken DialogSystems, edited by Susann Luperfoy, MIT Press,1998.

[3] B. Grosz and C. Sidner, “Attentions, Intentionsand the Structure Of Discourse.” ComputationalLinguistics, Vol. 12. No. 3, pp. 175-204, 1986.

[4] M. Walker and J. Moore, editors “EmpiricalStudies in Discourse,” Computational Linguisticsspecial issue. Vol. 20. No. 2, 1997.

[5] G. Flammia and V. Zue, “Empirical Evaluation ofHuman Performance and Agreement in ParsingDiscourse Constituents in Spoken Dialogue.”Proc. European Conference on Speech Communicationand Technology, pp. 1965-1968, Madrid, Spain,September 1995.

[6] J. Carletta, “Assessing Agreement onClassification Tasks: The Kappa Statistics.”Computational Linguistics. Vol. 22, No. 2, pp. 249-254, 1996.

[7] G. Flammia and V. Zue, “Learning the Structureof Mixed-initiative Dialogues using a Corpus ofAnnotated Conversations.’’ Proc. EuropeanConference on Speech Communication and Technology,pp. 1871-1874, Rhodes, Greece, September 1997.

[8] W. W. Cohen, “Fast Effective Rules Induction,”Machine Learning: Proceedings of the 12thInternational Conference. 1995.

[9] G. Flammia. Corpus-based Discourse Segmentation ofSpoken Dialogue. Ph.D. thesis, MIT Department ofElectrical Engineering and Computer Science,June 1998.

Topic Customer Init. Agent Init. Turn of Response

List movies 67.8% 32.2% 4.5

Phone number 87.1% 12.9% 3.7

Show times 76.3% 23.7% 2.9

Where is it playing 79.4% 20.7% 4.0

DISCOURSE SEGMENTATION OF SPOKEN DIALOGUE: AN EMPIRICAL APPROACH

Table 12. Distribution of segmentinitiatives by topics and averageturn position of response.


Heterogeneous Acoustic Measurements andMultiple Classifiers for Speech RecognitionAndrew Halberstadt

Most automatic speech recognition systems

use a small set of homogeneous acoustic

measurements and a single classifier to

make acoustic-phonetic distinctions. We are

exploring the use of a large set of heteroge-

neous measurements and multiple classifiers

in order to improve phonetic classification.

There are several areas for innovative work

involved in implementing this approach.

First, a variety of acoustic measurements

need to be developed, or selected, from

those proposed in the literature. In the past,

different acoustic measurements have

generally been compared in a winner-takes-

all paradigm in which the goal is to select

the single best measurement set. In contrast

to this approach, we are interested in

making use of complementary information

in different measurement sets. In addition,

measurements have usually been evaluated

for their performance over the entire phone

set. In contrast, in this work, we explore the

notion that high-performance acoustic

measurements may be different across

different phone classes. Thus, heteroge-

neous measurements may be used both

within and across phone classes. Second,

methods for utilizing high-dimensional

acoustic measurement spaces need to be

proposed and developed. This problem will

be addressed through schemes for combin-

ing the results of multiple classifiers.

In the process of developing heteroge-

neous acoustic measurements, we focused

initially on stop consonants because of

evidence that their short-time burst

characteristics and rapidly changing

acoustics were poorly represented by

conventional homogeneous measurements

[1]. A perceptual experiment using only stop

consonants was performed in order to

facilitate comparative analysis of the types of

errors made by humans and machines. The

experiment was designed so that humans

could not make profitable use of

phonotactic or lexical knowledge. Figure 9

provides a summary of the results of these

experiments. The error rates are generally

high because the data set was deliberately

chosen to include some of the most

difficult-to-identify stops in our develop-

ment set. The machine systems are labelled

A, B, C, MMV, D, where A, B, and C are

three different context-independent systems,

MMV (Machine Majority Vote) is a system

that takes the 3-way majority vote answer

from A, B, and C, and D is a context-

dependent system. The perceptual results

from listeners are labelled PA (Perceptual

Average) and PMV (Perceptual Majority

Vote). The place of articulation identifica-

tion by machine is 2.2-11.2 times worse than

humans, whereas the voicing identification

is only 1.1-2.4 times worse. Our conclusion

is that the place of articulation identifica-

tion of automatic systems is an area that

requires significant improvement in order to

approach human-like levels of performance.

The second challenge is to develop

overall system architectures which can make

profitable use of a large number of acoustic

measurements. The fundamental challenge

of high-dimensional input spaces arises

because the quantity of training data

28.9

PA23.1PMV

70.0

A

56.9

B

55.3

C

52.7

MMV

38.4

D

Stop Identification

Per

cent

Err

or

6.3PA

2.2PMV

24.7

A

22.9

B

22.2

C

18.6

MMV

14.1

D

Place Identification

Per

cent

Err

or

24.7

PA

21.2

PMV

51.8

A

38.2

B

37.6

C

38.2

MMV

27.1

D

Voicing Identification

Per

cent

Err

or

Figure 9. Human Perception (PA,PMV) versus MachineClassification (A,B, C, MMV, D)in the tasks of stop identification,place of articulation, identificationof stops, and voicing identificationof stops.


needed to adequately train a classifier grows

exponentially with the input dimensionality.

In one approach to the problem, multiple

classifiers trained from different measure-

ments can be arranged hierarchically. In this

scheme, the hierarchical structure empha-

sizes taking the task of phonetic classifica-

tion and breaking it down into subproblems

such as vowel classification and nasal

classification. Figure 10 illustrates the fact

that most classifier errors remain within the

same manner class, thus supporting the

“subproblem” approach of hierarchical

classification. Roughly speaking, if the first

stage of the hierarchy has high confidence

that a particular token is a nasal, then a

classifier tuned especially for nasals may

perform further processing. In [1], this

approach was developed and used to obtain

79.0% context-independent classification

on the TIMIT core test set. Alternatively,

multiple classifiers may be formed into

“committees”. Each committee member has

some influence on the final selection. In its

simplest form, the final choice could be

determined by popular vote of the classifier

committee. The performance of the MMV

(Machine Majority Vote) system in Figure 9,

which is the result of voting among systems

A, B, and C, and the PMV (Perceptual

Majority Vote) results are examples of

improved performance through the use of

voting. The ideas of classification according

to a hierarchy or by a committee are not

mutually exclusive, but rather can be

combined. Thus, one member of a commit-

tee could be a hierarchical classifier, or there

could be a hierarchy of committees.

In the future, we hope to narrow the

gap observed in perceptual experiments

between human and machine performance

in the task of place of articulation identifica-

tion. We plan to continue investigating

heterogeneous measurement sets and

developing a variety of ways of combining

those measurements into classification and

recognition systems.

Reference[1] A. K. Halberstadt and J. R. Glass, “Heterogeneous

Measurements for Phonetic Classification”, Proc.European Conference on Speech Communication andTechnology, pg. 401-404, Rhodes, Greece,September 1997.

HETEROGENEOUS ACOUSTIC MEASUREMENTS AND MULTIPLE CLASSIFIERS FOR SPEECH RECOGNITION

vowels/semivowels

nasals/flaps

strong fricatives

weak fricatives

stops

Hypothesis

Ref

eren

ce

iyihehaeahuwuhaaeyayoyawowerl r w y m n ngdxjhchz s shhhv f dhthb p d t g k cl

iyihehaeahuwuhaaeyayoyawowerl r w y m n ngdxjhchz s shhhv f dhthb p d t g k cl

iy ih eh ae ah uw uh aa ey ay oy aw ow er l r w

y m

n ng dx jh ch z s sh hh v f dh th b p d t g k cl

iy ih eh ae ah uw uh aa ey ay oy aw ow er l r w

y m

n ng dx jh ch z s sh hh v f dh th b p d t g k cl

Figure 10. Bubble plot ofconfusions in phoneticclassification on TIMITdevelopment set. Radii are linearlyproportional to the error. Thelargest bubble is 5.2% of the totalerror.


The Use of Speaker Correlation Informationfor Automatic Speech RecognitionT. J. Hazen

Typical speech recognition systems perform

much better in speaker dependent (SD)

mode than they do in speaker independent

(SI) mode. This is a result of flaws in the

probabilistic framework and modeling

techniques used by today’s speech

recognizers. In particular, current SI

recognizers typically assume that all acoustic

observations can be considered indepen-

dent of each other. This assumption ignores

within-speaker correlation information which

exists between speech events produced by

the same speaker. Knowledge of the speaker

constraints imposed on the acoustic

realization of an utterance can be extremely

useful for improving the accuracy of a

recognition system.

To describe the problem mathemati-

cally, begin by letting P represent a sequence

of phonetic units. If P contains N different

phones then let it be expressed as:

(1)

Here each np represents the identity of one

phone in the sequence. Next, let x be a

sequence of feature vectors which represent

the acoustic information of an utterance. If

X contains one feature vector for each

phone in P then X can be expressed as:

(2)

Given the above definitions, the probabilis-

tic expression for the acoustic model is

given as ( )PXp .

In order to develop effective and

efficient methods for estimating the acoustic

model likelihood, typical recognition

systems use a variety of simplifying assump-

tions. To begin, the general expression can

be expanded as follows:

(3)

At this point, speech recognition systems

almost universally assume that the acoustic

feature vectors are independent. With this

assumption the acoustic model is expressed

as follows:

(4)

Because this is a standard assumption in

most recognition systems, the term ( )Pxp n

r

will be referred to as the standard acoustic

model.

In Equation (3), the likelihood of a

particular feature vector is deemed depen-

dent on the observation of all of the feature

vectors which have preceded it. In Equation

(4), each feature vector nxr

is treated as an

independently drawn observation which is

not dependent on any other observations,

thus implying that no statistical correlation

exists between the observations. What these

two equations do not show is the net effect

of making the independence assumption.

Consider applying Bayes rule to the

probabilistic term in Equation (3).

In this case the term in this expression

can be rewritten as:

(5)

After applying Bayes rule, the condi-

tional probability expression contained in

(3) is rewritten as a product of the standard

acoustic model ( )Pxp n

r

and a probability

},...,,{ 21 NpppP =

( ) ( ) ( )( )Pxxp

PxxxpPxpPxxxp

n

nnnnn

1,1

1,111 ...,

,...,,,...,

rr

rrr

rrrr

−

−− =

},...,{ 21 NxxxXrrr=

( ) ( ) ( )∏=

−==

N

nnnN PxxxpPxxxpPXp

11121 ,,...,,...,rrrrrr

( ) ( )∏=

=N

nn PxpPXp

1

r


ratio which will be referred as the consistency

ratio. The consistency ratio is a multiplica-

tive factor which is ignored when the

feature vectors are considered independent.

It represents the contribution of the

correlations which exist between the feature

vectors.

The purpose of this dissertation is to

examine the assumptions and modeling

techniques that are utilized by SI recogni-

tion systems and to propose novel modeling

techniques to account for the speaker

constraints which are typically ignored. To

this end, this thesis has examined two

primary approaches: speaker adaptation and

consistency modeling. The goal of speaker

adaptation is to alter the standard acoustic

models represented by the expression ( )Pxp n

r

so as to match the current test speaker as

closely as possible. The goal of consistency

modeling is to estimate the contribution of

the consistency ratio, which is typically

ignored when the independence of observa-

tions assumption is made.

Speaker clustering provides one of the

most effective techniques used by speaker

adaptation algorithms. This thesis examines

several different approaches to speaker

clustering. These techniques are reference

speaker weighting, hierarchical speaker

clustering, and speaker cluster weighting.

These methods examine various

different approaches for utilizing and

combining acoustic model parameters

trained from different speakers or speaker

clusters. For example, the hierarchical

speaker clustering used in this thesis

examines the use of gender dependent

models as well as gender and speaking rate

dependent models.

Consistency modeling is a novel

recognition technique for accounting for

the correlation information which is

generally ignored when each acoustic

observation is considered independent. The

key idea of consistency modeling is that the

contribution of the consistency ratio must

be estimated. Using several simplifying

assumptions, the estimation of the consis-

tency ratio can be reduced to the problem

of estimating the mutual information

between pairs of acoustic observations.

The various different techniques have

been evaluated on the DARPA Resource

Management recognition task [1] using the

SUMMIT speech recognition system [2]. The

algorithms were tested on the task of

instantaneous adaptation. In other words,

the methods attempt to adapt to the same

utterance which the system is trying to

recognize. The results are tabulated in Table

13 with respect to the baseline SI system.

The results include experiments where

speaker adaptation or clustering techniques

are used in conjunction with consistency

modeling in order to combine their

strengths. The results indicate that signifi-

cant performance improvements are

possible when speaker correlation informa-

tion is accounted for within the framework

of a speech recognition system.

THE USE OF SPEAKER CORRELATION INFORMATION FOR AUTOMATIC SPEECH RECOGNITION

Table 13. Summary of recognitionresults using various instantaneousadaptation techniques includingreference speaker weighting (RSW),gender dependent modeling (GD),gender and speaker rate dependentmodeling (GRD), speaker clusterweighting (SCW), and consistencymodeling (CM), dependentmodeling (GRD), speaker clusterweighting (SCW), and consistencymodeling (CM).

Adaptation Method

Word Error Rate

Error Rate Reduction

SI 8.6% -SI+RSW 8.0% 6.5%SI+CM 7.9% 8.2%

SI+RSW+CM 7.7% 10.0%GD 7.7% 10.5%

GD+CM 7.1% 17.6%GRD 7.2% 16.4%

GRD+CM 6.8% 20.3%SCW 6.9% 18.9%

SCW+CM 6.8% 21.1%


References[1] W. Fisher, “The DARPA Task Domain Speech

Recognition Database,” Proc. of the DARPA SpeechRecognition Workshop, pp. 105-109, San Diego CA,March 1987.

[2] J. Glass, J. Chang, and M. McCandless, “AProbabilistic Framework for Feature-based SpeechRecognition,” Proc. of the International Conferenceon Spoken Language Processing, pp. 2277-2280,Philadelphia, PA, 1996.

[3] T. Hazen, “A Comparison of Novel Techniquesfor Instantaneous Speaker Adaptation,” Proc.European Conference on Speech Communication andTechnology, pp. 1883-1886, Rhodes, Greece, 1997.

[4] T. Hazen. The Use of Speaker Correlation Informationfor Automatic Speech Recognition. Ph.D. thesis, MITDepartment of Electrical Engineering andComputer Science, January 1998.

T.J. HAZEN


Day

encapsulate encapsulate

Word ("low") Word ("high")

near near

Integer Integer

The Mole: A Robust Framework for AccessingInformation from the World Wide WebHyung-Jin Kim

Although many people have labeled the

World Wide Web as the largest database

ever created, very few applications have been

able to use the web as a database. This is

because the web is dynamic: web pages

change constantly, sometimes on a daily

basis. I propose a system called the “Mole”

that aims to solve this problem by providing

a semantic interface into the web. The

semantic interface uses the semantic

content on web pages to map very high-level

concepts, such as “weather reports for

Boston” to low-level requests for data (such

as getting the text in the third ‘A’ tag in a

web page). Therefore, even though web

pages change, the Mole will still be able to

find information on them.

The Mole will robustly access a web

page by taking advantage of the topology of

its underlying HTML. When web pages get

updated, the information that is presented

usually retains the same structure. For

example, when the CNN Weather Data site

changed in November of 1997, its facade

changed, but it still continued to present

the same information. CNN still presented

data about the current conditions of a city

and it still gave a four-day forecast. Further-

more, although the HTML structure of this

new page was drastically different, the

weather information was still grouped in

the same way (i.e. high and low tempera-

tures were still presented next to each

other).

The Mole uses semantic templates to

access information from web pages. In the

weather example, to gather all of the 4-day

forecasts of a city, the template in Figure 11

is used. The Mole takes this template and

matches it to the data on the web page. This

template essentially drills down through

high-level concepts presented on the web

page. First, it finds a “day” word e.g.,

“Monday” on the web page and then it tries

to find the words “low” and “high” that are

associated with that word. Finally, it finds

the integers that are most closely located to

the words “low” and “high”. Since this

semantic template is abstracting away from

the HTML structure, this template would

have found the same temperature informa-

tion before and after the change (see Figure

12). Notice that this template follows what a

human does to gather the same informa-

tion: first, he searches for a specific day and

then he searches for the temperatures

besides the words “high” and “low”.

In order to make use of semantic

templates, the Mole will require the

following facilities: a taxonomy of data

descriptors and a library of relationship

descriptions. A taxonomy of data descrip-

tors is used to describe all possible data or

recognizable features on a web page. In our

weather template, we used the names

“integer” and “day” to describe the data we

are looking for. In order for the Mole to

access many different types of web pages, a

large library of data types needs to be

created. One can imagine extending this

taxonomy to incorporate concepts of

“state”, “country”, and “car_name”. This

taxonomy can be hierarchical in that a

semantic idea can be built on top of other

semantic ides, making them highly scalable

and re-usable. A library of relationship

Figure 11. Semantic template forCNN weather.


descriptors describes all the ways in which

features of a web page can relate to each

other. Descriptors such as “near” and

“on_top_of” are simple examples of

relationship descriptors. More complicated

descriptors include “encapsulate” which not

only define how one datum is positioned

relative to another, but also how the fonts of

each datum are related to each other (words

with large, bolded fonts encapsulate smaller

fonted words following them).

The Mole is potentially a very robust

and simple interface for applications to

access the web. By “lifting” semantic

concepts found on a web page away from

the HTML structure, the Mole will be able

to gather information from web pages even

when these pages change. In many ways,

semantic templates attempt to mimic what a

Figure 12. Mapping of thesemantic template to two versionsof the weather page (note: notnecessarily the CNN weatherpage).

human does to find information. By using

concepts instead HTML tags to find

information, the Mole is using web pages as

they were meant to be used: by the human

eye.

Day

Word ("low") Word ("high")

Integer Integer

MondayHigh: 95Low: 55

MondayHigh: Low: 95 65


In this work, we introduce and explore a

novel framework, ANGIE, for modelling

subword lexical phenomena in speech

recognition. Our framework provides a

flexible and powerful mechanism for

capturing morphology, syllabification,

phonology and other subword effects in a

hierarchical manner which maximizes the

sharing of subword structures. We hope

that such a system can provide a single

unified probabilistic framework for model-

ling phonological variation and morphol-

ogy. Many current systems handle phono-

logical variations either by having a pronun-

ciation graph (such as in MIT's SUMMIT

system) or by implicitly absorbing the

variations into the acoustic modelling. The

former has the disadvantage of not sharing

common subword structure, hence splitting

training data. The latter masks the process

of handling phonological variations and

makes the process difficult to control and to

improve upon. For example, in the ATIS

domain, the words "fly," "flying," "flight,"

and "flights" all share the common initial

phoneme sequence f l ay, so presumably,

phonological variations affecting this

sequence can be better learned if examples

from all four words were pooled together.

Our system does just that. The sharing of

subword structure will hopefully facilitate

the search process and also make it easier to

deal with new, out-of-vocabulary, words. By

pursuing merged common subword theories

during search, we can mitigate the combina-

torial explosion of the search tree, making

large vocabulary recognition more manage-

able. Because we expect new words to share

much common subword structure with

words in our vocabulary, we can easily add

new words dynamically, allowing them to

adopt existing subword structures. In

principle, we can even detect the occurrence

of out-of-vocabulary words by recognizing as

much of the subword structure as we can in

a bottom up manner.

We are using the ANGIE framework to

model constraints at the subword level.

Within our framework, subword structure is

modeled via a context-free grammar and a

probability model. The grammar generates a

layered structure very similar to that

proposed by Meng [1]. An example of an

ANGIE parse tree is shown in Figure 13. Our

work attempts to validate the feasibility of

using the framework for speech recognition

by demonstrating its effectiveness in three

recognition tasks, phonetic recognition,

word-spotting and continuous speech

recognition. We also explore the combina-

tion of ANGIE with a natural language

understanding system, TINA, that is also

based on a context-free grammar and hence

can be more easily integrated into our ANGIE-

based system as compared to a more

traditional recognition framework. Finally,

we conclude with two pilot studies, one

attempting to leverage off the ANGIE

subword structural information for prosodic

modelling and the other exploring the

addition of new words to the recognition

vocabulary in real time.

Our first demonstration of recognition

with ANGIE was a system for forced phone-

mic/phonetic/acoustic alignment and

phonetic recognition as described in greater

detail in [2]. In this system, we perform a

bottom up best first search over possible

phone strings, incorporating the acoustic

score of each phone along with the score of

the best ANGIE parse for the path up to that

phone. Phonetic recognition results

obtained have been promising, with the

ANGIE-based system achieving a 36.1% error

Sublexical Modelling for Word-Spotting andSpeech Recognition using ANGIERaymond Lau


SENTENCE

ISUF

NUCLAX+ CODA

/eh/ /d*ed/

WORD

S OR OT UROOT2 DSUF

NUC DNUC UCODA PAST

/ih+/ /er//n/ /t/ /s/ /t/

[ih] [n] [-n] [axr]

Morphology

Syllabification

Phonemics

Phonetics

[s] [t] [ix] [dx] [ix][ax] [m]

FCODAFNUC

FCN

WORD

/m//ay_i/

Figure 13. A sample parse tree forthe phrase "I'm interested."

rate as compared to a phone bigram

baseline system with a 39.8% error rate on

ATIS data. The improvement was due

roughly equally to improved phonological

modelling and the more powerful longer

distance constraints made possible with

ANGIE's upper layers.

Our second demonstration was the

implementation of an ANGIE-based system

for word spotting. Our test case was word

spotting the city names in the ATIS corpus.

We have successfully implemented the

wordspotter with competitive performance.

We have also conducted several experiments

varying the nature of the subword con-

straints on the filler model within the

wordspotter. The constraint sets experi-

mented with ranged from simple phone

bigram, to syllables, to full word recogni-

tion. The results showed that, as expected,

the inclusion of more constraints on the

filler led to improved word-spotting perfor-

mance. On our test set, the system had a

FOM of 89.3 with full word recognition,

87.7 with syllables, and 85.3 with phone

bigrams. Surprisingly, speed tended to

improve with FOM performance. We

believe the explanation is that more

constraints lead to a less bushy search. More

details of our work in word-spotting can be

found in [3].

For our final feasibility test, we have

implemented a continuous speech recogni-

tion system based on the ANGIE framework.

Our recognizer, employing a word bigram,

achieves the same level of performance as

the our SUMMIT baseline system with a word

bigram (18.8% word error rate vs. 18.9%).

In both cases, context-independent acoustic

models were used. Because ANGIE is based

on a context-free grammar framework, we

have experimented with integrating our TINA

natural language understanding system, also

based on a context-free framework) with

ANGIE, resulting in a single, coupled search

strategy. The main challenge with the

integrated system was in curtailing the

computational requirements of supporting

robust parse. We finalized upon a greedy

strategy described in greater detail in [4].

With the combined system, the word error

rate declines to 14.8%. We have also

attempted TINA resorting of SUMMIT N-best

lists in an effort to separate the benefits of

an integrated search strategy from those of

bringing in the powerful TINA language

model. That experiment yielded only

marginal improvement over the word


bigram, suggesting that the tightly coupled

search can lead to a gain not attainable

when the recognition and NL understand-

ing processes are separated and interfaced

through an N-best list, generated without

the use of information from TINA.

Finally, we conducted two pilot studies

exploring problems for which we believe the

ANGIE-based framework will exhibit advan-

tages. The first pilot study examines the

ability to add new words to the recognition

vocabulary in real time, that is, without

requiring extensive retraining of the lexical

models. We believe that, because of ANGIE's

hierarchical structure, new words added to

the vocabulary can share lexical subword

structures with existing words in the

vocabulary. For this study, we simulated the

appearance of new words by artificially

removing the city names that only appear in

ATIS-3, that is, city names which did not

appear in ATIS-2. These city names were then

considered the new words in our system.

For the baseline comparison, we added the

words to a similarly reduced SUMMIT

recognizer and assigned zero to their lexical

arc weights in the pronunciation graph. In

the ANGIE case, we allowed ANGIE to general-

ize probabilities learned from other words

with similar word substructures. In both

cases, the word level bigram model used a

class bigram, with uniform probabilities

distributed over all city names, including

the simulated new words. Both the baseline

and ANGIE systems achieved the same word

error rate, 19.2%. This represents a slight

decrease from a system trained with full

knowledge of the simulated new words.

Apparently, the lack of lexical training did

not adversely impact recognition perfor-

mance much with our set of simulated new

words. It is unclear whether ANGIE would

show an improvement over the baseline for

a different choice of new words. We do

note, however, that for the artificially

reduced system, without the simulated new

words in the vocabulary, the ANGIE-based

system achieves a 31.2% error rate as

compared to a 34.2% error rate for the

baseline SUMMIT system, suggesting that

ANGIE is more robust in the presence of

unknown words.

For our other pilot study, we attempted

to leverage the word substructure informa-

tion provided by ANGIE for prosodic

modelling. Our experiment, conducted in

conjunction with our colleague Grace

Chung, was to implement a hierarchical

duration model based on the ANGIE parse

tree and to incorporate the duration score

into our recognition search process. We

evaluated the duration model in the context

of our ANGIE-based word-spotting system. Its

inclusion increased the FOM from 89.3 to

91.6, leading us to conclude that the ANGIE

subword structure information can indeed

be used for improved prosodic modelling,

minimally in terms of duration.

We believe that our work demonstrates

the feasibility of using ANGIE as a competitive

lexical modelling framework for various

speech recognition systems. Our experience

with word-spotting shows that ANGIE

provides a platform where it is easy to alter

subword constraints. Our success at NL

integration for improved recognition

suggests that a context-free framework has

several advantages. Finally, our pilot study

in prosodic modelling suggests that ANGIE's

subword structuring information can be

leveraged to provide improved performance.

SUBLEXICAL MODELLING FOR WORD-SPOTTING AND SPEECH RECOGNITION USING ANGIE


References[1] H. M. Meng, Phonological Parsing for Bi-directional

Letter-to-Sound/Sound-to-Letter Generation, Ph.D.thesis, Department of Electrical Engineering andComputer Science, Massachusetts Institute ofTechnology, Cambridge, MA, June 1995.

[2] S. Seneff, R. Lau, and H. Meng, ANGIE: A NewFramework for Speech Analysis Based onMorpho-Phonological Modelling, Proc. ICSLP '96,Philadelphia, PA, pp. 225-228, October 1996.(Available online at http://www.raylau.com/icslp96_angie.pdf)

[3] R. Lau and S. Seneff, Providing SublexicalConstraints for Word Spotting within the ANGIE

Framework, Proc. Eurospeech '97, Rhodes, Greece,pp. 263-266, September 1997. (Available online athttp://www.raylau.com/angie/eurospeech97/main.pdf)

[4] R. Lau, Subword Lexical Modelling for SpeechRecognition, Ph.D. thesis, MIT Department ofElectrical Engineering and Computer Science,Cambridge, MA, May 1998. (Available online athttp://www.raylau.com/thesis/thesis.pdf)

RAYMOND LAU


Probabilistic Segmentationfor Segment-Based Speech RecognitionSteven Lee

The objective of this research is to develop a

high-quality real-time probabilistic segmenta-

tion algorithm for use with SUMMIT, a

segment-based speech recognition system

[1]. Until recently, SUMMIT used a segmenta-

tion algorithm based on acoustic change.

This algorithm was adequate, but produced

segment graphs that were denser than

necessary because a low acoustic change

threshold was needed to ensure segment

boundaries not marked by sharp acoustic

change were also included. Recently, Chang

developed an approach to segmentation that

uses a Viterbi and a backwards A* search to

produce a phonetic graph in the same

manner as word graph production [2, 3].

This algorithm achieved a 11.4% decrease in

phonetic recognition error rate while

hypothesizing half the number of segments

of the acoustic segmentation algorithm.

While the results of this approach are

promising, it has two drawbacks that keep it

from widespread use in practical speech

recognition systems. The first is that the

algorithm cannot run in real-time because it

requires a complete forward Viterbi search

followed by a backward A* search. The

second is that the algorithm requires

enormous computational power since the

search is performed at the frame level. This

research seeks to develop a search algorithm

that produces a segment network in a

pipelined, left-to-right mode. It also aims to

improve and lower the computational

requirements.

The approach being adopted in this

research is to introduce a simplified search

framework and to shrink the search space.

The new search framework, a frame-based

Viterbi search that does not utilize a

segment graph, is attractive for probabilistic

segmentation because of its simplicity and

its relatively low computational require-

ments. Although work on using this search

to produce a segment graph is ongoing,

preliminary results using this search on

phonetic recognition resulted in a competi-

tive error rate of 30.3% [4]. Since recogni-

tion performance should be somewhat

correlated to the quality of the segment

graph produced, this is a promising result.

The size of the search space in probabi-

listic segmentation is bounded by time in

one dimension and by the number of

phonetic units in another dimension. Both

dimensions can be shrunk to provide

computational savings. This research will

investigate shrinking the time dimension by

using landmarks instead of frames. It will

also investigate the use of broad classes to

shrink the search space along the lexical

dimension.

The domains being used for this work

are TIMIT and JUPITER [5], a telephone-

based weather information domain.

References[1] J. Glass and J. Chang and M. McCandless, “A

Probabilistic Framework for Feature-based SpeechRecognition,” Proc. International Conference onSpeech and Language Processing, pp. 2277-2280,Philadelphia, PA, October 1996.

[2] J. Chang and J. Glass, “Segmentation andModeling in Segment-based Recognition,” Proc.European Conference on Speech Communication andTechnology, pp. 1199-1202, Rhodes, Greece,September 1997.

[3] I. Hetherington and M. Phillips and J. Glass andV. Zue, “A* Word Network Search forContinuous Speech Recognition,” Proc. EuropeanConference on Speech Communication and Technology,pp. 1533-1536, Berlin, Germany, September 1993.

[4] S. Lee. Probabilistic Segmentation for Segmentation-based Speech Recognition, M.Eng thesis, MITDepartment of Electrical Enginering andComputer Science, June 1998.

[5] V. Zue, et al. “From Interface to Content:Translingual Access and Delivery of On-lineInformation,’ Proc. European Conference on SpeechCommuncation and Technology, pp. 2227-2230,Rhodes, Greece, September 1997.


A Model for Interactive Computation:Applications to Speech ResearchMichael McCandless

Although interactive tools are extremely

valuable for progress in speech research, the

programming techniques required to

implement them are often difficult to

master and apply. There are numerous

interface toolkits which facilitate implemen-

tation of the user-interface, but these tools

still require the programmer to build the

tool’s back end by hand. The goal of this

research is to create a programming

environment which simplifies the process of

building interactive tools by automating the

computational details of providing

interactivity.

Interactive tools engage their users in a

dialogue, effectively allowing the user to ask

questions and receive answers. Questions

are typically asked by interacting with the

tool’s interface via direct manipulation. I

propose a set of metrics which may be used

to measure the extent of a tool’s

interactivity: rapid response (does the tool

answer the user’s question as quickly as

possible); high coverage (is the user able to

ask a wide range of questions); adaptability

(does the tool adapt to varying computation

environments); scalability (can the tool

manage both large and small inputs);

pipelining (does the tool provide the answer

in pieces over time for computations that

take a long time); backgrounding (is the

user able to ask other questions while an

answer is being computed). I refer to a tool

which can meet these stringent require-

ments as a “finely interactive tool”. These

dimensions provide metrics for measuring

and comparing the interactivity of different

tools.

Based on these requirements for

interactivity, I have designed a declarative

computation model for specifying and

implementing interactive computation. In

order to evaluate the effectiveness of the

model, I have incorporated it into a speech

toolkit called MUSE [1,2]. MUSE contains

numerous components allowing a program-

mer to quickly construct finely interactive

tools. MUSE is implemented in the Python

programming language with extensions in

C. A Python interface to the Tk widget set is

used for interface design and layout.

The programmer specifies computation

in MUSE differently from existing impera-

tive programming languages. Like existing

languages, the programmer builds a MUSE

program by applying functions to strongly-

typed values. However, in MUSE, the

programmer does not have detailed control

over when the computations actually take

place, nor over when and where intermedi-

ate results are stored; instead, the program-

mer declares the functional relationships

among a collection of MUSE values. The

MUSE system records these relationships,

constructs a run-time acyclic dependency

graph, and then chooses when to compute

which values.

MUSE’s data-types also differ from

existing programming languages. The

specification for each data-type, for example

a waveform, image or graph, includes

provisions for incremental change: every

data-type is allowed to change in certain

ways. For example, images may change by

replacing the set of pixels within a specified

rectangular area. When a value changes at

run-time, MUSE will consult the depen-

dency graph, and will then take the

necessary steps to bring all dependents of

that value up to date with the new change.

These unique properties of MUSE free the

programmer from dealing with many of the

complex computational aspects of providing

interactivity.


Because the programmer relinquishes

control over the details of how values are

computed, the MUSE run-time system must

make such choices. While there are many

ways to implement this, the technique used

by MUSE is based on purely lazy evaluation

plus caching. When values are changed, a

synchronous depth-first search is performed,

notifying all impacted values of the change.

Values are computed entirely on-demand,

and are then cached away according to the

program. For example, if the user is looking

at a spectrogram, only the portion of the

image they are actually looking at will be

computed, which requires a certain range of

the STFT, which in turn requires only a

certain range of the input waveform.

This implementation choice affects all

of the built-in functions; the implementa-

tion of these functions, in both Python and

C, must “force” the evaluation of any inputs

that they need, but only in response to their

output being forced. Further, any incremen-

tal change on an input to the function must

be propagated as an incremental change on

the function’s output.

In order to effectively test the

interactivity of MUSE, I have added many

necessary speech functions and data-types.

The functions include waveform

preemphasis, (short-time) Fourier trans-

forms, linear-predictive analysis, Cepstral

analysis, energy, word-spotting lexical access,

mixture diagonal Gaussian training. The

datatypes include waveforms, spectra,

graphs, tracks, time marks and cursors, and

images. Each data-type has an associated

cache which the programmer may use to

easily control the extent of storage of

intermediate results, as well as a visual

function, which translates the data-type into

an appropriate image.

I have constructed four example tools

which illustrate the unique capabilities of

Figure 14. A screen shot of aninteractive lexical access tool. Theuser is able to edit the phonetictranscription and with eachchange, the word transcription isupdated in real-time to reflect theallowed word alignments accordingto the TIMIT pronunciationlexicon. The tool demonstrates theunique nature of MUSE’sincremental computation model.

A MODEL FOR INTERACTIVE COMPUTATION: APPLICATIONS TO SPEECH RESEARCH


the MUSE toolkit. The first tool is a basic

speech analysis tool showing a waveform,

spectrogram and transcription, which allows

the user to modify the alignment of

individual frames of the STFT by directly

editing time marks, and then see the impact

on the spectrogram image. The second tool

displays three overlaid spectral slices (FFT,

LPC, and CEPSTRUM), and allows the

user to change all aspects of the computa-

tion. The third tool illustrates the process of

training a diagonal Gaussian mixture model

on one-dimensional data, allowing the user

to vary many of the parameters affecting the

training process. The final tool is a lexical

access tool, allowing the user to phonetically

transcribe an utterance and then see the

corresponding potential word matches.

Figure 14 shows a screen-shot of this tool.

The properties of MUSE’s incremental

computation model are reflected in the high

degree of interactivity each of these tools

offers the user; MUSE’s run-time model is

able to effectively carry out the require-

ments of interactivity.

References[1] M. McCandless and J. Glass, “MUSE: A Scripting

Language for the Development of InteractiveSpeech Analysis and Recognition Tools, “ Proc.European Conference on Speech Communication andTechnology, Rhodes, Greece, September 1997.

[2] M. McCandless. A Model for InteractiveComputation: Applications to Speech Research. Ph.D.thesis, MIT Department of Electrical Engineeringand Computer Science, June 1998.

MICHAEL MCCANDLESS


Subword Approachesto Spoken Document RetrievalKenney Ng

As the amount of accessible data continues

to grow, the need for automatic methods to

process, organize, and analyze this data and

present it in human usable form has

become increasingly important. Of particu-

lar interest is the problem of efficiently

finding “interesting” pieces of information

from the growing collections and streams of

data. Much research has been done on the

problem of selecting “relevant” items from

large collections of text documents given a

query or request from a user. Only recently

has there been work addressing the retrieval

of information from other media such as

images, video, audio, and speech. Given the

growing amounts of spoken language data,

such as recorded speech messages and radio

and television broadcasts, the development

of automatic methods to index, organize,

and retrieve spoken documents will become

more important.

In our work, we are investigating the

feasibility of using subword unit indexing

terms for spoken document retrieval as an

alternative to words generated by either

keyword spotting or word recognition. The

investigation is motivated by the observa-

tion that word-based retrieval approaches

face the problem of either having to know

the keywords to search for a priori, or

requiring a very large recognition vocabu-

lary in order to cover the contents of

growing and diverse message collections.

The use of subword units in the recognizer

constrains the size of the vocabulary needed

to cover the language; and the use of

subword unit indexing terms allows for the

detection of new user-specified query terms

during retrieval.

We explore a range of subword unit

indexing terms of varying complexity

derived from phonetic transcriptions. The

basic underlying unit is the phone; more

and less complex units are derived by

varying the level of detail and the sequence

length of these units. Labels of the units

range from specific phones to broad

phonetic classes obtained via hierarchical

clustering. Automatically derived fixed- and

variable-length sequences ranging from one

to six units long are examined. Also,

sequences with and without overlap are

explored. In generating the subword units,

each message/query is treated as one long

phone sequence with no word or sentence

boundary information.

The speech data used in this work

consists of recorded FM radio broadcasts of

the NPR “Morning Edition” news show.

The training set for the speech recognizer

consists of 2.5 hours of clean speech from 5

shows while the development set consists of

one hour of data from one show. The

spoken document collection is made up of

12 hours of speech from 16 shows parti-

tioned into 384 separate news stories. In

addition, a set of 50 natural language text

queries and associated relevance judgments

on the message collection are created to

support the retrieval experiments.

Phonetic recognition of the data is

performed with the MIT SUMMIT speech

recognizer. It is a probabilistic segment-

based approach that uses context-indepen-

dent segment and context-dependent

boundary acoustic models. A two pass

search strategy is used during recognition. A

forward Viterbi search is performed using a

statistical bigram language model followed

by a backwards A* search using a higher

order statistical n-gram language model.

Information retrieval is done using a

standard vector space approach. In this

model, the documents and queries are


represented as vectors where each compo-

nent is an indexing term. The terms are

weighted based on the term’s occurrence

statistics both within the document and

across the collection. A normalized inner

product similarity measure between

document and query vectors is used to score

and rank the documents during retrieval.

We perform a series of experiments to

measure the ability of the different subword

units to perform effective spoken document

retrieval. A baseline text retrieval run is

performed using word-level text transcrip-

tions of the spoken documents and queries.

This is equivalent to using a perfect word

recognizer to transcribe the speech messages

followed by a full-text retrieval system.

An upper bound on the performance of

the different subword unit indexing terms is

obtained by running retrieval experiments

using phonetic expansions of the words in

the messages and queries obtained via a

pronunciation dictionary. We find that

many of the subword unit indexing terms

are able to capture enough information to

perform effective retrieval. With the

appropriate subword units it is possible to

achieve performance comparable to that of

text-based word units if the underlying

phonetic units are recognized correctly.

We next examine the retrieval perfor-

mance of the subword unit indexing terms

derived from errorful phonetic transcrip-

tions created by running the phonetic

recognizer on the entire spoken document

collection. From this experiment, we find

that although performance is worse for all

units when there are phonetic recognition

errors, some subword units can still give

reasonable performance even before the use

of any error compensation techniques such

as approximate term matching.

We then attempt to improve retrieval

performance by exploring “robust” indexing

and retrieval approaches which take into

account and try to compensate for the

speech recognition errors introduced into

the spoken document collection. We look at

two approaches. One involves modifying the

query representation to include additional

approximate match terms; the main idea is

to include terms that are likely to be

confused with the original query terms. The

other approach is to modify the speech

document representation by expanding

them to include high scoring recognition

alternatives; the goal is to increase the

chance of including the correct hypothesis.

We find that both approaches are able to

help improve retrieval performance.

Our results indicate that subword-based

approaches to spoken document retrieval

are feasible and merit further research. In

terms of current and future work, we are

expanding the corpus to include more

speech for both recognizer training and the

speech message collection; exploring ways to

improve the performance of the phonetic

recognizer; and investigating more sophisti-

cated robust indexing and retrieval methods

in an effort to improve retrieval perfor-

mance when there are recognition errors.

References[1] K. Ng and V. Zue, “Subword Unit

Representations for Spoken Document Retrieval,”Proc. European Conference on Speech Communicationand Technology, pp. 1607-1610, Rhodes, Greece,September 1997,

[2] K. Ng and V. Zue, “An Investigation of SubwordUnit Representations for Spoken DocumentRetrieval,” Proceedings of the ACM SIGIRConference, p. 139, Philadelphia, PA, July 1997.


A Semi-Automatic System for the Syllabificationand Stress Assignment of Large LexiconsAarati Parmar

Sub-word modelling, which includes

morphology, syllabification, stress, and

phonemes, has been shown to improve

performance in certain speech applications

[1]. This observation has motivated us to

attempt to formally define a convention for

a set of syllable-sized units, intended to

capture these sub-word level realizations in

words in the English language, through a

two-tiered approach. The assumption is that

words can be represented as sequences of

units we call “morphs,” which capture

explicitly both the pronunciation and the

orthography. Each morph unit has a

carefully constructed label and a lexical

entry that provides its canonic phonemic

realization. Each word is entered into a

word lexicon decomposed into its appropri-

ate morph sequence. Thus, for example, the

word “contentiously” would be represented

as “con- ten+ -tious =ly” with the markers, “-

”, “+”, and “=” coding for morphological

categories such as prefix, stressed root, and

derivational/inflectional suffix. It is our

hope that all words of English can be

represented in terms of a reasonably small,

closed set of these morph units.

This thesis introduces a new semi-

automatic procedure for acquiring a

representation of a large corpus of words in

terms of morphs. Morph transcription, as

we have defined it, is a considerably more

difficult task than phonetic or phonemic

transcriptions, simply because constraints

have to be satisfied on more than one level.

Morphs with similar spellings but different

pronunciations must be distinguished

through selected capital letters, as in the

examples “com+” (/k!/ /aa+/ /m/) in

“combat” and “cOm+” (/k!/ /ah+/ /m/) in

“comfort.” The letters of the morph

spellings for a given word must, if

lowercased and concatenated, realize a

correct spelling of the word. Syllabification

must be correctly marked, and the phone-

mic transcription obtained by replacing the

morph units with their phonemic realiza-

tions must be accurate.

We would like to know if our represen-

tation is extensible, and if it is possible to

automatically or semi-automatically extract

these sub-lexical units from large corpora of

words with associated phonetic transcrip-

tions. Thus we have devised a procedure

that can hopefully propose morph decom-

positions accurately and efficiently. We have

evaluated the procedure on two corpora,

and have also assessed how appropriate the

morph concept is as a basic unit for

capturing sub-lexical constraints.

We used the ANGIE formalism to

generate and test our morphs. ANGIE is a

system that can parse either spellings or

phonetics into a probabilistic hierarchical

framework. We decided to develop our

procedure based on a medium-sized corpus

known as TIMIT. We began with a grammar

that had been developed and trained on a

corpus we call “ABH” (a combination

including the ATIS vocabulary, a subset of

the 10,000 most frequent words of the

Brown corpus, and the Harvard List

lexicon). We then applied the knowledge we

had gained from ABH, both with and

without that derived from the TIMIT

experiment, to the much larger COMLEX

lexicon (omitting proper nouns and

abbreviations). In this way we tested how

well a set of morphs derived from a seed

lexicon can be applied to a much larger set

of some 30,000 words. If morphs are a good

representation, then good coverage should

be attainable.

Our procedure was to first parse, in

recognition mode, the letters of all the new

words in a corpus to be absorbed, using a

letter-terminal grammar trained on the seed


ABH corpus. This yielded a set of hypoth-

esized phoneme sequences and/or morph

sequences for each word, which could then

be verified or rejected by parsing in phone

mode, using the phonetic transcription

provided by the corpus as established phone

terminals, along with a phone-to-phoneme

grammar that defines the mappings from

the conventions of the corpus to ANGIE’s

conventions. By enforcing morph con-

straints as well, we obtained further

constraint than if we just used the phone-

mic knowledge.

We have some encouraging signs that

our set of morphs is large enough to

encompass most or all English words,

particularly if we allow novel stressed roots

to be “invented” by decomposing them into

a confirmed onset and rhyme. In our

experiments, even without “invented”

stressed roots, we determined that coverage

of TIMIT was about 89%, and for

COMLEX it was about 94%. The parse

coverage of our procedure is quite good,

considering the large size of the COMLEX

corpus. The accuracy of the morphological

decompositions is reasonable as well.

According to an informal evaluation,

morphological decompositions of words in

TIMIT that pass through both letter and

phone parsing steps have a 78% probability

of matching exactly the expert transcription.

Of course this metric does not take into

account alternate decompositions which

may also be correct, or more consistent with

one another than the human-generated

ones.

We performed an analysis and compari-

son of the experiments performed on

TIMIT and COMLEX. The topics covered

include degree of constraint, hand-written

versus automatic rules, and consistency of

morphological decompositions. Constraint

can be measured by the average number of

alternate morphological decompositions per

word. The average number of morphs

generated from the letter parsing step is

about three, for both TIMIT and

COMLEX. After parsing with phones, the

figure drops to 1.1 for TIMIT and to 1.7 for

COMLEX. Automatically derived rules (for

the mapping from ANGIE’s phoneme

conventions to the phonetic conventions of

the corpus) provide a quick alternative to

hand-written rules, with greater coverage,

but at a price of some performance loss.

Morphological decompositions produced by

our procedure also appear to be self-

consistent.

We have also developed a new analysis

tool to simplify the task of labelling words

for morph transcriptions. This tool aids the

transcriber by providing easy access to many

different sources of knowledge, via a

sophisticated graphical interface. It can be

used to efficiently repair errors obtained in

the automatic parsing procedure.

A significant outcome of this thesis is a

much larger inventory of the possible

morphs of English, and a much larger

lexicon of words decomposed into these

morph units. These resources should serve

us well in future experiments in letter-to-

sound/sound-to-letter generation, for the

automatic acquisition of pronunciations for

new words. They should also be useful for

the automatic acquisition of vocabularies

for speech recognition tasks using ANGIE,

and for other experiments, e.g, in prosodic

analysis, where syllable decomposition may

be important.

Reference[1] R. Lau and S. Seneff, Providing Sublexical

Constraints for Word Spotting within the ANGIE

Framework, Proc. European Conference on SpeechCommunication and Technology, pp. 263-266,Rhodes, Greece, September 1997.


A Segment-Based Speaker Verification SystemUsing SUMMITSridevi Sarma

This thesis describes the development of a

segment-based speaker verification system

and explores two computationally efficient

techniques. Our investigation is motivated

by past observations that speaker-specific

cues may manifest themselves differently

depending on the manner of articulation of

the phonemes. By treating the speech signal

as a concatenation of phone-sized units, one

may be able to capitalize on measurements

for such units more readily. A potential side

benefit of such an approach is that one may

be able to achieve good performance with

unit (i.e., phonetic inventory) and feature

sizes that are smaller than what would

normally be required for a frame-based

system, thus deriving the benefit of reduced

computation.

To carry out our investigation, we

started with the segment-based speech

recognition system developed in our group

called SUMMIT [1,2], and modified it to suit

our needs. The speech signal was first

transformed into a hierarchical segment

network using frame-based measurements.

Next, acoustic models for each speaker were

developed for a small set of six phoneme

broad classes. The models represented

feature statistics with diagonal Gaussians,

which characterized the principle compo-

nents of the feature set. The feature vector

included averages of MFCCs 1-14, plus

three prosodic measurements: energy,

fundamental frequency (F0), and duration.

To facilitate a comparison with previ-

ously reported work [3,4,5], our speaker

verification experiments were carried out

using 2 sets of 100 speakers from the TIMIT

corpus. Each speaker-specific model was

developed from the eight SI and SX

sentences. Verification was performed using

the two SA sentences common to all

speakers. To classify a speaker, a Viterbi

forced alignment was determined for each

test utterance, and the forced alignment

score of the purported speaker was com-

pared with those obtained with the models

of the speaker’s competitors. These scores

were then rank ordered and the user was

accepted if his/her model’s score was within

the top N of 100 scores, where N is a

parameter we varied in our experiments. To

test for false acceptance, we used every other

speaker in the system as impostors.

Ideally, the purported speaker’s score

should be compared to scores of every other

system user. However, computation

becomes expensive as more users are added

to the system. To reduce the computation,

we adopted a procedure in which the score

for the purported speaker is compared only

to scores of a cohort set consisting of a small

set of acoustically similar speakers. These

scores were then rank ordered and the user

was accepted if his/her model’s score was

within the top N scores, where N is a

parameter we varied in our experiments. To

test for false acceptance, we used only the

members of a speaker’s cohort set as

impostors.

In addition to using cohort normaliza-

tion to reduce computation, we determined

the the size and content of the feature

vector through a greedy algorithm opti-

mized on overall speaker verification

performance. Fewer features allows for fewer

parameters to be estimated during training,

and fewer scores to be computed during

testing.

We were able to achieve a performance

of 0% equal error rate (EER) on clean data

and 8.36% EER on noisy telephone data,

with a simple system design. Thus we show

that a segment-based approach to speaker


verification is viable, competitive and

efficient. Cohort normalization and

conducting a feature search to reduce

dimensions minimally affect performance

and are useful when computation is

prohibitive.

References[1] V. Zue, J. Glass, M. Phillips, and S. Seneff,

“Acoustic Segmentation and PhoneticClassification in the SUMMIT Speech RecognitionSystem,’’ Proc. of the International Conference onAcoustics, Speech, and Signal Processing, pp. 389-392,Glasgow, Scotland, May 1989.

[2] V. Zue, J. Glass, M. Phillips, and S. Seneff, “TheSUMMIT Speech Recognition System: PhonologicalModeling and Lexical Access,’’ Proc. of theInternational Conference on Acoustics, Speech, andSignal Processing, pp. 49-52, Albuquerque, NM,April 1990.

[3] L. Lamel, J.L. Gauvain, “A Phone-based Approachto Non-linguistic Speech Feature Identification,”Computer Speech and Language, pp. 87-103, 1995.

[4] Y. Bennani, “Speaker Identification ThroughModular Connectionist Architecture: Evaluationon the TIMIT database,” Proc. from theInternational Conference on Spoken LanguageProcessing, pp. 607-610, Banff, Alberta 1992.

[5] D. Reynolds, D., “Speaker Identification andVerification Using Gaussian Mixture SpeakerModels,” Speech Communication, Vol. 17, No. 1, pp.91-108, August 1995.


Context-Dependent Modellingin a Segment-Based Speech Recognition SystemBenjamin Serridge

Modern speech recognition systems typically

classify speech into sub-word units that

loosely correspond to phonemes. These

phonetic units are, at least in theory,

independent of task and vocabulary, and

because they constitute a small set, each one

can be well-trained with a reasonable

amount of data. In practice, however, the

acoustic realization of a phoneme varies

greatly depending on its context, and speech

recognition systems can benefit by choosing

units that more explicitly model such

contextual effects.

The goal of this research was to explore

various strategies for incorporating contex-

tual information into a segment-based

speech recognition system, while maintain-

ing computational costs at a level acceptable

for implementation in a real-time system.

The latter was achieved by using context-

independent models in the search, while

context-dependent models are reserved for

re-scoring the hypotheses proposed by the

context-independent system.

Within this framework, several types of

context-dependent sub-word units were

evaluated, including word-dependent,

biphone, and triphone phonetic units. In

each case, deleted interpolation was used to

compensate for the lack of training data for

the models. Other types of context-depen-

dent modeling, such as context-dependent

boundary modelling and “offset” modelling,

were also used successfully in the re-scoring

pass.

The evaluation of the system was

performed using the Resource Management

task. Context-dependent segment models

were able to reduce the error rate of the

context-independent system by more than

twenty percent, and context-dependent

boundary models were able to reduce the

word error rate by more than a third. A

straight-forward combination of context-

dependent segment models and boundary

models leads to further reductions in error

rate.

So that it can be incorporated easily

into existing and future systems, the code

for re-sorting N-best lists was been imple-

mented as an object in SAPPHIRE [2], a

framework for specifying the configuration

of a speech recognition system using a

scripting language. It is currently being

tested on JUPITER [3], a real-time telephone

based weather information system under

development at SLS.

References[1] B. Serridge. Context-dependent Modeling in a

Segment-based Speech Recognition System. M.Eng.thesis, MIT Department of Electrical Engineeringand Computer Science, Cambrudge, MA, August1997.

[2] L. Hetherington and M.McCandless,“SAPPHIRE: An Extensible Speech Analysis andRecognition Tool based on Tcl/Tk,” Proc.International Conference on Spoken LanguageProcessing, pp. 1942-1945, Philadelphia, PA,October 1996.

[3] V. Zue, et al., “From Interface to Content:Translingual Access and Delivery of On-lineInformation,” Proc. European Conference on SpeechCommunication and Technology, pp. 2227-2230,Rhodes, Greece, September 1997.


Toward the Automatic Transcriptionof General Audio DataMichelle S. Spina

Recently, ASR research has broadened its

scope to include the transcription of general

audio data (GAD), from sources such as

radio or television broadcasts. This shift in

research focus is largely brought on by the

growing need to shift content-based

information retrieval from text to speech.

However, GAD pose new challenges to

present-day ASR technology because they

often contain extemporaneously-generated,

and therefore disfluent speech, with words

drawn from a very large vocabulary, and

they are usually recorded from varying

acoustic environments. Also, the voices of

multiple speakers often interleave and

overlap with one another or with music and

other sounds. Since the performance of

ASR systems can vary a great deal depend-

ing on speaker, microphone, recording

conditions and transmission channel, we

have argued that the transcription of GAD

would benefit from a preprocessing step

that first segmented the signal into acousti-

cally homogeneous chunks [3]. Such

preprocessing would enable the transcrip-

tion system to utilize the appropriate

acoustic models during recognition. The

goal of the research presented here was to

investigate some of the strategies for training

a phonetic recognition system for GAD.

We have chosen to focus on the

Morning Edition (ME) news program

broadcast by National Public Radio (NPR).

NPR-ME consists of news reports from

national and local studio anchors as well as

reporters from the field, special interest

editorials and musical segments. The

analysis presented here is based on a

collection of six hours of recording from

November, 1996 to January, 1997. The six,

one-hour shows were automatically split into

manageable sized waveform files at silence

breaks. In addition, if any of the resulting

waveform files contained multiple sound

environments (e.g., a segment of music

followed by a segment of speech) they were

further split at these boundaries. Therefore,

each file was homogeneous with respect to

sound environment. Orthographies and

phonetic alignments were generated for

each of the files using orthographic

transcriptions of the data and a forced

Viterbi search.

Seven categories were used to character-

ize the files. These categories were described

in our previous work [3], and are briefly

reviewed here: 1) clean speech: wideband

(8kHz) speech from anchors and reporters,

recorded in the studio, 2) music speech:

speech with music in the background, 3)

noisy speech: speech with background

noise, 4) field speech: telephone bandwidth

(4kHz) speech from field reporters, 5)

music, 6) silence, and 7) garbage, which

accounted for anything that did not fall into

one of the other six categories. In [3], we

described some preliminary analyses and

experiments that we had conducted

concerning the transcription of this data.

For the NPR-ME corpus, we were able to

achieve better than 80% classification

accuracy for these seven sound classes on

unseen data, using relatively straightforward

acoustic measurements and pattern classifi-

cation techniques. A speech/non-speech

classifier achieved an accuracy of nearly

94%. The level of performance of such a

classifier is clearly related to the ways in

which it will serve as an intelligent front-end

to a speech recognition system. The

experiments done for this work attempt to

determine if such a preprocessor is neces-

sary, and if so, what level of performance is

required for the sound segmentation.


For the development of the phonetic

recognition system, 4.25 hours of the NPR-

ME data were used for system training, and

the remaining hour was used for system test.

Acoustic models were built using the TIMIT

61 label set. Results, expressed as phonetic

recognition error rates, are collapsed down

to the 39 labels typically used by others to

report recognition results. The SUMMIT

segment-based speech recognizer developed

by our group was used for these experi-

ments. The feature vector for each segment

consisted of MFCC and energy averages

over segment thirds as well as two deriva-

tives computed at segment boundaries.

Segment duration was also included.

Mixtures of up to 50 diagonal Gaussians

were used to model the phone distributions

on the training data. For simplicity, only

context-independent models were used. The

language model used in all experiments was

a phone bigram based on over four hours of

training data. This particular configuration

of SUMMIT achieved an error rate of 37.1%

when trained and tested on TIMIT.

We conducted experiments to deter-

mine the trade-offs between using a large

amount data recorded under a variety of

speaking environments (a multi-style

training approach) and a smaller amount of

high quality data if a single recognizer

system was to be used to recognize all four

different types of speech material present in

NPR-ME. We found that a multi-style

approach yielded an overall error rate of

39.2%, with the lowest error rates arising

from clean speech (33.2%) and the highest

error rates arising from field speech

(50.4%). Training the system with only the

clean, wideband speech material found in

the training set yielded comparable results,

with an overall error rate of 38.8%.

However, the multi-style approach utilized

nearly 1.7 times the amount of data for

training the acoustic models. To perform a

fair comparison between these two ap-

proaches, we trained a multi-style system

with an amount of training data equivalent

to that of the clean speech system. We

found this training approach degraded our

results to an overall error rate of 41.1%, an

increase of nearly 3%. This result indicates

that it is advantageous to use only clean,

wideband speech material for acoustic

model training when data and computation

availability becomes an issue.

In addition to the single recognizer

system explained above, we also explored

the use of a multiple recognizer system for

the phonetic recognition of NPR-ME, one

for each type of speech material. The

environment-specific approach involves

training a separate set of models for each

speaking environment, and using the

appropriate models for testing. We used the

sound classification system described in [3]

as the preprocessor to classify each test

utterance as one of the four speech environ-

ments. The environment-specific model

chosen by the automatic classifier for each

utterance was then used to perform the

phonetic recognition. This resulted in an

overall error rate of 38.3%, which is slightly

better than the best single recognizer result.

In all of the experiments conducted, we

found that the field speech environment has

consistently shown the highest phonetic

recognition error rates. In an attempt to

improve the recognition performance of the

field speech, we bandlimited the training

data by restricting our analysis to the

frequency range of 133Hz to 4kHz. Using

this approach, we were able to achieve lower

recognition error rates on the field speech

TOWARD THE AUTOMATIC TRANSCRIPTION OF GENERAL AUDIO DATA


data to 46.9% through bandlimiting the

clean speech training data. Using the

bandlimited clean speech models in

multiple recognizer system for utterances

classified as field speech, the overall error

rate becomes 37.9%, which is 2.3% better

than the best single recognizer result.

In future work in this area, we intend to

concentrate on improving the phonetic

recognition results from the clean speech

environment, and to investigate how the

recognition of GAD compares to other

automatic speech recognition tasks.

MICHELLE S. SPINA

References[1] J.L. Gauvain, L. Lamel, M. Adda-Decker,

“Acoustic Modelling in the LIMSI Nov96 Hub4System,” Proc. of DARPA Speech RecognitionWorkshop, February 1997.

[2] R. Schwartz, H. Jin, F. Kubala, S. Matsoukas,“Modeling those F-conditions - Or not,” Proc. ofDARPA Speech Recognition Workshop, February1997.

[3] M.S. Spina and V. W. Zue, “AutomaticTranscription of General Audio Data: PreliminaryAnalysis,” Proc. of the International Conference onSpoken Language Processing, pp. 594-597,Philadelphia, PA, October 1996.

[4] M.S. Spina and V.W. Zue, “AutomaticTranscription of General Audio Data: Effect ofEnvironment Segmentation on PhoneticRecognition,” Proc. of European Conference onSpeech Communication and Technology, pp. 1547-1550, Rhodes, Greece, September 1997.


Porting the GALAXY System to Mandarin ChineseChao Wang

The GALAXY system is a human-computer

conversational system providing a spoken

language interface for accessing on-line

information. It was initially implemented

for English in travel-related domains,

including air travel, local city navigation,

and weather. One of the design goals of the

GALAXY architecture was to accommodate

multiple languages in a common frame-

work. This thesis concerns the development

of YINHE, a Mandarin Chinese version of the

GALAXY system [1,2]. Acoustic models,

language models, vocabularies, and linguis-

tic rules for Mandarin speech recognition,

language understanding, and language

generation have been developed; large

amounts of domain specific Mandarin

speech data have been collected from native

speakers for system training; and issues that

are specific for Chinese have been addressed

to make the system core more language

independent. Figure 15 shows the system

operating in Chinese. The user communi-

cates with the system in spoken Mandarin,

and the system displays responses in

Chinese ideographs, along with maps, etc.

In the following, data collection, develop-

ment of speech recognition, understanding

and generation components, and system

evaluation will be described in more detail.

Both read and spontaneous speech have

been collected from native speakers of

Mandarin Chinese. Spontaneous speech

data were collected using a simulated

environment based on the existing English

GALAXY system. The data were used for

training both acoustic and language models

for recognition, and deriving and training a

grammar for language understanding. In

addition, a significant amount of read

speech data was collected through our Web

data collection facility. It is easier to collect

read data in large amounts, and they are

very valuable for acoustic training due to

the phone-line diversity of randomly

distributed callers. We use pinyin, enhanced

with tones, for Chinese representation in

out transcription to simplify the input task.

Figure 15. An example of adialogue exchange between YINHEand a user.


Homophones that are the same in both

base-syllables and tones are indistinguish-

able in the pinyin representation. We

determined however that this ambiguity

could be resolved by the language under-

standing component. We also decided not

to tokenize the utterances into word

sequences in the transcription, because it is

not always obvious even for native speakers

what constitutes a word, and the selection

of words would likely change during the

development process. The sentences were

later segmented into word sequences using a

semi-automatic tokenization procedure

based on a predefined vocabulary, for

training the language models. Time-aligned

phonetic transcriptions were derived using a

forced alignment procedure during the

iterative training process. A summary of the

corpus is shown in Table 14.

The speech recognition is performed by

the SUMMIT segment-based speech recogni-

tion system. It would be advantageous to

incorporate some kind of tone recognition

into the framework. However, SUMMIT, as

currently configured, does not have any

capability for explicitly dealing with

fundamental frequency, and it would also

be difficult to incorporate scores provided at

the syllable level. Thus, we have omitted

tone recognition at this stage. We realize

that this leads to a greater number of

potential homophones, but most of these

can be disambiguated at the parsing stage.

We went through several iterations to

decide the actual vocabulary, mainly in

making decisions about where to insert

word boundaries in the syllable string. City

names are prominent in GALAXY. Since they

are world-wide, users are uncertain as to

whether to refer to them in English or

Chinese. Hence we had to allow multiple

entries for many of them, essentially an

English and a Chinese equivalent. A similar

problem exists for the place names in the

City Guide domain. We felt it would be

difficult to cover all the odd pronunciations

of restaurant names, etc. Therefore, we

eliminated most of them from the vocabu-

lary, thus encouraging the user to refer to

them by index or clicking. The current

vocabulary has about 1000 words, which is

much smaller than that of the English-based

system. About one quarter of the vocabulary

items are English; each Chinese word has

on average 2.1 characters.

After experimenting with various sets of

phonetic units, we finally settled on the

simple choice of representing each syllable

initial and each final as an individual

phonetic unit. We feel that our segment-

based framework is particularly effective at

capturing the dynamic nature of these

multi-phoneme units, and the only problem

was that we did not have very obvious

English analogs for some of these units

(such as “UAN” and “IANG”) on which to

seed. We were able to solve this problem by

seeding any unusual finals on schwa,

because of its inherent variability, along

with an artificial reward during early

iterations. This improves their chance of

consuming the entire span of the syllable

final during forced alignment, rather than

giving part of it up to an undesirable

insertion model or a neighboring syllable

initial.

We also had some difficulty with the

rich set of strong fricatives and affricates in

Mandarin. Mandarin makes a distinction

between /s/, /sh/, and a retroflexed /shr/.

SET TRAIN DEV TESTNo. of utt. 6457 500 274Type of utt. spontaneousNo. of spks. 6Wds per utt. 8.3 8.5 8.0

spont. and read93

Table 14. Summary of the corpus.


Similar distinctions are possible for the

voiced and affricate counterparts. These

phonemes are further complicated by the

widespread regional differences among

speakers. For example, in southern dialects,

there is a tendency to lose the distinction of

/s/ and /shr/. We initially tried handling

these dialectal variations by phonological

rules, but in the absence of hand-labeled

data it became difficult to guarantee a

correct realization in our training utter-

ances. In the end we decided to let the

models handle the variability through the

Gaussian mixtures. The English proper

nouns are usually outside of the phonologi-

cal and phonotactic structure of Mandarin.

As a consequence, users ofter speak these

names with a heavy accent, and it becomes

problematic whether to build separate

English phonetic models or to force these

outliers into the nearest-neighbor Mandarin

equivalent. For the most part we were able

to share models, with the system being

augmented with only a few phonemes

particular to English, such as /v/ and /eh/.

Thus, in some sense, we lexicalized the

foreign accent for English, entering “New

York” in the lexicon pronounced as “Niu

Yok” and “South Boston” as “Saus

Basteng.”

For natural language understanding we

used the TINA system. Our approach to rule

development was to determine the appropri-

ate rules for each new Chinese sentence by

first parsing an English equivalent, and

choosing, as much as possible, category

names that paralleled the English equiva-

lent. This minimized the effort involved in

mapping the resulting parse tree to a

semantic frame. While the temporal

ordering of constituents is quite different

for Chinese than for English, the basic

hierarchy of the phrase structure is usually

very similar to that of English.

We were a little uncertain about what to

do with the tokenization problem —

whether to include the partial tokenization

that takes place at the time of recognition,

or to disregard it and reparse the syllable

sequence. We finally decided to discard the

PORTING THE GALAXY SYSTEM TO MANDARIN CHINESE

Figure 16. An example of longdistance movement in Chinese fortopicalization, for the sentence,“Boston has how many museums?”

sentence

full_parse

statement

subject vp_count

a_town

town_name

bo1 shi4 dun4

have count_object

you3

how_many an_object

duo1 shao3

pre_adjunct a_building

in_focus museum

IN REGION bo2 wu4 guan3

IN_REGION


recognizer tokenization, and rely instead on

the grammar rules of TINA to retokenize,

with the belief that the final result would be

more reliable. Since the grammar is heavily

constrained by semantic categories through-

out the parse tree, it is usually able to

reconstruct the correct tokenization of the

sentence. We made a few exceptions to this

rule, in cases where confusions with a

common word start syllable could cause

significant ambiguity. For instance, the first

syllable of the word “jiu3 dian4” (“wine

store”) is a homophone of the word “jiu3”

meaning “nine”. Since numbers are

prevalent in the grammar, we decided it was

safer to commit the whole word “wine

store” up front, to expedite the parsing

process. This effectively provides a one-

syllable look-ahead to the parser.

TINA has a trace mechanism to handle

gaps that are prevalent for wh-questions in

English (e.g., “[What street] is MIT on

[trace]?”). In Chinese, wh- words are not

moved to the front of the sentence, and

therefore these sentences are easier to

accommodate than their English equiva-

lents. Chinese does however frequently

utilize an analogous forward-movement

strategy to topicalize certain constituents in

a sentence. An example is given in Figure

16. Such sentences were well-matched to

TINA’s trace mechanism, which produces a

desirable frame containing “in Boston” as a

predicate modifying “museums”, but

paraphrasing preperly, with “Boston” in the

topicalized initial position, due to the trace

marker.

Language generation for Chinese was

performed by the GENESIS system. We found

that the process of generating correct

paraphrases and responses in Chinese was

quite straightforward, and, for the most

CHAO WANG

part, we were able to utilize our GENESIS

framework without any changes. One aspect

of Chinese that is quite different from

English is the use of particles to accompany

quantified nouns. These particles are

analogous to “a flock of sheep” in English,

except that they are far more pervasive in

the language. Thus “four banks” becomes

“four <particle> banks.” Furthermore, the

exact realization of the particle depends on

the class of the noun, and there is a fairly

large number of possibilities. For any

language internal paraphrases (Chinese →semantic frame →Chinese) the particle can

be parsed into the frame and reparaphrased

intact. However, for actual translation, the

situation is problematic because complex

context effects determine which particle to

use under what circumstances. Similarly,

Chinese does not make obvious distinctions

between singular and plural, which can be

problematic when translating to English.

Since YINHE is self-consistent with respect to

language, these issues have been avoided,

but we would like to be able to produce

No. of Utts. WER SER

Dev 500 9.1% 37.4%

Test 274 10.8% 39.1%

Table 15. Summary of therecognition performance.

Parsed

Perfect Acceptab le W rong

1-best 62 .4 6 .9 2 .6 28 .1

10-best 70 .0 8 .4 5 .5 16 .4

ortho. 80 .3 2 .9 0 .7 16 .06

trans-lingual paraphrases that are also well-

formed.

Table 15 shows the speech recognition

performance in terms of word error rate

and sentence error rate on the develop-

ment and test data. Table 16 shows the

speech understanding performance on the

Table 16. Speech understandingperformance in percentages on the274 spontaneous utterances of thetest set.


PORTING THE GALAXY SYSTEM TO MANDARIN CHINESE

test data. The 10-best entry gives the results

obtained based on the parse selected

automatically from a 10-best list. There are

20% of the queries which, if recognized

perfectly, would still not be understood

correctly. Most of the sentences that fail to

parse are outside of the domain of GALAXY

or suffer from disfluencies which are

beyond the limited robust parsing capabili-

ties of YINHE.

Overall, we consider the exercise of

porting GALAXY to Mandarin to be a success.

The end-to-end Mandarin system appears to

be comparable in performance to its English

counterpart. We feel that the success of this

effort demonstrates the feasibility of our

design aimed at accommodating multiple

languages in a common framework.

Several aspects of the system still remain

to be improved. While the system converses

with the user, it often displays information

it has obtained, for example from the Web,

in English. It would be more natural if the

information itself, and not just the remark

about the information, could be provided to

the user in their preferred language. We

have not obtained an adequate Mandarin

Chinese speech synthesizer yet, so the

system only displays the verbal answer to the

user in text form. We plan to add a tone-

recognition capability to our recognizer.

This may require us to restructure the

framework to accommodate explicit

knowledge of syllable boundaries.

References[1] C. Wang. Porting the GALAXY System to Mandarin

Chinese, M.S. thesis, MIT Department ofElectrical Engineering and Computer Science,Cambridge, MA, May 1997.

[2] C. Wang, J. Glass, H. Meng, J. Polifroni, S.Seneff, and V. Zue, “YINHE: A Mandarin ChineseVersion of the GALAXY System,” Proc. EuropeanConference on Speech Communication and Technology,pp. 351-354, Rhodes, Greece, September 1997.


Natural-Sounding Speech SynthesisUsing Variable-Length UnitsJon Yi

Our work in the previous year showed that

by careful design of system responses to

ensure consistent intonation contours,

natural-sounding speech synthesis can be

achievable with word- and phrase-level

concatenation. In order to extend the

flexibility of this framework, we focused on

generating novel words from a corpus of

sub-word units. The design of the corpus

was motivated by perceptual experiments

that investigated where speech could be

spliced with minimal audible distortion and

what contextual constraints were necessary

to maintain in order to produce natural-

sounding speech. From this sub-word

corpus, a Viterbi search selects a sequence of

units based on how well they match the

input specification and concatenation

constraints. This concatenative speech

synthesis system, ENVOICE, has been used

in WHEELS and PEGASUS to convert meaning

represen tations into speech waveforms.

The synthesis process used in the

WHEELS system involved the concatenation

of word- and phrase-level units with no

signal processing. These units were carefully

prepared by recording them in the precise

prosodic environment in which they would

be used. However, recording every word in

every prosodic environment realizable

represents a trade-off between large-scale

recording and high naturalness. Essentially,

this type of generation approach has two

shortcomings. First, while the carrier

phrases attempt to capture prosodic

constraints, they do not explicitly capture

co-articulatory constraints, which may be

more important at sub-word levels. Second,

some application domain vocabularies are

continuously expanding (e.g., new car

models may be introduced each year), or

have a large number of words (e.g., the

23,000 United States city names in a Yellow

Pages domain). In order to discover

strategies to combat these two factors, we

decided to investigate the synthesis of

arbitrary proper names using sub-word units

from a designed sub-word corpus.

We performed perceptual experiments

to learn what units are appropriate for

concatenative synthesis and how well these

units sound as an ensemble when concat-

enated together to form new words. These

two constraints are the unit and transition

criteria. Because source changes (e.g.,

voiced-unvoiced transitions) typically result

in significant spectral changes, we hypoth-

esized that a splice might not be as percep-

tible at this point, in comparison to other

places. Should the speech signal be broken

between two voiced regions, it would be

important to ensure formant continuity at

the splice boundary. This hypothesis

motivated a series of consonant-vowel-

consonant (CVC) studies that dealt with the

substitution of vowels at boundaries of

source change.

One study tested potential transition

points by fixing the place of articulation of

the surrounding consonants. For example,

the /AO/ from the city name, “Boston”

(/B AO S T IH N), was replaced by the /

AO/ from “bossed” (/ B AO S T/).

Perceptually the splicing is not noticeable.

We found this effect to hold when the

consonants are stops, fricatives, as well as

affricates. A variation on the above study

showed that the voicing dimension of the

surrounding consonants can be ignored

while still producing a natural-sounding

splice. This knowledge contributed towards

the formation of the unit criterion studies

which showed that the place of articulation

and nasal consonants to be the main


contextual constraints for vowels. While it

was possible to perform natural-sounding

splicing at boundaries between vowels and

consonants, we found it preferable to keep

vowel and semivowel sequences together as a

unit.

The various principles learned from the

perceptual studies were used to enumerate a

set of synthesis units for concatenative

synthesis of non-foreign English words. We

made use of a 90,000-word lexicon from the

Linguistic Data Consortium called the

COMLEX English Pronunciation Dictionary,

commonly referred to as PRONLEX. We

limited our analysis of contiguous multi-

phoneme sequences of vowels and

semivowels to the non-foreign subset of

PRONLEX containing approximately 68,000

words. We identified 2,358 unique vowel

and semivowel sequences; consonants were

assumed to be adequately covered. These

sequences were compactly covered by using

an automatic algorithm to select a compact

set of prompts to record given a set of units

to cover and a set of words to choose from.

When this prompt selection algorithm was

applied, a total of 1,604 words was selected.

The unit selection algorithm is a Viterbi

search that provides an automatic means to

select an optimal sequence of sub-word

units from a speech database given an input

pronunciation. Because the use of longer-

length units tends to improve synthesis

quality, it is important to maximize the size

and the contiguity of speech segments to

encourage the selection of multi-phone

sequences. The search metric is composed

of a unit cost function and a transition cost

function. The unit cost function measures

co-articulatory distance by considering

triphone classes which have consistent

manner of production. The transition cost

function measures co-articulatory continuity

between two phones proposed for concat-

enation. A transition cost must be incurred

if they were not spoken in succession to

avoid concatenations at places exhibiting a

significant amount of co-articulation, or

formant motion. Also, we decouple

transitions occurring within or across

syllables into intra-syllable and inter-syllable

transitions, respectively.

We have deployed the total variable-

length concatenative synthesis framework in

GALAXY, where ENVOICE servers return

speech waveforms to clients presenting

meaning representations as input. In

PEGASUS, both phrase-level and sub-word

unit concatenation are utilized, where

system responses are generated by the

former, and city names the latter. Overall,

users thought the system sounded natural

and found sentences to be much preferable

over those generated by DECTalk.

This research work has three types of

contributions: a framework for Meaning-to-

Speech (MTS) concatenative synthesis,

principles about sub-word unit design for

concatenative synthesis, and sub-word

corpus design. This MTS framework is

suitable for use in a conversational system

because it was designed from the ground up

for understanding domains as opposed to

general-purpose Text-to-Speech synthesizers.

There remains much future work in many

areas including unit design, prosody,

evaluation methods, and development

strategies.

Reference[1] Jon Yi, Natural-Sounding Speech Synthesis Using

Variable-Length Units, M.Eng. thesis, MITDepartment of Electrical Engineering andComputer Science, May 1998.

NATURAL-SOUNDING SPEECH SYNTHESIS USING VARIABLE-LENGTH UNITS

Documents

THESIS RESEARCH - MIT CSAILgroups.csail.mit.edu/sls/publications/1998/thesis98.pdf40 SUMMARY OF RESEARCH Hierarchical Duration Modelling for a Speech Recognition System Grace Chung