Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
35SPOKEN LANGUAGE SYSTEMS
T H E S I S
R E S E A R C H
36 SUMMARY OF RESEARCH
37SPOKEN LANGUAGE SYSTEMS
A Model for Segment-Based Speech RecognitionJane Chang
Currently, most approaches to speech
recognition are frame-based in that they
represent the speech signal using a temporal
sequence of frame-based features, such as
Mel-cepstral vectors. Frame-based ap-
proaches take advantage of efficient search
algorithms that largely contribute to their
success. However, they cannot easily
incorporate segment-based modeling
strategies that can further improve recogni-
tion performance. For example, duration is
a segment-based feature that is useful but
difficult to model in a frame-based ap-
proach.
In contrast, segment-based approaches
represent the speech signal using a graph of
segment-based features, such as average Mel-
cepstral vectors over hypothesized phone
segments. Segment-based approaches enable
the use of segment-based modeling strate-
gies. However, they introduce multiple
difficulties in recognition that have limited
their success.
In this work, we have developed a
framework for speech recognition that
overcomes many of the difficulties of a
segment-based approach. We have published
experiments in phone recognition on the
core test set of the TIMIT corpus over 39
classes [1]. We have also run preliminary
experiments in word recognition on the
December ’94 test set of the ATIS corpus.
In our segment-based approach, we
hypothesize segments prior to recognition.
Previously, our segmentation algorithm was
based on local acoustic change. However,
segmentation depends on contextual factors
that are difficult to capture in a simple
measure. We have developed a probabilistic
segmentation algorithm called “segmenta-
tion by recognition” that hypothesizes
segments in the process of recognition.
Segmentation by recognition applies all of
the constraints used in recognition towards
segmentation. As a result, it hypothesizes
more accurate segments. In addition, it
adapts to all types of variability, focuses
modeling on confusable segments, hypoth-
esizes all types of units, and uses scores that
can be re-used in recognition. We have
implemented this segmentation algorithm
using a backwards A* search and a diphone
context-dependent frame-based phone
recognizer. In published TIMIT experi-
ments, we have reported an 11.3% reduc-
tion in phone recognition error rate from
38.7% with our previous acoustic segmenta-
tion to 34.3% with segmentation by
recognition [1].
In segment-based recognition, the
speech signal is represented using a graph of
features. Probabilistically, it is necessary to
account for all of the features in the graph.
However, each path through the graph
directly accounts for only a subset of all
features. Previously, we modeled the features
that are not in a path using a single “anti-
phone” model [2]. However, the features
that are not in a path depend on contextual
factors that are difficult to capture in one
model. We have developed a search
algorithm called “near-miss modeling” that
uses multiple models for all features in a
graph. Near-miss modeling associates each
feature with a near-miss subset of features
such that any path through a graph is
associated with all features. As a result, it
probabilistically accounts for and efficiently
enforces constraints across all features. In
addition, it focuses modeling on discrimi-
nating between a feature and its near-misses.
We have implemented near-miss modeling
using a Viterbi search and a set of near-miss
phone models that correspond to our
38 SUMMARY OF RESEARCH
context-independent phone models. In
published experiments, we have reported a
9.3% reduction in phone recognition error
rate from 34.3% with anti-phone modeling
to 31.1% with near-miss modeling [1]. In
addition, in preliminary ATIS experiments,
we have shown a 21.4% reduction in word
recognition error rate from 12.6% with anti-
phone modeling to 9.9% with near-miss
modeling.
In word recognition, deletion and
insertion errors in segmentation can cause
multiple recognition errors. Previously, we
have been using phone units. However,
phone realizations depend on contextual
factors and are difficult to segment. We
have developed larger units called “multi-
phone units” that span multiple phones.
Multi-phone units cover phone sequences
that demonstrate systematic acoustic and
lexical variations. As a result, they recover
from systematic segmentation errors. In
addition, they focus modeling on systematic
context-dependencies. We select multi-
phones using a Viterbi search to minimize
match, deletion and insertion criteria. In
preliminary experiments, we have shown a
4.2% reduction in word recognition error
rate from 9.9% with phone units to 9.1%
with multi-phone units.
Figure 7 shows an example of our
framework. The input speech is displayed as
a spectrogram. Segmentation by recognition
hypothesizes the graph of segments under
the spectrogram. Near-miss modeling
associates near-misses such that the black
segment is associated with the three gray
segments. The total score for a unit is the
sum of segment and near-miss scores. The
seven best scoring units for the black
segment are listed on the right. The best
scoring unit is the multi-phone unit that
spans the phone sequence of /r/ followed
by /l/. The recognized phone and word
outputs are displayed under the segment
graph.
With segmentation by recognition,
near-miss modeling and multi-phone units,
our framework overcomes many of the
difficulties in segment-based recognition
and enables the exploration of a wide range
of segment-based modeling strategies.
Although our work does not focus on
developing such strategies, we have already
Figure 7. Example of framework,showing spectrogram, segmentgraph, phone and word recognition,and scores for the highlightedsegment.
A MODEL FOR SEGMENT-BASED SPEECH RECOGNITION
39SPOKEN LANGUAGE SYSTEMS
shown improvements in recognition
performance. For example, a segment-based
approach can use both frame- and segment-
based features. In published experiments,
we have reported a 4.0% reduction in
phone recognition error rate from 27.7%
with just frame-based features to 26.6%
with both types of features. In addition, a
segment-based approach facilitates the use
of duration to model segment probability.
In preliminary experiments, we have shown
a 5.3% reduction in word recognition error
rate from 9.1% with no model to 8.5% with
a duration model.
References[1] J. Chang and J. Glass, “Segmentation and
Modeling in Segment-based Recognition,” Proc.European Conference on Speech Communication andTechnology, pp. 1199-1202, Rhodes, Greece,September 1997.
[2] J. Glass, J. Chang and M. McCandless, “AProbabilistic Framework for Feature-based SpeechRecognition,” Proc. International Conference onSpoken Language Processing, pp, 2277-2280,Philadelphia, PA, October1996.
[3] J. Chang. Near-Miss Modeling: A Segment-basedApproach to Speech Recognition. Ph.D. thesis, MITDepartment of Electrical Engineering andComputer Science, June 1998.
JANE CHANG
40 SUMMARY OF RESEARCH
Hierarchical Duration Modellingfor a Speech Recognition SystemGrace Chung
Durational patterns of phonetic segments
and pauses convey information about the
linguistic content of an utterance. Most
speech recognition systems grossly
underutilize the knowledge provided by
durational cues due to the vast array of
factors that influence speech timing and the
complexity with which they interact. In this
thesis, we introduce a duration model based
on the ANGIE framework. ANGIE is a para-
digm which captures morpho-phonemic and
phonological phenomena under a unified
hierarchical structure. Sublexical parse trees
provided by ANGIE are well-suited for
constructing complex statistical models to
account for durational patterns that are
functions of effects at various linguistic
levels. By constructing models for all the
sublexical nodes of a parse tree, we implic-
itly model duration phenomena at these
linguistic levels simultaneously, and
subsequently account for a vast array of
contextual variables affecting duration from
the phone level up to the word level.
Experiments in our work have been
conducted in the ATIS domain which
consists of continuous, spontaneous
utterances concerning enquiries for travel
information.
In this duration model, a strategy has
been formulated in which node durations
in upper layers are successively normalized
by their respective realizations in the layers
below; that is, given a nonterminal node,
individual probability distributions,
corresponding with each different realiza-
tion in the layer immediately below, are all
scaled to have the same mean. This reduces
the variance at each node, and enables the
sharing of statistical distributions. Upon
normalization, a set of relative duration
models is constructed by measuring the
percentage duration of nodes occupied with
respect to their parent nodes.
Under this normalization scheme, the
normalized duration of each word node is
independent of the inherent durations of its
descendents and hence is an indicator of
speaking rate. A speaking rate parameter
can be defined as a ratio of the normalized
word duration over the global average
normalized word duration. This speaking
rate parameter is then used to construct
absolute duration models that are normal-
ized by the rate of speech. This is done by
scaling absolute phoneme durations by the
above parameter. By combining hierarchical
normalization and speaking rate normaliza-
tion, the average standard deviation for
phoneme duration was reduced from 50ms
to 33ms.
Using the hierarchical structure, we
have conducted a series of experiments
investigating speech timing phenomena. We
are specifically interested in (1) examining
secondary effects of speaking rate, (2)
characterizing the effects of prepausal
lengthening, and (3) detecting other word
boundary effects associated with duration
such as gemination. For example, we have
found, with statistical significance, that a
suffix within a word is affected far more by
speaking rate than is a prefix. We have also
studied closely the types of words which
tend to be realized particularly slowly in our
training corpus and it is discovered that
these are predominantly function words and
single syllable words.
Prepausal lengthening is the phenom-
enon where words preceding pauses tend to
be somewhat lengthened. Our goal is to
examine the characteristics associated with
prepausal effects and in the future further
incorporate these into our model. In our
studies, we consider the relationship
between this phenomenon and the rate of
41SPOKEN LANGUAGE SYSTEMS
speech. We found that lengthening occurs
when pauses tend to be greater than 100ms
in duration. It is also observed that
prepausal lengthening affects the various
sublexical units non-uniformly. For ex-
ample, the stressed syllable nucleus tends to
be lengthened more than the onset
position.The final duration model has been
implemented into the ANGIE phonetic
recognizer. In addition to contextual effects
captured by the model at various sublexical
levels, the scoring mechanism also accounts
explicitly for two inter-word level phenom-
ena, namely, prepausal lengthening and
gemination. Our experiments have been
conducted under increasing levels of
linguistic constraint with correspondingly
different baseline performances. The
improved performance is obtained by
providing successively greater amounts of
implicit lexical knowledge during recogni-
tion by way of an intermediate morph or
syllable lexicon.
When maximal linguistic contraint is
imposed, the incorporation of the relative
and speaking-rate normalized absolute
phoneme duration scores reduced the
phonetic error rate from 29.7% to 27.4%, a
relative reduction of 7.7%. These gains are
over and above any gains realized from
standard phone duration models present in
the baseline system, and encourage us to
further apply our model in future recogni-
tion tasks.
As a first step towards demonstrating
the benefit of duration modelling for full
word recognition, we have conducted a
preliminary study using duration as a post-
processor in a word-spotting task. We have
simplified the task of spotting city names in
the ATIS domain by choosing a pair of highly
confusable keywords, “New York” and
“Newark.” All tokens initially spotted as
“New York” are passed to a post-processor,
which reconsiders those words and makes a
final decision, with the duration component
incorporated. For this task, the duration post-
processor reduced the number of confusions
from 60 to 19 tokens out of a total of 323
tokens, a 68% reduction of error. We believe
that the dramatic performance improvement
demonstrates the power of durational knowl-
edge in specific instances where acoustic-
phonetic features are less effective.
In another experiment, the duration
model was fully integrated into an ANGIE-based
wordspotting system. As in our phonetic
recognition experiments, results were obtained
by adding varying degrees of linguistic
contraint. When maximum constraint is
imposed, the duration model improved
performance from 89.3 to 91.6 (FOM), a
relative improvement of 21.5%. The duration
model has shown to be most effective when the
maximum amount of lexical knowledge is
provided, wherein the model is able to best
take advantage of the various durational
relationships among the components of the
sublexical parse structure. We also believe that
the more complex parse structures available in
the keywords for this task contribute to the
performance of our duration model.
This research has demonstrated success in
employing a complex statistical duration model
in order to improve speech recognition
performance. In particular, we see that
duration is more valuable during word
recognition. We would like to incorporate our
duration modeling into a continuous speech
recognition system, in which significant gains
should also be possible there.
Reference[1] G. Chung. Hierarchical Duration Modelling
for a Speech Recognition System. S.M. thesis, MITDepartment of Electrical Engineering and ComputerScience, Cambridge, MA, May 1997.
42 SUMMARY OF RESEARCH
Discourse Segmentation of Spoken Dialogue:An Empirical ApproachGiovanni Flammia
Empirical research in discourse and
dialogue is instrumental in quantifying
which conventions of human-to-human
language may be applicable for human-to-
machine language [1,2] This thesis is an
empirical exploration of one aspect of
human-to-human dialogue that can be
applicable to human-to-machine language.
Some linguistic and computational models
assume that human-to-human dialogue can
be modeled as a sequence of segments [3].
Detecting segment boundaries has potential
practical benefits in building spoken
language applications (e.g., designing
effective system dialogue strategies for each
discourse segment and dynamically chang-
ing the system lexicon at segment bound-
aries).
Unfortunately, drawing conclusions
from studying human-to-human conversa-
tion is difficult because spontaneous
dialogue can be quite variable, containing
frequent interruptions, incomplete sen-
tences and unstructured segments. Some of
these variabilities may not contribute
directly to effective communication of
information. The goal of this thesis is to
determine empirically the extent to which
discourse segment boundaries can be
extracted from annotated transcriptions of
spontaneous, natural dialogues in specific
application domains. We seek answers to
three questions. First, is it possible to obtain
consistent annotations from many subjects?
Second, what are the regular vs. irregular
discourse patterns found by the analysis of
the annotated corpus? Third, is it possible
to build discourse segment models auto-
matically from an annotated corpus?
The contributions of this thesis are
twofold. Firstly, we developed and evaluated
the performance of a novel annotation tool
and associated discourse segmentation
instructions. The tool and the instructions
have proven to be instrumental in obtaining
reliable annotations from many subjects.
Our findings indicate that it is possible to
obtain reliable and efficient discourse
segmentation when the task instructions are
specific and the annotators have few degrees
of freedom, i.e., when the annotation task is
limited to choosing among few independent
alternatives. The reliability results are very
competitive with other published work [4].
Secondly, the analysis of the annotated
corpus provides substantial quantitative
evidence about the differences between
human-to-human conversation and current
human-to-machine telephone applications.
Since dialogue annotation can be
extremely time consuming, it is essential
that we develop the necessary tools to
maximize efficiency and consistency. To this
end, we have developed a visual annotation
tool called Nb which has been used for
discourse segmentation in our group and
other institutions.
With the help of Nb, we determined
how reliably human annotators can tag
segments in the dialogue transcriptions of
our corpus. We conducted two experiments
in which the transcriptions have each been
annotated by several people.
To carry out our research, we are
making use of a corpus of orthographically
transcribed and annotated telephone
conversations. The text data are faithful
transcriptions of actual telephone conversa-
tions between customers and telephone
operators collected by BellSouth
Intelliventures and American Airlines in
1994. The first pilot study consisted of 18
dialogues from all the domains of our
corpus, each one annotated by 6 different
43SPOKEN LANGUAGE SYSTEMS
0 5 10 15 20 25 30 35 400
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Number of words in preceeding agent turn
Pro
babi
lity
of c
usto
mer
bac
k−ch
anne
l res
pons
e
coders [5]. The goal of this experiment was
rather exploratory in nature, without
particular constrains on where to place
discourse segment boundaries.We measured
reliability by recall, precision and the kappa
coefficient. When comparing two different
segmentations of the same text, we alterna-
tively select one as the reference and the
other one as the test. Reliability is best
measured by the kappa coefficient, a
statistical measure which is gaining popular-
ity in computational linguistics because it
measures how much better than chance is
the observed agreement [6]. A coefficient of
0.7 or better indicates reliable results. Table
11 summarizes our findings. We found that
without detailed instructions, annotators
agree at the 0.45 reliability level in placing
segment boundaries. In our data, we found
that the kappa coefficient is always less than
the average of precision and recall.
The analysis of the disagreements of the
first experiment led to a second, more
focused experiment. This other experiment
consisted of 22 dialogues from only one
application, the movies listing domain. Each
dialogue was annotated by 7 - 9 coders [7].
The instructions defined a segment to be a
section of the dialogue in which the agent
delivers a new piece of information that is
relevant to the task. In addition, the
annotators had to choose among five
different segment purpose labels when
tagging a discourse segment. In that case,
we found that the kappa reliability measure
Table 11. Summary percentagestatistics of the two annotationexperiments. Average precision andrecall are measured across allpossible combinations of pairs ofcoders. The groupwise kappacoefficient is computed from theclassification matrix of all thecoders. Statistics are computedusing as unit of analysis thesentence or the dialogue turn.Typically, a dialogue turn iscomposed of one to three shortsentences.
Experiment First SecondDialogues 18 22Coders per dialogue 6 7-9
Precision 57.7 85Recall 61.5 83.9Kappa 45.1 82.2
Precision 67 85.1Recall 71 84.7Kappa 53.7 82.4
Units: sentences
Units: turns
Figure 8. Observed frequency ofcustomer’s acknowledgments as afunction of the preceeding agent’sdialogue turn duration
44 SUMMARY OF RESEARCH
in placing segment boundaries is 0.824, and
the accuracy in assigning segment purpose
labels is 80.1%.
To evaluate the feasibility of segmenting
dialogues automatically, we implemented a
simple discourse segment boundary
classifier based on learning classification
rules from lexical features [8]. On average,
the automatic algorithm agrees with the
manually annotated boundaries with 69.4%
recall and 74.5% precision.
Analysis of the movies listing conversa-
tions indicates that the customer follows
with an explicit acknowledgement the
information reported by the agent 84% of
the time. We found that the agent delivers
information using shorter rather than
longer sentences. Figure 8 is a cumulative
frequency plot of the length of the agent's
dialogue turn before a customer's
acknowledgement. Most of the times, the
agent does not speak more than 15 words
before the customer responds with an
acknowledgment. After the
acknowledgement, 40% of the time the
information is explicitly confirmed by both
parties with at least two additional dialogue
turns.
Analysis of the annotated segments
indicate that the customer is mainly
responsible for switching to new topics, and
that in average the agent's response is not
immediate but instead is preceeded by a few
clarification turns. Table 12 lists the fraction
of agent vs. customer initiated segments by
topics and the average turn of the agent's
response from the beginning of the seg-
ment.
References[1] N.O. Bernsen, L. Dybkjaer, and H. Dybkjaer,
“Cooperativity in Human-machine and Human-human Spoken Dialogue.” Discourse Processes Vol.21, No. 2, pp. 213-236, 1996.
[2] N. Yankelovich, “Using Natural Dialogs as theBasis for Speech Interface Design” Chapter forthe upcoming book, Automated Spoken DialogSystems, edited by Susann Luperfoy, MIT Press,1998.
[3] B. Grosz and C. Sidner, “Attentions, Intentionsand the Structure Of Discourse.” ComputationalLinguistics, Vol. 12. No. 3, pp. 175-204, 1986.
[4] M. Walker and J. Moore, editors “EmpiricalStudies in Discourse,” Computational Linguisticsspecial issue. Vol. 20. No. 2, 1997.
[5] G. Flammia and V. Zue, “Empirical Evaluation ofHuman Performance and Agreement in ParsingDiscourse Constituents in Spoken Dialogue.”Proc. European Conference on Speech Communicationand Technology, pp. 1965-1968, Madrid, Spain,September 1995.
[6] J. Carletta, “Assessing Agreement onClassification Tasks: The Kappa Statistics.”Computational Linguistics. Vol. 22, No. 2, pp. 249-254, 1996.
[7] G. Flammia and V. Zue, “Learning the Structureof Mixed-initiative Dialogues using a Corpus ofAnnotated Conversations.’’ Proc. EuropeanConference on Speech Communication and Technology,pp. 1871-1874, Rhodes, Greece, September 1997.
[8] W. W. Cohen, “Fast Effective Rules Induction,”Machine Learning: Proceedings of the 12thInternational Conference. 1995.
[9] G. Flammia. Corpus-based Discourse Segmentation ofSpoken Dialogue. Ph.D. thesis, MIT Department ofElectrical Engineering and Computer Science,June 1998.
Topic Customer Init. Agent Init. Turn of Response
List movies 67.8% 32.2% 4.5
Phone number 87.1% 12.9% 3.7
Show times 76.3% 23.7% 2.9
Where is it playing 79.4% 20.7% 4.0
DISCOURSE SEGMENTATION OF SPOKEN DIALOGUE: AN EMPIRICAL APPROACH
Table 12. Distribution of segmentinitiatives by topics and averageturn position of response.
45SPOKEN LANGUAGE SYSTEMS
Heterogeneous Acoustic Measurements andMultiple Classifiers for Speech RecognitionAndrew Halberstadt
Most automatic speech recognition systems
use a small set of homogeneous acoustic
measurements and a single classifier to
make acoustic-phonetic distinctions. We are
exploring the use of a large set of heteroge-
neous measurements and multiple classifiers
in order to improve phonetic classification.
There are several areas for innovative work
involved in implementing this approach.
First, a variety of acoustic measurements
need to be developed, or selected, from
those proposed in the literature. In the past,
different acoustic measurements have
generally been compared in a winner-takes-
all paradigm in which the goal is to select
the single best measurement set. In contrast
to this approach, we are interested in
making use of complementary information
in different measurement sets. In addition,
measurements have usually been evaluated
for their performance over the entire phone
set. In contrast, in this work, we explore the
notion that high-performance acoustic
measurements may be different across
different phone classes. Thus, heteroge-
neous measurements may be used both
within and across phone classes. Second,
methods for utilizing high-dimensional
acoustic measurement spaces need to be
proposed and developed. This problem will
be addressed through schemes for combin-
ing the results of multiple classifiers.
In the process of developing heteroge-
neous acoustic measurements, we focused
initially on stop consonants because of
evidence that their short-time burst
characteristics and rapidly changing
acoustics were poorly represented by
conventional homogeneous measurements
[1]. A perceptual experiment using only stop
consonants was performed in order to
facilitate comparative analysis of the types of
errors made by humans and machines. The
experiment was designed so that humans
could not make profitable use of
phonotactic or lexical knowledge. Figure 9
provides a summary of the results of these
experiments. The error rates are generally
high because the data set was deliberately
chosen to include some of the most
difficult-to-identify stops in our develop-
ment set. The machine systems are labelled
A, B, C, MMV, D, where A, B, and C are
three different context-independent systems,
MMV (Machine Majority Vote) is a system
that takes the 3-way majority vote answer
from A, B, and C, and D is a context-
dependent system. The perceptual results
from listeners are labelled PA (Perceptual
Average) and PMV (Perceptual Majority
Vote). The place of articulation identifica-
tion by machine is 2.2-11.2 times worse than
humans, whereas the voicing identification
is only 1.1-2.4 times worse. Our conclusion
is that the place of articulation identifica-
tion of automatic systems is an area that
requires significant improvement in order to
approach human-like levels of performance.
The second challenge is to develop
overall system architectures which can make
profitable use of a large number of acoustic
measurements. The fundamental challenge
of high-dimensional input spaces arises
because the quantity of training data
28.9
PA23.1PMV
70.0
A
56.9
B
55.3
C
52.7
MMV
38.4
D
Stop Identification
Per
cent
Err
or
6.3PA
2.2PMV
24.7
A
22.9
B
22.2
C
18.6
MMV
14.1
D
Place Identification
Per
cent
Err
or
24.7
PA
21.2
PMV
51.8
A
38.2
B
37.6
C
38.2
MMV
27.1
D
Voicing Identification
Per
cent
Err
or
Figure 9. Human Perception (PA,PMV) versus MachineClassification (A,B, C, MMV, D)in the tasks of stop identification,place of articulation, identificationof stops, and voicing identificationof stops.
46 SUMMARY OF RESEARCH
needed to adequately train a classifier grows
exponentially with the input dimensionality.
In one approach to the problem, multiple
classifiers trained from different measure-
ments can be arranged hierarchically. In this
scheme, the hierarchical structure empha-
sizes taking the task of phonetic classifica-
tion and breaking it down into subproblems
such as vowel classification and nasal
classification. Figure 10 illustrates the fact
that most classifier errors remain within the
same manner class, thus supporting the
“subproblem” approach of hierarchical
classification. Roughly speaking, if the first
stage of the hierarchy has high confidence
that a particular token is a nasal, then a
classifier tuned especially for nasals may
perform further processing. In [1], this
approach was developed and used to obtain
79.0% context-independent classification
on the TIMIT core test set. Alternatively,
multiple classifiers may be formed into
“committees”. Each committee member has
some influence on the final selection. In its
simplest form, the final choice could be
determined by popular vote of the classifier
committee. The performance of the MMV
(Machine Majority Vote) system in Figure 9,
which is the result of voting among systems
A, B, and C, and the PMV (Perceptual
Majority Vote) results are examples of
improved performance through the use of
voting. The ideas of classification according
to a hierarchy or by a committee are not
mutually exclusive, but rather can be
combined. Thus, one member of a commit-
tee could be a hierarchical classifier, or there
could be a hierarchy of committees.
In the future, we hope to narrow the
gap observed in perceptual experiments
between human and machine performance
in the task of place of articulation identifica-
tion. We plan to continue investigating
heterogeneous measurement sets and
developing a variety of ways of combining
those measurements into classification and
recognition systems.
Reference[1] A. K. Halberstadt and J. R. Glass, “Heterogeneous
Measurements for Phonetic Classification”, Proc.European Conference on Speech Communication andTechnology, pg. 401-404, Rhodes, Greece,September 1997.
HETEROGENEOUS ACOUSTIC MEASUREMENTS AND MULTIPLE CLASSIFIERS FOR SPEECH RECOGNITION
vowels/semivowels
nasals/flaps
strong fricatives
weak fricatives
stops
Hypothesis
Ref
eren
ce
iyihehaeahuwuhaaeyayoyawowerl r w y m n ngdxjhchz s shhhv f dhthb p d t g k cl
iyihehaeahuwuhaaeyayoyawowerl r w y m n ngdxjhchz s shhhv f dhthb p d t g k cl
iy ih eh ae ah uw uh aa ey ay oy aw ow er l r w
y m
n ng dx jh ch z s sh hh v f dh th b p d t g k cl
iy ih eh ae ah uw uh aa ey ay oy aw ow er l r w
y m
n ng dx jh ch z s sh hh v f dh th b p d t g k cl
Figure 10. Bubble plot ofconfusions in phoneticclassification on TIMITdevelopment set. Radii are linearlyproportional to the error. Thelargest bubble is 5.2% of the totalerror.
47SPOKEN LANGUAGE SYSTEMS
The Use of Speaker Correlation Informationfor Automatic Speech RecognitionT. J. Hazen
Typical speech recognition systems perform
much better in speaker dependent (SD)
mode than they do in speaker independent
(SI) mode. This is a result of flaws in the
probabilistic framework and modeling
techniques used by today’s speech
recognizers. In particular, current SI
recognizers typically assume that all acoustic
observations can be considered indepen-
dent of each other. This assumption ignores
within-speaker correlation information which
exists between speech events produced by
the same speaker. Knowledge of the speaker
constraints imposed on the acoustic
realization of an utterance can be extremely
useful for improving the accuracy of a
recognition system.
To describe the problem mathemati-
cally, begin by letting P represent a sequence
of phonetic units. If P contains N different
phones then let it be expressed as:
(1)
Here each np represents the identity of one
phone in the sequence. Next, let x be a
sequence of feature vectors which represent
the acoustic information of an utterance. If
X contains one feature vector for each
phone in P then X can be expressed as:
(2)
Given the above definitions, the probabilis-
tic expression for the acoustic model is
given as ( )PXp .
In order to develop effective and
efficient methods for estimating the acoustic
model likelihood, typical recognition
systems use a variety of simplifying assump-
tions. To begin, the general expression can
be expanded as follows:
(3)
At this point, speech recognition systems
almost universally assume that the acoustic
feature vectors are independent. With this
assumption the acoustic model is expressed
as follows:
(4)
Because this is a standard assumption in
most recognition systems, the term ( )Pxp n
r
will be referred to as the standard acoustic
model.
In Equation (3), the likelihood of a
particular feature vector is deemed depen-
dent on the observation of all of the feature
vectors which have preceded it. In Equation
(4), each feature vector nxr
is treated as an
independently drawn observation which is
not dependent on any other observations,
thus implying that no statistical correlation
exists between the observations. What these
two equations do not show is the net effect
of making the independence assumption.
Consider applying Bayes rule to the
probabilistic term in Equation (3).
In this case the term in this expression
can be rewritten as:
(5)
After applying Bayes rule, the condi-
tional probability expression contained in
(3) is rewritten as a product of the standard
acoustic model ( )Pxp n
r
and a probability
},...,,{ 21 NpppP =
( ) ( ) ( )( )Pxxp
PxxxpPxpPxxxp
n
nnnnn
1,1
1,111 ...,
,...,,,...,
rr
rrr
rrrr
−
−− =
},...,{ 21 NxxxXrrr=
( ) ( ) ( )∏=
−==
N
nnnN PxxxpPxxxpPXp
11121 ,,...,,...,rrrrrr
( ) ( )∏=
=N
nn PxpPXp
1
r
48 SUMMARY OF RESEARCH
ratio which will be referred as the consistency
ratio. The consistency ratio is a multiplica-
tive factor which is ignored when the
feature vectors are considered independent.
It represents the contribution of the
correlations which exist between the feature
vectors.
The purpose of this dissertation is to
examine the assumptions and modeling
techniques that are utilized by SI recogni-
tion systems and to propose novel modeling
techniques to account for the speaker
constraints which are typically ignored. To
this end, this thesis has examined two
primary approaches: speaker adaptation and
consistency modeling. The goal of speaker
adaptation is to alter the standard acoustic
models represented by the expression ( )Pxp n
r
so as to match the current test speaker as
closely as possible. The goal of consistency
modeling is to estimate the contribution of
the consistency ratio, which is typically
ignored when the independence of observa-
tions assumption is made.
Speaker clustering provides one of the
most effective techniques used by speaker
adaptation algorithms. This thesis examines
several different approaches to speaker
clustering. These techniques are reference
speaker weighting, hierarchical speaker
clustering, and speaker cluster weighting.
These methods examine various
different approaches for utilizing and
combining acoustic model parameters
trained from different speakers or speaker
clusters. For example, the hierarchical
speaker clustering used in this thesis
examines the use of gender dependent
models as well as gender and speaking rate
dependent models.
Consistency modeling is a novel
recognition technique for accounting for
the correlation information which is
generally ignored when each acoustic
observation is considered independent. The
key idea of consistency modeling is that the
contribution of the consistency ratio must
be estimated. Using several simplifying
assumptions, the estimation of the consis-
tency ratio can be reduced to the problem
of estimating the mutual information
between pairs of acoustic observations.
The various different techniques have
been evaluated on the DARPA Resource
Management recognition task [1] using the
SUMMIT speech recognition system [2]. The
algorithms were tested on the task of
instantaneous adaptation. In other words,
the methods attempt to adapt to the same
utterance which the system is trying to
recognize. The results are tabulated in Table
13 with respect to the baseline SI system.
The results include experiments where
speaker adaptation or clustering techniques
are used in conjunction with consistency
modeling in order to combine their
strengths. The results indicate that signifi-
cant performance improvements are
possible when speaker correlation informa-
tion is accounted for within the framework
of a speech recognition system.
THE USE OF SPEAKER CORRELATION INFORMATION FOR AUTOMATIC SPEECH RECOGNITION
Table 13. Summary of recognitionresults using various instantaneousadaptation techniques includingreference speaker weighting (RSW),gender dependent modeling (GD),gender and speaker rate dependentmodeling (GRD), speaker clusterweighting (SCW), and consistencymodeling (CM), dependentmodeling (GRD), speaker clusterweighting (SCW), and consistencymodeling (CM).
Adaptation Method
Word Error Rate
Error Rate Reduction
SI 8.6% -SI+RSW 8.0% 6.5%SI+CM 7.9% 8.2%
SI+RSW+CM 7.7% 10.0%GD 7.7% 10.5%
GD+CM 7.1% 17.6%GRD 7.2% 16.4%
GRD+CM 6.8% 20.3%SCW 6.9% 18.9%
SCW+CM 6.8% 21.1%
49SPOKEN LANGUAGE SYSTEMS
References[1] W. Fisher, “The DARPA Task Domain Speech
Recognition Database,” Proc. of the DARPA SpeechRecognition Workshop, pp. 105-109, San Diego CA,March 1987.
[2] J. Glass, J. Chang, and M. McCandless, “AProbabilistic Framework for Feature-based SpeechRecognition,” Proc. of the International Conferenceon Spoken Language Processing, pp. 2277-2280,Philadelphia, PA, 1996.
[3] T. Hazen, “A Comparison of Novel Techniquesfor Instantaneous Speaker Adaptation,” Proc.European Conference on Speech Communication andTechnology, pp. 1883-1886, Rhodes, Greece, 1997.
[4] T. Hazen. The Use of Speaker Correlation Informationfor Automatic Speech Recognition. Ph.D. thesis, MITDepartment of Electrical Engineering andComputer Science, January 1998.
T.J. HAZEN
50 SUMMARY OF RESEARCH
Day
encapsulate encapsulate
Word ("low") Word ("high")
near near
Integer Integer
The Mole: A Robust Framework for AccessingInformation from the World Wide WebHyung-Jin Kim
Although many people have labeled the
World Wide Web as the largest database
ever created, very few applications have been
able to use the web as a database. This is
because the web is dynamic: web pages
change constantly, sometimes on a daily
basis. I propose a system called the “Mole”
that aims to solve this problem by providing
a semantic interface into the web. The
semantic interface uses the semantic
content on web pages to map very high-level
concepts, such as “weather reports for
Boston” to low-level requests for data (such
as getting the text in the third ‘A’ tag in a
web page). Therefore, even though web
pages change, the Mole will still be able to
find information on them.
The Mole will robustly access a web
page by taking advantage of the topology of
its underlying HTML. When web pages get
updated, the information that is presented
usually retains the same structure. For
example, when the CNN Weather Data site
changed in November of 1997, its facade
changed, but it still continued to present
the same information. CNN still presented
data about the current conditions of a city
and it still gave a four-day forecast. Further-
more, although the HTML structure of this
new page was drastically different, the
weather information was still grouped in
the same way (i.e. high and low tempera-
tures were still presented next to each
other).
The Mole uses semantic templates to
access information from web pages. In the
weather example, to gather all of the 4-day
forecasts of a city, the template in Figure 11
is used. The Mole takes this template and
matches it to the data on the web page. This
template essentially drills down through
high-level concepts presented on the web
page. First, it finds a “day” word e.g.,
“Monday” on the web page and then it tries
to find the words “low” and “high” that are
associated with that word. Finally, it finds
the integers that are most closely located to
the words “low” and “high”. Since this
semantic template is abstracting away from
the HTML structure, this template would
have found the same temperature informa-
tion before and after the change (see Figure
12). Notice that this template follows what a
human does to gather the same informa-
tion: first, he searches for a specific day and
then he searches for the temperatures
besides the words “high” and “low”.
In order to make use of semantic
templates, the Mole will require the
following facilities: a taxonomy of data
descriptors and a library of relationship
descriptions. A taxonomy of data descrip-
tors is used to describe all possible data or
recognizable features on a web page. In our
weather template, we used the names
“integer” and “day” to describe the data we
are looking for. In order for the Mole to
access many different types of web pages, a
large library of data types needs to be
created. One can imagine extending this
taxonomy to incorporate concepts of
“state”, “country”, and “car_name”. This
taxonomy can be hierarchical in that a
semantic idea can be built on top of other
semantic ides, making them highly scalable
and re-usable. A library of relationship
Figure 11. Semantic template forCNN weather.
51SPOKEN LANGUAGE SYSTEMS
descriptors describes all the ways in which
features of a web page can relate to each
other. Descriptors such as “near” and
“on_top_of” are simple examples of
relationship descriptors. More complicated
descriptors include “encapsulate” which not
only define how one datum is positioned
relative to another, but also how the fonts of
each datum are related to each other (words
with large, bolded fonts encapsulate smaller
fonted words following them).
The Mole is potentially a very robust
and simple interface for applications to
access the web. By “lifting” semantic
concepts found on a web page away from
the HTML structure, the Mole will be able
to gather information from web pages even
when these pages change. In many ways,
semantic templates attempt to mimic what a
Figure 12. Mapping of thesemantic template to two versionsof the weather page (note: notnecessarily the CNN weatherpage).
human does to find information. By using
concepts instead HTML tags to find
information, the Mole is using web pages as
they were meant to be used: by the human
eye.
Day
Word ("low") Word ("high")
Integer Integer
MondayHigh: 95Low: 55
MondayHigh: Low: 95 65
52 SUMMARY OF RESEARCH
In this work, we introduce and explore a
novel framework, ANGIE, for modelling
subword lexical phenomena in speech
recognition. Our framework provides a
flexible and powerful mechanism for
capturing morphology, syllabification,
phonology and other subword effects in a
hierarchical manner which maximizes the
sharing of subword structures. We hope
that such a system can provide a single
unified probabilistic framework for model-
ling phonological variation and morphol-
ogy. Many current systems handle phono-
logical variations either by having a pronun-
ciation graph (such as in MIT's SUMMIT
system) or by implicitly absorbing the
variations into the acoustic modelling. The
former has the disadvantage of not sharing
common subword structure, hence splitting
training data. The latter masks the process
of handling phonological variations and
makes the process difficult to control and to
improve upon. For example, in the ATIS
domain, the words "fly," "flying," "flight,"
and "flights" all share the common initial
phoneme sequence f l ay, so presumably,
phonological variations affecting this
sequence can be better learned if examples
from all four words were pooled together.
Our system does just that. The sharing of
subword structure will hopefully facilitate
the search process and also make it easier to
deal with new, out-of-vocabulary, words. By
pursuing merged common subword theories
during search, we can mitigate the combina-
torial explosion of the search tree, making
large vocabulary recognition more manage-
able. Because we expect new words to share
much common subword structure with
words in our vocabulary, we can easily add
new words dynamically, allowing them to
adopt existing subword structures. In
principle, we can even detect the occurrence
of out-of-vocabulary words by recognizing as
much of the subword structure as we can in
a bottom up manner.
We are using the ANGIE framework to
model constraints at the subword level.
Within our framework, subword structure is
modeled via a context-free grammar and a
probability model. The grammar generates a
layered structure very similar to that
proposed by Meng [1]. An example of an
ANGIE parse tree is shown in Figure 13. Our
work attempts to validate the feasibility of
using the framework for speech recognition
by demonstrating its effectiveness in three
recognition tasks, phonetic recognition,
word-spotting and continuous speech
recognition. We also explore the combina-
tion of ANGIE with a natural language
understanding system, TINA, that is also
based on a context-free grammar and hence
can be more easily integrated into our ANGIE-
based system as compared to a more
traditional recognition framework. Finally,
we conclude with two pilot studies, one
attempting to leverage off the ANGIE
subword structural information for prosodic
modelling and the other exploring the
addition of new words to the recognition
vocabulary in real time.
Our first demonstration of recognition
with ANGIE was a system for forced phone-
mic/phonetic/acoustic alignment and
phonetic recognition as described in greater
detail in [2]. In this system, we perform a
bottom up best first search over possible
phone strings, incorporating the acoustic
score of each phone along with the score of
the best ANGIE parse for the path up to that
phone. Phonetic recognition results
obtained have been promising, with the
ANGIE-based system achieving a 36.1% error
Sublexical Modelling for Word-Spotting andSpeech Recognition using ANGIERaymond Lau
53SPOKEN LANGUAGE SYSTEMS
SENTENCE
ISUF
NUCLAX+ CODA
/eh/ /d*ed/
WORD
S OR OT UROOT2 DSUF
NUC DNUC UCODA PAST
/ih+/ /er//n/ /t/ /s/ /t/
[ih] [n] [-n] [axr]
Morphology
Syllabification
Phonemics
Phonetics
[s] [t] [ix] [dx] [ix][ax] [m]
FCODAFNUC
FCN
WORD
/m//ay_i/
Figure 13. A sample parse tree forthe phrase "I'm interested."
rate as compared to a phone bigram
baseline system with a 39.8% error rate on
ATIS data. The improvement was due
roughly equally to improved phonological
modelling and the more powerful longer
distance constraints made possible with
ANGIE's upper layers.
Our second demonstration was the
implementation of an ANGIE-based system
for word spotting. Our test case was word
spotting the city names in the ATIS corpus.
We have successfully implemented the
wordspotter with competitive performance.
We have also conducted several experiments
varying the nature of the subword con-
straints on the filler model within the
wordspotter. The constraint sets experi-
mented with ranged from simple phone
bigram, to syllables, to full word recogni-
tion. The results showed that, as expected,
the inclusion of more constraints on the
filler led to improved word-spotting perfor-
mance. On our test set, the system had a
FOM of 89.3 with full word recognition,
87.7 with syllables, and 85.3 with phone
bigrams. Surprisingly, speed tended to
improve with FOM performance. We
believe the explanation is that more
constraints lead to a less bushy search. More
details of our work in word-spotting can be
found in [3].
For our final feasibility test, we have
implemented a continuous speech recogni-
tion system based on the ANGIE framework.
Our recognizer, employing a word bigram,
achieves the same level of performance as
the our SUMMIT baseline system with a word
bigram (18.8% word error rate vs. 18.9%).
In both cases, context-independent acoustic
models were used. Because ANGIE is based
on a context-free grammar framework, we
have experimented with integrating our TINA
natural language understanding system, also
based on a context-free framework) with
ANGIE, resulting in a single, coupled search
strategy. The main challenge with the
integrated system was in curtailing the
computational requirements of supporting
robust parse. We finalized upon a greedy
strategy described in greater detail in [4].
With the combined system, the word error
rate declines to 14.8%. We have also
attempted TINA resorting of SUMMIT N-best
lists in an effort to separate the benefits of
an integrated search strategy from those of
bringing in the powerful TINA language
model. That experiment yielded only
marginal improvement over the word
54 SUMMARY OF RESEARCH
bigram, suggesting that the tightly coupled
search can lead to a gain not attainable
when the recognition and NL understand-
ing processes are separated and interfaced
through an N-best list, generated without
the use of information from TINA.
Finally, we conducted two pilot studies
exploring problems for which we believe the
ANGIE-based framework will exhibit advan-
tages. The first pilot study examines the
ability to add new words to the recognition
vocabulary in real time, that is, without
requiring extensive retraining of the lexical
models. We believe that, because of ANGIE's
hierarchical structure, new words added to
the vocabulary can share lexical subword
structures with existing words in the
vocabulary. For this study, we simulated the
appearance of new words by artificially
removing the city names that only appear in
ATIS-3, that is, city names which did not
appear in ATIS-2. These city names were then
considered the new words in our system.
For the baseline comparison, we added the
words to a similarly reduced SUMMIT
recognizer and assigned zero to their lexical
arc weights in the pronunciation graph. In
the ANGIE case, we allowed ANGIE to general-
ize probabilities learned from other words
with similar word substructures. In both
cases, the word level bigram model used a
class bigram, with uniform probabilities
distributed over all city names, including
the simulated new words. Both the baseline
and ANGIE systems achieved the same word
error rate, 19.2%. This represents a slight
decrease from a system trained with full
knowledge of the simulated new words.
Apparently, the lack of lexical training did
not adversely impact recognition perfor-
mance much with our set of simulated new
words. It is unclear whether ANGIE would
show an improvement over the baseline for
a different choice of new words. We do
note, however, that for the artificially
reduced system, without the simulated new
words in the vocabulary, the ANGIE-based
system achieves a 31.2% error rate as
compared to a 34.2% error rate for the
baseline SUMMIT system, suggesting that
ANGIE is more robust in the presence of
unknown words.
For our other pilot study, we attempted
to leverage the word substructure informa-
tion provided by ANGIE for prosodic
modelling. Our experiment, conducted in
conjunction with our colleague Grace
Chung, was to implement a hierarchical
duration model based on the ANGIE parse
tree and to incorporate the duration score
into our recognition search process. We
evaluated the duration model in the context
of our ANGIE-based word-spotting system. Its
inclusion increased the FOM from 89.3 to
91.6, leading us to conclude that the ANGIE
subword structure information can indeed
be used for improved prosodic modelling,
minimally in terms of duration.
We believe that our work demonstrates
the feasibility of using ANGIE as a competitive
lexical modelling framework for various
speech recognition systems. Our experience
with word-spotting shows that ANGIE
provides a platform where it is easy to alter
subword constraints. Our success at NL
integration for improved recognition
suggests that a context-free framework has
several advantages. Finally, our pilot study
in prosodic modelling suggests that ANGIE's
subword structuring information can be
leveraged to provide improved performance.
SUBLEXICAL MODELLING FOR WORD-SPOTTING AND SPEECH RECOGNITION USING ANGIE
55SPOKEN LANGUAGE SYSTEMS
References[1] H. M. Meng, Phonological Parsing for Bi-directional
Letter-to-Sound/Sound-to-Letter Generation, Ph.D.thesis, Department of Electrical Engineering andComputer Science, Massachusetts Institute ofTechnology, Cambridge, MA, June 1995.
[2] S. Seneff, R. Lau, and H. Meng, ANGIE: A NewFramework for Speech Analysis Based onMorpho-Phonological Modelling, Proc. ICSLP '96,Philadelphia, PA, pp. 225-228, October 1996.(Available online at http://www.raylau.com/icslp96_angie.pdf)
[3] R. Lau and S. Seneff, Providing SublexicalConstraints for Word Spotting within the ANGIE
Framework, Proc. Eurospeech '97, Rhodes, Greece,pp. 263-266, September 1997. (Available online athttp://www.raylau.com/angie/eurospeech97/main.pdf)
[4] R. Lau, Subword Lexical Modelling for SpeechRecognition, Ph.D. thesis, MIT Department ofElectrical Engineering and Computer Science,Cambridge, MA, May 1998. (Available online athttp://www.raylau.com/thesis/thesis.pdf)
RAYMOND LAU
56 SUMMARY OF RESEARCH
Probabilistic Segmentationfor Segment-Based Speech RecognitionSteven Lee
The objective of this research is to develop a
high-quality real-time probabilistic segmenta-
tion algorithm for use with SUMMIT, a
segment-based speech recognition system
[1]. Until recently, SUMMIT used a segmenta-
tion algorithm based on acoustic change.
This algorithm was adequate, but produced
segment graphs that were denser than
necessary because a low acoustic change
threshold was needed to ensure segment
boundaries not marked by sharp acoustic
change were also included. Recently, Chang
developed an approach to segmentation that
uses a Viterbi and a backwards A* search to
produce a phonetic graph in the same
manner as word graph production [2, 3].
This algorithm achieved a 11.4% decrease in
phonetic recognition error rate while
hypothesizing half the number of segments
of the acoustic segmentation algorithm.
While the results of this approach are
promising, it has two drawbacks that keep it
from widespread use in practical speech
recognition systems. The first is that the
algorithm cannot run in real-time because it
requires a complete forward Viterbi search
followed by a backward A* search. The
second is that the algorithm requires
enormous computational power since the
search is performed at the frame level. This
research seeks to develop a search algorithm
that produces a segment network in a
pipelined, left-to-right mode. It also aims to
improve and lower the computational
requirements.
The approach being adopted in this
research is to introduce a simplified search
framework and to shrink the search space.
The new search framework, a frame-based
Viterbi search that does not utilize a
segment graph, is attractive for probabilistic
segmentation because of its simplicity and
its relatively low computational require-
ments. Although work on using this search
to produce a segment graph is ongoing,
preliminary results using this search on
phonetic recognition resulted in a competi-
tive error rate of 30.3% [4]. Since recogni-
tion performance should be somewhat
correlated to the quality of the segment
graph produced, this is a promising result.
The size of the search space in probabi-
listic segmentation is bounded by time in
one dimension and by the number of
phonetic units in another dimension. Both
dimensions can be shrunk to provide
computational savings. This research will
investigate shrinking the time dimension by
using landmarks instead of frames. It will
also investigate the use of broad classes to
shrink the search space along the lexical
dimension.
The domains being used for this work
are TIMIT and JUPITER [5], a telephone-
based weather information domain.
References[1] J. Glass and J. Chang and M. McCandless, “A
Probabilistic Framework for Feature-based SpeechRecognition,” Proc. International Conference onSpeech and Language Processing, pp. 2277-2280,Philadelphia, PA, October 1996.
[2] J. Chang and J. Glass, “Segmentation andModeling in Segment-based Recognition,” Proc.European Conference on Speech Communication andTechnology, pp. 1199-1202, Rhodes, Greece,September 1997.
[3] I. Hetherington and M. Phillips and J. Glass andV. Zue, “A* Word Network Search forContinuous Speech Recognition,” Proc. EuropeanConference on Speech Communication and Technology,pp. 1533-1536, Berlin, Germany, September 1993.
[4] S. Lee. Probabilistic Segmentation for Segmentation-based Speech Recognition, M.Eng thesis, MITDepartment of Electrical Enginering andComputer Science, June 1998.
[5] V. Zue, et al. “From Interface to Content:Translingual Access and Delivery of On-lineInformation,’ Proc. European Conference on SpeechCommuncation and Technology, pp. 2227-2230,Rhodes, Greece, September 1997.
57SPOKEN LANGUAGE SYSTEMS
A Model for Interactive Computation:Applications to Speech ResearchMichael McCandless
Although interactive tools are extremely
valuable for progress in speech research, the
programming techniques required to
implement them are often difficult to
master and apply. There are numerous
interface toolkits which facilitate implemen-
tation of the user-interface, but these tools
still require the programmer to build the
tool’s back end by hand. The goal of this
research is to create a programming
environment which simplifies the process of
building interactive tools by automating the
computational details of providing
interactivity.
Interactive tools engage their users in a
dialogue, effectively allowing the user to ask
questions and receive answers. Questions
are typically asked by interacting with the
tool’s interface via direct manipulation. I
propose a set of metrics which may be used
to measure the extent of a tool’s
interactivity: rapid response (does the tool
answer the user’s question as quickly as
possible); high coverage (is the user able to
ask a wide range of questions); adaptability
(does the tool adapt to varying computation
environments); scalability (can the tool
manage both large and small inputs);
pipelining (does the tool provide the answer
in pieces over time for computations that
take a long time); backgrounding (is the
user able to ask other questions while an
answer is being computed). I refer to a tool
which can meet these stringent require-
ments as a “finely interactive tool”. These
dimensions provide metrics for measuring
and comparing the interactivity of different
tools.
Based on these requirements for
interactivity, I have designed a declarative
computation model for specifying and
implementing interactive computation. In
order to evaluate the effectiveness of the
model, I have incorporated it into a speech
toolkit called MUSE [1,2]. MUSE contains
numerous components allowing a program-
mer to quickly construct finely interactive
tools. MUSE is implemented in the Python
programming language with extensions in
C. A Python interface to the Tk widget set is
used for interface design and layout.
The programmer specifies computation
in MUSE differently from existing impera-
tive programming languages. Like existing
languages, the programmer builds a MUSE
program by applying functions to strongly-
typed values. However, in MUSE, the
programmer does not have detailed control
over when the computations actually take
place, nor over when and where intermedi-
ate results are stored; instead, the program-
mer declares the functional relationships
among a collection of MUSE values. The
MUSE system records these relationships,
constructs a run-time acyclic dependency
graph, and then chooses when to compute
which values.
MUSE’s data-types also differ from
existing programming languages. The
specification for each data-type, for example
a waveform, image or graph, includes
provisions for incremental change: every
data-type is allowed to change in certain
ways. For example, images may change by
replacing the set of pixels within a specified
rectangular area. When a value changes at
run-time, MUSE will consult the depen-
dency graph, and will then take the
necessary steps to bring all dependents of
that value up to date with the new change.
These unique properties of MUSE free the
programmer from dealing with many of the
complex computational aspects of providing
interactivity.
58 SUMMARY OF RESEARCH
Because the programmer relinquishes
control over the details of how values are
computed, the MUSE run-time system must
make such choices. While there are many
ways to implement this, the technique used
by MUSE is based on purely lazy evaluation
plus caching. When values are changed, a
synchronous depth-first search is performed,
notifying all impacted values of the change.
Values are computed entirely on-demand,
and are then cached away according to the
program. For example, if the user is looking
at a spectrogram, only the portion of the
image they are actually looking at will be
computed, which requires a certain range of
the STFT, which in turn requires only a
certain range of the input waveform.
This implementation choice affects all
of the built-in functions; the implementa-
tion of these functions, in both Python and
C, must “force” the evaluation of any inputs
that they need, but only in response to their
output being forced. Further, any incremen-
tal change on an input to the function must
be propagated as an incremental change on
the function’s output.
In order to effectively test the
interactivity of MUSE, I have added many
necessary speech functions and data-types.
The functions include waveform
preemphasis, (short-time) Fourier trans-
forms, linear-predictive analysis, Cepstral
analysis, energy, word-spotting lexical access,
mixture diagonal Gaussian training. The
datatypes include waveforms, spectra,
graphs, tracks, time marks and cursors, and
images. Each data-type has an associated
cache which the programmer may use to
easily control the extent of storage of
intermediate results, as well as a visual
function, which translates the data-type into
an appropriate image.
I have constructed four example tools
which illustrate the unique capabilities of
Figure 14. A screen shot of aninteractive lexical access tool. Theuser is able to edit the phonetictranscription and with eachchange, the word transcription isupdated in real-time to reflect theallowed word alignments accordingto the TIMIT pronunciationlexicon. The tool demonstrates theunique nature of MUSE’sincremental computation model.
A MODEL FOR INTERACTIVE COMPUTATION: APPLICATIONS TO SPEECH RESEARCH
59SPOKEN LANGUAGE SYSTEMS
the MUSE toolkit. The first tool is a basic
speech analysis tool showing a waveform,
spectrogram and transcription, which allows
the user to modify the alignment of
individual frames of the STFT by directly
editing time marks, and then see the impact
on the spectrogram image. The second tool
displays three overlaid spectral slices (FFT,
LPC, and CEPSTRUM), and allows the
user to change all aspects of the computa-
tion. The third tool illustrates the process of
training a diagonal Gaussian mixture model
on one-dimensional data, allowing the user
to vary many of the parameters affecting the
training process. The final tool is a lexical
access tool, allowing the user to phonetically
transcribe an utterance and then see the
corresponding potential word matches.
Figure 14 shows a screen-shot of this tool.
The properties of MUSE’s incremental
computation model are reflected in the high
degree of interactivity each of these tools
offers the user; MUSE’s run-time model is
able to effectively carry out the require-
ments of interactivity.
References[1] M. McCandless and J. Glass, “MUSE: A Scripting
Language for the Development of InteractiveSpeech Analysis and Recognition Tools, “ Proc.European Conference on Speech Communication andTechnology, Rhodes, Greece, September 1997.
[2] M. McCandless. A Model for InteractiveComputation: Applications to Speech Research. Ph.D.thesis, MIT Department of Electrical Engineeringand Computer Science, June 1998.
MICHAEL MCCANDLESS
60 SUMMARY OF RESEARCH
Subword Approachesto Spoken Document RetrievalKenney Ng
As the amount of accessible data continues
to grow, the need for automatic methods to
process, organize, and analyze this data and
present it in human usable form has
become increasingly important. Of particu-
lar interest is the problem of efficiently
finding “interesting” pieces of information
from the growing collections and streams of
data. Much research has been done on the
problem of selecting “relevant” items from
large collections of text documents given a
query or request from a user. Only recently
has there been work addressing the retrieval
of information from other media such as
images, video, audio, and speech. Given the
growing amounts of spoken language data,
such as recorded speech messages and radio
and television broadcasts, the development
of automatic methods to index, organize,
and retrieve spoken documents will become
more important.
In our work, we are investigating the
feasibility of using subword unit indexing
terms for spoken document retrieval as an
alternative to words generated by either
keyword spotting or word recognition. The
investigation is motivated by the observa-
tion that word-based retrieval approaches
face the problem of either having to know
the keywords to search for a priori, or
requiring a very large recognition vocabu-
lary in order to cover the contents of
growing and diverse message collections.
The use of subword units in the recognizer
constrains the size of the vocabulary needed
to cover the language; and the use of
subword unit indexing terms allows for the
detection of new user-specified query terms
during retrieval.
We explore a range of subword unit
indexing terms of varying complexity
derived from phonetic transcriptions. The
basic underlying unit is the phone; more
and less complex units are derived by
varying the level of detail and the sequence
length of these units. Labels of the units
range from specific phones to broad
phonetic classes obtained via hierarchical
clustering. Automatically derived fixed- and
variable-length sequences ranging from one
to six units long are examined. Also,
sequences with and without overlap are
explored. In generating the subword units,
each message/query is treated as one long
phone sequence with no word or sentence
boundary information.
The speech data used in this work
consists of recorded FM radio broadcasts of
the NPR “Morning Edition” news show.
The training set for the speech recognizer
consists of 2.5 hours of clean speech from 5
shows while the development set consists of
one hour of data from one show. The
spoken document collection is made up of
12 hours of speech from 16 shows parti-
tioned into 384 separate news stories. In
addition, a set of 50 natural language text
queries and associated relevance judgments
on the message collection are created to
support the retrieval experiments.
Phonetic recognition of the data is
performed with the MIT SUMMIT speech
recognizer. It is a probabilistic segment-
based approach that uses context-indepen-
dent segment and context-dependent
boundary acoustic models. A two pass
search strategy is used during recognition. A
forward Viterbi search is performed using a
statistical bigram language model followed
by a backwards A* search using a higher
order statistical n-gram language model.
Information retrieval is done using a
standard vector space approach. In this
model, the documents and queries are
61SPOKEN LANGUAGE SYSTEMS
represented as vectors where each compo-
nent is an indexing term. The terms are
weighted based on the term’s occurrence
statistics both within the document and
across the collection. A normalized inner
product similarity measure between
document and query vectors is used to score
and rank the documents during retrieval.
We perform a series of experiments to
measure the ability of the different subword
units to perform effective spoken document
retrieval. A baseline text retrieval run is
performed using word-level text transcrip-
tions of the spoken documents and queries.
This is equivalent to using a perfect word
recognizer to transcribe the speech messages
followed by a full-text retrieval system.
An upper bound on the performance of
the different subword unit indexing terms is
obtained by running retrieval experiments
using phonetic expansions of the words in
the messages and queries obtained via a
pronunciation dictionary. We find that
many of the subword unit indexing terms
are able to capture enough information to
perform effective retrieval. With the
appropriate subword units it is possible to
achieve performance comparable to that of
text-based word units if the underlying
phonetic units are recognized correctly.
We next examine the retrieval perfor-
mance of the subword unit indexing terms
derived from errorful phonetic transcrip-
tions created by running the phonetic
recognizer on the entire spoken document
collection. From this experiment, we find
that although performance is worse for all
units when there are phonetic recognition
errors, some subword units can still give
reasonable performance even before the use
of any error compensation techniques such
as approximate term matching.
We then attempt to improve retrieval
performance by exploring “robust” indexing
and retrieval approaches which take into
account and try to compensate for the
speech recognition errors introduced into
the spoken document collection. We look at
two approaches. One involves modifying the
query representation to include additional
approximate match terms; the main idea is
to include terms that are likely to be
confused with the original query terms. The
other approach is to modify the speech
document representation by expanding
them to include high scoring recognition
alternatives; the goal is to increase the
chance of including the correct hypothesis.
We find that both approaches are able to
help improve retrieval performance.
Our results indicate that subword-based
approaches to spoken document retrieval
are feasible and merit further research. In
terms of current and future work, we are
expanding the corpus to include more
speech for both recognizer training and the
speech message collection; exploring ways to
improve the performance of the phonetic
recognizer; and investigating more sophisti-
cated robust indexing and retrieval methods
in an effort to improve retrieval perfor-
mance when there are recognition errors.
References[1] K. Ng and V. Zue, “Subword Unit
Representations for Spoken Document Retrieval,”Proc. European Conference on Speech Communicationand Technology, pp. 1607-1610, Rhodes, Greece,September 1997,
[2] K. Ng and V. Zue, “An Investigation of SubwordUnit Representations for Spoken DocumentRetrieval,” Proceedings of the ACM SIGIRConference, p. 139, Philadelphia, PA, July 1997.
62 SUMMARY OF RESEARCH
A Semi-Automatic System for the Syllabificationand Stress Assignment of Large LexiconsAarati Parmar
Sub-word modelling, which includes
morphology, syllabification, stress, and
phonemes, has been shown to improve
performance in certain speech applications
[1]. This observation has motivated us to
attempt to formally define a convention for
a set of syllable-sized units, intended to
capture these sub-word level realizations in
words in the English language, through a
two-tiered approach. The assumption is that
words can be represented as sequences of
units we call “morphs,” which capture
explicitly both the pronunciation and the
orthography. Each morph unit has a
carefully constructed label and a lexical
entry that provides its canonic phonemic
realization. Each word is entered into a
word lexicon decomposed into its appropri-
ate morph sequence. Thus, for example, the
word “contentiously” would be represented
as “con- ten+ -tious =ly” with the markers, “-
”, “+”, and “=” coding for morphological
categories such as prefix, stressed root, and
derivational/inflectional suffix. It is our
hope that all words of English can be
represented in terms of a reasonably small,
closed set of these morph units.
This thesis introduces a new semi-
automatic procedure for acquiring a
representation of a large corpus of words in
terms of morphs. Morph transcription, as
we have defined it, is a considerably more
difficult task than phonetic or phonemic
transcriptions, simply because constraints
have to be satisfied on more than one level.
Morphs with similar spellings but different
pronunciations must be distinguished
through selected capital letters, as in the
examples “com+” (/k!/ /aa+/ /m/) in
“combat” and “cOm+” (/k!/ /ah+/ /m/) in
“comfort.” The letters of the morph
spellings for a given word must, if
lowercased and concatenated, realize a
correct spelling of the word. Syllabification
must be correctly marked, and the phone-
mic transcription obtained by replacing the
morph units with their phonemic realiza-
tions must be accurate.
We would like to know if our represen-
tation is extensible, and if it is possible to
automatically or semi-automatically extract
these sub-lexical units from large corpora of
words with associated phonetic transcrip-
tions. Thus we have devised a procedure
that can hopefully propose morph decom-
positions accurately and efficiently. We have
evaluated the procedure on two corpora,
and have also assessed how appropriate the
morph concept is as a basic unit for
capturing sub-lexical constraints.
We used the ANGIE formalism to
generate and test our morphs. ANGIE is a
system that can parse either spellings or
phonetics into a probabilistic hierarchical
framework. We decided to develop our
procedure based on a medium-sized corpus
known as TIMIT. We began with a grammar
that had been developed and trained on a
corpus we call “ABH” (a combination
including the ATIS vocabulary, a subset of
the 10,000 most frequent words of the
Brown corpus, and the Harvard List
lexicon). We then applied the knowledge we
had gained from ABH, both with and
without that derived from the TIMIT
experiment, to the much larger COMLEX
lexicon (omitting proper nouns and
abbreviations). In this way we tested how
well a set of morphs derived from a seed
lexicon can be applied to a much larger set
of some 30,000 words. If morphs are a good
representation, then good coverage should
be attainable.
Our procedure was to first parse, in
recognition mode, the letters of all the new
words in a corpus to be absorbed, using a
letter-terminal grammar trained on the seed
63SPOKEN LANGUAGE SYSTEMS
ABH corpus. This yielded a set of hypoth-
esized phoneme sequences and/or morph
sequences for each word, which could then
be verified or rejected by parsing in phone
mode, using the phonetic transcription
provided by the corpus as established phone
terminals, along with a phone-to-phoneme
grammar that defines the mappings from
the conventions of the corpus to ANGIE’s
conventions. By enforcing morph con-
straints as well, we obtained further
constraint than if we just used the phone-
mic knowledge.
We have some encouraging signs that
our set of morphs is large enough to
encompass most or all English words,
particularly if we allow novel stressed roots
to be “invented” by decomposing them into
a confirmed onset and rhyme. In our
experiments, even without “invented”
stressed roots, we determined that coverage
of TIMIT was about 89%, and for
COMLEX it was about 94%. The parse
coverage of our procedure is quite good,
considering the large size of the COMLEX
corpus. The accuracy of the morphological
decompositions is reasonable as well.
According to an informal evaluation,
morphological decompositions of words in
TIMIT that pass through both letter and
phone parsing steps have a 78% probability
of matching exactly the expert transcription.
Of course this metric does not take into
account alternate decompositions which
may also be correct, or more consistent with
one another than the human-generated
ones.
We performed an analysis and compari-
son of the experiments performed on
TIMIT and COMLEX. The topics covered
include degree of constraint, hand-written
versus automatic rules, and consistency of
morphological decompositions. Constraint
can be measured by the average number of
alternate morphological decompositions per
word. The average number of morphs
generated from the letter parsing step is
about three, for both TIMIT and
COMLEX. After parsing with phones, the
figure drops to 1.1 for TIMIT and to 1.7 for
COMLEX. Automatically derived rules (for
the mapping from ANGIE’s phoneme
conventions to the phonetic conventions of
the corpus) provide a quick alternative to
hand-written rules, with greater coverage,
but at a price of some performance loss.
Morphological decompositions produced by
our procedure also appear to be self-
consistent.
We have also developed a new analysis
tool to simplify the task of labelling words
for morph transcriptions. This tool aids the
transcriber by providing easy access to many
different sources of knowledge, via a
sophisticated graphical interface. It can be
used to efficiently repair errors obtained in
the automatic parsing procedure.
A significant outcome of this thesis is a
much larger inventory of the possible
morphs of English, and a much larger
lexicon of words decomposed into these
morph units. These resources should serve
us well in future experiments in letter-to-
sound/sound-to-letter generation, for the
automatic acquisition of pronunciations for
new words. They should also be useful for
the automatic acquisition of vocabularies
for speech recognition tasks using ANGIE,
and for other experiments, e.g, in prosodic
analysis, where syllable decomposition may
be important.
Reference[1] R. Lau and S. Seneff, Providing Sublexical
Constraints for Word Spotting within the ANGIE
Framework, Proc. European Conference on SpeechCommunication and Technology, pp. 263-266,Rhodes, Greece, September 1997.
64 SUMMARY OF RESEARCH
A Segment-Based Speaker Verification SystemUsing SUMMITSridevi Sarma
This thesis describes the development of a
segment-based speaker verification system
and explores two computationally efficient
techniques. Our investigation is motivated
by past observations that speaker-specific
cues may manifest themselves differently
depending on the manner of articulation of
the phonemes. By treating the speech signal
as a concatenation of phone-sized units, one
may be able to capitalize on measurements
for such units more readily. A potential side
benefit of such an approach is that one may
be able to achieve good performance with
unit (i.e., phonetic inventory) and feature
sizes that are smaller than what would
normally be required for a frame-based
system, thus deriving the benefit of reduced
computation.
To carry out our investigation, we
started with the segment-based speech
recognition system developed in our group
called SUMMIT [1,2], and modified it to suit
our needs. The speech signal was first
transformed into a hierarchical segment
network using frame-based measurements.
Next, acoustic models for each speaker were
developed for a small set of six phoneme
broad classes. The models represented
feature statistics with diagonal Gaussians,
which characterized the principle compo-
nents of the feature set. The feature vector
included averages of MFCCs 1-14, plus
three prosodic measurements: energy,
fundamental frequency (F0), and duration.
To facilitate a comparison with previ-
ously reported work [3,4,5], our speaker
verification experiments were carried out
using 2 sets of 100 speakers from the TIMIT
corpus. Each speaker-specific model was
developed from the eight SI and SX
sentences. Verification was performed using
the two SA sentences common to all
speakers. To classify a speaker, a Viterbi
forced alignment was determined for each
test utterance, and the forced alignment
score of the purported speaker was com-
pared with those obtained with the models
of the speaker’s competitors. These scores
were then rank ordered and the user was
accepted if his/her model’s score was within
the top N of 100 scores, where N is a
parameter we varied in our experiments. To
test for false acceptance, we used every other
speaker in the system as impostors.
Ideally, the purported speaker’s score
should be compared to scores of every other
system user. However, computation
becomes expensive as more users are added
to the system. To reduce the computation,
we adopted a procedure in which the score
for the purported speaker is compared only
to scores of a cohort set consisting of a small
set of acoustically similar speakers. These
scores were then rank ordered and the user
was accepted if his/her model’s score was
within the top N scores, where N is a
parameter we varied in our experiments. To
test for false acceptance, we used only the
members of a speaker’s cohort set as
impostors.
In addition to using cohort normaliza-
tion to reduce computation, we determined
the the size and content of the feature
vector through a greedy algorithm opti-
mized on overall speaker verification
performance. Fewer features allows for fewer
parameters to be estimated during training,
and fewer scores to be computed during
testing.
We were able to achieve a performance
of 0% equal error rate (EER) on clean data
and 8.36% EER on noisy telephone data,
with a simple system design. Thus we show
that a segment-based approach to speaker
65SPOKEN LANGUAGE SYSTEMS
verification is viable, competitive and
efficient. Cohort normalization and
conducting a feature search to reduce
dimensions minimally affect performance
and are useful when computation is
prohibitive.
References[1] V. Zue, J. Glass, M. Phillips, and S. Seneff,
“Acoustic Segmentation and PhoneticClassification in the SUMMIT Speech RecognitionSystem,’’ Proc. of the International Conference onAcoustics, Speech, and Signal Processing, pp. 389-392,Glasgow, Scotland, May 1989.
[2] V. Zue, J. Glass, M. Phillips, and S. Seneff, “TheSUMMIT Speech Recognition System: PhonologicalModeling and Lexical Access,’’ Proc. of theInternational Conference on Acoustics, Speech, andSignal Processing, pp. 49-52, Albuquerque, NM,April 1990.
[3] L. Lamel, J.L. Gauvain, “A Phone-based Approachto Non-linguistic Speech Feature Identification,”Computer Speech and Language, pp. 87-103, 1995.
[4] Y. Bennani, “Speaker Identification ThroughModular Connectionist Architecture: Evaluationon the TIMIT database,” Proc. from theInternational Conference on Spoken LanguageProcessing, pp. 607-610, Banff, Alberta 1992.
[5] D. Reynolds, D., “Speaker Identification andVerification Using Gaussian Mixture SpeakerModels,” Speech Communication, Vol. 17, No. 1, pp.91-108, August 1995.
66 SUMMARY OF RESEARCH
Context-Dependent Modellingin a Segment-Based Speech Recognition SystemBenjamin Serridge
Modern speech recognition systems typically
classify speech into sub-word units that
loosely correspond to phonemes. These
phonetic units are, at least in theory,
independent of task and vocabulary, and
because they constitute a small set, each one
can be well-trained with a reasonable
amount of data. In practice, however, the
acoustic realization of a phoneme varies
greatly depending on its context, and speech
recognition systems can benefit by choosing
units that more explicitly model such
contextual effects.
The goal of this research was to explore
various strategies for incorporating contex-
tual information into a segment-based
speech recognition system, while maintain-
ing computational costs at a level acceptable
for implementation in a real-time system.
The latter was achieved by using context-
independent models in the search, while
context-dependent models are reserved for
re-scoring the hypotheses proposed by the
context-independent system.
Within this framework, several types of
context-dependent sub-word units were
evaluated, including word-dependent,
biphone, and triphone phonetic units. In
each case, deleted interpolation was used to
compensate for the lack of training data for
the models. Other types of context-depen-
dent modeling, such as context-dependent
boundary modelling and “offset” modelling,
were also used successfully in the re-scoring
pass.
The evaluation of the system was
performed using the Resource Management
task. Context-dependent segment models
were able to reduce the error rate of the
context-independent system by more than
twenty percent, and context-dependent
boundary models were able to reduce the
word error rate by more than a third. A
straight-forward combination of context-
dependent segment models and boundary
models leads to further reductions in error
rate.
So that it can be incorporated easily
into existing and future systems, the code
for re-sorting N-best lists was been imple-
mented as an object in SAPPHIRE [2], a
framework for specifying the configuration
of a speech recognition system using a
scripting language. It is currently being
tested on JUPITER [3], a real-time telephone
based weather information system under
development at SLS.
References[1] B. Serridge. Context-dependent Modeling in a
Segment-based Speech Recognition System. M.Eng.thesis, MIT Department of Electrical Engineeringand Computer Science, Cambrudge, MA, August1997.
[2] L. Hetherington and M.McCandless,“SAPPHIRE: An Extensible Speech Analysis andRecognition Tool based on Tcl/Tk,” Proc.International Conference on Spoken LanguageProcessing, pp. 1942-1945, Philadelphia, PA,October 1996.
[3] V. Zue, et al., “From Interface to Content:Translingual Access and Delivery of On-lineInformation,” Proc. European Conference on SpeechCommunication and Technology, pp. 2227-2230,Rhodes, Greece, September 1997.
67SPOKEN LANGUAGE SYSTEMS
Toward the Automatic Transcriptionof General Audio DataMichelle S. Spina
Recently, ASR research has broadened its
scope to include the transcription of general
audio data (GAD), from sources such as
radio or television broadcasts. This shift in
research focus is largely brought on by the
growing need to shift content-based
information retrieval from text to speech.
However, GAD pose new challenges to
present-day ASR technology because they
often contain extemporaneously-generated,
and therefore disfluent speech, with words
drawn from a very large vocabulary, and
they are usually recorded from varying
acoustic environments. Also, the voices of
multiple speakers often interleave and
overlap with one another or with music and
other sounds. Since the performance of
ASR systems can vary a great deal depend-
ing on speaker, microphone, recording
conditions and transmission channel, we
have argued that the transcription of GAD
would benefit from a preprocessing step
that first segmented the signal into acousti-
cally homogeneous chunks [3]. Such
preprocessing would enable the transcrip-
tion system to utilize the appropriate
acoustic models during recognition. The
goal of the research presented here was to
investigate some of the strategies for training
a phonetic recognition system for GAD.
We have chosen to focus on the
Morning Edition (ME) news program
broadcast by National Public Radio (NPR).
NPR-ME consists of news reports from
national and local studio anchors as well as
reporters from the field, special interest
editorials and musical segments. The
analysis presented here is based on a
collection of six hours of recording from
November, 1996 to January, 1997. The six,
one-hour shows were automatically split into
manageable sized waveform files at silence
breaks. In addition, if any of the resulting
waveform files contained multiple sound
environments (e.g., a segment of music
followed by a segment of speech) they were
further split at these boundaries. Therefore,
each file was homogeneous with respect to
sound environment. Orthographies and
phonetic alignments were generated for
each of the files using orthographic
transcriptions of the data and a forced
Viterbi search.
Seven categories were used to character-
ize the files. These categories were described
in our previous work [3], and are briefly
reviewed here: 1) clean speech: wideband
(8kHz) speech from anchors and reporters,
recorded in the studio, 2) music speech:
speech with music in the background, 3)
noisy speech: speech with background
noise, 4) field speech: telephone bandwidth
(4kHz) speech from field reporters, 5)
music, 6) silence, and 7) garbage, which
accounted for anything that did not fall into
one of the other six categories. In [3], we
described some preliminary analyses and
experiments that we had conducted
concerning the transcription of this data.
For the NPR-ME corpus, we were able to
achieve better than 80% classification
accuracy for these seven sound classes on
unseen data, using relatively straightforward
acoustic measurements and pattern classifi-
cation techniques. A speech/non-speech
classifier achieved an accuracy of nearly
94%. The level of performance of such a
classifier is clearly related to the ways in
which it will serve as an intelligent front-end
to a speech recognition system. The
experiments done for this work attempt to
determine if such a preprocessor is neces-
sary, and if so, what level of performance is
required for the sound segmentation.
68 SUMMARY OF RESEARCH
For the development of the phonetic
recognition system, 4.25 hours of the NPR-
ME data were used for system training, and
the remaining hour was used for system test.
Acoustic models were built using the TIMIT
61 label set. Results, expressed as phonetic
recognition error rates, are collapsed down
to the 39 labels typically used by others to
report recognition results. The SUMMIT
segment-based speech recognizer developed
by our group was used for these experi-
ments. The feature vector for each segment
consisted of MFCC and energy averages
over segment thirds as well as two deriva-
tives computed at segment boundaries.
Segment duration was also included.
Mixtures of up to 50 diagonal Gaussians
were used to model the phone distributions
on the training data. For simplicity, only
context-independent models were used. The
language model used in all experiments was
a phone bigram based on over four hours of
training data. This particular configuration
of SUMMIT achieved an error rate of 37.1%
when trained and tested on TIMIT.
We conducted experiments to deter-
mine the trade-offs between using a large
amount data recorded under a variety of
speaking environments (a multi-style
training approach) and a smaller amount of
high quality data if a single recognizer
system was to be used to recognize all four
different types of speech material present in
NPR-ME. We found that a multi-style
approach yielded an overall error rate of
39.2%, with the lowest error rates arising
from clean speech (33.2%) and the highest
error rates arising from field speech
(50.4%). Training the system with only the
clean, wideband speech material found in
the training set yielded comparable results,
with an overall error rate of 38.8%.
However, the multi-style approach utilized
nearly 1.7 times the amount of data for
training the acoustic models. To perform a
fair comparison between these two ap-
proaches, we trained a multi-style system
with an amount of training data equivalent
to that of the clean speech system. We
found this training approach degraded our
results to an overall error rate of 41.1%, an
increase of nearly 3%. This result indicates
that it is advantageous to use only clean,
wideband speech material for acoustic
model training when data and computation
availability becomes an issue.
In addition to the single recognizer
system explained above, we also explored
the use of a multiple recognizer system for
the phonetic recognition of NPR-ME, one
for each type of speech material. The
environment-specific approach involves
training a separate set of models for each
speaking environment, and using the
appropriate models for testing. We used the
sound classification system described in [3]
as the preprocessor to classify each test
utterance as one of the four speech environ-
ments. The environment-specific model
chosen by the automatic classifier for each
utterance was then used to perform the
phonetic recognition. This resulted in an
overall error rate of 38.3%, which is slightly
better than the best single recognizer result.
In all of the experiments conducted, we
found that the field speech environment has
consistently shown the highest phonetic
recognition error rates. In an attempt to
improve the recognition performance of the
field speech, we bandlimited the training
data by restricting our analysis to the
frequency range of 133Hz to 4kHz. Using
this approach, we were able to achieve lower
recognition error rates on the field speech
TOWARD THE AUTOMATIC TRANSCRIPTION OF GENERAL AUDIO DATA
69SPOKEN LANGUAGE SYSTEMS
data to 46.9% through bandlimiting the
clean speech training data. Using the
bandlimited clean speech models in
multiple recognizer system for utterances
classified as field speech, the overall error
rate becomes 37.9%, which is 2.3% better
than the best single recognizer result.
In future work in this area, we intend to
concentrate on improving the phonetic
recognition results from the clean speech
environment, and to investigate how the
recognition of GAD compares to other
automatic speech recognition tasks.
MICHELLE S. SPINA
References[1] J.L. Gauvain, L. Lamel, M. Adda-Decker,
“Acoustic Modelling in the LIMSI Nov96 Hub4System,” Proc. of DARPA Speech RecognitionWorkshop, February 1997.
[2] R. Schwartz, H. Jin, F. Kubala, S. Matsoukas,“Modeling those F-conditions - Or not,” Proc. ofDARPA Speech Recognition Workshop, February1997.
[3] M.S. Spina and V. W. Zue, “AutomaticTranscription of General Audio Data: PreliminaryAnalysis,” Proc. of the International Conference onSpoken Language Processing, pp. 594-597,Philadelphia, PA, October 1996.
[4] M.S. Spina and V.W. Zue, “AutomaticTranscription of General Audio Data: Effect ofEnvironment Segmentation on PhoneticRecognition,” Proc. of European Conference onSpeech Communication and Technology, pp. 1547-1550, Rhodes, Greece, September 1997.
70 SUMMARY OF RESEARCH
Porting the GALAXY System to Mandarin ChineseChao Wang
The GALAXY system is a human-computer
conversational system providing a spoken
language interface for accessing on-line
information. It was initially implemented
for English in travel-related domains,
including air travel, local city navigation,
and weather. One of the design goals of the
GALAXY architecture was to accommodate
multiple languages in a common frame-
work. This thesis concerns the development
of YINHE, a Mandarin Chinese version of the
GALAXY system [1,2]. Acoustic models,
language models, vocabularies, and linguis-
tic rules for Mandarin speech recognition,
language understanding, and language
generation have been developed; large
amounts of domain specific Mandarin
speech data have been collected from native
speakers for system training; and issues that
are specific for Chinese have been addressed
to make the system core more language
independent. Figure 15 shows the system
operating in Chinese. The user communi-
cates with the system in spoken Mandarin,
and the system displays responses in
Chinese ideographs, along with maps, etc.
In the following, data collection, develop-
ment of speech recognition, understanding
and generation components, and system
evaluation will be described in more detail.
Both read and spontaneous speech have
been collected from native speakers of
Mandarin Chinese. Spontaneous speech
data were collected using a simulated
environment based on the existing English
GALAXY system. The data were used for
training both acoustic and language models
for recognition, and deriving and training a
grammar for language understanding. In
addition, a significant amount of read
speech data was collected through our Web
data collection facility. It is easier to collect
read data in large amounts, and they are
very valuable for acoustic training due to
the phone-line diversity of randomly
distributed callers. We use pinyin, enhanced
with tones, for Chinese representation in
out transcription to simplify the input task.
Figure 15. An example of adialogue exchange between YINHEand a user.
71SPOKEN LANGUAGE SYSTEMS
Homophones that are the same in both
base-syllables and tones are indistinguish-
able in the pinyin representation. We
determined however that this ambiguity
could be resolved by the language under-
standing component. We also decided not
to tokenize the utterances into word
sequences in the transcription, because it is
not always obvious even for native speakers
what constitutes a word, and the selection
of words would likely change during the
development process. The sentences were
later segmented into word sequences using a
semi-automatic tokenization procedure
based on a predefined vocabulary, for
training the language models. Time-aligned
phonetic transcriptions were derived using a
forced alignment procedure during the
iterative training process. A summary of the
corpus is shown in Table 14.
The speech recognition is performed by
the SUMMIT segment-based speech recogni-
tion system. It would be advantageous to
incorporate some kind of tone recognition
into the framework. However, SUMMIT, as
currently configured, does not have any
capability for explicitly dealing with
fundamental frequency, and it would also
be difficult to incorporate scores provided at
the syllable level. Thus, we have omitted
tone recognition at this stage. We realize
that this leads to a greater number of
potential homophones, but most of these
can be disambiguated at the parsing stage.
We went through several iterations to
decide the actual vocabulary, mainly in
making decisions about where to insert
word boundaries in the syllable string. City
names are prominent in GALAXY. Since they
are world-wide, users are uncertain as to
whether to refer to them in English or
Chinese. Hence we had to allow multiple
entries for many of them, essentially an
English and a Chinese equivalent. A similar
problem exists for the place names in the
City Guide domain. We felt it would be
difficult to cover all the odd pronunciations
of restaurant names, etc. Therefore, we
eliminated most of them from the vocabu-
lary, thus encouraging the user to refer to
them by index or clicking. The current
vocabulary has about 1000 words, which is
much smaller than that of the English-based
system. About one quarter of the vocabulary
items are English; each Chinese word has
on average 2.1 characters.
After experimenting with various sets of
phonetic units, we finally settled on the
simple choice of representing each syllable
initial and each final as an individual
phonetic unit. We feel that our segment-
based framework is particularly effective at
capturing the dynamic nature of these
multi-phoneme units, and the only problem
was that we did not have very obvious
English analogs for some of these units
(such as “UAN” and “IANG”) on which to
seed. We were able to solve this problem by
seeding any unusual finals on schwa,
because of its inherent variability, along
with an artificial reward during early
iterations. This improves their chance of
consuming the entire span of the syllable
final during forced alignment, rather than
giving part of it up to an undesirable
insertion model or a neighboring syllable
initial.
We also had some difficulty with the
rich set of strong fricatives and affricates in
Mandarin. Mandarin makes a distinction
between /s/, /sh/, and a retroflexed /shr/.
SET TRAIN DEV TESTNo. of utt. 6457 500 274Type of utt. spontaneousNo. of spks. 6Wds per utt. 8.3 8.5 8.0
spont. and read93
Table 14. Summary of the corpus.
72 SUMMARY OF RESEARCH
Similar distinctions are possible for the
voiced and affricate counterparts. These
phonemes are further complicated by the
widespread regional differences among
speakers. For example, in southern dialects,
there is a tendency to lose the distinction of
/s/ and /shr/. We initially tried handling
these dialectal variations by phonological
rules, but in the absence of hand-labeled
data it became difficult to guarantee a
correct realization in our training utter-
ances. In the end we decided to let the
models handle the variability through the
Gaussian mixtures. The English proper
nouns are usually outside of the phonologi-
cal and phonotactic structure of Mandarin.
As a consequence, users ofter speak these
names with a heavy accent, and it becomes
problematic whether to build separate
English phonetic models or to force these
outliers into the nearest-neighbor Mandarin
equivalent. For the most part we were able
to share models, with the system being
augmented with only a few phonemes
particular to English, such as /v/ and /eh/.
Thus, in some sense, we lexicalized the
foreign accent for English, entering “New
York” in the lexicon pronounced as “Niu
Yok” and “South Boston” as “Saus
Basteng.”
For natural language understanding we
used the TINA system. Our approach to rule
development was to determine the appropri-
ate rules for each new Chinese sentence by
first parsing an English equivalent, and
choosing, as much as possible, category
names that paralleled the English equiva-
lent. This minimized the effort involved in
mapping the resulting parse tree to a
semantic frame. While the temporal
ordering of constituents is quite different
for Chinese than for English, the basic
hierarchy of the phrase structure is usually
very similar to that of English.
We were a little uncertain about what to
do with the tokenization problem —
whether to include the partial tokenization
that takes place at the time of recognition,
or to disregard it and reparse the syllable
sequence. We finally decided to discard the
PORTING THE GALAXY SYSTEM TO MANDARIN CHINESE
Figure 16. An example of longdistance movement in Chinese fortopicalization, for the sentence,“Boston has how many museums?”
sentence
full_parse
statement
subject vp_count
a_town
town_name
bo1 shi4 dun4
have count_object
you3
how_many an_object
duo1 shao3
pre_adjunct a_building
in_focus museum
IN REGION bo2 wu4 guan3
IN_REGION
73SPOKEN LANGUAGE SYSTEMS
recognizer tokenization, and rely instead on
the grammar rules of TINA to retokenize,
with the belief that the final result would be
more reliable. Since the grammar is heavily
constrained by semantic categories through-
out the parse tree, it is usually able to
reconstruct the correct tokenization of the
sentence. We made a few exceptions to this
rule, in cases where confusions with a
common word start syllable could cause
significant ambiguity. For instance, the first
syllable of the word “jiu3 dian4” (“wine
store”) is a homophone of the word “jiu3”
meaning “nine”. Since numbers are
prevalent in the grammar, we decided it was
safer to commit the whole word “wine
store” up front, to expedite the parsing
process. This effectively provides a one-
syllable look-ahead to the parser.
TINA has a trace mechanism to handle
gaps that are prevalent for wh-questions in
English (e.g., “[What street] is MIT on
[trace]?”). In Chinese, wh- words are not
moved to the front of the sentence, and
therefore these sentences are easier to
accommodate than their English equiva-
lents. Chinese does however frequently
utilize an analogous forward-movement
strategy to topicalize certain constituents in
a sentence. An example is given in Figure
16. Such sentences were well-matched to
TINA’s trace mechanism, which produces a
desirable frame containing “in Boston” as a
predicate modifying “museums”, but
paraphrasing preperly, with “Boston” in the
topicalized initial position, due to the trace
marker.
Language generation for Chinese was
performed by the GENESIS system. We found
that the process of generating correct
paraphrases and responses in Chinese was
quite straightforward, and, for the most
CHAO WANG
part, we were able to utilize our GENESIS
framework without any changes. One aspect
of Chinese that is quite different from
English is the use of particles to accompany
quantified nouns. These particles are
analogous to “a flock of sheep” in English,
except that they are far more pervasive in
the language. Thus “four banks” becomes
“four <particle> banks.” Furthermore, the
exact realization of the particle depends on
the class of the noun, and there is a fairly
large number of possibilities. For any
language internal paraphrases (Chinese →semantic frame →Chinese) the particle can
be parsed into the frame and reparaphrased
intact. However, for actual translation, the
situation is problematic because complex
context effects determine which particle to
use under what circumstances. Similarly,
Chinese does not make obvious distinctions
between singular and plural, which can be
problematic when translating to English.
Since YINHE is self-consistent with respect to
language, these issues have been avoided,
but we would like to be able to produce
No. of Utts. WER SER
Dev 500 9.1% 37.4%
Test 274 10.8% 39.1%
Table 15. Summary of therecognition performance.
Parsed
Perfect Acceptab le W rong
1-best 62 .4 6 .9 2 .6 28 .1
10-best 70 .0 8 .4 5 .5 16 .4
ortho. 80 .3 2 .9 0 .7 16 .06
trans-lingual paraphrases that are also well-
formed.
Table 15 shows the speech recognition
performance in terms of word error rate
and sentence error rate on the develop-
ment and test data. Table 16 shows the
speech understanding performance on the
Table 16. Speech understandingperformance in percentages on the274 spontaneous utterances of thetest set.
74 SUMMARY OF RESEARCH
PORTING THE GALAXY SYSTEM TO MANDARIN CHINESE
test data. The 10-best entry gives the results
obtained based on the parse selected
automatically from a 10-best list. There are
20% of the queries which, if recognized
perfectly, would still not be understood
correctly. Most of the sentences that fail to
parse are outside of the domain of GALAXY
or suffer from disfluencies which are
beyond the limited robust parsing capabili-
ties of YINHE.
Overall, we consider the exercise of
porting GALAXY to Mandarin to be a success.
The end-to-end Mandarin system appears to
be comparable in performance to its English
counterpart. We feel that the success of this
effort demonstrates the feasibility of our
design aimed at accommodating multiple
languages in a common framework.
Several aspects of the system still remain
to be improved. While the system converses
with the user, it often displays information
it has obtained, for example from the Web,
in English. It would be more natural if the
information itself, and not just the remark
about the information, could be provided to
the user in their preferred language. We
have not obtained an adequate Mandarin
Chinese speech synthesizer yet, so the
system only displays the verbal answer to the
user in text form. We plan to add a tone-
recognition capability to our recognizer.
This may require us to restructure the
framework to accommodate explicit
knowledge of syllable boundaries.
References[1] C. Wang. Porting the GALAXY System to Mandarin
Chinese, M.S. thesis, MIT Department ofElectrical Engineering and Computer Science,Cambridge, MA, May 1997.
[2] C. Wang, J. Glass, H. Meng, J. Polifroni, S.Seneff, and V. Zue, “YINHE: A Mandarin ChineseVersion of the GALAXY System,” Proc. EuropeanConference on Speech Communication and Technology,pp. 351-354, Rhodes, Greece, September 1997.
75SPOKEN LANGUAGE SYSTEMS
Natural-Sounding Speech SynthesisUsing Variable-Length UnitsJon Yi
Our work in the previous year showed that
by careful design of system responses to
ensure consistent intonation contours,
natural-sounding speech synthesis can be
achievable with word- and phrase-level
concatenation. In order to extend the
flexibility of this framework, we focused on
generating novel words from a corpus of
sub-word units. The design of the corpus
was motivated by perceptual experiments
that investigated where speech could be
spliced with minimal audible distortion and
what contextual constraints were necessary
to maintain in order to produce natural-
sounding speech. From this sub-word
corpus, a Viterbi search selects a sequence of
units based on how well they match the
input specification and concatenation
constraints. This concatenative speech
synthesis system, ENVOICE, has been used
in WHEELS and PEGASUS to convert meaning
represen tations into speech waveforms.
The synthesis process used in the
WHEELS system involved the concatenation
of word- and phrase-level units with no
signal processing. These units were carefully
prepared by recording them in the precise
prosodic environment in which they would
be used. However, recording every word in
every prosodic environment realizable
represents a trade-off between large-scale
recording and high naturalness. Essentially,
this type of generation approach has two
shortcomings. First, while the carrier
phrases attempt to capture prosodic
constraints, they do not explicitly capture
co-articulatory constraints, which may be
more important at sub-word levels. Second,
some application domain vocabularies are
continuously expanding (e.g., new car
models may be introduced each year), or
have a large number of words (e.g., the
23,000 United States city names in a Yellow
Pages domain). In order to discover
strategies to combat these two factors, we
decided to investigate the synthesis of
arbitrary proper names using sub-word units
from a designed sub-word corpus.
We performed perceptual experiments
to learn what units are appropriate for
concatenative synthesis and how well these
units sound as an ensemble when concat-
enated together to form new words. These
two constraints are the unit and transition
criteria. Because source changes (e.g.,
voiced-unvoiced transitions) typically result
in significant spectral changes, we hypoth-
esized that a splice might not be as percep-
tible at this point, in comparison to other
places. Should the speech signal be broken
between two voiced regions, it would be
important to ensure formant continuity at
the splice boundary. This hypothesis
motivated a series of consonant-vowel-
consonant (CVC) studies that dealt with the
substitution of vowels at boundaries of
source change.
One study tested potential transition
points by fixing the place of articulation of
the surrounding consonants. For example,
the /AO/ from the city name, “Boston”
(/B AO S T IH N), was replaced by the /
AO/ from “bossed” (/ B AO S T/).
Perceptually the splicing is not noticeable.
We found this effect to hold when the
consonants are stops, fricatives, as well as
affricates. A variation on the above study
showed that the voicing dimension of the
surrounding consonants can be ignored
while still producing a natural-sounding
splice. This knowledge contributed towards
the formation of the unit criterion studies
which showed that the place of articulation
and nasal consonants to be the main
76 SUMMARY OF RESEARCH
contextual constraints for vowels. While it
was possible to perform natural-sounding
splicing at boundaries between vowels and
consonants, we found it preferable to keep
vowel and semivowel sequences together as a
unit.
The various principles learned from the
perceptual studies were used to enumerate a
set of synthesis units for concatenative
synthesis of non-foreign English words. We
made use of a 90,000-word lexicon from the
Linguistic Data Consortium called the
COMLEX English Pronunciation Dictionary,
commonly referred to as PRONLEX. We
limited our analysis of contiguous multi-
phoneme sequences of vowels and
semivowels to the non-foreign subset of
PRONLEX containing approximately 68,000
words. We identified 2,358 unique vowel
and semivowel sequences; consonants were
assumed to be adequately covered. These
sequences were compactly covered by using
an automatic algorithm to select a compact
set of prompts to record given a set of units
to cover and a set of words to choose from.
When this prompt selection algorithm was
applied, a total of 1,604 words was selected.
The unit selection algorithm is a Viterbi
search that provides an automatic means to
select an optimal sequence of sub-word
units from a speech database given an input
pronunciation. Because the use of longer-
length units tends to improve synthesis
quality, it is important to maximize the size
and the contiguity of speech segments to
encourage the selection of multi-phone
sequences. The search metric is composed
of a unit cost function and a transition cost
function. The unit cost function measures
co-articulatory distance by considering
triphone classes which have consistent
manner of production. The transition cost
function measures co-articulatory continuity
between two phones proposed for concat-
enation. A transition cost must be incurred
if they were not spoken in succession to
avoid concatenations at places exhibiting a
significant amount of co-articulation, or
formant motion. Also, we decouple
transitions occurring within or across
syllables into intra-syllable and inter-syllable
transitions, respectively.
We have deployed the total variable-
length concatenative synthesis framework in
GALAXY, where ENVOICE servers return
speech waveforms to clients presenting
meaning representations as input. In
PEGASUS, both phrase-level and sub-word
unit concatenation are utilized, where
system responses are generated by the
former, and city names the latter. Overall,
users thought the system sounded natural
and found sentences to be much preferable
over those generated by DECTalk.
This research work has three types of
contributions: a framework for Meaning-to-
Speech (MTS) concatenative synthesis,
principles about sub-word unit design for
concatenative synthesis, and sub-word
corpus design. This MTS framework is
suitable for use in a conversational system
because it was designed from the ground up
for understanding domains as opposed to
general-purpose Text-to-Speech synthesizers.
There remains much future work in many
areas including unit design, prosody,
evaluation methods, and development
strategies.
Reference[1] Jon Yi, Natural-Sounding Speech Synthesis Using
Variable-Length Units, M.Eng. thesis, MITDepartment of Electrical Engineering andComputer Science, May 1998.
NATURAL-SOUNDING SPEECH SYNTHESIS USING VARIABLE-LENGTH UNITS