41
Thesis Proposal: Scalable, Anytime Speech Recognition for Mobile and Multicore Applications David Huggins-Daines November 17, 2008

Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

Thesis Proposal: Scalable, Anytime Speech

Recognition for Mobile and Multicore

Applications

David Huggins-Daines

November 17, 2008

Page 2: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

Abstract

Large-vocabulary continuous speech recognition involves a heuristic searchwithin an intractably large space of possible sentences, guided by imperfectmodels of speech and language. It is subject to both modeling errors, wherethe model attributes a greater likelihood to an incorrect hypothesis, and searcherrors, where the heuristic used to guide the search has failed in some way, re-sulting in the correct hypothesis being passed over. Multiple-pass search strate-gies alleviate these problems by progressively reducing the search space, whiledeferring decisions about the single best hypothesis until more complex modelscan be applied.

This strategy of progressive refinement lends itself to two real-word use casesdiscussed in this proposal. The first is mobile speech understanding systemswhich require a quick initial response, followed by continuous learning and re-finement. The second is parallel speech recognition systems based on a pipelineof partial recognizers distributed in a producer-consumer relationship. Whatunites these seemingly unrelated problems is that both of them are bandwidth-constrained, and both require the ability to scale down the complexity of indi-vidual recognizers while preserving the accuracy of the system as a whole.

Since this scaling necessarily involves constraining the set of possible hy-potheses, it inevitably leads to reduced accuracy due to search errors and imper-fect models. If this reduced set of hypotheses is used to constrain future passesof recognition, as is typical in multi-pass systems, search errors are propagatedthrough the system. We show that the underlying information in the speech sig-nal is preserved in the face of error, and that it is possible to avoid propagatingerrors by modeling the sources of error and reconstructing the missing parts ofthe search space. Finally, we propose to explicitly model the uncertainty whichresults from scaling down the recognizer so as to make this reconstruction moreefficient and effective.

Page 3: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

Contents

1 Introduction 21.1 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Proposal Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Automatic Speech Recognition 52.1 Isolated and Continuous Speech Recognition . . . . . . . . . . . . 62.2 Noisy Channel Paradigm . . . . . . . . . . . . . . . . . . . . . . . 72.3 Speech Recognition as Data Compression . . . . . . . . . . . . . 82.4 Statistical Modeling for ASR . . . . . . . . . . . . . . . . . . . . 9

2.4.1 Acoustic Modeling . . . . . . . . . . . . . . . . . . . . . . 92.4.2 Language Modeling . . . . . . . . . . . . . . . . . . . . . 12

2.5 Search Algorithms for LVCSR . . . . . . . . . . . . . . . . . . . . 132.5.1 Viterbi Beam Search . . . . . . . . . . . . . . . . . . . . . 142.5.2 Two-Level Viterbi Search . . . . . . . . . . . . . . . . . . 142.5.3 Lexicon Tree Search . . . . . . . . . . . . . . . . . . . . . 16

2.6 Multiple Pass Search . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Anytime and Parallel Recognition 193.1 Anytime Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Load Balancing in Parallel Recognition . . . . . . . . . . . . . . . 20

4 Preliminary Experiments 214.1 Acoustic Reconstruction . . . . . . . . . . . . . . . . . . . . . . . 224.2 Hypothesis Expansion . . . . . . . . . . . . . . . . . . . . . . . . 25

5 Proposed Work 305.1 Multi-Resolution Decoding . . . . . . . . . . . . . . . . . . . . . 315.2 Vocabulary Optimization . . . . . . . . . . . . . . . . . . . . . . 32

6 Contributions and Timeline 34

1

Page 4: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

Chapter 1

Introduction

As Moore’s Law continues to give us more and smaller transistors, computersas we know them are undergoing a radical transformation. We are currentlywitnessing the nascent ubiquity of intelligent mobile devices; for instance, thecellular phones of today are more capable in terms of raw processor speed thanthe personal computers of a mere 10 years ago. Many people are now rarelyoutside an arm’s reach of two or more fully programmable, multimedia capableand Internet connected devices, be they smartphones, laptops, or conventionalPCs and workstations. Wireless networks, too, have become ubiquitous, as theyprovide a convenient way both to access Internet content and move data betweenmobile devices.

In some respects, though, the growth curve of computational capacity hasbegun to flatten. The single-threaded performance of modern CPUs has mostlystopped increasing, as clock speeds have levelled off and gains in instruction-level parallelism are becoming increasingly hard to come by. In addition, asmore and more interesting applications rely on Internet access, network latencyrather than processor speed has become a deciding factor in application perfor-mance. Moreover, both processor clock speed and network latency are stronglyconstrained by the physics of semiconductors and fiber-optic cables, respectively.This means that on the local node, performance gains will primarily come fromincreased thread-level parallelism, while over the network, performance gainswill be achieved by distributing the processing and storage of data away fromcentralized servers and closer to the user.

In addition, the consequences of increasing transistor density are often quitedifferent in mobile devices. Whereas desktop computers may continue to acquiremore memory and faster, more highly parallel CPUs and GPUs, mobile devicesare more highly constrained by considerations of power consumption, size, andprice. Therefore, improved technology in the mobile space tends to focus pri-marily on longer battery life, smaller and lighter devices, specific applicationssuch as music and video playback, and only secondarily on general-purposeperformance.

We view speech as a natural and effective way to interact with mobile devices.

2

Page 5: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

The utility of speech recognition on mobile devices arises first and foremostfrom the difficulty of entering text manually on mobile form-factors, where afull-sized keyboard is not available. However, we view speech recognition onmobile devices as having applications beyond simple text entry. The ability ofmobile devices to store and manage personal information and various forms ofmedia in large quantities makes it necessary to have efficient ways to create,edit, navigate, query, and access this information.

Given the aforementioned trends, then, how can we bring ubiquitous speechrecognition and understanding capabilities to mobile devices? One approachis simply to rely on the network, offloading most of the computation to re-mote servers. This is a perfectly viable option for many applications, such asvoice-driven search, where a network connection is required for the core appli-cation. The second is to build special-purpose hardware for speech processing,as in [28]. Unfortunately, most speech applications require a large amount offlexibility, since vocabularies and grammars are typically generated dynamicallybased on the needs of the application. As well, the underlying statistical modelsmust be adapted to the user and task for optimal performance. Therefore, apure hardware solution may not be feasible for many interactive speech-drivensystems.

A third option is to extract the maximum utility from general-purpose com-puting hardware, and it is this approach with which this proposal is concerned.There are two ways to go about this, which are in fact complimentary to eachother. The first, and most conventional, is to minimize the amount of computa-tion required in order to achieve a certain level of accuracy in automatic speechrecognition. This typically involves tuning or rearchitecting a system to be asfast as possible within the bounds of a particular error rate. The other wayto effectively implement speech recognition in resource-constrained tasks is toaccept the presence of errors, and pursue a strategy of minimizing the effect oferror on the functioning of the application. This can involve making the appli-cation more robust to the presence of error, or it can involve making it possibleto recover from or correct errors at some future time. It is this latter strategywhich is the subject of this proposal. We refer to it as anytime recognition byanalogy to the anytime algorithms originally used for planning tasks [9].

1.1 Thesis Statement

Automatic speech recognition for mobile and interactive applications necessar-ily involves a compromise between efficiency and accuracy. The problem ofdegraded accuracy can be successfully mitigated by modeling the uncertainty inthe output of the speech recognizer, which makes it possible to defer decisionsabout the recognition result to higher-level reasoning components of the system,or to subsequent passes of recognition using more detailed models. However, inaddition to the speed-accuracy tradeoff mentioned above, efficient recognitionalso constrains the ability to effectively model uncertainty, by reducing the spaceof possible hypotheses considered by the recognizer. We show, first, that errors

3

Page 6: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

in speech recognition output do not necessarily reflect a failure to capture un-derlying acoustic and phonetic information. We then propose specific strategiesfor jointly achieving efficient recognition and modeling the errors introduced bydoing so. In this way, we can use the resulting error models to reconstruct themissing parts of the search space. This technique of intelligent scaling and errorrecovery has applications to mobile, parallel, and distributed computation forspeech recognition.

1.2 Proposal Outline

In Chapter 2, we review the problem of automatic speech recognition, withreference to fundamental concepts, techniques, and algorithms. Chapter 3 in-troduces in detail the problem which we propose to solve, and previous work onwhich our solution builds. In Chapter 4 we present experimental results sup-porting the thesis statement. We also present preliminary investigations intohypothesis expansion and error recovery techniques which form the core of ourproposed work. In Chapter 5 we propose specific techniques to achieve the goalof this thesis, with reference to related work from previous researchers. Finally,in Chapter 6, we review the expected contributions of this thesis along with atimeline for its completion.

4

Page 7: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

Chapter 2

Automatic SpeechRecognition

Automatic speech recognition, hereafter referred to as ASR, is the identificationof linguistic content (words, phrases, or concepts) in audio data without humanintervention. In the most general case, this can be considered an AI-completeproblem, or one whose solution implies a complete solution to all problems ofartifical intelligence. That is, to recognize speech in all situations and conditionsrequires a full knowledge of the ontological context of the utterance in question.This is because speech does not exist in isolation but as part of a pattern ofhuman communication which refers to things in the real world.

For this reason, ASR is nearly always treated in the context of a particularapplication. It is also common to talk of the task or domain in which an ASRsystem operates. While a broad range of applications have been constructedor proposed based on this technology, ASR systems can generally be groupedinto four main tasks, which are presented here in roughly increasing order ofdifficulty:

• Voice control - Use of voice commands, typically from a single user anddirected at the system, to control a computer or some application runningon it.

• Dictation - Conversion of speech, typically from a single user and directedat the system, to text, with control components to allow for editing anderror correction.

• Dialog - Speech-based interaction between a computer and a human, whosepurpose is to achieve some external goal.

• Transcription - Conversion of speech, typically from multiple users and notdirected at the system, to text. This is frequently used to provide inputto other natural language processing tasks, such as information retrievaland machine translation.

5

Page 8: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

The technique of anytime recognition is particularly suitable to applicationswhich involve a higher-level understanding or reasoning component operatingunder a time constraint. For example, in a dialog system, the grammar whichdefines the space of acceptable sentences is often dependent on the state of thedialog. In many dialog systems, this state is fully determined by the system,that is, the expected input from the user is constrained to the set of acceptableresponses to a specific question or other direction from the system. However,in mixed-initiative systems [27], the system needs to determine the state of thedialog in response to user input, which is relatively unconstrained. In such asituation we can envision a quick first pass of recognition which allows the dialogmanager to determine the appropriate state-specific grammar, which can thenbe used to obtain a more reliable result.

Another promising use case lies part way between the dictation and tran-scription tasks. If exact dictation is not possible in real-time, such as on a mobiledevice, it may still be useful to obtain rough transcriptions which can be used tocategorize, index, and segment the recorded audio for search and browsing. Forexample, in [45], time-compressed audio was presented in conjunction with ASRtranscriptions in a browsing interface for passages of recorded speech. The useof transcriptions ws shown to have a positive effect on comprehension, even withnon-negligible error rates. We propose that rough transcripts can also be usedto classify short utterances or voice memos according to topic, or to producesummaries, while still allowing detailed transcripts to be performed off-line.

2.1 Isolated and Continuous Speech Recognition

ASR is generally thought of as being a problem of pattern classification. Namely,we are attempting to classify an observation (or utterance) O as belonging toa class S, where S is a symbolic or linguistic representation of the observation.In the simplest case, there exists a finite vocabulary V of isolated words orphrases, and the recognition task consists simply of finding the word S ∈ Vwhich matches the utterance. This type of recognition is frequently implementedwithout the use of probabilistic models, for example using dynamic time warping[21].

In a more general case, which is the one with which this proposal is con-cerned, the classification S is a sequence of words, and the observation consistsof connected, natural speech. While the words in S are still drawn from a finitevocabulary, the number of possible classes or sequences is now extremely large.We refer to the set of possible word sequences as the search space and denote itwith the symbol S.

This task, large-vocabulary continuous speech recognition, known hereafteras LVCSR, is considerably more difficult, for several reasons:

• There are no clear boundaries between words in the input.

• The pronunciation of words can vary considerably due to context.

6

Page 9: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

• Since the search space is extremely large, is no longer possible to simplytest each possible classification in turn and pick the best.

Since we cannot exhaustively evaluate the entire search space, all LVCSRsystems rely on heuristic search algorithms. The effect of the various heuristicsis that a much smaller effective search space, denoted S′, is actually consideredby the recognizer. The speed and accuracy of recognition depend heavily on theeffects of the heuristic used to determine S′. Since S′ is that part of S which is,in fact, exhaustively evaluated by the recognizer, the computational complexityof the recognizer is a non-decreasing function of the size of S′. However, as thesize of S′ decreases, the probability that the true classification S lies outside S′increases, and therefore the error rate of the recognizer tends to increase. Thegoal of this thesis is to develop heuristics for decreasing the size of S′ in sucha way that the missing elements can be reconstructed as needed by subsequentphases of search. This can also be viewed as modeling and correcting the errorsincurred by these heuristics.

2.2 Noisy Channel Paradigm

Virtually all successful implementations of LVCSR are based on the information-theoretic concept of the noisy channel paradigm. Conceptually, we treat thespeaker as an information source, which emits a sentence S according to aprobability distribution P (S). We assume that S has been sent through a noisychannel, which has “corrupted” the original message, producing a speech ut-terance O, according to a probability distribution P (O|S). Speech recognition,therefore, is viewed as a process of decoding, or recovering the original message.

To decode, we search for the hypothesis S which minimizes the probabilityof error.1 This leads us to the maximum a posteriori (MAP) hypothesis, whichis, as its name indicates, the hypothesis with the maximum posterior probabilityP (S|O).

SMAP = arg maxS

P (S|O) (2.1)

By Bayes’ Rule, this can be expressed in terms of the probability distribu-tions over the source and the channel:

SMAP = arg maxS

P (O|S)P (S)P (O)

(2.2)

= arg maxS

P (O|S)P (S) (2.3)

= arg maxS

logP (O|S) + logP (S) (2.4)

1In fact, we are implicitly searching for S which minimizes the expectation of a loss func-tion, which in this case is simply the zero-one loss function δ(S, S).

7

Page 10: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

Being able to factor the search problem in this way considerably simplifies thetask of speech recognition. This is partly due to the fact that the speech signal iscontinuous while the text is discrete, making it difficult to model the distributionP (S|O). In Chapter 2.4, we will briefly review the standard techniques formodeling P (O|S) and P (S).

2.3 Speech Recognition as Data Compression

An alternative view of the speech recognition problem is that instead of a de-coding task, it constitutes a lossy data compression or source coding task. Here,we view speech as simply a highly redundant encoding of the original message.The goal of speech recognition in this view is to find a compact representationof the speech which jointly minimizes the entropy rate of the output, and thedistortion between the original message and the output, as measured by somedistortion function.

If we choose a sequence of words as the representation, and the negativeacoustic log-likelihood of the hypothesis as the distortion function, then this isactually very similar the standard MAP decoding technique described in Equa-tion 2.4:

S = arg minSH(S) +D(O|S) (2.5)

= arg minS−E[logP (S)]− logP (O|S) (2.6)

= arg maxS

logP (O|S) + E[logP (S)] (2.7)

However, one reason to take this alternate view of speech recognition is thatit allows us to consider the output of a speech recognizer as something other thansimply a sequence of words. Namely, we view the output of a speech recognizeras a hypothesis space, which consists of a single hypothesis in the limit. Thehypothesis space is a representation of the uncertainty remaining in the output ofthe speech recognizer. In multiple-pass decoding algorithms, described furtherin 2.6, the hypothesis space from one pass of recognition becomes the effectivesearch space S′ for a subsequent pass. This is a very useful heuristic for search,and allows the use of more exact decoding algorithms and models than wouldotherwise be possible. The problem with this approach is that it can resultin the propagation of errors if the first-pass heuristic is too restrictive, or ifthe models used for first-pass recognition result in the hypothesis space beingimpoverished in some way. In Chapter 4 we present some preliminary resultswhich show that this problem, though serious, is not insurmountable, and inChapter 5, we propose specific strategies for overcoming it.

8

Page 11: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

2.4 Statistical Modeling for ASR

The basic equation for decoding in automatic speech recognition, as shown inEquation 2.8, results in the hypothesis SMAP with the highest a posterioriprobability, or alternatively, the lowest probability of error.

SMAP = arg maxS

P (O|S)P (S) (2.8)

However, this hypothesis is optimal under the assumption that we know theprobability distributions P (O|S) and P (S), which are, respectively, the likeli-hood of acoustic realizations given a word sequence, and the prior probabilityof a word sequence. In practice, these distributions are not known, and thuswe must rely on statistical estimation to build models of them from observedlinguistic data. These models are known, respectively, as the acoustic modeland the language model.

In evaluating the performance of a speech recognition system, we can dis-tinguish between two types of error [6]. So-called modeling errors are errorswhich result from divergence between the acoustic and language models andthe true distributions P (O|S) and P (S). On the other hand, search errors areerrors which result from algorithmic limitations (or software bugs) in the de-coder which prevent it from being able to discover the optimal hypothesis SMAP .The error recovery techniques we propose relate primarily to modeling errors.This is because our strategy for anytime decoding involves the use, in earlierphases of recognition, of models known to be incorrect. For reasons described inChapter 4, we focus particularly on the case of out-of-vocabulary errors, wherethe language model is unaware of a word in the input speech, and therefore itis absent from the hypothesis space.

2.4.1 Acoustic Modeling

Virtually all contemporary ASR systems use Hidden Markov Models (hereafterreferred to as HMMs) as a statistical model for the speech generation process.HMMs are a mathematical formalism for pattern recognition, first desribed in[4], which have proven to be extremely useful in modeling speech, both forrecognition [39] and for synthesis [30]. As with all generative models of speechproduction (e.g. the source-filter model), an HMM is a highly simplified andabstract version of a very complicated process. However, in practice, it capturesenough of the structure of speech to be useful. Estimation and inference onHMMs is also computationally efficient.

HMMs were originally, and perhaps more precisely, referred to as probabilisticfunctions of Markov chains. We say that a sequence of discrete random variablesX1, X2, X3, ..., XN forms a first-order Markov chain if the distribution of Xk

for any k is conditionally independent of all other variables in the sequencegiven the preceding element Xk−1. The values which the random variables Xk

take on are known as the states of the Markov chain. A first-order Markovchain can be described compactly as a vector of initial probabilities P (xi) and a

9

Page 12: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

A

P(A|A)

BP(B|A)

P(B|B)

CP(C|B)

P(C|C)

Figure 2.1: First-order Markov Chain with Three States

matrix of transition probabilities P (xi|xj). Higher-order Markov chains are alsopossible, in which case the conditional independence assumption is made giventhe previous M symbols. In this case, the matrix of transition probabilitiescontains NM+1 parameters.

Markov chains are useful in modeling the distribution of discrete sequenceswhere successive symbols are not statistically independent. For non-trivial ex-amples such as human language, it is generally the case that each symbol de-pends on more than simply the identity of the previous symbol, but in practice,the benefits of this simplifying assumption of independence in terms of compu-tational and statistical efficiency outweigh the possibility of modeling errors.

The use of Markov chains for modeling the speech signal is motivated by thefact that speech is quasi-stationary in nature. That is, the acoustic propertiesof the signal are stable over short time periods. We can therefore view thesestable periods as states in a Markov process. However, there are two majorproblems in attempting to model speech using a simple Markov chain. The firstand most obvious is that the speech signal is continuous rather than discrete.The second is that the acoustic realization of these states is itself highly variable,even within the same utterance by the same speaker.

The first problem can be solved using vector quantization, which maps thecontinuous units of acoustic space to discrete symbols. However, in order tominimize quantization error, it is necessary to use many more VQ codewordsthan there are states in the underlying Markov model. In addition, since vectorquantization is a form of unsupervised learning, the individual codewords aregenerally not identifiable, that is, they do not correspond directly to any priorclassification of the acoustic space, such as the set of model states.

Therefore, we formally treat the observation of a state as a random variablewhose distribution is conditional on the identity of the state. In order to simplifythe resulting model, we make a further independence assumption, known asthe output independence assumption, which states that the distribution of anobservation is conditionally independent of all other observations in the sequencegiven the identity of the underlying state. A graphical representation of a three-

10

Page 13: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

A

P(O|A)

P(A|A)

BP(B|A)

P(O|B)

P(B|B)

CP(C|B)

P(O|C)

P(C|C)

Figure 2.2: First-order Hidden Markov Model with Three States

state HMM is shown in Figure 2.2.Although the independence assumptions inherent in HMMs can be viewed

as flaws in the model, their main effect is that HMM based acoustic modelstend to exaggerate the dynamic range of the conditional probability P (O|S).This is particularly acute when continous density functions are used to modelthe observation of states, as is commonly done for automatic speech recognition.Because of this, the results of evaluating an acoustic model are generally thoughtof as scores rather than probabilities as such, and a variety of empirically tunedheuristics are used to regularize them.

There are in fact several ways in which the acoustic model contributes tomodeling error. The first and most obvious is a mismatch between the acousticconditions in which the model was trained, and those in which it is used forrecognition. The second arises from inadequate amounts of training data, suchthat the parameters of the output density functions for the states of the modelare not robustly estimated. To mitigate this problem, essentially all speechrecognition systems employ some form of parameter tying, where equivalenceclasses are constructed among the various acoustic units in the system, andthese classes share some or all HMM parameters.

As we show in Section 4.1, it is frequently the case that the acoustic modelis “correct” at some level even when the results of speech recognition are quiteincorrect. This indicates that the source of modeling error may not be thedivergence between the model and the input data, but rather the uncertaintyin the mapping between acoustic units and higher-level symbolic units, suchas phonemes and words. This can be explained through the inherent acousticconfusability of words, as well as through coarticulation and other context-dependent effects which neutralize the distinctions between phonemes. As well,in the case of speaker-independent acoustic models, it is necessary to model bothintra-speaker and between-speaker variability in pronunciation, which adds an

11

Page 14: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

additional source of confusability.These sources of error can be addressed through techniques such as speaker-

adaptive training [1], where between-speaker variability is modeled externallyusing parameter or feature space adaptation, and discriminative training [2],where the parameters of the acoustic model are explicitly optimized to minimizethe expected error rate of the recognizer. We propose a complimentary approachto this problem, where the uncertainty in the mapping between symbolic andacoustic representations is modeled explicitly, and this model is used to identifyand correct errors in the early stages of recognition.

2.4.2 Language Modeling

The language model is the component of a speech recognizer which determineshow likely a word or sentence is to have been spoken in the first place. Themost obvious reason why this is necessary is that many words or sequences ofwords sound very similar, and it is not possible to decide among them withoutsome prior knowledge of which ones are admissible or likely for a given lan-guage or domain. Mathematically, the language model is a statistical model ofthe probability distribution P (S) over word sequences. As with HMMs in acous-tic modeling, virtually all modern ASR systems use history-based or N-Grammodels for language modeling.

In describing language models, we define a sentence S as a sequence of Nwords, denoted wi, and we use the notation wji to represent the sequence ofwords from index i to j:

S = (w1, w2, ..., wN ) = wN1

The probability of the sentence is taken to be the joint probability of thesequence as a whole, which is intractable to model directly due to the extremelylarge number of possible sequences. However, the definition of conditional prob-ability allows us to factor the joint probability of a sequence into a product ofconditional probabilities:

P (A,B) = P (A|B)P (B) (2.9)P (A,B|C) = P (A|B,C)P (B|C) (2.10)

P (A,B,C) = P (A,B|C)P (C) = P (A|B,C)P (B|C)P (C) (2.11)

Using this chain rule, the probability of a sentence can be factored into theproduct of the conditional probability of each word given the sequence of allpreceding words, which we refer to as the history, denoted as hi:

P (S) = P (w1, w2, ..., wN )= P (wN , wn−1, ..., w1)= P (wN |wn−1

1 )P (wn−1|wn−21 )...P (w1)

= P (wN |hN )P (wn−1|hn−1)...P (w1)

(2.12)

12

Page 15: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

To model these conditional probabilities efficiently, we use the observationthat the mutual information between a word and any predecessor word tendsto decay with distance. Therefore, we can consider all histories ending with thesame m words to be part of an equivalence class. This is equivalent to makingan assumption of conditional independence of the kind used in other Markovchains (and, as we saw earlier, in Hidden Markov Models). Since the word andhistory taken together form a sequence of N words, we refer to a history-basedmodel as an N-Gram model, though in fact, it is simply an N − 1 order Markovchain over word sequences.

Because the number of words in a natural language is still extremely large,we encounter a number of unique problems in estimating the conditional prob-abilities P (w|h) which make up an N-Gram model, and these are potentialsources of modeling error. The most serious problem is that for any corpus oftext, the majority of the words in the vocabulary may only occur a few times,and the overwhelming majority of N-Grams, that is, word sequences of lengthN, will never be observed. If maximum likelihood estimation is used, the vari-ance of the resulting probability estimates will be quite large, and the modelwill also assign zero probability to many plausible word sequences. A wide vari-ety of smoothing techniques have been proposed to mitigate this problem. Themost well-known and widely used ones are Katz smoothing [23] and ModifiedKneser-Ney smoothing [7].

As the size of the basic vocabulary grows, N-Gram models encounter twoissues with consequences for speech recognition. The first is that the numberof possible N-Grams grows polynomially in the size of the vocabulary. As thenumber of N-Grams increases, so does the number of parameters to be trained,as well as the storage and memory space needed to store them. The second isthat a larger vocabulary tends to increase the perplexity of the model, which hasan adverse effect on the speed and accuracy of the recognizer, since it increasesthe size of the search space. For these reasons it may be preferable to limit thesize of the vocabulary in earlier stages of recognition. This approach is exploredin greater detail in Chapter 4.

2.5 Search Algorithms for LVCSR

For a small-vocabulary isolated word recognition system, where each word ismodeled using a single HMM, a word hypothesis can be found by simply eval-uating the input data against all word models and picking the model with thehighest probability. However, this approach becomes very slow as the size of thevocabulary increases, and is completely intractable for connected-word recogni-tion. For this reason, heuristics are necessary to make automatic speech recog-nition achievable in practical, real-time systems, and therefore, search errorsare inevitable in the absence of perfect heuristics. Although, as previously men-tioned, the focus of this thesis proposal is on modeling errors, the mechanicsof large-vocabulary search, and in particular the technique of multi-pass searchare essential background for the proposed work.

13

Page 16: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

2.5.1 Viterbi Beam Search

A common solution to the problem of search complexity is to evaluate all mod-els time-synchronously, retaining only the highest scoring paths at each timepoint. Unlikely hypotheses are discarded early on, such that for most of theobservation data, only a small subset of the model and state space needs to beexplicitly evaluated. Since the Viterbi algorithm is concerned only with findingthe probability of the most likely state sequence, the resulting score remainsexact as long as the most likely state sequence was not eliminated from thesearch.

This algorithm is known as beam search [13], and is implemented by keepinga list of “active” states for each timepoint (for first-order HMMs, in fact, onlythe present and previous timepoint need to be stored). After the probabilityof the best path entering each state has been calculated, a “beam”, or ratio,is multiplied by the highest probability to obtain a threshold value. All stateswhose best path probability falls below the threshold are removed from theactive list for the current frame, and no transitions out of them are consideredwhen evaluating the next frame.

2.5.2 Two-Level Viterbi Search

In large-vocabulary continuous speech recognition, as mentioned in Section 2.1,more advanced algorithms are required, due to the fact that the space of pos-sible sentences is extremely large. Since it is clearly impossible to model everypossible sentence using a separate HMM, we must use a two-level search strat-egy, in which sentence hypotheses are constructed dynamically based on theresults of word-level recognition. This entails a modified version of the Viterbialgorithm, where, instead of finding the best state sequence, we attempt to findthe best word sequence. Therefore, transitions between states inside a word andtransitions between words are treated separately.

This simple two-level strategy is known as flat lexicon search. For eachword in the vocabulary, an HMM is constructed, possibly by concatenating asequence of subword models. Transitions are then created between the finalstate of each word HMM and the initial state of all other word HMMs. Whencalculating the path score for the initial state of a word, the language modelscore is calculated based on the previous word, and the identity of the bestprevious word is recorded along with the current time point in a backpointertable. For word-internal transitions, instead of recording the previous state inthe best path ending in a given state, a pointer to this backpointer entry is“propagated” through the states of the word along with the corresponding pathscore. For each state of each word, the algorithm stores a pair consisting of thebest path score ending in that state, and the backpointer entry correspondingto that path score. This is also known as the token-passing algorithm, where the“token” is a structure which here contains a score and backpointer entry [48].

When a higher-order language model is used, it may be the case that thebest state sequence entering a given word does not correspond to the best word

14

Page 17: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

sequence. Assuming a trigram model (i.e. a second-order Markov model), this isthe case when the language model score given the previous two words of the bestpath is sufficiently lower than that for some other two-word history. Considerthe case in Figure 2.3. Here, the best path entering word C involves exitingword A with path score x. Likewise, the best path entering word D involvesexiting word C with path score z, which incorporates word A and path score x,since these were the optimal candidates open entering C, and all lower-scoringpaths were discarded.

Figure 2.3: Search Error with Trigram Models

However, in this example, it is actually the case that, had we instead enteredword C from B with path score y, the trigram language model score P (D|C,B),when combined with the alternative path score that would have resulted fromentering C from B with y, would be greater than the score for the best wordsequence (A,C,D) found by the search algorithm. Unfortunately, since thepath exiting word B with path score y was discarded on entry to C, thereis no way for the search algorithm to discover this optimal word sequence. Inorder to fully incorporate a trigram language model, it is necessary to propagateall the alternative two-word histories and associated path scores, as shown inFigure 2.4.

Figure 2.4: Full Trigram Search

In practice, since this involves an order of magnitude more storage space inthe backpointer table, an approximate trigram search is frequently employed,where, as in Figure 2.3, only the current best word sequence is considered whencalculating language model scores.

15

Page 18: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

2.5.3 Lexicon Tree Search

With large vocabularies, the flat lexicon search algorithm can be quite slow,because the number of states in the decoding network becomes extremely large.In addition to beam pruning, which temporarily removes words from consider-ation when the best path score falls below a given threshold, it is also possibleto compress the decoding network so that fewer states need to be evaluated atany given timepoint. A commonly used method of compressing the decodingnetwork is to use a tree-structured lexicon. This is based on the fact that manywords share common phonetic prefixes. Therefore, in a recognizer based onsubword units, it is possible to group all words beginning with the same phoneinto a prefix tree, where the leaf nodes correspond to unique words, as shownin Figure 2.5.

Figure 2.5: Lexicon Tree

In a lexicon tree decoder, transitions between phones must also be treatedseparately from normal HMM transitions, with propagation of tokens throughthe lexicon tree. One specific problem with lexicon tree decoding arises fromthe fact that the identity of a hypothesized word is not known until a leaf nodein the lexicon tree is reached. If a single, static lexicon tree is used, it is notpossible to apply language model scores until the final phone of a word has beenentered.

Figure 2.6: Lexicon Tree Search Error

This results a similar problem to the one encountered with trigram scoringin flat lexicon search, except that it also occurs for bigram language models. Asshown in Figure 2.6, only the predecessor word with the highest path score is

16

Page 19: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

propagated to the point where the language model score is applied. However itis possible, and in fact quite likely, that another predecessor word would havebeen preferred had the language model score been available at word entry.

This problem is typically solved by making a copy of the lexicon tree foreach unique trigram history, as in the word-conditioned lexicon tree search algo-rithm [22]. Alternately, in a token-passing decoder, multiple active hypotheses(in the form of tokens) can be associated with any given state in the searchgraph. An approximate solution is employed in the Sphinx-III decoder (sinceSphinx 3.3), which uses a fixed number of lexicon trees and “rotates” them onevery N word transitions. This reduces the probability of search errors of thesort shown in Figure 2.6 without the overhead of dynamically creating lexicontree copies. The Sphinx-II decoder and its descendants instead use a multi-passstrategy, described in more detail in the next section.

2.6 Multiple Pass Search

The output of speech recognition systems typically consist not only of a singlehypothesized word sequence but also a word lattice which is an encoding of avery large number of alternative sentence hypotheses [31]. This word latticeis an approximate, finite representation of the space of sentences S which hasbeen searched by the decoder. It is also frequently the case that speech recog-nition systems use a multi-pass search strategy [43], which implicitly involvesa multi-stage reduction in the search space. For example, in the PocketSphinxsystem [15] used as the baseline in our research, a three-pass search strategyinherited from the earlier Sphinx-II system [14, 41] is used. A block diagramof this architecture is shown in Figure 2.7.

Figure 2.7: PocketSphinx Decoding Architecture

In the PocketSphinx decoder, an approximate first-pass search, using a staticlexicon tree, is first used to generate a short-list of words at each frame. Asmentioned in Section 2.5.3, this search strategy suffers from widespread searcherrors. Therefore, the second pass uses a flat lexicon search, but uses the short-list generated in the first pass to restrict the set of words to be searched ineach frame to a manageable number. However, as described in Section 2.5.2,

17

Page 20: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

this search algorithm uses an approximate trigram scoring technique, whereonly the best 2-word history is considered when applying language model scoresat word transitions. To compensate for this, the third pass then performs anA* search over the resulting word lattice, allowing all trigram histories to beconsidered.

This organization of multiple passes is designed such that each successivepass is more exhaustive than the previous one, but also searches a more re-strictive space of hypotheses. In this way, the known deficiencies of the earlierpasses, such as the search error problem in the case of a static lexicon tree, andthe approximate trigram problem in the case of a simple flat lexicon search, arecorrected by a subsequent pass of search.

Another approach to multiple pass search, exemplified by the AT&T LVCSR-2000 system [29], is to use a cascade of progressively “stronger” models for eachstage of recognition. This approach allows very complex models, such as 6-gramlanguage models, to be used without resulting in a combinatorial explosion ofsearch states. It is this strategy of moving from weak to strong models whichis the basis of the research proposed here. The problem with this approach,which we propose to solve, is that the computational load is very unevenlybalanced between the different passes. In particular, the first pass in the AT&Tsystem is approximately 60 times slower than subsequent passes, and producesextremely large output lattices. This occurs both because the first pass usesa weak language model, and because it is specifically tuned to minimize thelattice word error rate. That is, the search space has been expanded so as toincrease the probability that the correct hypothesis is contained within it, evenif its probability according to the weak model is low.

The requirement of very dense first-pass word lattices is driven in part bythe need to accurately model the uncertainty in recognition. In this case, uncer-tainty is modeled internally by expanding the bounds of the space of hypothesesconsidered by the decoder. In contrast, we propose to model uncertainty exter-nally, by capturing the relation between the reduced and expanded hypothesisspace. We focus on one method of “weakening” the model, namely, reducing thesize of the vocabulary, which poses problems for the standard model of multi-pass search. Drawing on prior work [42, 12, 34] which has explored the problemof modeling lexical uncertainty, we show preliminary results which suggest thatthis approach can be effective in allowing accurate multi-pass recognition witha time-constrained first pass.

18

Page 21: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

Chapter 3

Anytime and ParallelRecognition

We propose to solve two problems which, on the surface, appear to be unrelated.These are the problem of anytime recognition and the problem of load balancingfor parallel implementations of multi-pass recognition. At the root of bothproblems is a basic observation about the performance of speech recognition,which is that the speed of a recognizer is directly related to the size of the searchspace it must consider.

3.1 Anytime Recognition

The idea of anytime recognition is that useful results should be made avail-able quickly, with the option to wait for more detailed or correct results. Thisis essentially an “inversion” of the standard approach to multiple-pass speechrecognition, where a time-consuming first pass constrains the search space, al-lowing more complex models to be used efficiently in subsequent passes. Forinteractive applications, particularly on resource-constrained platforms, it maybe more useful to have an inexpensive first pass, which provides some usefulinformation to higher levels of the system, while allowing subsequent passes ofrecognition to produce detailed transcriptions. As an example, we propose tobuild a voice memo application which uses initial rough transcriptions to do real-time topic classification and indexing. Full transcriptions will then be generatedin subsequent passes, which will either run “off-line” on the mobile device whencharging, or on a personal computer connected via a wireless network (such asBluetooth or 802.11).

In addition to postponing the use of expensive models to later, off-line passesof recognition, this framework also admits the use of adaptive models, suchthat subsequent passes can lead to continuous improvement. Since the passesmay be separated by an arbitrary amount of time, during which the user mayinteract with the results of recognition, it also admits the possibility of using

19

Page 22: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

both implicit and explicit feedback from the user to improve recognition results,as demonstrated in [16] with user-submitted notes in a meeting transcriptionsystem.

3.2 Load Balancing in Parallel Recognition

This idea also has a straightforward application to the problem of parallel anddistributed speech recognition systems. A major problem with parallel speechrecognition is that, while many basic operations such as acoustic density com-putation can easily be parallelized, there are serial dependencies in search whichprevent a straightforward parallel implementation [40, 36]. One potential so-lution to this problem is to parallelize the system at a much higher level, bypipelining multiple passes of decoding [19, 20]. This can be accomplished ifthe output of each pass is roughly time-synchronous, that is, if the informationused to constrain the search space can be derived from current, past, and alimited number of future acoustic observations. The shortlisting strategy usedin the CMU Sphinx system, which is described in more detail in Section 4.2,relies only on a cross-section of active words within a small window of time, andtherefore is amenable to a pipelined implementation.

The problem, then, is balancing the computational load between the multiplepasses of recognition. Unless it is possible to organize them so that they consumeroughly equivalent amounts of CPU time, the overall speed of the system willbe limited by the slowest pass. In fact, this is exactly the same as the basicproblem of anytime decoding, since the initial pass of recognition is by naturemuch more resource-intensive than the others and must be “scaled down” tomeet anytime constraints. If we consider a distributed implementation of sucha system, then we face the additional problem of network bandwidth betweenthe various nodes. Here, the root of the problem is the need to overgenerateintermediate representations so as to minimize errors in subsequent passes. Inall of these cases, the fundamental question is whether it is possible to performspeech recognition through iterative refinement of the hypothesis space withoutexplicitly generating the entire space.

To solve these problems, we propose a new approach to multiple pass speechrecognition which does not rely on minimizing the lattice word error rate fromthe outset. In the following section, we propose ways for the system to recoverfrom lattice errors, effectively reconstructing the missing parts of the searchspace that are lost when the previous pass has been optimized for performance,or for a specific, reduced vocabulary or task. These techniques will allow us toeffectively balance computational load between passes in an “anytime” system,or between threads in a “pipelined” system.

20

Page 23: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

Chapter 4

Preliminary Experiments

The experiments described in this section are preliminary investigations into thepossibility of reconstructing missing information from the output of a reducedor “impoverished” first pass of decoding. Our principal aim is to show that thisinformation is present in some form. That is, despite the presence of search ormodeling errors at the word level, the acoustic or phonetic information in thespeech signal has not actually been corrupted. This allows us to view these errorsnot as errors but rather as approximations. In this way, given a suitably reliableconfidence measure, and a method for reconstructing likely alternatives in errorregions, we are able to achieve the goal of deferring decisions by preservinguncertainty, which is the basis of multiple pass search.

In these experiments, we use the size of the vocabulary as the independentvariable by which we “scale” the decoder. There are several reasons for this. Themost important is that, although the computational load of a large-vocabularycontinuous speech recognizer is dominated in roughly equal parts by acousticmodel evaluation and search, the search module controls to a large degree theoverall complexity of the system. If fewer arcs in the search graph (i.e. words orphonemes) are active at any given time, it is usually the case that fewer acousticmodels will need to be evaluated. Therefore, the perplexity of the search graphplays a major role in determining the speed of the system as a whole. In asystem using a statistical N-Gram language model, all else being equal, this isproportional to the perplexity of the language model λ with respect to the inputwN1 :

pplx(wN1 ;λ) = expN∑i=1

P (wi|hi, λ) log1

P (wi|hi, λ)(4.1)

Test set perplexity is the principal measure used to evaluate the progressof language modeling research. In general, there are three ways to reduce theperplexity of a given data set with respect to a language model:

1. Make the language model closer to the data set, by training it on morerepresentative data.

21

Page 24: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

2. Improve the quality of the probability estimates which make up the lan-guage model, by training it on more data, or by using more advancedsmoothing techniques.

3. Holding the training set constant, reduce the size of the vocabulary, map-ping out-of-vocabulary words in the training and testing sets to a single“unknown” word type.

It is easy to see why reducing the vocabulary also reduces perplexity. Asmore words in the training set are mapped to the unknown word, the probabilityestimate for P (unk|hunk, λ) increases for all histories hunk, and likewise the in-formation term log 1

P (unk|hunk,λ) becomes smaller. As more words in the testingset are mapped to the unknown word, the number of terms in the summationin Equation 4.1 containing the unknown word increases. Since the informationof the unknown word has decreased, and its relative weight has increased, theoverall perplexity must also decrease.

Since the subject of this proposal is neither improved language modelingnor efficient acoustic model evaluation, reducing the vocabulary is the preferredmethod here for reducing the perplexity of the search graph, and thus speedingup the decoder. In addition, the out-of-vocabulary errors which result from areduced vocabulary are largely independent of the characteristics of the decoder,unlike the types of search errors described in Section 2.5.2. For this reason, thetechniques developed here are likely to generalize to ASR systems other thanthe ones used in our experiments.

4.1 Acoustic Reconstruction

The idea that it is possible to recover from errors in the context of a multiplepass system is based on the fundamental observation that the acoustic matchingperformed by speech recognition engines is robust even at very high phonemeerror rates [37]. In speech recognition, error analysis which goes beyond simpleword error rate calculations [6] reveals that a similar connection holds betweenerrors at the word and phoneme level. That is, for any given word in a hypothesissentence which is marked as incorrect by a word-level alignment, it is frequentlythe case that most of its constituent phonemes are actually correct with respectto the reference sentence, both in their identity and their time alignment.

To demonstrate this, we performed a set of experiments using the Sphinx-III speech recognition system, where we resynthesized speech from the resultsof decoding at different levels of representation. We trained a set of continuous-density acoustic models from the WSJ0 corpus [35], using the same variety ofmel-frequency cepstral coefficients used in HMM-based speech synthesis [30],along with their first and second time derivatives. There are two distinctcases we investigated: the first is in-domain though possibly out-of-vocabularydata, while the second is out-of-domain data. We are specifically interested inthe robustness of different levels of representation in speech recognition (word,phoneme, HMM state) to out-of-vocabulary and out-of-domain data.

22

Page 25: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

To resynthesize speech from the word and phone-level decoder output, weran a pass of force alignment to obtain state segmentations, along with the pos-terior distribution over Gaussian mixture components for each state’s outputdistribution. An initial sequence of acoustic feature vectors was generated byconcatenating the means of the most likely mixture component for each statein the sequence. The maximum-likelihood parameter generation technique de-scribed in [44] was then used to smooth the MFCC coefficients with respect totheir time derivatives. To produce reconstructed speech, we used the MLSAfilter [18] with an excitation signal generated from a pulse-noise switch andframe-by-frame pitch estimates calculated from the original waveform using theYIN algorithm [8].

The acoustic distortion in the reconstructed speech can be measured ob-jectively as the average mel-cepstral distortion between the original parametervectors x and the reconstructed parameters y, as in Equation 4.2. This iscommonly used as a rough evaluation measure for voice conversion and speechsynthesis systems.

mcd(x,y) =10

log 10

√(x− y)T (x− y) (4.2)

We first evaluated the case of in-domain but out-of-vocabulary data by de-coding the November ’92 development set using reduced vocabulary languagemodels. The language models used were the standard Lincoln Labs 20,000 wordand 5,000 word open vocabulary bigram models, along with reduced modelstrained from the WSJ0 language model training data using only the most fre-quent 3,000 and 500 word types. The results of this evaluation are shown inTable 4.1. Here, we compare the word error rate, the equivalent phoneme errorrate, and the average distortion of resynthesis from the Viterbi state alignmentand best mixture component, as described above. The phoneme error rateswere obtained by expanding the pronunciation of each word in the hypothesisand aligning them against a reference phonetic transcription generated by forcealignment. As expected, in a word-level decoder, the equivalent phoneme errorrate is typically much lower than the word error rate, and phoneme error andword error are closely correlated. For example, moving from a 20,000 word toa 5,000 word vocabulary results in a 50% relative increase in word error and a42% relative increase in phoneme error.

However, the acoustic distortion value barely increases even with catas-trophic increases in the word and phoneme error rate. Amazingly, the 500word vocabulary, despite an 83.23% word error rate, achieves an MCD valueof 6.268 dB, which is within the range of acceptable results in voice conversionand synthesis. Furthermore, in listening to the resynthesized waveforms, thereis little or no difference in the perceived word sequence, despite gross errorsin the word-level transcription, such as those shown in Table 4.2. In order toverify that, despite the high error rates, the recognition results were phoneti-cally accurate and useful at some level, we also ran a resynthesis experimentusing a randomized version of the reference transcript, where the words of eachsentence were replaced by words randomly sampled according to the unigram

23

Page 26: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

Decoder Word Error Phoneme Error MCDWord Bigram, 20k words 20.62% 8.09% 5.970Word Bigram, 5k words 30.84% 11.52% 5.985Word Bigram, 3k words 38.69% 14.80% 6.000Word Bigram, 500 words 83.23% 34.72% 6.268Random Transcript 99.66% 91.23% 8.156Force Alignment 0.00% 0.00% 5.982Phoneme Trigram n/a 27.12% 5.963

Table 4.1: Mel-Cepstral Distortion, WSJ devel20k

distribution of the language model training data. In this case, the average MCDincreases to 8.156 dB, and the output is no longer intelligible.

Voc Textref then he treats it with agree conditioner to add fragrance

20k THAT he treats THAT WHEN THE GREEK conditioner to add fragrance

5k THAT EACH REED SAID WHEN THEY agree CONDITION OR to add FREE GRANTS

3k THAT THE TRADE SAID WHEN THEY agree CONDITION OR to add FOR A GRANT C

500 THAT THE TRADE SAID WHEN THEY AGREED TO THIS NOW to AT FOR A CURRENT C

Table 4.2: Examples of Word Error with Reduced Vocabularies

Returning to Table 4.1, we also ran experiments using the reference tran-scription as the input to resynthesis, as well as the output of the Sphinx-IIIphoneme decoder, using a phoneme trigram model trained from TIMIT andphonetically-expanded WSJ0 training transcripts. First, the MCD value ofresynthesis from the “perfect” reference transcriptions is actually slightly higherthan that obtained by using the best speech recognition results, with a 20% worderror rate. As well, although the phoneme error rate of the phone decoder issignificantly higher than all but the worst word decoder, resynthesis from itsoutput achieves the lowest distortion value in this test. Both of these resultsseem to indicate that word or phoneme error is only one of several factors whichcontribute to acoustic mismatch. It appears that acoustic mismatch is alsoproportional to the degree to which the symbolic representation of the speechsignal is constrained, be it by a word-based language model or by a referencetranscription.

This result can be explained by returning to the idea of speech recognition asdata compression mentioned in Section 2.3. For real-world data and compressionalgorithms there is typically a tradeoff between rate and distortion, such thatthe minimum distortion increases as we move to lower bit rates. By addingconstraints to the output of the speech recognizer, we also inevitably reduce theentropy rate of this output. Equivalently, by adding words to the vocabularywith the language model training data set held constant, we increase the entropyrate.

To examine the case of out-of-domain data, we used recordings from the

24

Page 27: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

CMU Arctic speech synthesis corpora [25]. This data consists of carefully readrecordings of selected sentences from out-of-copyright novels. We used 100 sen-tences of data from six speakers, and decoded using both the the 20,000 wordvocabulary WSJ model and the phoneme decoder, resynthesizing in the samefashion mentioned above. The results are shown in Table 4.3. The patternhere is similar to that shown above, where resynthesis from speech recognitiontranscripts, even at high levels of error, results in less distortion than using theoriginal transcripts. The MCD numbers shown here are extremely high, forreasons that are not clear, since in all conditions the resynthesized speech is in-telligible and matches the input text. This may be due to the acoustic mismatchbetween the acoustic model training data and the test set, a condition which isexacerbated by the fact that no cepstral mean normalization was performed inthese experiments.

Decoder Word Error Phoneme Error MCDWord Bigram 49.41% 33.47% 10.098Phoneme Trigram n/a 45.98% 10.141Force Alignment 0.00% 0.00% 10.372Random Transcript 99.78% 82.74% 12.888

Table 4.3: Mel-Cepstral Distortion, CMU Arctic

Although the main goal of these experiments is to establish that the essentialqualities of the speech signal are not diminished by speech recognition at higherror rates, they are also potentially relevant to the “anytime” systems describedpreviously. This is because the amount of information used to code the originalutterance for resynthesis is quite small, and therefore may be effective as acompressed form for archiving speech data. In addition, the fact that the outputspeech remains intelligible even at fairly high phoneme error rates indicates thatthere is a considerable amount of phonetic redundancy in the acoustic model,such that likely mixture components with respect to a given frame of inputcan be found across a range of different base phonemes. It may be possible tocompress the models or improve their discriminative ability by identifying andremoving this redundancy.

4.2 Hypothesis Expansion

Having shown that the acoustic and phonetic characteristics of the speech arepreserved at some level in the decoder, we still face the problem of how toaccess this information for the purpose of guiding or constraining the searchalgorithm in later passes of recognition. In order to demonstrate that this ispossible, we have investigated a simple word-level approach based on expandingthe hypothesis space between passes. In [42], an error-correction model based onstatistical machine translation techniques was used to compensate for domainmismatch between the training and test sets. The motivation for this work was

25

Page 28: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

to allow a speech recognition engine to be used as a “black box” in a varietyof spoken dialog systems, that is, without requiring customized language andacoustic models for every domain.

We take a similar approach to correcting lattice errors. That is, given a first-pass lattice which fails to contain the top hypothesis, due to out-of-vocabularywords or search errors, we seek to correct these errors by reintroducing themissing words. This is actually somewhat easier than correcting errors in thesingle best hypothesis as in [42], since we rely on the second pass of decoding,which has access to acoustic models and observations, to make the final deci-sion as to the best hypothesis. In this respect our approach is similar to thehypothesis driven lexical adaptation technique described in [12], where phoneticinformation from a first-pass lattice is used to add words to the pronunciationdictionary in order to improve coverage of out-of-vocabulary words in broadcastnews transcription.

Furthermore, in the Sphinx-III system, the second pass of decoding doesnot rescore the lattice, strictly speaking, but rather uses a shortlisting strategy.This means that, as opposed to constructing the search graph directly from thefirst-pass lattice, a cross-section operation is done to generate a list of likelywords at each frame, and this list of words is used as the dynamic vocabularyfor decoding. It is possible for the second pass to generate word sequences whichare not present in the first pass lattice. Therefore, it is sufficient for to simplyinsert the correct words into the first pass lattice overlapping the correct timeregion, regardless of whether doing so actually reduces the lattice word errorrate.

In order to achieve this, we minimally require a function that maps wordsin the errorful lattices obtained from reduced-vocabulary decoding to possiblecorrections. In an initial “cheating” experiment, we simply collected errors fromthe word lattices obtained by decoding the 20,000 word November ’92 develop-ment set using the 5,000 word bigram language model and the PocketSphinxrecognition system. This was done by using dynamic programming to find theminimum-error path through the lattice with respect to the reference transcrip-tion, backtracing to find an alignment, and constructing a mapping of incorrectto correct words.

Some example mappings found by this method are shown in Table 4.4. Itseems that they fall into five main categories, as shown here. The first categoryconsists of homophones or near-homophones, which are generally also seman-tically related, as in the pair PRODUCTION / PRODUCTIVE. The second consistsof unrelated words which are phonetically close. Typically these share a com-mon phonetic subsequence, as in EXCEPTION / ACCEPTS, or a common metricalstructure, as in ARABIA / CARIBBEAN. The third category is seemingly unrelatedwords. The fourth category is splits, nearly all of which are well-motivated pho-netically. Finally, function words, which are typically short, highly confusableacoustically, and highly probable according to the language model, are typicallyalign to a variety of other words.

Although we did attempt to find “splits”, where one correct word is mappedto more than one incorrect word, we did not use these in the cheating experi-

26

Page 29: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

Incorrect CorrectWORKERS WORKERS’PRODUCTION PRODUCTIVEACCEPTABLE UNACCEPTABLEFALLING FOLEY’SEXCEPTION ACCEPTSARABIA CARIBBEANOPPOSITE PARADOXICALDOWNTURN GILBERTSTRESSED RATINGS FRUSTRATINGWILL TRADED INFILTRATEDBASED TRUST DISTRUSTA ENGAGE, CONDITIONER, ACQUIRER, REBUILD,

TO, HER, ORDER, STABLE

Table 4.4: Example Confusion Pairs and Splits, devel20k Using 5k Vocabulary

ment. We then “corrected” these lattices by simply adding duplicate arcs forevery incorrect word in the mapping, labeled with the correction words fromthe mapping. We then ran second-pass decoding using the Sphinx-III system,the results of which are shown in Table 4.5.

First Pass WER (devel20k) WER (test20k)20k bigram 12.18% 12.08%5k bigram 24.20% 24.96%5k expanded 12.24% 24.85%

Table 4.5: Lattice Expansion, Cheating Experiment

There are some obvious problems with this experiment, the first one beingthat our correction “model” is built from the development set and, as predicted,does not generalize to the test set. Nonetheless, it demonstrates the feasibilityof “expanding” first-pass lattices to compensate for an impoverished vocabulary.It is also interesting to note that, despite no attempt being made to correct forsplits, the second-pass error rate is very close to that achieved by using the fullvocabulary in the first pass. The other major problem with this experimentis that it massively increases the size of the first-pass lattice. This is becausethe expansion pairs in the “model” include spurious mappings from functionwords to a large number of confusion words. This can easily be solved byexcluding a small set of stopwords from the expansion process. This results inlattices which are roughly the same size as the unexpanded ones, and in thecheating experiment detailed above, has a minimal effect on the resulting errorrate (12.39% versus 12.24%).

The generalization problem is more difficult, and is central to this proposal.Some solutions to it are suggested in literature related to speech-based informa-tion extraction and retrieval. In [38], a phoneme decoder is used in conjunction

27

Page 30: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

with a phonetic distance measure defined over the vocabulary in order to gener-ate ranked N-best lists of query words for a name recognition task. One chapterof [34] is devoted to the correction of errors in ASR output for information ex-traction. Here too, a phonetic distance measure was used to generate rankedlists of candidates for correction.

As a preliminary experiment to gauge the usefulness of phonetic distancein expanding lattices, we implemented a simple correction model using pho-netic edit distance. The edit distance is the minimum number of insertions,deletions, and substitutions needed to transform one string of phones into an-other. For each hypothesis word in the 5k vocabulary, we generated a ranked listof candidate words for correction from the non-overlapping portion of the 20kvocabulary. The edit distance for each candidate was normalized by the num-ber of phones in the hypothesis word, and those candidates under a thresholdwere retained for use in expansion. The results of this experiment are shown inTable 4.6.

Threshold WER (devel20k) WER (test20k)0.0 24.20% 24.96%0.1 24.01% 24.92%0.2 22.27% 23.29%0.3 20.37% 21.31%0.4 18.15% 18.83%

Table 4.6: Lattice Expansion by Normalized Phonetic Edit Distance

For this experiment, an identical, fixed cost was assigned to all substitu-tions, as well as to insertions and deletions. This creates the problem that,particularly for short words, a large number of candidate words have an equalranking. As well, the degree of acoustic similarity between substitution pairs isnot considered in the model, and thus the ranking of candidates is imprecise.The effect of both of these problems is that the candidate lists must be madelonger in order to achieve the desired error correction effect, leading to verylarge expanded lattices. One solution to this is to use costs estimated from aphoneme confusion matrix, as in [38].

The results in Table 4.6 show that lattice expansion using phonetic distancemeasures is effective in reducing the second-pass error rate. However, it doesnot allow us to achieve comparable results to single-pass decoding using the fullvocabulary and detailed acoustic model. Using a threshold of 0.4 to expand thefirst-pass lattice, the error rate is still roughly 50% higher than it would be hadwe run a first-pass search using the full vocabulary.

While some of this effect can be attributed to the shortcomings of the cor-rection model noted above, it may also be the case that there are broad classesof errors which cannot be corrected by this model. In fact, in comparing thecandiate lists generated by this model with the taxonomy of confusions shownin Table 4.4, it seems that only the first class, namely near-homophones, arecaptured with any regularity. Candidate lists for some of the same words, using

28

Page 31: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

Hypothesis CandidatesWORKERS WORKER’S, WORKERS’PRODUCTION PRODUCTIONS, DEDUCTION, PERFECTIONACCEPTABLE ACCESSIBLE, SUSCEPTIBLE, UNACCEPTABLE,

EXCEPTIONALFALLING (none)EXCEPTION EXCEPTIONS, INCEPTION, DECEPTION,

EXCEPTIONAL, RECEPTIONARABIA ARABIA’S, ARABIANOPPOSITE (none)DOWNTURN DOWNTURNS

Table 4.7: Example Candidate Lists, Phonetic Edit Distance, Threshold 0.3

a distance threshold of 0.3, are shown in Table 4.7. The second class of errorsfrom Table 4.4, namely words with similar phonetic structure, may be moreeasily captured by using an acoustically sensitive distance measure.

While the class of errors which results from word splits in the hypothesisis clearly phonetically motivated, it is unable to be corrected by this model,since by nature it expands each hypothesis word into candidates of roughly thesame duration. In the “cheating” experiment, splits were captured by the factthat all of the split hypothesis words were ultimately aligned to the correctword. Therefore, in the expanded lattice, they were all present in the shortlist throughout the duration of the original word and were thus available tothe recognizer. There are obvious combinatorial problems with attempting toenumerate and model all possible splits, both in training the model and inexpanding lattices at runtime. However, from the “cheating” results, it appearsthat many of the words which participate in splits are exact subsequences of thelonger word for which they are confused.

We propose to address all of the aforementioned issues with this hypothesisexpansion technique, as well as to make it suitable for real-time use as part ofa pipelined or sequential multi-pass system. In particular, we intend to exploreboth dictionary-based and recognizer-based models for error correction. Theformer is the extension of the phonetic distance model described above, usingacoustic and semantic distance measures to generate correction models. Thelatter is a generalization of the technique used in the “cheating” experiment ofcollecting confusion words from actual word lattices. In examining the baselinelattices generated by the full-vocabulary system, we find that most of the errorwords from our cheating model occur in close proximity to the correct word.Therefore we propose to use similarity measures and confidence scoring to findlikely errors in a large body of full-vocabulary lattices.

29

Page 32: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

Chapter 5

Proposed Work

This thesis proposal is premised on the idea that real-world speech recogni-tion systems, particularly on mobile devices, involve an inevitable compromisebetween efficiency and accuracy. Although a wide variety of acoustic modelevaluation and search techniques, e.g. [17], exist which improve the efficiencyof the system with minimial effect on accuracy, ultimately design choices mustbe made which will result either in a higher error rate or greater computationalcomplexity. Our preliminary experiments suggest that at some level the infor-mation in the speech signal is retained even by poor recognition, and that it ispossible to model the uncertainty at other levels in order to recover from errors.The premise of the work proposed here is that the choice to sacrifice accuracyfor efficiency can be made by introducing errors in a controlled fashion so thatthey can be recovered from more easily.

For reasons previously described, one of the most important design choices,and the primary one considered by this proposal, is the size and compositionof the vocabulary of words recognized by the system. When using a reducedvocabulary for the first pass of recognition, we are faced with the choice of whichwords to include in this reduced vocabulary. In the two applications proposedhere, namely anytime decoding and parallel decoding, the choice of a reducedvocabulary may be quite different. Therefore, in addition to the future workdescribed in Chapter 4, we propose two distinct approaches to reducing, orscaling, the vocabulary while preserving the information in the speech signalsuch that it can be used to guide future passes of recognition.

In the case of anytime decoding for mobile applications, the reduced vo-cabulary is by nature a domain-specific one, where the goal is to minimize theconcept error rate or the content word error rate. That is, we are most concernedabout preserving the semantic content of the input, as conveyed by certain keywords. Though this is an extremely naıve view of semantics, it correspondsclosely to the one used in information retrieval, where properties such as topicand relevance are typically modeled based on unigram statistics. By contrast,in the case of load balancing for parallel decoding, we are concerned primarilywith reducing the complexity of the first-pass search, with informational con-

30

Page 33: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

tent or usefulness to the parent application being a secondary concern. In thiscase the objective of vocabulary optimization is recoverability of the resultingout-of-vocabulary errors.

5.1 Multi-Resolution Decoding

In recent brief experiments in keyword spotting [24], it has been shown thatgood results can be achieved by using a hybrid word-phoneme decoder. This es-sentially takes the form of a decoder whose vocabulary consists of the keywordsto be recognized, along with words for individual phonemes, and a statisticallanguage model estimating the probability of transitions between phones andkeywords. The output of such a decoder is interesting in that it consists of amixture of words and phonemes, essentially “backing off” to phonetic decodingin regions where no keywords were recognized. This technique has also previ-ously been applied to the problem of out-of-vocabulary word detection [11, 47].In this case, the goal is to identify regions of speech which are outside, ratherthan inside, the given vocabulary.

We propose to extend this approach to the problem of anytime recognition.That is, we propose to construct an efficient system which performs a combina-tion of word and phoneme recognition, with the primary goal of recognizing asmall to medium sized vocabulary of content words, and to use this as the firstpass in a multi-pass recognition system. This use of phonetic information has aprecedent in the phoneme look-ahead techniques described in papers such as [32],where phoneme posterior probability estimates are calculated in advance over asmall window of speech using a simplified model, and are subsequently used toprune the set of active HMMs. This is also known as acoustic fast match, andwas applied to a system running on a multicore CPU for mobile devices in [20].

To achieve this, the second-pass decoder will combine the hypothesis expan-sion and shortlisting techniques described in Section 4.2 at the word level withacoustic lookahead techniques at the phonetic level, that is, at the level of indi-vidual HMMs. Since the first-pass lattice will contain both phoneme and wordhypotheses, the former can be used to generate phoneme posterior probabilitiesfor use in pruning, while the latter can be used to determine the active vocab-ulary for search. We envision that this will allow the first-pass decoder to scalealong a continuum from pure keyword spotting to medium-vocabulary speechrecognition with out-of-vocabulary detection, while providing useful informationto the second-pass decoder.

When doing keyword spotting, the second-pass error rate will depend largelyon the false alarm rate of the first pass, since the existence of word hypothesesin the first pass will constrain the vocabulary for the second pass. If we takethe first-pass lattice as the input for the second pass, then it may be necessaryto prune the word components of it extensively, so as to remove this constraintfrom low-confidence regions.

At the other end of the continuum, namely medium-vocabulary speech recog-nition with out-of-vocabulary detection, the second-pass error rate will be con-

31

Page 34: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

strained by the effectiveness of hypothesis expansion. Since, as mentioned ear-lier, it is always possible to achieve effective hypothesis expansion by massivelyovergenerating confusable words (in the limit, we can simply “expand” to theentire vocabulary), the existence of extra phonetic information in the lattice canbe used to constrain the set of expanded words.

The goal of this proposed work is to show that multiple levels of represen-tation can be effectively combined to produce effective heuristics for search inautomatic speech recognition. We also intend to integrate keyword spotting andout-of-vocabulary word detection in a single framework for scalable recognition,which we expect to be useful for a variety of mobile and resource-constrainedapplications.

5.2 Vocabulary Optimization

The idea of using a two-pass system with a first-pass vocabulary optimizedfor phonetic coverage has mainly been studied in the context of vocabulary-independent recognition, where the goal is to provide accurate orthographic tran-scriptions of out-of-vocabulary words. This is particularly important for broad-cast news transcription tasks, where the active vocabulary changes frequentlydue to current events, and where personal and place names are widespread.Some approaches to this problem have used phoneme decoding combined withgrapheme-to-phoneme conversion models [10]. However, as in our preliminaryexperiments, it has been noted [5] that larger basic units of recognition such assyllables or words lead to lower phoneme error rates. Therefore, a number ofstudies have investigated the automatic generation of multi-phone lexical itemsfor speech recognition. This problem is related to that of automatic speechrecognition for morphologically complex languages such as Korean [26], wheremorphemes or other sub-word units have been used as the lexical basis forrecognition.

In [46], the use of “particles” in first-pass decoding was proposed. Theseparticles are units consisting of common subsequences of phonemes, which aredetermined automatically using a bottom-up clustering algorithm with the per-plexity of the resulting particle-based language model as an objective function.After recognition using the particle-based language model, a word lattice wasconstructed by aligning sequences of particles to words in the vocabulary, andthe resulting lattice was rescored using a word-based decoder. Although the goalhere was to allow vocabulary independence, this method had a convenient sideeffect of being significantly faster than a word-based decoder, almost certainlydue to the considerably smaller inventory of particles. Unfortunately perplexitynumbers for the particle versus the word language model were not reported. Ina similar study [11], subword units consisting of grapheme-to-phoneme corre-spondences were created using a greedy algorithm based on mutual information,and the resulting units were used to generate orthographic representations forout-of-vocabulary words. This approach has the advantage of not requiring aseparate phoneme-to-grapheme conversion model.

32

Page 35: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

Our proposed research has a somewhat different goal than these previ-ous studies. First, we are interested in having immediately useful results offirst-pass decoding, rather than requiring a stage of alignment and rescoringbefore a word transcription can be obtained. Also, although recovery fromout-of-vocabulary errors is central to this proposal, we are not concerned withunlimited-vocabulary or vocabulary-independent recognition, since the out-of-vocabulary words are assumed to be present in the lexicon used for a subsequentpass of recognition, from which the scaled-down lexicon has been derived.

For these reasons, we propose to use actual words, rather than subword orword-like units, as the basic units of recognition, and to optimize the vocabularyusing an objective function based on phonetic similarity rather than languagemodel perplexity. As a basic example, in [34], a vocabulary of names wasreduced by 15% simply by removing all identical pronunciations. Doing soproduced a set of homophone sets which could be easily be used to recoverfrom errors introduced by using the resulting reduced vocabulary in recognition.It is also suggested that similar equivalence classes could be constructed formorphological variants such as plurals and possessives, which involve minimalphonetic differences and are commonly confused in recognition.

We propose to study the effects on performance and accuracy of optimizingthe vocabulary in this fashion as part of an anytime recognition system. Weintend to investigate a variety of phonetic, morphological, and semantic fea-tures for building equivalence classes among vocabulary words. This work canbe viewed as a logical extension of the hypothesis expansion work described inSection 4.2, where rather than building post facto models of likely word confu-sions in reduced vocabularies, we are using these models to construct these veryvocabularies. The goal of this proposed research is to confirm the thesis that byjointly introducing and modeling uncertainty, we can more effectively recoverfrom the errors resulting from this uncertainty.

33

Page 36: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

Chapter 6

Contributions and Timeline

The main contributions outlined in this thesis proposal are as follows:

1. We will formulate a multiple-pass search strategy which presumes an im-poverished or sparse first-pass lattice.

2. We will develop efficient error correction models for transforming or ex-panding these sparse lattices to constrain second-pass searches, minimizingthe effect of search errors and out-of-vocabulary errors.

3. We will build on past work on hybrid language models to construct anefficient multi-resolution decoder, and incorporate this into a multi-passsystem using acoustic fast match and shortlisting techniques.

4. We will develop effective clustering techniques to jointly produce errorcorrection models and scale the first-pass vocabulary for optimal phoneticcoverage.

5. We will build a parallel speech recognition system using all aforementionedtechniques.

The initial phase of our work will concentrate on developing and improvingthe idea of hypothesis expansion, as described in Section 4.2. Specifically, wewill investigate both data and dictionary-driven techniques for training expan-sion models. We will carry out experiments using this approach in several do-mains including read speech, voicemail transcription [33], and multi-participantmeeting speech using the SmartNotes corpus [3, 16]. Evaluation will use theSphinx-III decoder, due to its greater flexibility. This task will be evaluated interms of word error rate at a fixed real-time factor. The baseline will consist ofa single-pass Sphinx-III system using the full vocabulary.

Having demonstrated the viability of this approach in a batch, sequentialrecognition system, the next phase of the thesis work will involve rearchitectingthe PocketSphinx system to support the anytime and parallel use cases. Cur-rently, this system is hard-wired for the three passes of recognition described

34

Page 37: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

in Section 2.6. We propose to implement a general framework for multi-passrecognition which allows passes to run in separate threads and to be organizedin a pipelined fashion. The effectiveness of this approach will be shown by aspeed-accuracy tradeoff which scales with the number of CPU cores.

The preceding experiments will provide a baseline for the hybrid languagemodel and vocabulary optimization work. For hybrid language models, we wishto show that success on the keyword spotting task can be combined with successin anytime recognition. With vocabulary optimization, our goal is to show animprovement in the speed-accuracy tradeoff achieved in the original results ofmulti-pass search with expansion, and improved scaling across multiple CPUcores in a parallel implementation.

An estimated timeline for the work proposed above follows.

• November 2008 – February 2009: Investigate error-correction models andhypothesis expansion algorithms. Evaluate these techniques on read andspontaneous speech corpora, using a variety of vocabulary sizes and stepsizes between reduced and full vocabularies.

• February – April 2009: Rearchitect PocketSphinx system to supportpipelined multi-pass search and anytime recognition. Demonstrate multi-core scaling using baseline multi-pass system.

• April – June 2009: Investigate clustering techniques for vocabulary op-timzation and their effect on hypothesis expansion. Integrate clusteringand error-correction modeling, and evaluate the resulting system on thesame corpora as above.

• June – September 2009: Implement scalable hybrid first-pass decoding forword-spotting and OOV detection. This will involve replicating previousresults and integrating them into a common framework, as well as anefficient implementation based on PocketSphinx.

• September – December 2009: Develop techniques for integrating multi-resolution decoding results into heuristics for multi-pass search. Comparethe results with purely word-based systems using vocabulary optimiza-tion. At this point, we expect to have experimental results comparing theeffectiveness of all the techniques proposed. We also expect to have imple-mentations with which we can evaluate the efficiency of these techniquesand the viability of the original concept.

• December 2009 – April 2010: Further evaluation, as needed. Dissertationwriting and defense preparation.

35

Page 38: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

Bibliography

[1] Ytasos Anastasakos, John Mcdonough, Richard Schwartz, and JohnMakhoul. A compact model for speaker-adaptive training. In Proceedingsof ICSLP, 1996.

[2] L. R. Bahl, P. F. Brown, P. V. de Souza, and R. L. Mercer. MaximumMutual Information Estimation of Hidden Markov Model Parameters forSpeech Recognition. In Proceedings of ICASSP, 1986.

[3] S. Banerjee and A. I. Rudnicky. SmartNotes: Implicit labeling of meetingdata through user note-taking and browsing. In Proceedings of NAACL-HLT, 2006.

[4] L. E. Baum and T. Petrie. Statistical inference for probabilistic functionsof finite state markov chains. Annals of Manthematical Statistics, 37:1554–1563, 1966.

[5] Izzam Bazzi and James Glass. Heterogeneous lexical units for automaticspeech recognition: Preliminary investigations. In Proceedings of ICASSP,2000.

[6] Lin Chase. Error-Responsive Feedback Mechanisms for Speech Recognizers.PhD thesis, Carnegie Mellon University, 1997.

[7] Stanley F. Chen and Joshua Goodman. An empirical study of smoothingtechniques for language modeling. Technical Report TR-10-98, HarvardUniversity, 1998.

[8] A. de Cheveigne and H. Kawahara. YIN, a fundamental frequency esti-mator for speech and music. Journal of the Acoustical Society of America,111(4):1917–1930, April 2002.

[9] Thomas Dean and Mark S. Boddy. An analysis of time-dependent planning.In AAAI, pages 49–54, 1988.

[10] Bart Decadt, Jacques Duchateau, Walter Daelemans, and PatrickWambacq. Transcription of out-of-vocabulary words in large vocabularyspeech recognition based on phoneme-to-grapheme conversoin. In Proceed-ings of ICASSP, 2002.

36

Page 39: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

[11] Lucian Galescu. Recognition of out-of-vocabulary words with sub-lexicallanguage models. In Proceedings of Eurospeech, 2003.

[12] Petra Geutner, Michael Finke, and Alex Waibel. Phonetic-distance-basedhypothesis driven lexical adaptation for transcribing multilingual broadcastnews. In Proceedings of ICSLP, 1998.

[13] Reinhold Haeb-Umbach and Hermann Ney. Improvements in beam searchfor 10 000-word continuous-speech recognition. IEEE Transactions onSpeech and Audio Processing, 2(2):353–356, 1994.

[14] Xuedong Huang, Fileno Alleva, Hsiao-Wuen Hon, Mei-Yuh Hwang, andRonald Rosenfeld. The SPHINX-II speech recognition system: an overview.Computer Speech and Language, 7(2):137–148, 1993.

[15] David Huggins-Daines, Mohit Kumar, Arthur Chan, Alan W Black, MosurRavishankar, and Alexander I. Rudnicky. Pocketsphinx: A free, real-timecontinuous speech recognition system for hand-held devices. In Proceedingsof ICASSP, 2006.

[16] David Huggins-Daines and Alexander I. Rudnicky. Implicitly supervisedlanguage model adaptation for meeting transcription. In Proceedings ofHLT-NAACL, 2007.

[17] David Huggins-Daines and Alexander I. Rudnicky. Mixture pruning androughening for scalable acoustic models. In Proceedings of Mobile NLPWorkshop at ACL, 2008.

[18] S. Imai. Cepstral analysis synthesis on the mel frequency scale. In Proceed-ings of ICASSP, volume 8, pages 93–96, April 1983.

[19] T. Imai, A. Kobayashi, S. Sato, H. Tanaka, and A. Ando. Progressive2-pass decoder for real-time broadcast news captioning. In Proceedings ofICASSP, 2000.

[20] S. Ishikawa, K. Yamabana, R. Isotani, and A. Okamura. Parallel LVCSRalgorithm for cellphone-oriented multicore processors. In Proceedings ofICASSP, 2006.

[21] F. Itakura. Minimum prediction residual principle applied to speech recog-nition. IEEE Transactions on Acoustics, Speech, and Signal Processing,23(1):67–72, February 1975.

[22] Stephan Kanthak, Achim Sixtus, Sirko Morlau, Ralf Schluter, and Her-mann Ney. Fast search for large vocabulary speech recognition. In ISCAITRW Automatic Speech Recognition 2000, 2000.

[23] Slava M. Katz. Estimation of probabilities from sparse data for the languagemodel component of a speech recognizer. IEEE Transactions on Acoustics,Speech and Signal Processing, 35(3):400–401, 1987.

37

Page 40: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

[24] John Kominek. Personal communication, 2008.

[25] John Kominek and Alan Black. CMU ARCTIC databases for speech syn-thesis. Technical Report CMU-LTI-03-177, CMU Language TechnolgiesInstitute, 2003.

[26] Oh-Wook Kwon and Jun Park. Korean large vocabulary continuous speechrecognition with morpheme-based recognition units. Speech Communica-tion, 39(3-4):287–300, 2003.

[27] E. Levin, S. Narayanan, R. Pieraccini, K. Biatov, E. Bocchieri, G. DiFabbrizio, W. Eckert, S. Lee, A. Pokrovsky, M. Rahim, P. Ruscitti, andM. Walker. The AT&T-DARPA Communicator mixed-initiative spokendialog system. In Proceedings of ICSLP, 2000.

[28] E. Lin, K. Yu, R. Rutebar, and T. Chen. Moving speech recognition fromsoftware to silicon: the In Silico Vox project. In Proceedings of Interspeech,2006.

[29] Andrej Ljolje, Michael D. Riley, Donald M. Hindle, and Richard W. Sproat.The AT&T LVCSR-2000 system. In Proceedings of NIST Speech Transcrip-tion Workshop, 2000.

[30] T. Masuko, K. Tokuda, T. Kobayashi, and S. Imai. Speech synthesis fromHMMs using dynamic features. In Proceedings of ICASSP, 1996.

[31] Hermann Ney and Xavier Aubert. A word graph algorithm for large vo-cabulary, continuous speech recognition. In Proceedings of ICSLP, 1994.

[32] S. Ortmanns, A. Eiden, H. Ney, and N. Coenen. Look-ahead techniquesfor fast beam search. In Proceedings of ICASSP, 1997.

[33] M. Padmanabhan, G. Saon, J. Huang, B. Kingsbury, and L. Mangu. Au-tomatic speech recognition performance on a voicemail transcription task.Technical Report RC-22172, IBM, 2001.

[34] David Donald Palmer. Modeling Uncertainty for Information ExtractionFrom Speech Data. PhD thesis, University of Washington, 2001.

[35] D. Paul and J. Baker. The design for the Wall Street Journal based CSRcorpus. In Proceedings of the ACL workshop on Speech and Natural Lan-guage, 1992.

[36] Steven Phillips and Anne Rogers. Parallel speech recognition. InternationalJournal of Parallel Programming, 27(4):257–288, 1999.

[37] J. Picone and G. Doddington. A phonetic vocoder. In Proceedings ofICASSP-89, pages 580–583, May 1989.

[38] E. Pusateri and J. M. Van Thong. N-best list generation using word andphoneme recognition fusion. In Proceedings of Eurospeech, 2001.

38

Page 41: Thesis Proposal: Scalable, Anytime Speech Recognition for ...dhuggins/Publications/proposal.pdf · recovery has applications to mobile, parallel, and distributed computation for speech

[39] Larry R. Rabiner. A tutorial on Hidden Markov Models and selected ap-plications in speech recognition. Proceedings of the IEEE, 77(2):257–286,February 1989.

[40] M. K. Ravishankar. Parallel implementation of fast beam search forspeaker-independent continuous speech recognition. Technical report, In-dian Institute of Science, Bangalore, 1993.

[41] M. K. Ravishankar. Efficient Algorithms for Speech Recognition. PhDthesis, Carnegie Mellon University, 1996.

[42] Eric K. Ringger and James F. Allen. Error correction via a post-processorfor continuous speech recognition. In Proceedings of ICASSP, 1996.

[43] R. Schwartz, L. Nguyen, and J. Makhoul. Multiple-pass search. In Chin-Hui Lee, Frank K. Soong, and Kuldip K. Paliwal, editors, Automatic Speechand Speaker Recognition: Advanced Topics, chapter 18, pages 429–456.Springer, 1996.

[44] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura.Speech parameter generation algorithms for HMM-based speech synthesis.In Proceedings of ICASSP, pages 1315–1318, Istanbul, Turkey, 2000.

[45] Sunil Vemuri, Philip DeCamp, Walter Bender, and Chris Schmandt. Im-proving speech playback using time-compression and speech recognition.In Proceedings of CHI, 2004.

[46] E. W. D. Whittaker, J. M. Van Thong, and P. J. Moreno. Vocabularyindependent speech recognition using particles. In Proceedings of ASRU,2001.

[47] Ali Yazgan and Murat Saraclar. Hybrid language models for out-of-vocabulary word detection in large vocabulary conversational speech recog-nition. In Proceedings of ICASSP, 2004.

[48] S. J. Young, N. H. Russel, and J. H. S. Thornton. Token passing: A simpleconceptual model for connected speech reccognition systems. TechnicalReport CUED/F-INFENG/TR38, Cambridge University, 1989.

39