21
Mobile Speech Processing David Huggins-Daines Language Technologies Institute Carnegie Mellon University September 19, 2008

Mobile Speech Processing · I Mobile devices are more likely to be used in noisy environments ... Common Problems for Mobile Speech Processing ... operating system)

Embed Size (px)

Citation preview

Mobile Speech Processing

David Huggins-Daines

Language Technologies InstituteCarnegie Mellon University

September 19, 2008

Outline

I Mobile DevicesI What are they?I What would we like to do with them?

I Mobile Speech Applications

I Mobile Speech Technologies

I Current Research

Mobile Devices

I What is a “mobile device”?

I A hammer is a device, and you can carry it around with you!

I But no, that’s not what we mean here

Mobile Devices

I What is a “mobile device”?

I A device that goes everywhere with you

I ... which provides some or all of the functions of computer

I ... and some things it doesn’t, such as a cell phone or GPS.

Speech on Mobile Devices

I Why do we care about speech processing on these devices?I Because they are the future of computersI Because speech is actually a useful way to interact with them,

unlike full-sized computers

I What kind of speech processing do we care about?I Speech coding to improve voice quality for cellular and VoIPI Speech recognition for hands-free input to appsI Speech synthesis for eyes-free output from apps

I In some cases, speech is a natural and convenient modality

I In other cases, it is a necessity (e.g. in-car navigation)

Speech on Mobiles vs. Mobile Speech

I None of this necessarily implies doing actual speech processing(aside from coding) on the device itself

I Telephone dialog systems are “mobile” by any definitionI Let’s Go - bus scheduling informationI HealthLine - medical information for rural health workers

I But all synthesis and recognition is done on a serverI This can be a good thing especially in the latter caseI You can’t run a speech recognizer on a Motofone or a Nokia

1010

I Speech processing on the device is useful for:I Multimodal applicationsI Disconnected applicationsI Access to local data

Some Mobile Speech Applications

I GPS navigationI Older systems used a small number of recorded prompts (“turn

left”, “100 metres”, etc)I More recently, TTS has been used to speak street namesI Even more recently, ASR is used for input

I Voice dialingI Old systems used DTW and required trainingI Newer ones build models from your address bookI Cactus for iPhone - uses CMU Flite and Sphinx

I Voice-driven search (local, web, etc)I Nuance, Vlingo, TellMe, Microsoft are all doing this

I Voice-to-textI Typically server-based, requires a data connectionI “on-line”, ASR-based: Vlingo, NuanceI “off-line”, human-assisted: SpinVox, Jott, ReQall

I Speech to Speech Translation

Mobile Speech Technologies

I Speech CodingI Efficient digital representation of speech signalsI Fundamental for 2G and 3G cell networks and VoIP

I Speech SynthesisI Speech output for commands, directionsI Text-to-speech for messages, books, other content

I Speech RecognitionI Command and control (“voice control”)I Dictation (Speech-to-text for e-mail, SMS)I Search input (questions, keywords)I Dialogue

Speech Coding

I A fairly mature technology (started in the 1960s)I Early versions were mostly for military applicationsI Digital cell phone networks changed this dramatically

I Almost universally based on linear prediction and thesource-filter model.

I Each sample is a weighted sum of P previous samples.I Weights are linear prediction coefficients (LPCs), and are

calculated to minimize mean squared error.I Conveniently enough, this is actually a good model of the

frequency response of the vocal tract (given enough LPCs).I An “excitation function” models the glottal source.

I Everything else is just tweakingI Better excitation functions (CELP)I Variable bit rates (AMR)I Compression tricks (VAD + comfort noise)

Mobile Speech Synthesis

I Two traditional categories, one new oneI Synthesis by rule, e.g. formant synthesisI Concatenative synthesis, e.g. diphone, unit selectionI Statistical-parametric synthesis (“HMM synthesis”)

I We have had very efficient (often hardware-based)implementations of TTS for decades

I They sound terrible (but are often quite intelligible)

I The challenges for mobile devices are:I Achieving natural-sounding speechI Dealing with very large, irregular vocabulariesI Dealing with raw and diverse input text

Mobile Speech Synthesis

I Unit selection currently gives the most natural outputI But it is very ill-suited to mobile implementations

I Best systems use gigabytes of speech dataI But, you say... I have an 8GB microSD card in my phone!I Search time: finding the right units of speechI Access time: loading them from the storage medium

I Signal generation can also be time-consuming if not efficientlyimplemented

I Some ways to improve efficiency:I Compress the speech databaseI Prune the speech database by discarding units that are

infrequently or never usedI Approximate search algorithms (much like ASR)

Mobile Speech Synthesis

I Statistical-parametric synthesis is quite promisingI Models are quite small (1-2MB)I The search problem is nonexistentI Parameter and waveform generation are the most time

consuming parts currentlyI Requires higher dimensionality parameterizations than

concatenative synthesisI Output parameters are smoothed using an iterative algorithm

(similar to EM)I Waveform generation from mcep is much slower than LPC

I Dictionary compression and text normalizationI Dictionary can be compressed by building letter-to-sound

models and listing only the exceptionsI Efficient finite-state transducer representations can be created

for pronunciation and text processing rules

Mobile Speech Recognition

I Challenges for mobile devices are:I Variable and noisy acoustic environmentsI Large vocabulariesI Open domain dictation input

I As with speech synthesis, simple ASR is not very resourceintensive, although it has not been as widely implemented

I Even with large vocabularies, ASR can be done efficientlyI The most important factor is the complexity of the grammarI Commercial systems achieve impressive performance based on

very constrained grammars

I Systems tend to be extensively tuned for a given application

Mobile Speech Recognition: Acoustic Issues

I How do you talk to a device?I This depends on the application, user, and environmentI Acoustic feature vectors can look very differentI Microphones may not be optimized for all positions

I Noisy environmentsI Mobile devices are more likely to be used in noisy environmentsI Worse, they are more likely to be used in difficult onesI Non-stationary noise, crosstalk, human babbleI Array processing is not well suited to handheld devices

I On the bright side:I Usually a mobile device has only one userI Speaker adaptation can improve acoustic modelingI Speaker identification can be used to filter out babble and

crosstalk

Mobile Speech Recognition: Computational Issues

I Acoustic feature extractionI Efficient, as long as it is implemented properlyI Fixed-point arithmetic, data-parallel processing

I Most processing time is consumed by, in roughly equalamounts:

I Acoustic model evaluationI Search (hypothesis generation and evaluation)

I These can be made computationally efficient but must also bemade memory efficient, search in particular.

I This necessarily involves tuning heuristics because a completesolution is intractable.

Mobile Speech Recognition: Acoustic Modeling

I Exact acoustic model evaluation is intractable

P(o|si , λ) =K∑

k=1

wik1√

(2π)D |Σik |exp

D∑d=1

−(od − µikd)2

2σ2ikd

I Typical continuous-density acoustic model:I 5000 tied states, each withI 32 Gaussian densities, ofI 39 dimensions

I Complete evaluation of all log-likelihoods for one 10ms frame:I 155000 log-additionsI 12480000 subtractionsI 12480000 multiplications

I That’s 2500 million operations per second!I Your new MacBook Pro can do that, but just barelyI (yes, its video card can do it easily)

Mobile Speech Recognition: Acoustic Modeling

I How do we make this fast enough?I Only evaluate densities for “active” phones in searchI Predict which densities will score highly using a smaller,

approximate model set, and only evaluate these onesI Use fewer densities and:

I Share them between all HMM states (semi-continuous HMM)I or all the states for some phonetic class (phonetically-tied

HMM)

I Make density computation faster by quantizing acousticfeatures and parameters

I Skip some frames in the input, either byI Blindly computing only multiples of N (usually 2 or 3)I Detecting “interesting” regions in the input and only

computing densities there (landmark detection)

I Every ASR system in existence uses some combination of these

I However, too many approximations can make the systemslower

Mobile Speech Recognition: Search

I Search is not arithmetically intensiveI It largely consists of adding up scores and comparing them to

other scores

I However it is very memory intensiveI The search module in an ASR system touches:

I Acoustic scoresI Language model scoresI Dictionary entriesI Viterbi path scores and backpointersI Backpointer table entries

I In other words, pretty much every piece of memory except theacoustic model parameters

I Worse yet, there are sequential dependencies between all thesememory accesses

Mobile Speech Recognition: Search

I Fundamentally, the speed of the recognizer is proportional tothe number of different hypotheses it considers at once

I Optimizing search is entirely devoted to reducing this numberwithout significantly affecting accuracy

I This includes:I Careful tuning of various thresholds (beams) for word

transitions, phone transitions, etc.I Absolute pruning - hard limits on words per frameI Phonetic lookaheadI Language model lookahead (factorization / weight pushing)

I Finite-state transducer systems can be very fastI Dictionary, grammar, and (part of) acoustic model are

composed into a single decoding networkI Determinization - allows exact language model searchI Minimization - merges common subpathsI Weight pushing - more general kind of LM lookahead

Common Problems for Mobile Speech Processing

I Moore’s Law works differently for mobile devicesI Instead of getting faster, they get smaller and cheaperI Storage gets bigger, RAM doesn’tI Memory doesn’t get much faster

I Memory bandwidth is a major bottleneckI Making things smaller almost always makes them fasterI Memory allocations can be very expensive (depending on the

operating system)

I Audio input quality is often much lowerI Typically 8kHz or 11kHz maximum sampling rateI Dubious microphones

Current and Future Research

I Incorporating user feedback in multimodal (speech + touch)applications

I Presenting information efficiently using speech synthesis

I Very low bitrate speech coding using ASR and TTS

I Distributed processing for mobile speech recognition

I Acoustic robustness for handheld mobile devices

I Voice and multimodal user interface design