Speech Recognition 2

8/6/2019 Speech Recognition 2

1/27

Speech RecognitionSpeech RecognitionFromFrom

Judith A. Markowitz. Using Speech Recognition,Judith A. Markowitz. Using Speech Recognition,

Prentice Hall, NJ, 1996Prentice Hall, NJ, 1996Guojun Lu. Multimedia Database ManagementGuojun Lu. Multimedia Database Management

Systems, Chapter 5, Artech House, 1999Systems, Chapter 5, Artech House, 1999


2/27

Speech RecognitionSpeech Recognition

PreprocessingPreprocessing

Digitize and represent waveformsDigitize and represent waveforms

Feature extraction (10 ms frame)Feature extraction (10 ms frame)

Important Feature:Important Feature:MelMel--frequency cepstral coefficients (MFCC) developed based on howfrequency cepstral coefficients (MFCC) developed based on how

human hears soundhuman hears sound

Recognition: Identify what the user has saidRecognition: Identify what the user has said

Three approachesThree approaches Template matchingTemplate matching

AcousticAcoustic--phonetic recognition (e.g., FastTalk)phonetic recognition (e.g., FastTalk)

Stochastic processingStochastic processing


3/27

TerminologyTerminology

Phoneme: the smallest unit of sound that is uniquePhoneme: the smallest unit of sound that is unique

(distinguishing one word from another word for a(distinguishing one word from another word for a

given language)given language)

Example:Example:

The words The words sseat, eat, mmeat, eat, bbeat, eat, ccheat are differentheat are different

words since the initial sound is a separate phoneme inwords since the initial sound is a separate phoneme in

English.English.

About 40About 40--50 phonemes in English50 phonemes in English

Abnormal: AE B N AO R M AX LAbnormal: AE B N AO R M AX L


4/27

The simplest sound is pure tone that has a signThe simplest sound is pure tone that has a sign

waveform. Pure tone are rare.waveform. Pure tone are rare.

Most sounds, including speech phonemes, areMost sounds, including speech phonemes, are

complex waves, having a dominant or primarycomplex waves, having a dominant or primary

frequency calledfrequency called fundamental frequencyfundamental frequency overlaid withoverlaid with

secondary frequencies.secondary frequencies.

TerminologyTerminology (Contd)(Contd)

Fundamental frequency for speech is the rate atFundamental frequency for speech is the rate atwhich the vocal cords flap against each other whenwhich the vocal cords flap against each other when

producing a voiced phoneme.producing a voiced phoneme.


5/27

Examples of Complex Waves forExamples of Complex Waves for

PhonemePhoneme

Phoneme

Noise


6/27

Formants: Bands of secondary frequencies that

distinguish one phoneme from another

MultiMulti--frequency sounds like the phonemes offrequency sounds like the phonemes ofspeech can be represented asspeech can be represented as complexcomplex waveswaves..

Bandwidth of a complex wave: the range ofBandwidth of a complex wave: the range of

frequencies in the waveform.frequencies in the waveform. Sounds that produce acyclic waves are oftenSounds that produce acyclic waves are often

called noise.called noise.

Terminology (Contd)Terminology (Contd)


7/27

CoCo--articulationarticulation

CoCo--articulation effects: interarticulation effects: inter--phonemephonemeinfluencesinfluences

Neighboring phonemes, the position of aNeighboring phonemes, the position of aphoneme within words, and the position ofphoneme within words, and the position ofthe word in the sentence all influence the waythe word in the sentence all influence the waya phoneme is uttered.a phoneme is uttered.

Because of coBecause of co--articulation effects,articulation effects, a specifica specificutterance or instance of a phoneme is called autterance or instance of a phoneme is called aphone.phone.


8/27

Template MatchingTemplate Matching

Each word or phrase is stored as a separate template.Each word or phrase is stored as a separate template.

Idea: Select the template that best matches theIdea: Select the template that best matches the

spoken input (framespoken input (frame--byby--frame comparison) and theframe comparison) and the

dissimilarity is within a predetermined threshold.dissimilarity is within a predetermined threshold.

Template matching is performed at the word level.Template matching is performed at the word level.

Temporal alignment is used to ensure that fast orTemporal alignment is used to ensure that fast or

slow utterance of the same word is not identified asslow utterance of the same word is not identified asdifferent words.different words.

Dynamic Time Warping is used for temporal alignment.Dynamic Time Warping is used for temporal alignment.


9/27

Dynamic Time WarpingDynamic Time Warping


10/27

Robust TemplateRobust Template

In early systems, one template for one exampleIn early systems, one template for one example

(token)(token)

To handle variability, many templates of theTo handle variability, many templates of thesame word are stored.same word are stored.

Robust template is created from more than oneRobust template is created from more than one

token of the same word using mathematicaltoken of the same word using mathematical

averages and statistical clustering techniques.averages and statistical clustering techniques.


11/27

Template MatchingTemplate Matching

AdvantageAdvantage

Perform well with small vocabularies of phoneticallyPerform well with small vocabularies of phoneticallydistinct words.distinct words.

Midsize vocabularies in the range of 1000Midsize vocabularies in the range of 1000--10000 words are10000 words arepossible if the number of vocabulary choices at a one timepossible if the number of vocabulary choices at a one timeis kept minimal.is kept minimal.

DisadvantageDisadvantage

Must have at least one template for each word in theMust have at least one template for each word in the

application vocabulary.application vocabulary.

Not good with large vocabularies containing words thatNot good with large vocabularies containing words thathave similar sounds (confusable words., e.g., to andhave similar sounds (confusable words., e.g., to andtwo)two)


12/27

AcousticAcoustic--Phonetic RecognitionPhonetic Recognition

Store only representations ofStore only representations of

phonemes for a languagephonemes for a language

Three stepsThree steps

I. Feature extractionI. Feature extraction II. Segmentation and labeling:II. Segmentation and labeling:

SegmentationSegmentation determinedetermine

when one phoneme ends andwhen one phoneme ends and

another beginsanother begins

LabelingLabeling identifyidentify

phonemesphonemes

Output a set of phonemeOutput a set of phoneme

hypotheses that can behypotheses that can be

represented by a phoneme lattice,represented by a phoneme lattice,

a decision tree, etc.a decision tree, etc.

III. WordIII. Word--level recognition:level recognition:Search for words matchingSearch for words matching

phoneme hypotheses. Thephoneme hypotheses. The

word best matching aword best matching a

sequence of hypotheses issequence of hypotheses is

identifiedidentified..


13/27

Stochastic ProcessingStochastic Processing

Use Hidden Markov ModelUse Hidden Markov Model(HMM) to store the model(HMM) to store the modelof each of theof each of the itemsitems that willthat will

be recognized.be recognized.

Items: phonemes or subwords.Items: phonemes or subwords.

Each state of the HMM hasEach state of the HMM hasstatistics for a segment ofstatistics for a segment ofthe word.the word.

The statistics describe theThe statistics describe theparameter values andparameter values andvariation that were found invariation that were found insamples of the word.samples of the word.

A recognition system may havenumerous HMMs or may combine

them into one network of states andtransitions.

3-state HMM of a triphone obtain from training

Stochastic processing using HMM is accurate and flexible.


14/27

Subword UnitsSubword Units

Training wholeTraining whole--word models are not good for largeword models are not good for largevocabularies. Subword units are considered.vocabularies. Subword units are considered.

The most popular subword unit is triphone.The most popular subword unit is triphone.

Triphone (phoneme in context (PIC)) consists of the currentTriphone (phoneme in context (PIC)) consists of the currentphoneme and its left and right phonemes.phoneme and its left and right phonemes.

A triphone is generally represented by a 3A triphone is generally represented by a 3--state HMM.state HMM.

The first state represents the left phonemeThe first state represents the left phoneme

The middle state represents the current phonemeThe middle state represents the current phoneme

The last state represents the following phoneme.The last state represents the following phoneme. The number of triphones for English is much larger than theThe number of triphones for English is much larger than the

number of phonemesnumber of phonemes


15/27

The recognition system compares the inputThe recognition system compares the input

with stored modelswith stored models

Two comparison approachesTwo comparison approaches BaumBaum--Welch maximum likelihood algorithmWelch maximum likelihood algorithm --

computes the probability scores between the inputcomputes the probability scores between the input

and the stored models and selects the best match.and the stored models and selects the best match.

Viterbi AlgorithmViterbi Algorithm looks for the best pathlooks for the best path


16/27

Evaluation of SpeechEvaluation of Speech

Recognition SystemRecognition System

Vocabulary size and flexibilityVocabulary size and flexibility

Required sentence and application structuresRequired sentence and application structures The end usersThe end users

Type and amount of noiseType and amount of noise

Stress placed upon the person using theStress placed upon the person using theapplicationapplication


17/27

Basic class of errorsBasic class of errors

Deletion: dropping of wordsDeletion: dropping of words

Substitution: replace a word with another wordSubstitution: replace a word with another word

Insertion: adding a wordInsertion: adding a word Rejection: cannot recognized by the programRejection: cannot recognized by the program

High thresholdHigh threshold more rejectionmore rejection

Low thresholdLow threshold more substitution or insertionmore substitution or insertionerrors.errors.


18/27

VariabilityVariability

CoCo--articulationarticulation

InterInter--speaker differencesspeaker differences

IntraIntra--speaker inconsistenciesspeaker inconsistencies Robustness of a system: how system performsRobustness of a system: how system performs

under variabilityunder variability


19/27

CorpusCorpus-- reference database for training; itreference database for training; itincludes machineincludes machine--readable dictionaries, wordreadable dictionaries, wordlists, and published materials from specificlists, and published materials from specific

professions.professions. Homophones: same pronunciation, differentHomophones: same pronunciation, different

spelling Ex. one and wonspelling Ex. one and won

Active Vocabulary: set of words theActive Vocabulary: set of words theapplication expects to be spoken at any oneapplication expects to be spoken at any onetime.time.


20/27

Grammars (models, scripts) are used toGrammars (models, scripts) are used to

structure words to reduce perplexity, increasestructure words to reduce perplexity, increase

speed and accuracy, and enhance vocabularyspeed and accuracy, and enhance vocabulary

flexibility.flexibility.

Finite state grammarsFinite state grammars

Probabilistic modelsProbabilistic models

LinguisticsLinguistics--based grammarsbased grammars


21/27

Search through the vocabularies to find theSearch through the vocabularies to find the

best matchbest match

Branching factor: the number of items inBranching factor: the number of items in

the active vocabulary at a single point in athe active vocabulary at a single point in a

recognition process.recognition process.

Perplexity is often used to refer to thePerplexity is often used to refer to the

average branching factor.average branching factor.

High branching factorHigh branching factor high recognition timehigh recognition time


22/27

A total vocabulary of 1000 wordsA total vocabulary of 1000 words

Input: Take the toll road to MilwaukeeInput: Take the toll road to Milwaukee

With no grammar, the branching factor at each pointWith no grammar, the branching factor at each point

is 1000. The perplexity is 1000.is 1000. The perplexity is 1000. With a finite state grammar,With a finite state grammar,

Take theTake the TYPETYPE road toroad to PLACEPLACE

TYPE=high OR toll OR back OR rocky OR longTYPE=high OR toll OR back OR rocky OR long

PLACE=Milwaukee OR Kokomo OR nowhere OR Arby OR RioPLACE=Milwaukee OR Kokomo OR nowhere OR Arby OR Rio The branching factor of the first, second, forth, fifthThe branching factor of the first, second, forth, fifth

is one; the branching factor of the third and sixth isis one; the branching factor of the third and sixth isfive.five.


23/27

Weakness of FiniteWeakness of Finite--State GrammarState Grammar

Users cannot deviate from the patternsUsers cannot deviate from the patterns

Cannot rank the probability of occurrence toCannot rank the probability of occurrence to

improve speed and accuracyimprove speed and accuracy


24/27

Statistical ModelsStatistical Models

Often used in dictation systemsOften used in dictation systems

Specify what is likely instead of what isSpecify what is likely instead of what is

allowed.allowed.

Two forms of statistical modelingTwo forms of statistical modeling

NN--gram modelsgram models

NN--class modelsclass models


25/27

NN--gram Modelgram Model

Identify the current word (unknown word) byIdentify the current word (unknown word) byassuming the identity of that word dependent uponassuming the identity of that word dependent uponthe previous Nthe previous N--1 words and the acoustic information1 words and the acoustic informationof the unknown word.of the unknown word. Example: TrigramExample: Trigram------N=3N=3 two words prior the unknowntwo words prior the unknown

wordword

ExampleExample This is my printer [unknown word]This is my printer [unknown word]

The unknown word would be identified using the two priorThe unknown word would be identified using the two priorwords my printer and the acoustic information of thewords my printer and the acoustic information of thecurrent wordcurrent word

Good for large vocabulary dictation applications.Good for large vocabulary dictation applications.


26/27

NN--class modelclass model

Extend the concept of NExtend the concept of N--gram modeling to syntacticgram modeling to syntactic

categories.categories.

BiBi--class modeling calculates the possibility that twoclass modeling calculates the possibility that two

categories will appear in succession.categories will appear in succession.

Ex of biclassEx of biclass

Article: a, an, theArticle: a, an, the

Countable noun: table, book, shoeCountable noun: table, book, shoe

The probability ofThe probability ofarticle countablearticle countable--nounnoun

Good for corpus much smaller than NGood for corpus much smaller than N--gram modelinggram modeling


27/27

LinguisticsLinguistics--Based GrammarsBased Grammars

Aim to understand what a user has said as wellAim to understand what a user has said as well

as identify the spoken words.as identify the spoken words.

ContextContext--free grammars are often used.free grammars are often used.

Documents

Speech Recognition 2