Upload
mudit-misra
View
218
Download
0
Embed Size (px)
Citation preview
8/6/2019 Speech Recognition 2
1/27
Speech RecognitionSpeech RecognitionFromFrom
Judith A. Markowitz. Using Speech Recognition,Judith A. Markowitz. Using Speech Recognition,
Prentice Hall, NJ, 1996Prentice Hall, NJ, 1996Guojun Lu. Multimedia Database ManagementGuojun Lu. Multimedia Database Management
Systems, Chapter 5, Artech House, 1999Systems, Chapter 5, Artech House, 1999
8/6/2019 Speech Recognition 2
2/27
Speech RecognitionSpeech Recognition
PreprocessingPreprocessing
Digitize and represent waveformsDigitize and represent waveforms
Feature extraction (10 ms frame)Feature extraction (10 ms frame)
Important Feature:Important Feature:MelMel--frequency cepstral coefficients (MFCC) developed based on howfrequency cepstral coefficients (MFCC) developed based on how
human hears soundhuman hears sound
Recognition: Identify what the user has saidRecognition: Identify what the user has said
Three approachesThree approaches Template matchingTemplate matching
AcousticAcoustic--phonetic recognition (e.g., FastTalk)phonetic recognition (e.g., FastTalk)
Stochastic processingStochastic processing
8/6/2019 Speech Recognition 2
3/27
TerminologyTerminology
Phoneme: the smallest unit of sound that is uniquePhoneme: the smallest unit of sound that is unique
(distinguishing one word from another word for a(distinguishing one word from another word for a
given language)given language)
Example:Example:
The words The words sseat, eat, mmeat, eat, bbeat, eat, ccheat are differentheat are different
words since the initial sound is a separate phoneme inwords since the initial sound is a separate phoneme in
English.English.
About 40About 40--50 phonemes in English50 phonemes in English
Abnormal: AE B N AO R M AX LAbnormal: AE B N AO R M AX L
8/6/2019 Speech Recognition 2
4/27
The simplest sound is pure tone that has a signThe simplest sound is pure tone that has a sign
waveform. Pure tone are rare.waveform. Pure tone are rare.
Most sounds, including speech phonemes, areMost sounds, including speech phonemes, are
complex waves, having a dominant or primarycomplex waves, having a dominant or primary
frequency calledfrequency called fundamental frequencyfundamental frequency overlaid withoverlaid with
secondary frequencies.secondary frequencies.
TerminologyTerminology (Contd)(Contd)
Fundamental frequency for speech is the rate atFundamental frequency for speech is the rate atwhich the vocal cords flap against each other whenwhich the vocal cords flap against each other when
producing a voiced phoneme.producing a voiced phoneme.
8/6/2019 Speech Recognition 2
5/27
Examples of Complex Waves forExamples of Complex Waves for
PhonemePhoneme
Phoneme
Noise
8/6/2019 Speech Recognition 2
6/27
Formants: Bands of secondary frequencies that
distinguish one phoneme from another
MultiMulti--frequency sounds like the phonemes offrequency sounds like the phonemes ofspeech can be represented asspeech can be represented as complexcomplex waveswaves..
Bandwidth of a complex wave: the range ofBandwidth of a complex wave: the range of
frequencies in the waveform.frequencies in the waveform. Sounds that produce acyclic waves are oftenSounds that produce acyclic waves are often
called noise.called noise.
Terminology (Contd)Terminology (Contd)
8/6/2019 Speech Recognition 2
7/27
CoCo--articulationarticulation
CoCo--articulation effects: interarticulation effects: inter--phonemephonemeinfluencesinfluences
Neighboring phonemes, the position of aNeighboring phonemes, the position of aphoneme within words, and the position ofphoneme within words, and the position ofthe word in the sentence all influence the waythe word in the sentence all influence the waya phoneme is uttered.a phoneme is uttered.
Because of coBecause of co--articulation effects,articulation effects, a specifica specificutterance or instance of a phoneme is called autterance or instance of a phoneme is called aphone.phone.
8/6/2019 Speech Recognition 2
8/27
Template MatchingTemplate Matching
Each word or phrase is stored as a separate template.Each word or phrase is stored as a separate template.
Idea: Select the template that best matches theIdea: Select the template that best matches the
spoken input (framespoken input (frame--byby--frame comparison) and theframe comparison) and the
dissimilarity is within a predetermined threshold.dissimilarity is within a predetermined threshold.
Template matching is performed at the word level.Template matching is performed at the word level.
Temporal alignment is used to ensure that fast orTemporal alignment is used to ensure that fast or
slow utterance of the same word is not identified asslow utterance of the same word is not identified asdifferent words.different words.
Dynamic Time Warping is used for temporal alignment.Dynamic Time Warping is used for temporal alignment.
8/6/2019 Speech Recognition 2
9/27
Dynamic Time WarpingDynamic Time Warping
8/6/2019 Speech Recognition 2
10/27
Robust TemplateRobust Template
In early systems, one template for one exampleIn early systems, one template for one example
(token)(token)
To handle variability, many templates of theTo handle variability, many templates of thesame word are stored.same word are stored.
Robust template is created from more than oneRobust template is created from more than one
token of the same word using mathematicaltoken of the same word using mathematical
averages and statistical clustering techniques.averages and statistical clustering techniques.
8/6/2019 Speech Recognition 2
11/27
Template MatchingTemplate Matching
AdvantageAdvantage
Perform well with small vocabularies of phoneticallyPerform well with small vocabularies of phoneticallydistinct words.distinct words.
Midsize vocabularies in the range of 1000Midsize vocabularies in the range of 1000--10000 words are10000 words arepossible if the number of vocabulary choices at a one timepossible if the number of vocabulary choices at a one timeis kept minimal.is kept minimal.
DisadvantageDisadvantage
Must have at least one template for each word in theMust have at least one template for each word in the
application vocabulary.application vocabulary.
Not good with large vocabularies containing words thatNot good with large vocabularies containing words thathave similar sounds (confusable words., e.g., to andhave similar sounds (confusable words., e.g., to andtwo)two)
8/6/2019 Speech Recognition 2
12/27
AcousticAcoustic--Phonetic RecognitionPhonetic Recognition
Store only representations ofStore only representations of
phonemes for a languagephonemes for a language
Three stepsThree steps
I. Feature extractionI. Feature extraction II. Segmentation and labeling:II. Segmentation and labeling:
SegmentationSegmentation determinedetermine
when one phoneme ends andwhen one phoneme ends and
another beginsanother begins
LabelingLabeling identifyidentify
phonemesphonemes
Output a set of phonemeOutput a set of phoneme
hypotheses that can behypotheses that can be
represented by a phoneme lattice,represented by a phoneme lattice,
a decision tree, etc.a decision tree, etc.
III. WordIII. Word--level recognition:level recognition:Search for words matchingSearch for words matching
phoneme hypotheses. Thephoneme hypotheses. The
word best matching aword best matching a
sequence of hypotheses issequence of hypotheses is
identifiedidentified..
8/6/2019 Speech Recognition 2
13/27
Stochastic ProcessingStochastic Processing
Use Hidden Markov ModelUse Hidden Markov Model(HMM) to store the model(HMM) to store the modelof each of theof each of the itemsitems that willthat will
be recognized.be recognized.
Items: phonemes or subwords.Items: phonemes or subwords.
Each state of the HMM hasEach state of the HMM hasstatistics for a segment ofstatistics for a segment ofthe word.the word.
The statistics describe theThe statistics describe theparameter values andparameter values andvariation that were found invariation that were found insamples of the word.samples of the word.
A recognition system may havenumerous HMMs or may combine
them into one network of states andtransitions.
3-state HMM of a triphone obtain from training
Stochastic processing using HMM is accurate and flexible.
8/6/2019 Speech Recognition 2
14/27
Subword UnitsSubword Units
Training wholeTraining whole--word models are not good for largeword models are not good for largevocabularies. Subword units are considered.vocabularies. Subword units are considered.
The most popular subword unit is triphone.The most popular subword unit is triphone.
Triphone (phoneme in context (PIC)) consists of the currentTriphone (phoneme in context (PIC)) consists of the currentphoneme and its left and right phonemes.phoneme and its left and right phonemes.
A triphone is generally represented by a 3A triphone is generally represented by a 3--state HMM.state HMM.
The first state represents the left phonemeThe first state represents the left phoneme
The middle state represents the current phonemeThe middle state represents the current phoneme
The last state represents the following phoneme.The last state represents the following phoneme. The number of triphones for English is much larger than theThe number of triphones for English is much larger than the
number of phonemesnumber of phonemes
8/6/2019 Speech Recognition 2
15/27
The recognition system compares the inputThe recognition system compares the input
with stored modelswith stored models
Two comparison approachesTwo comparison approaches BaumBaum--Welch maximum likelihood algorithmWelch maximum likelihood algorithm --
computes the probability scores between the inputcomputes the probability scores between the input
and the stored models and selects the best match.and the stored models and selects the best match.
Viterbi AlgorithmViterbi Algorithm looks for the best pathlooks for the best path
8/6/2019 Speech Recognition 2
16/27
Evaluation of SpeechEvaluation of Speech
Recognition SystemRecognition System
Vocabulary size and flexibilityVocabulary size and flexibility
Required sentence and application structuresRequired sentence and application structures The end usersThe end users
Type and amount of noiseType and amount of noise
Stress placed upon the person using theStress placed upon the person using theapplicationapplication
8/6/2019 Speech Recognition 2
17/27
Basic class of errorsBasic class of errors
Deletion: dropping of wordsDeletion: dropping of words
Substitution: replace a word with another wordSubstitution: replace a word with another word
Insertion: adding a wordInsertion: adding a word Rejection: cannot recognized by the programRejection: cannot recognized by the program
High thresholdHigh threshold more rejectionmore rejection
Low thresholdLow threshold more substitution or insertionmore substitution or insertionerrors.errors.
8/6/2019 Speech Recognition 2
18/27
VariabilityVariability
CoCo--articulationarticulation
InterInter--speaker differencesspeaker differences
IntraIntra--speaker inconsistenciesspeaker inconsistencies Robustness of a system: how system performsRobustness of a system: how system performs
under variabilityunder variability
8/6/2019 Speech Recognition 2
19/27
CorpusCorpus-- reference database for training; itreference database for training; itincludes machineincludes machine--readable dictionaries, wordreadable dictionaries, wordlists, and published materials from specificlists, and published materials from specific
professions.professions. Homophones: same pronunciation, differentHomophones: same pronunciation, different
spelling Ex. one and wonspelling Ex. one and won
Active Vocabulary: set of words theActive Vocabulary: set of words theapplication expects to be spoken at any oneapplication expects to be spoken at any onetime.time.
8/6/2019 Speech Recognition 2
20/27
Grammars (models, scripts) are used toGrammars (models, scripts) are used to
structure words to reduce perplexity, increasestructure words to reduce perplexity, increase
speed and accuracy, and enhance vocabularyspeed and accuracy, and enhance vocabulary
flexibility.flexibility.
Finite state grammarsFinite state grammars
Probabilistic modelsProbabilistic models
LinguisticsLinguistics--based grammarsbased grammars
8/6/2019 Speech Recognition 2
21/27
Search through the vocabularies to find theSearch through the vocabularies to find the
best matchbest match
Branching factor: the number of items inBranching factor: the number of items in
the active vocabulary at a single point in athe active vocabulary at a single point in a
recognition process.recognition process.
Perplexity is often used to refer to thePerplexity is often used to refer to the
average branching factor.average branching factor.
High branching factorHigh branching factor high recognition timehigh recognition time
8/6/2019 Speech Recognition 2
22/27
A total vocabulary of 1000 wordsA total vocabulary of 1000 words
Input: Take the toll road to MilwaukeeInput: Take the toll road to Milwaukee
With no grammar, the branching factor at each pointWith no grammar, the branching factor at each point
is 1000. The perplexity is 1000.is 1000. The perplexity is 1000. With a finite state grammar,With a finite state grammar,
Take theTake the TYPETYPE road toroad to PLACEPLACE
TYPE=high OR toll OR back OR rocky OR longTYPE=high OR toll OR back OR rocky OR long
PLACE=Milwaukee OR Kokomo OR nowhere OR Arby OR RioPLACE=Milwaukee OR Kokomo OR nowhere OR Arby OR Rio The branching factor of the first, second, forth, fifthThe branching factor of the first, second, forth, fifth
is one; the branching factor of the third and sixth isis one; the branching factor of the third and sixth isfive.five.
8/6/2019 Speech Recognition 2
23/27
Weakness of FiniteWeakness of Finite--State GrammarState Grammar
Users cannot deviate from the patternsUsers cannot deviate from the patterns
Cannot rank the probability of occurrence toCannot rank the probability of occurrence to
improve speed and accuracyimprove speed and accuracy
8/6/2019 Speech Recognition 2
24/27
Statistical ModelsStatistical Models
Often used in dictation systemsOften used in dictation systems
Specify what is likely instead of what isSpecify what is likely instead of what is
allowed.allowed.
Two forms of statistical modelingTwo forms of statistical modeling
NN--gram modelsgram models
NN--class modelsclass models
8/6/2019 Speech Recognition 2
25/27
NN--gram Modelgram Model
Identify the current word (unknown word) byIdentify the current word (unknown word) byassuming the identity of that word dependent uponassuming the identity of that word dependent uponthe previous Nthe previous N--1 words and the acoustic information1 words and the acoustic informationof the unknown word.of the unknown word. Example: TrigramExample: Trigram------N=3N=3 two words prior the unknowntwo words prior the unknown
wordword
ExampleExample This is my printer [unknown word]This is my printer [unknown word]
The unknown word would be identified using the two priorThe unknown word would be identified using the two priorwords my printer and the acoustic information of thewords my printer and the acoustic information of thecurrent wordcurrent word
Good for large vocabulary dictation applications.Good for large vocabulary dictation applications.
8/6/2019 Speech Recognition 2
26/27
NN--class modelclass model
Extend the concept of NExtend the concept of N--gram modeling to syntacticgram modeling to syntactic
categories.categories.
BiBi--class modeling calculates the possibility that twoclass modeling calculates the possibility that two
categories will appear in succession.categories will appear in succession.
Ex of biclassEx of biclass
Article: a, an, theArticle: a, an, the
Countable noun: table, book, shoeCountable noun: table, book, shoe
The probability ofThe probability ofarticle countablearticle countable--nounnoun
Good for corpus much smaller than NGood for corpus much smaller than N--gram modelinggram modeling
8/6/2019 Speech Recognition 2
27/27
LinguisticsLinguistics--Based GrammarsBased Grammars
Aim to understand what a user has said as wellAim to understand what a user has said as well
as identify the spoken words.as identify the spoken words.
ContextContext--free grammars are often used.free grammars are often used.