Speech Recognition 2

Embed Size (px)

Citation preview

  • 8/6/2019 Speech Recognition 2

    1/27

    Speech RecognitionSpeech RecognitionFromFrom

    Judith A. Markowitz. Using Speech Recognition,Judith A. Markowitz. Using Speech Recognition,

    Prentice Hall, NJ, 1996Prentice Hall, NJ, 1996Guojun Lu. Multimedia Database ManagementGuojun Lu. Multimedia Database Management

    Systems, Chapter 5, Artech House, 1999Systems, Chapter 5, Artech House, 1999

  • 8/6/2019 Speech Recognition 2

    2/27

    Speech RecognitionSpeech Recognition

    PreprocessingPreprocessing

    Digitize and represent waveformsDigitize and represent waveforms

    Feature extraction (10 ms frame)Feature extraction (10 ms frame)

    Important Feature:Important Feature:MelMel--frequency cepstral coefficients (MFCC) developed based on howfrequency cepstral coefficients (MFCC) developed based on how

    human hears soundhuman hears sound

    Recognition: Identify what the user has saidRecognition: Identify what the user has said

    Three approachesThree approaches Template matchingTemplate matching

    AcousticAcoustic--phonetic recognition (e.g., FastTalk)phonetic recognition (e.g., FastTalk)

    Stochastic processingStochastic processing

  • 8/6/2019 Speech Recognition 2

    3/27

    TerminologyTerminology

    Phoneme: the smallest unit of sound that is uniquePhoneme: the smallest unit of sound that is unique

    (distinguishing one word from another word for a(distinguishing one word from another word for a

    given language)given language)

    Example:Example:

    The words The words sseat, eat, mmeat, eat, bbeat, eat, ccheat are differentheat are different

    words since the initial sound is a separate phoneme inwords since the initial sound is a separate phoneme in

    English.English.

    About 40About 40--50 phonemes in English50 phonemes in English

    Abnormal: AE B N AO R M AX LAbnormal: AE B N AO R M AX L

  • 8/6/2019 Speech Recognition 2

    4/27

    The simplest sound is pure tone that has a signThe simplest sound is pure tone that has a sign

    waveform. Pure tone are rare.waveform. Pure tone are rare.

    Most sounds, including speech phonemes, areMost sounds, including speech phonemes, are

    complex waves, having a dominant or primarycomplex waves, having a dominant or primary

    frequency calledfrequency called fundamental frequencyfundamental frequency overlaid withoverlaid with

    secondary frequencies.secondary frequencies.

    TerminologyTerminology (Contd)(Contd)

    Fundamental frequency for speech is the rate atFundamental frequency for speech is the rate atwhich the vocal cords flap against each other whenwhich the vocal cords flap against each other when

    producing a voiced phoneme.producing a voiced phoneme.

  • 8/6/2019 Speech Recognition 2

    5/27

    Examples of Complex Waves forExamples of Complex Waves for

    PhonemePhoneme

    Phoneme

    Noise

  • 8/6/2019 Speech Recognition 2

    6/27

    Formants: Bands of secondary frequencies that

    distinguish one phoneme from another

    MultiMulti--frequency sounds like the phonemes offrequency sounds like the phonemes ofspeech can be represented asspeech can be represented as complexcomplex waveswaves..

    Bandwidth of a complex wave: the range ofBandwidth of a complex wave: the range of

    frequencies in the waveform.frequencies in the waveform. Sounds that produce acyclic waves are oftenSounds that produce acyclic waves are often

    called noise.called noise.

    Terminology (Contd)Terminology (Contd)

  • 8/6/2019 Speech Recognition 2

    7/27

    CoCo--articulationarticulation

    CoCo--articulation effects: interarticulation effects: inter--phonemephonemeinfluencesinfluences

    Neighboring phonemes, the position of aNeighboring phonemes, the position of aphoneme within words, and the position ofphoneme within words, and the position ofthe word in the sentence all influence the waythe word in the sentence all influence the waya phoneme is uttered.a phoneme is uttered.

    Because of coBecause of co--articulation effects,articulation effects, a specifica specificutterance or instance of a phoneme is called autterance or instance of a phoneme is called aphone.phone.

  • 8/6/2019 Speech Recognition 2

    8/27

    Template MatchingTemplate Matching

    Each word or phrase is stored as a separate template.Each word or phrase is stored as a separate template.

    Idea: Select the template that best matches theIdea: Select the template that best matches the

    spoken input (framespoken input (frame--byby--frame comparison) and theframe comparison) and the

    dissimilarity is within a predetermined threshold.dissimilarity is within a predetermined threshold.

    Template matching is performed at the word level.Template matching is performed at the word level.

    Temporal alignment is used to ensure that fast orTemporal alignment is used to ensure that fast or

    slow utterance of the same word is not identified asslow utterance of the same word is not identified asdifferent words.different words.

    Dynamic Time Warping is used for temporal alignment.Dynamic Time Warping is used for temporal alignment.

  • 8/6/2019 Speech Recognition 2

    9/27

    Dynamic Time WarpingDynamic Time Warping

  • 8/6/2019 Speech Recognition 2

    10/27

    Robust TemplateRobust Template

    In early systems, one template for one exampleIn early systems, one template for one example

    (token)(token)

    To handle variability, many templates of theTo handle variability, many templates of thesame word are stored.same word are stored.

    Robust template is created from more than oneRobust template is created from more than one

    token of the same word using mathematicaltoken of the same word using mathematical

    averages and statistical clustering techniques.averages and statistical clustering techniques.

  • 8/6/2019 Speech Recognition 2

    11/27

    Template MatchingTemplate Matching

    AdvantageAdvantage

    Perform well with small vocabularies of phoneticallyPerform well with small vocabularies of phoneticallydistinct words.distinct words.

    Midsize vocabularies in the range of 1000Midsize vocabularies in the range of 1000--10000 words are10000 words arepossible if the number of vocabulary choices at a one timepossible if the number of vocabulary choices at a one timeis kept minimal.is kept minimal.

    DisadvantageDisadvantage

    Must have at least one template for each word in theMust have at least one template for each word in the

    application vocabulary.application vocabulary.

    Not good with large vocabularies containing words thatNot good with large vocabularies containing words thathave similar sounds (confusable words., e.g., to andhave similar sounds (confusable words., e.g., to andtwo)two)

  • 8/6/2019 Speech Recognition 2

    12/27

    AcousticAcoustic--Phonetic RecognitionPhonetic Recognition

    Store only representations ofStore only representations of

    phonemes for a languagephonemes for a language

    Three stepsThree steps

    I. Feature extractionI. Feature extraction II. Segmentation and labeling:II. Segmentation and labeling:

    SegmentationSegmentation determinedetermine

    when one phoneme ends andwhen one phoneme ends and

    another beginsanother begins

    LabelingLabeling identifyidentify

    phonemesphonemes

    Output a set of phonemeOutput a set of phoneme

    hypotheses that can behypotheses that can be

    represented by a phoneme lattice,represented by a phoneme lattice,

    a decision tree, etc.a decision tree, etc.

    III. WordIII. Word--level recognition:level recognition:Search for words matchingSearch for words matching

    phoneme hypotheses. Thephoneme hypotheses. The

    word best matching aword best matching a

    sequence of hypotheses issequence of hypotheses is

    identifiedidentified..

  • 8/6/2019 Speech Recognition 2

    13/27

    Stochastic ProcessingStochastic Processing

    Use Hidden Markov ModelUse Hidden Markov Model(HMM) to store the model(HMM) to store the modelof each of theof each of the itemsitems that willthat will

    be recognized.be recognized.

    Items: phonemes or subwords.Items: phonemes or subwords.

    Each state of the HMM hasEach state of the HMM hasstatistics for a segment ofstatistics for a segment ofthe word.the word.

    The statistics describe theThe statistics describe theparameter values andparameter values andvariation that were found invariation that were found insamples of the word.samples of the word.

    A recognition system may havenumerous HMMs or may combine

    them into one network of states andtransitions.

    3-state HMM of a triphone obtain from training

    Stochastic processing using HMM is accurate and flexible.

  • 8/6/2019 Speech Recognition 2

    14/27

    Subword UnitsSubword Units

    Training wholeTraining whole--word models are not good for largeword models are not good for largevocabularies. Subword units are considered.vocabularies. Subword units are considered.

    The most popular subword unit is triphone.The most popular subword unit is triphone.

    Triphone (phoneme in context (PIC)) consists of the currentTriphone (phoneme in context (PIC)) consists of the currentphoneme and its left and right phonemes.phoneme and its left and right phonemes.

    A triphone is generally represented by a 3A triphone is generally represented by a 3--state HMM.state HMM.

    The first state represents the left phonemeThe first state represents the left phoneme

    The middle state represents the current phonemeThe middle state represents the current phoneme

    The last state represents the following phoneme.The last state represents the following phoneme. The number of triphones for English is much larger than theThe number of triphones for English is much larger than the

    number of phonemesnumber of phonemes

  • 8/6/2019 Speech Recognition 2

    15/27

    The recognition system compares the inputThe recognition system compares the input

    with stored modelswith stored models

    Two comparison approachesTwo comparison approaches BaumBaum--Welch maximum likelihood algorithmWelch maximum likelihood algorithm --

    computes the probability scores between the inputcomputes the probability scores between the input

    and the stored models and selects the best match.and the stored models and selects the best match.

    Viterbi AlgorithmViterbi Algorithm looks for the best pathlooks for the best path

  • 8/6/2019 Speech Recognition 2

    16/27

    Evaluation of SpeechEvaluation of Speech

    Recognition SystemRecognition System

    Vocabulary size and flexibilityVocabulary size and flexibility

    Required sentence and application structuresRequired sentence and application structures The end usersThe end users

    Type and amount of noiseType and amount of noise

    Stress placed upon the person using theStress placed upon the person using theapplicationapplication

  • 8/6/2019 Speech Recognition 2

    17/27

    Basic class of errorsBasic class of errors

    Deletion: dropping of wordsDeletion: dropping of words

    Substitution: replace a word with another wordSubstitution: replace a word with another word

    Insertion: adding a wordInsertion: adding a word Rejection: cannot recognized by the programRejection: cannot recognized by the program

    High thresholdHigh threshold more rejectionmore rejection

    Low thresholdLow threshold more substitution or insertionmore substitution or insertionerrors.errors.

  • 8/6/2019 Speech Recognition 2

    18/27

    VariabilityVariability

    CoCo--articulationarticulation

    InterInter--speaker differencesspeaker differences

    IntraIntra--speaker inconsistenciesspeaker inconsistencies Robustness of a system: how system performsRobustness of a system: how system performs

    under variabilityunder variability

  • 8/6/2019 Speech Recognition 2

    19/27

    CorpusCorpus-- reference database for training; itreference database for training; itincludes machineincludes machine--readable dictionaries, wordreadable dictionaries, wordlists, and published materials from specificlists, and published materials from specific

    professions.professions. Homophones: same pronunciation, differentHomophones: same pronunciation, different

    spelling Ex. one and wonspelling Ex. one and won

    Active Vocabulary: set of words theActive Vocabulary: set of words theapplication expects to be spoken at any oneapplication expects to be spoken at any onetime.time.

  • 8/6/2019 Speech Recognition 2

    20/27

    Grammars (models, scripts) are used toGrammars (models, scripts) are used to

    structure words to reduce perplexity, increasestructure words to reduce perplexity, increase

    speed and accuracy, and enhance vocabularyspeed and accuracy, and enhance vocabulary

    flexibility.flexibility.

    Finite state grammarsFinite state grammars

    Probabilistic modelsProbabilistic models

    LinguisticsLinguistics--based grammarsbased grammars

  • 8/6/2019 Speech Recognition 2

    21/27

    Search through the vocabularies to find theSearch through the vocabularies to find the

    best matchbest match

    Branching factor: the number of items inBranching factor: the number of items in

    the active vocabulary at a single point in athe active vocabulary at a single point in a

    recognition process.recognition process.

    Perplexity is often used to refer to thePerplexity is often used to refer to the

    average branching factor.average branching factor.

    High branching factorHigh branching factor high recognition timehigh recognition time

  • 8/6/2019 Speech Recognition 2

    22/27

    A total vocabulary of 1000 wordsA total vocabulary of 1000 words

    Input: Take the toll road to MilwaukeeInput: Take the toll road to Milwaukee

    With no grammar, the branching factor at each pointWith no grammar, the branching factor at each point

    is 1000. The perplexity is 1000.is 1000. The perplexity is 1000. With a finite state grammar,With a finite state grammar,

    Take theTake the TYPETYPE road toroad to PLACEPLACE

    TYPE=high OR toll OR back OR rocky OR longTYPE=high OR toll OR back OR rocky OR long

    PLACE=Milwaukee OR Kokomo OR nowhere OR Arby OR RioPLACE=Milwaukee OR Kokomo OR nowhere OR Arby OR Rio The branching factor of the first, second, forth, fifthThe branching factor of the first, second, forth, fifth

    is one; the branching factor of the third and sixth isis one; the branching factor of the third and sixth isfive.five.

  • 8/6/2019 Speech Recognition 2

    23/27

    Weakness of FiniteWeakness of Finite--State GrammarState Grammar

    Users cannot deviate from the patternsUsers cannot deviate from the patterns

    Cannot rank the probability of occurrence toCannot rank the probability of occurrence to

    improve speed and accuracyimprove speed and accuracy

  • 8/6/2019 Speech Recognition 2

    24/27

    Statistical ModelsStatistical Models

    Often used in dictation systemsOften used in dictation systems

    Specify what is likely instead of what isSpecify what is likely instead of what is

    allowed.allowed.

    Two forms of statistical modelingTwo forms of statistical modeling

    NN--gram modelsgram models

    NN--class modelsclass models

  • 8/6/2019 Speech Recognition 2

    25/27

    NN--gram Modelgram Model

    Identify the current word (unknown word) byIdentify the current word (unknown word) byassuming the identity of that word dependent uponassuming the identity of that word dependent uponthe previous Nthe previous N--1 words and the acoustic information1 words and the acoustic informationof the unknown word.of the unknown word. Example: TrigramExample: Trigram------N=3N=3 two words prior the unknowntwo words prior the unknown

    wordword

    ExampleExample This is my printer [unknown word]This is my printer [unknown word]

    The unknown word would be identified using the two priorThe unknown word would be identified using the two priorwords my printer and the acoustic information of thewords my printer and the acoustic information of thecurrent wordcurrent word

    Good for large vocabulary dictation applications.Good for large vocabulary dictation applications.

  • 8/6/2019 Speech Recognition 2

    26/27

    NN--class modelclass model

    Extend the concept of NExtend the concept of N--gram modeling to syntacticgram modeling to syntactic

    categories.categories.

    BiBi--class modeling calculates the possibility that twoclass modeling calculates the possibility that two

    categories will appear in succession.categories will appear in succession.

    Ex of biclassEx of biclass

    Article: a, an, theArticle: a, an, the

    Countable noun: table, book, shoeCountable noun: table, book, shoe

    The probability ofThe probability ofarticle countablearticle countable--nounnoun

    Good for corpus much smaller than NGood for corpus much smaller than N--gram modelinggram modeling

  • 8/6/2019 Speech Recognition 2

    27/27

    LinguisticsLinguistics--Based GrammarsBased Grammars

    Aim to understand what a user has said as wellAim to understand what a user has said as well

    as identify the spoken words.as identify the spoken words.

    ContextContext--free grammars are often used.free grammars are often used.