Methods Pattern Recognition - LIACS

1

LML Speech Recognition 2009LML Speech Recognition 2009

Speech RecognitionSpeech RecognitionSignal Processing and AnalysisSignal Processing and Analysis

E.M. E.M. BakkerBakker


Features for Speech Recognition Features for Speech Recognition and Audio Indexingand Audio Indexing

Parametric Representations– Short Time Energy– Zero Crossing Rates– Level Crossing Rates– Short Time Spectral Envelope

Spectral Analysis– Filter Design– Filter Bank Spectral Analysis Model– Linear Predictive Coding (LPC)


MethodsMethodsVector QuantizationVector Quantization–– Finite code book of spectral shapesFinite code book of spectral shapes–– The code book codes for ‘typical’ spectral shapeThe code book codes for ‘typical’ spectral shape–– Method for all spectral representations (e.g. Filter Method for all spectral representations (e.g. Filter

Banks, LPC, ZCR, etc. …)Banks, LPC, ZCR, etc. …)Ensemble Interval Histogram (EIH) ModelEnsemble Interval Histogram (EIH) Model–– AuditoryAuditory--Based Spectral Analysis ModelBased Spectral Analysis Model–– More robust to noise and reverberationMore robust to noise and reverberation–– Expected to be inherently better representation of Expected to be inherently better representation of

relevant spectral information because it models the relevant spectral information because it models the human cochlea mechanicshuman cochlea mechanics


Pattern RecognitionPattern Recognition

ReferencePatterns

ParameterMeasurements

DecisionRules

PatternComparison

SpeechAudio, …

RecognizedSpeech, Audio, …

Test PatternQuery Pattern

2


Pattern RecognitionPattern Recognition

Reference VocabularyFeatures

Feature Detector1

HypothesisTester

Feature Combinerand

Decision Logic

SpeechAudio, …

RecognizedSpeech, Audio, …

Feature Detectorn


Spectral Analysis ModelsSpectral Analysis Models

Pattern Recognition ApproachPattern Recognition Approach1.1. Parameter Measurement => PatternParameter Measurement => Pattern2.2. Pattern ComparisonPattern Comparison3.3. Decision MakingDecision Making

Parameter MeasurementsParameter Measurements–– Bank of Filters ModelBank of Filters Model–– Linear Predictive Coding ModelLinear Predictive Coding Model


Band Pass FilterBand Pass Filter

Audio Signals(n)

Bandpass FilterF()

Result Audio SignalF(s(n)

Note that the bandpass filter can be defined as:

• a convolution with a filter response function in the time domain,

• a multiplication with a filter response function in the frequency domain


Bank of Filters Analysis ModelBank of Filters Analysis Model

3


Bank of Filters Analysis ModelBank of Filters Analysis ModelSpeech Signal: s(n), n=0,1,…– Digital with Fs the sampling frequency of s(n)

Bank of q Band Pass Filters: BPF1, …,BPFq– Spanning a frequency range of, e.g., 100-3000Hz or

100-16kHz– BPFi(s(n)) = xn(ejωi), where ωi = 2πfi/Fs is equal to the

normalized frequency fi, where i=1, …, q.– xn(ejωi) is the short time spectral representation of s(n)

at time n, as seen through the BPFi with centre frequency ωi, where i=1, …, q.

Note: Each BPF independently processes s to produce the spectral representation x


Bank of Filters Front End Bank of Filters Front End ProcessorProcessor


Typical Speech Wave FormsTypical Speech Wave Forms


MFCCsMFCCs

Mel-Scale Filter Bank

MFCC’sfirst 12 most

Signiifcantcoefficients

Log()

SpeechAudio, … Preemphasis Windowing Fast Fourier

Transform

Direct CosineTransform

MFCCs are calculated using the formula:

∑=

−=N

kiki NkXC

1)/)5.0(cos(π

Where • Ci is the cepstral coefficient• P the order (12 in our case)• K the number of discrete Fourier

transform magnitude coefficients• Xk the kth order log-energy output

from the Mel-Scale filterbank.• N is the number of filters

4


Linear Predictive Coding ModelLinear Predictive Coding Model


Filter Response FunctionsFilter Response Functions


SomeSomeExamples Examples

ofofIdeal Band Ideal Band

FiltersFilters


Perceptually Based Perceptually Based Critical Band ScaleCritical Band Scale

5


Short Time Fourier TransformShort Time Fourier Transform

• s(m) signal• w(n-m) a fixed low pass window


Short Time Fourier TransformShort Time Fourier TransformLong Hamming Window: 500 samples (=50msec)Long Hamming Window: 500 samples (=50msec)

Voiced Speech


Short Time Fourier TransformShort Time Fourier TransformShort Hamming Window: 50 samples (=5msec)Short Hamming Window: 50 samples (=5msec)

Voiced Speech


Short Time Fourier TransformShort Time Fourier TransformLong Hamming Window: 500 samples (=50msec)Long Hamming Window: 500 samples (=50msec)

Unvoiced Speech

6


Short Time Fourier TransformShort Time Fourier TransformShort Hamming Window: 50 samples (=5msec)Short Hamming Window: 50 samples (=5msec)

Unvoiced Speech


Short Time Fourier TransformShort Time Fourier TransformLinear Filter InterpretationLinear Filter Interpretation


Linear Predictive Coding (LPC) Linear Predictive Coding (LPC) ModelModel

Speech Signal: s(n), n=0,1,…– Digital with Fs the sampling frequency of s(n)

Spectral Analysis on Blocks of Speech with an all pole modeling constraintLPC of analysis order p– s(n) is blocked into frames [n,m]– Again consider xn(ejω) the short time spectral representation of s(n) at

time n. (where ω = 2πf/Fs is equal to the normalized frequency f). – Now the spectral representation xn(ejω) is constrained to be of the form

σ/A(ejω), where A(ejω) is the pth order polynomial with z-transform: A(z) = 1 + a1z-1 + a2z-2 + … + apz-p

– The output of the LPC parametric Conversion on block [n,m] is the vector [a1,…,ap].

– It specifies parametrically the spectrum of an all-pole model that best matches the signal spectrum over the period of time in which the frame of speech samples was accumulated (pth order polynomial approximation of the signal).


Vector QuantizationVector Quantization

Data represented as feature vectors.Data represented as feature vectors.VQ Training set to determine a set of code VQ Training set to determine a set of code words that constitute a code book.words that constitute a code book.Code words are Code words are centroidscentroids using a similarity or using a similarity or distance measure d.distance measure d.Code words together with d divide the space into Code words together with d divide the space into a a VoronoiVoronoi regions.regions.A query vector falls into a A query vector falls into a VoronoiVoronoi region and will region and will be represented by the respective codeword.be represented by the respective codeword.

7


Vector QuantizationVector Quantization

Distance measures Distance measures d(x,yd(x,y):):

Euclidean distanceEuclidean distanceTaxi cab distanceTaxi cab distanceHamming distanceHamming distanceetc.etc.


Vector QuantizationVector QuantizationClustering the Training VectorsClustering the Training Vectors

Initialize:Initialize: choose M arbitrary vectors of the L vectors of choose M arbitrary vectors of the L vectors of the training set. This is the initial code book.the training set. This is the initial code book.Nearest neighbor search:Nearest neighbor search: for each training vector, find for each training vector, find the code word in the current code book that is closest the code word in the current code book that is closest and assign that vector to the corresponding cell.and assign that vector to the corresponding cell.CentroidCentroid update:update: update the code word in each cell update the code word in each cell using the using the centroidcentroid of the training vectors that are of the training vectors that are assigned to that cell.assigned to that cell.Iteration:Iteration: repeat step 2repeat step 2--3 until the 3 until the averaeaverae distance falls distance falls below a preset threshold. below a preset threshold.


Vector ClassificationVector Classification

For an MFor an M--vector code book CB with codesvector code book CB with codesCB = {CB = {yyii | 1 | 1 ≤≤ i i ≤≤ M} ,M} ,

the index mthe index m** of the best codebook entry for a of the best codebook entry for a given vector v is:given vector v is:

mm** = = argarg min min d(vd(v, , yyii))1 1 ≤≤ i i ≤≤ MM


VQ for ClassificationVQ for ClassificationA code book A code book CBCBkk = {= {yykk

ii | 1 | 1 ≤≤ i i ≤≤ M}, can be used to M}, can be used to define a class Cdefine a class Ckk..

Example Audio Classification:Example Audio Classification:

Classes Classes ‘‘crowdcrowd’’, , ‘‘carcar’’, , ‘‘silencesilence’’, , ‘‘screamscream’’, , ‘‘explosionexplosion’’, etc., etc.Determine by using VQ code books Determine by using VQ code books CBCBkk for each for each of the classes.of the classes.VQ is very often used as a baseline method for VQ is very often used as a baseline method for classification problems.classification problems.

8


Sound, DNA: Sequences!Sound, DNA: Sequences!

DNA: helixDNA: helix--shaped molecule shaped molecule whose constituents are two whose constituents are two parallel strands of nucleotidesparallel strands of nucleotidesDNA is usually represented by DNA is usually represented by sequences of these four sequences of these four nucleotidesnucleotidesThis assumes only one strand This assumes only one strand is considered; the second is considered; the second strand is always derivable strand is always derivable from the first by pairing A’s from the first by pairing A’s with T’s and C’s with G’s and with T’s and C’s with G’s and vicevice--versaversa

Nucleotides (bases)Nucleotides (bases)–– Adenine (A)Adenine (A)–– Cytosine (C)Cytosine (C)–– Guanine (G)Guanine (G)–– Thymine (T)Thymine (T)


Biological Information: Biological Information: From Genes to ProteinsFrom Genes to Proteins

GeneDNA

RNA

Transcription

Translation

Protein Protein folding

genomics

molecular biology

structural biology

biophysics


DNA / amino acidsequence 3D structure protein functions

DNA (gene) →→→ pre-RNA →→→ RNA →→→ ProteinRNA-polymerase Spliceosome Ribosome

CGCCAGCTGGACGGGCACACCATGAGGCTGCTGACCCTCCTGGGCCTTCTG…

TDQAAFDTNIVTLTRFVMEQGRKARGTGEMTQLLNSLCTAVKAISTAVRKAGIAHLYGIAGSTNVTGDQVKKLDVLSNDLVINVLKSSFATCVLVTEEDKNAIIVEPEKRGKYVVCFDPLDGSSNIDCLVSIGTIFGIYRKNSTDEPSEKDALQPGRNLVAAGYALYGSATML

From Amino Acids to Proteins From Amino Acids to Proteins FunctionsFunctions


Motivation for Markov ModelsMotivation for Markov Models

TThere are many cases in which we would here are many cases in which we would like to representlike to represent the statistical regularities the statistical regularities of some class of sequencesof some class of sequences–– genesgenes–– proteins in a given familyproteins in a given family–– Sequences of audio featuresSequences of audio features

Markov models are well suited to this type Markov models are well suited to this type of taskof task

9


A Markov Chain ModelA Markov Chain Model

Transition Transition probabilitiesprobabilities–– Pr(xPr(xii=a|x=a|xii--11=g)=0.16=g)=0.16–– Pr(xPr(xii=c|x=c|xii--11=g)=0.34=g)=0.34–– Pr(xPr(xii=g|x=g|xii--11=g)=0.38=g)=0.38–– Pr(xPr(xii=t|x=t|xii--11=g)=0.12=g)=0.12∑ ==− 1)|Pr( 1 gxx ii


Definition of Markov Chain ModelDefinition of Markov Chain Model

A Markov chainA Markov chain[1][1] model is defined bymodel is defined by

–– a set of statesa set of states

some states emit symbolssome states emit symbols

other states (e.g., the begin state) are silentother states (e.g., the begin state) are silent

–– a set of transitions with associateda set of transitions with associated probabilitiesprobabilities

the transitions emanating from a given state define athe transitions emanating from a given state define a ddistribution istribution

over the possible next statesover the possible next states

[1] [1] МарковМарков АА. . АА., ., РаспространениеРаспространение законазакона большихбольших чиселчисел нана величинывеличины, , зависящиезависящие другдруг

отот другадруга.. —— ИзвестияИзвестия физикофизико--математическогоматематического обществаобщества припри КазанскомКазанском

университетеуниверситете.. —— 22--яя сериясерия.. —— ТомТом 15. (1906)15. (1906) —— СС. 135. 135——156 156


Markov Chain Models: Markov Chain Models: PropertiesProperties

Given some sequence x of length L, we can ask howGiven some sequence x of length L, we can ask howprobable the sequence is given our modelprobable the sequence is given our modelFor any probabilistic model of sequences, we can For any probabilistic model of sequences, we can write thiswrite this probability asprobability as

key property of a (1key property of a (1stst order) Markov chain: the order) Markov chain: the probabilityprobability of each of each xxii depends only on the value ofdepends only on the value of xxii--11

)Pr()...,...,|Pr(),...,|Pr(),...,,Pr()Pr(

112111

11

xxxxxxxxxxx

LLLL

LL

−−−

−

==

∏=

−

−−−

=

=L

iii

LLLL

xxx

xxxxxxxx

211

112211

)|Pr()Pr(

)Pr()|Pr()...|Pr()|Pr()Pr(


The Probability of a Sequence for a The Probability of a Sequence for a Markov Chain ModelMarkov Chain Model

Pr(cggt)=Pr(c)Pr(g|c)Pr(g|g)Pr(t|g)

10


Example ApplicationExample ApplicationCpGCpG islandsislands

CGCG didi--nucleotides are rarer in eukaryotic genomes thannucleotides are rarer in eukaryotic genomes than expected expected given the marginal probabilities of given the marginal probabilities of CC and and GG

but the regions upstream of genes are richer in but the regions upstream of genes are richer in CGCG didi--nucleotides nucleotides than elsewhere than elsewhere –– CpGCpG islandsislands

useful evidence for finding genesuseful evidence for finding genes

Application: Predict Application: Predict CpGCpG islands with Markov chainsislands with Markov chains

one Markov chain to represent one Markov chain to represent CpGCpG islandsislands

another Markov chain to represent the rest of the genomeanother Markov chain to represent the rest of the genome


Markov Chains for Markov Chains for DiscriminationDiscrimination

Suppose we want to distinguish Suppose we want to distinguish CpGCpG islands from islands from otherother sequence regionssequence regionsGiven sequences from Given sequences from CpGCpG islands, and sequences islands, and sequences fromfrom other regions, we can constructother regions, we can construct–– a model to represent a model to represent CpGCpG islandsislands–– a null model to represent the other regionsa null model to represent the other regions

We can then score a test sequence by:We can then score a test sequence by:

)|Pr()|Pr(log)(

nullModelxCpGModelxxscore =


Markov Chains for DiscriminationMarkov Chains for DiscriminationWhy can we use Why can we use

According to According to BayesBayes’’ rule:rule:

If we are not taking into account prior probabilities If we are not taking into account prior probabilities ((Pr(CpGPr(CpG)) and and Pr(nullPr(null)))) of the twoof the two classes, then from classes, then from BayesBayes’’ rule it is clear that rule it is clear that we just need towe just need to compare compare Pr(x|CpGPr(x|CpG)) andand Pr(x|nullPr(x|null)) as is done in as is done in our scoring function our scoring function score().score().

)Pr()Pr()|Pr()|Pr(

xCpGCpGxxCpG =

)Pr()Pr()|Pr()|Pr(

xnullnullxxnull =

)|Pr()|Pr(log)(

nullModelxCpGModelxxscore =


Higher Order Markov ChainsHigher Order Markov Chains

The Markov property specifies that the probability of a stateThe Markov property specifies that the probability of a statedepends depends onlyonly on the probability of the previous stateon the probability of the previous state

But we can build more “memory” into our states by using aBut we can build more “memory” into our states by using a higher higher orderorder Markov modelMarkov model

In an In an nn--thth order Markov modelorder Markov model

The probability of the current state depends on the previous The probability of the current state depends on the previous nn states.states.

),...,|Pr(),...,,|Pr( 1121 niiiiii xxxxxxx −−−− =

11


Selecting the Order of aSelecting the Order of a MarkovMarkov Chain Chain ModelModel

But the number of parameters we need to estimate But the number of parameters we need to estimate growsgrows exponentially with the orderexponentially with the order–– for modeling DNA we need for modeling DNA we need parameters for anparameters for an nn--thth

order modelorder model

The higher the order, the less reliable we can expect The higher the order, the less reliable we can expect ourour parameter estimates to beparameter estimates to be–– estimating the parameters of a estimating the parameters of a 22ndnd order Markov chainorder Markov chain from the from the

complete genome of E. Coli (5.44 x 10complete genome of E. Coli (5.44 x 1066 bases) , we’d see eachbases) , we’d see eachword ~ 85.000 times on average (divide by 4word ~ 85.000 times on average (divide by 433))

–– estimating the parameters of a 9estimating the parameters of a 9thth order chain, we’dorder chain, we’d see each see each word ~ 5 times on average (divide by 4word ~ 5 times on average (divide by 410 10 ~ 10~ 1066))

)4( 1+nO


Higher Order Markov ChainsHigher Order Markov Chains

An An nn--thth order Markov chain over some alphabet order Markov chain over some alphabet A A isisequivalent to a first order Markov chain over the equivalent to a first order Markov chain over the alphabetalphabet of of nn--tuplestuples: A: Ann

Example: A 2Example: A 2ndnd order Markov model for DNA can beorder Markov model for DNA can betreated as a 1treated as a 1stst order Markov model over alphabetorder Markov model over alphabetAA, AC, AG, ATAA, AC, AG, AT

CA, CC, CG, CTCA, CC, CG, CT

GA, GC, GG, GTGA, GC, GG, GT

TA, TC, TG, TTTA, TC, TG, TT


A Fifth Order Markov ChainA Fifth Order Markov Chain

Pr(gctaca)=Pr(gctac)Pr(a|gctac)LML Speech Recognition 2009LML Speech Recognition 2009

Hidden Markov Model: A Simple Hidden Markov Model: A Simple HMMHMM

Given observed sequence AGGCT, which state emits every item?

Model 1 Model 2

12


Tutorial on HMMTutorial on HMM

L.R. L.R. RabinerRabiner, A Tutorial on Hidden Markov Models , A Tutorial on Hidden Markov Models and Selected Applications in Speech and Selected Applications in Speech Recognition,Recognition,

Proceeding of the IEEE, Vol. 77, No. 22, February Proceeding of the IEEE, Vol. 77, No. 22, February 1989.1989.


HMM for Hidden Coin TossingHMM for Hidden Coin Tossing

HT

T

T T

TH

T……… H H T T H T H H T T H


Hidden StateHidden State

We’ll distinguish between the observed parts of a We’ll distinguish between the observed parts of a

problemproblem and the hidden partsand the hidden parts

In the Markov models we’ve considered previously, it isIn the Markov models we’ve considered previously, it is

clear which state accounts for each part of the observedclear which state accounts for each part of the observed

sequencesequence

In the model above, there are multiple states that couldIn the model above, there are multiple states that could

account for each part of the observed sequenceaccount for each part of the observed sequence

–– this is the hidden part of the problemthis is the hidden part of the problem


Learning and Prediction TasksLearning and Prediction Tasks(in general, i.e., applies on both MM as HMM)(in general, i.e., applies on both MM as HMM)

LearningLearning–– GivenGiven: a model, a set of training sequences: a model, a set of training sequences–– DoDo: find model parameters that explain the training sequences with: find model parameters that explain the training sequences with

relatively high probability (goal is to find a model that relatively high probability (goal is to find a model that generalizes generalizes wellwell to to sequences we haven’t seen before)sequences we haven’t seen before)

ClassificationClassification–– GivenGiven: a set of models representing different sequence classes,: a set of models representing different sequence classes, and and

given given a test sequencea test sequence–– DoDo: determine which model/class best explains the sequence: determine which model/class best explains the sequence

SegmentationSegmentation–– GivenGiven: a model representing different sequence classes,: a model representing different sequence classes, and given and given a a

test sequencetest sequence–– DoDo: segment the sequence into subsequences, predicting the class o: segment the sequence into subsequences, predicting the class of f

eacheach subsequencesubsequence

13


Algorithms for Learning & PredictionAlgorithms for Learning & Prediction

LearningLearning–– correct path known for each training sequencecorrect path known for each training sequence -->> simple maximumsimple maximum

likelihoodlikelihood or Bayesian estimationor Bayesian estimation–– correct path not known correct path not known -->> ForwardForward--Backward algorithm + ML orBackward algorithm + ML or Bayesian Bayesian

estimationestimation

ClassificationClassification–– simple Markov modelsimple Markov model --> > calculate probability of sequence along singlecalculate probability of sequence along single

path for each modelpath for each model–– hidden Markov modelhidden Markov model -->> Forward algorithm to calculate probability ofForward algorithm to calculate probability of

sequence along all paths for each modelsequence along all paths for each model

SegmentationSegmentation–– hidden Markov modelhidden Markov model -->> ViterbiViterbi algorithm to find most probable pathalgorithm to find most probable path for for

sequencesequence


The Parameters of an HMMThe Parameters of an HMM

Transition ProbabilitiesTransition Probabilities

–– Probability of transition from state k to state lProbability of transition from state k to state l

Emission ProbabilitiesEmission Probabilities

–– Probability of emitting character b in state kProbability of emitting character b in state k

Note: Note: HMMHMM’’ss can also be formulated using an emission probability can also be formulated using an emission probability associated with a transition from state k to state l.associated with a transition from state k to state l.

)|Pr( 1 kla iikl === −ππ

)|Pr()( kbxbe iik === π


An HMMAn HMM ExampleExample

Emission probabilities∑ pi = 1

Transition probabilities∑ pi = 1


Three Important QuestionsThree Important Questions(See also L.R. (See also L.R. RabinerRabiner (1989))(1989))

How likely is a given sequence?How likely is a given sequence?–– The Forward algorithmThe Forward algorithm

What is the most probable “path” for generating What is the most probable “path” for generating a givena given sequence?sequence?–– The The ViterbiViterbi algorithmalgorithm

How can we learn the HMM parameters given a How can we learn the HMM parameters given a set ofset of sequences?sequences?–– The ForwardThe Forward--Backward (BaumBackward (Baum--Welch) algorithmWelch) algorithm

14


How Likely is a Given Sequence?How Likely is a Given Sequence?The probability that a The probability that a givengiven path is taken and path is taken and thethe sequence is generated:sequence is generated:

∏=

+=

L

iiNL iiiaxeaxx

1001 11

)()...,...Pr( ππππππ

6.3.8.4.2.4.5.)(

)()(),Pr(

35313

111101

××××××=×××

×××=aCea

AeaAeaAAC π


How Likely is a Given Sequence?How Likely is a Given Sequence?

The probability The probability over all pathsover all paths isis

but the number of paths can be exponential in but the number of paths can be exponential in the length of the sequence...the length of the sequence...the Forward algorithm enables us to compute the Forward algorithm enables us to compute this efficientlythis efficiently


The Forward AlgorithmThe Forward Algorithm

Define Define to be the probability of being in to be the probability of being in state kstate k having observed the first i characters of having observed the first i characters of sequence sequence x of length Lx of length LTo compute To compute , the probability of being in, the probability of being inthe end state having observed all of the end state having observed all of sequence sequence xxCan be defined recursivelyCan be defined recursivelyCompute using dynamic programmingCompute using dynamic programming

)(ifk

)(Lf N


The Forward AlgorithmThe Forward Algorithm

ffkk(i(i)) equal toequal to the probability of being in state the probability of being in state kk having having observed the first observed the first ii characters of characters of sequence sequence xxInitializationInitialization–– ff00(0) = 1(0) = 1 for start state; for start state; ffii(0) = 0(0) = 0 for other statefor other state

RecursionRecursion–– For emitting state For emitting state (i = 1, (i = 1, …… L)L)

–– For silent stateFor silent state

TerminationTermination

∑=k

klkl aifif )()(

∑ −=k

klkll aifieif )1()()(

∑===k

kNkNL aLfLfxxx )()()...Pr()Pr( 1

15


Forward Algorithm ExampleForward Algorithm Example

Given the sequence x=TAGA


Forward Algorithm ExampleForward Algorithm Example

InitializationInitialization–– ff00(0)=1, f(0)=1, f11(0)=0(0)=0……ff55(0)=0(0)=0

Computing other valuesComputing other values–– ff11(1)=e(1)=e11(T)*(f(T)*(f00(0)a(0)a0101+f+f11(0)a(0)a1111))

=0.3*(1*0.5+0*0.2)=0.15=0.3*(1*0.5+0*0.2)=0.15–– ff22(1)=0.4*(1*0.5+0*0.8)(1)=0.4*(1*0.5+0*0.8)–– ff11(2)=e(2)=e11(A)*(f(A)*(f00(1)a(1)a0101+f+f11(1)a(1)a1111))

=0.4*(0*0.5+0.15*0.2)=0.4*(0*0.5+0.15*0.2)……–– Pr(TAGAPr(TAGA)= f)= f55(4)=f(4)=f33(4)a(4)a3535+f+f44(4)a(4)a4545


Three Important QuestionsThree Important Questions

How likely is a given sequence?How likely is a given sequence?

What is the most probable “path” for generating What is the most probable “path” for generating a givena given sequence?sequence?

How can we learn the HMM parameters given a How can we learn the HMM parameters given a set ofset of sequences?sequences?


Finding the Most Probable Path: The Finding the Most Probable Path: The ViterbiViterbi AlgorithmAlgorithm

Define Define vvkk(i(i)) to be the probability of to be the probability of the most probablethe most probablepathpath accounting for the first accounting for the first ii characters of characters of xx and and ending inending in state state kk

We want to compute We want to compute vvNN(L(L)),, the probability of the probability of the mostthe mostprobable pathprobable path accounting for all of the sequence andaccounting for all of the sequence andending in the end stateending in the end state

Can be defined recursivelyCan be defined recursively

Again we can use use Dynamic Programming to Again we can use use Dynamic Programming to compute compute vvNN(L(L)) and find the most probable path and find the most probable path efficientlyefficiently

16


Finding the Most Probable Path: The Finding the Most Probable Path: The ViterbiViterbi AlgorithmAlgorithm

Define Define vvkk(i(i)) to be the probability of to be the probability of the most probablethe most probablepath path ππ accounting for the first accounting for the first ii characters of characters of xx and and ending inending in state state kk

The The ViterbiViterbi Algorithm:Algorithm:1.1. Initialization Initialization (i = 0)(i = 0)

vv00(0) = 1, v(0) = 1, vkk(0) = 0(0) = 0 for for k>0k>0

2.2. Recursion Recursion (i = 1,…,L)(i = 1,…,L)vvll(i(i) = ) = eell(x(xii) .max) .maxkk(v(vkk(i(i--1).a1).aklkl))

ptrptrii(l(l) = argmax) = argmaxkk(v(vkk(i(i--1).a1).aklkl))

3.3. Termination: Termination: P(xP(x,,ππ**) = max) = maxkk((vvkk(L).a(L).ak0k0))

ππ**LL = argmax= argmaxkk(v(vkk(L).a(L).ak0k0))


Three Important QuestionsThree Important Questions

How likely is a given sequence?How likely is a given sequence?

What is the most probable “path” for What is the most probable “path” for

generating a givengenerating a given sequence?sequence?

How can we learn the HMM parameters How can we learn the HMM parameters

given a set ofgiven a set of sequences?sequences?


Learning Without Hidden StateLearning Without Hidden StateLearning is simple if we know the correct path for each Learning is simple if we know the correct path for each sequence in our training setsequence in our training set

estimate parameters by counting the number of times estimate parameters by counting the number of times each parameter is used across the training seteach parameter is used across the training set


Learning With Hidden StateLearning With Hidden StateIf we don’t know the correct path for each sequence If we don’t know the correct path for each sequence in ourin our training set, consider all possible paths for the training set, consider all possible paths for the sequencesequence

Estimate parameters through a procedure that Estimate parameters through a procedure that counts the expected number of times each counts the expected number of times each parameter is used across the training setparameter is used across the training set

17


Learning Parameters: The BaumLearning Parameters: The Baum--Welch AlgorithmWelch Algorithm

Also known as the ForwardAlso known as the Forward--Backward algorithmBackward algorithm

An Expectation Maximization (EM) algorithmAn Expectation Maximization (EM) algorithm–– EM is a family of algorithms for learning probabilisticEM is a family of algorithms for learning probabilistic

models in problems that involve hidden statesmodels in problems that involve hidden states

In this context, the hidden state is the path that In this context, the hidden state is the path that bestbest explains each training sequenceexplains each training sequence


Learning Parameters: The BaumLearning Parameters: The Baum--Welch AlgorithmWelch Algorithm

Algorithm sketch:Algorithm sketch:–– initialize parameters of modelinitialize parameters of model

–– iterate until convergenceiterate until convergence

calculate the calculate the expected expected number of times number of times eacheach transition or emission is usedtransition or emission is used

adjust the parameters to adjust the parameters to maximize maximize the the likelihood oflikelihood of these expected valuesthese expected values


Computational Complexity of HMM AlgorithmsComputational Complexity of HMM Algorithms

Given an HMM with S states and a sequence of length Given an HMM with S states and a sequence of length L,L, the complexity of the Forward, Backward and the complexity of the Forward, Backward and ViterbiViterbialgorithms isalgorithms is

–– This assumes that the states are densely interconnectedThis assumes that the states are densely interconnected

Given M sequences of length L, the complexity of Given M sequences of length L, the complexity of BaumBaum Welch on each iteration isWelch on each iteration is

)( 2LSO

)( 2LMSO


Markov Models SummaryMarkov Models SummaryWe considered models that vary in terms of We considered models that vary in terms of order,order, hidden statehidden state

Three DPThree DP--based algorithms for based algorithms for HMMsHMMs: Forward, : Forward, BackwardBackward and and ViterbiViterbi

We discussed three key tasks: learning, We discussed three key tasks: learning, classification andclassification and segmentationsegmentation

The algorithms used for each task depend on The algorithms used for each task depend on whether therewhether there is hidden state (correct path is hidden state (correct path known) in the problem or notknown) in the problem or not

18


SummarySummaryMarkov chains and hidden Markov models are Markov chains and hidden Markov models are probabilistic models in which the probability of a probabilistic models in which the probability of a state depends only on that of the previous statestate depends only on that of the previous state–– Given a sequence of symbols, x, the Given a sequence of symbols, x, the forwardforward

algorithm finds the probability of obtaining x in the algorithm finds the probability of obtaining x in the model model

–– The The ViterbiViterbi algorithm finds the most probable path algorithm finds the most probable path (corresponding to x) through the model(corresponding to x) through the model

–– The The BaumBaum--WelchWelch learns or adjusts the model learns or adjusts the model parameters (transition and emission probabilities) to parameters (transition and emission probabilities) to best explain a set of training sequences.best explain a set of training sequences.

Documents

Methods Pattern Recognition - LIACS