40
1 AUDIO FEATURES & MACHINE LEARNING E.M. Bakker API2016 FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING Parametric Representations Short Time Energy Zero Crossing Rates Level Crossing Rates Short Time Spectral Envelope Spectral Analysis Filter Design Filter Bank Spectral Analysis Model Linear Predictive Coding (LPC) MFCCs

FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

1

AUDIO FEATURES &MACHINE LEARNINGE.M. Bakker

API2016

FEATURES FOR SPEECH RECOGNITION AND AUDIO

INDEXING• Parametric Representations

• Short Time Energy

• Zero Crossing Rates

• Level Crossing Rates

• Short Time Spectral Envelope

• Spectral Analysis

• Filter Design

• Filter Bank Spectral Analysis Model

• Linear Predictive Coding (LPC)

• MFCCs

Page 2: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

2

FEATURES FOR SPEECH RECOGNITION AND AUDIO

INDEXING• Parametric Representations

• Short Time Energy

• Zero Crossing Rates

• Level Crossing Rates

Example: Speech of length 0.01 sec.

FEATURES FOR SPEECH RECOGNITION AND AUDIO

INDEXING• Spectral Analysis

• Fourier Transform

• Filter Design

• Filter Bank Spectral Analysis Model

• Linear Predictive Coding (LPC)

• MFCCs

Page 3: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

3

FOURIER TRANSFORM PRACTICALITIES

Fourier Transform equations:

𝐻 𝑓 = −∞

ℎ(𝑡)𝑒2𝜋𝑖𝑓𝑡𝑑𝑡

ℎ 𝑡 = −∞

𝐻(𝑓)𝑒−2𝜋𝑖𝑓𝑡𝑑𝑓

If t is in seconds, then f is in cycles per seconds.

[3, pp 501-503; pp 563-564]

FOURIER TRANSFORM PAIRS

Homogeneity a· h[t] a· H(f)

Additivity g[t] + h[t] G(f) + H(f)

Linearity a·g[t] + b·h[t] a·G(f) + b·H(f)

Multiplication g[t] · h[t] G ⋇ H (f) = −∞∞𝐺 𝜑 𝐻 𝑓 − 𝜑 𝑑𝜑

Convolution g ⋇ h (t) = −∞∞𝑔 𝜏 ℎ 𝑡 − 𝜏 𝑑𝜏 G(f)· H(f)

Time shifting h[ t – t0 ] H(f) ∙ 𝑒2𝜋𝑖𝑓𝑡0

Frequency shifting h(t) ∙ 𝑒−2𝜋𝑖𝑓0𝑡 H( f - f0 )

Page 4: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

4

SHORT TIME FOURIER TRANSFORMSHORT HAMMING WINDOW:

50 SAMPLES (=5MSEC)

Voiced Speech

SHORT TIME FOURIER TRANSFORMLONG HAMMING WINDOW:

500 SAMPLES (=50MSEC)

Voiced Speech

Page 5: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

5

SHORT TIME FOURIER TRANSFORMSHORT HAMMING WINDOW:

50 SAMPLES (=5MSEC)

Unvoiced Speech

SHORT TIME FOURIER TRANSFORMLONG HAMMING WINDOW:

500 SAMPLES (=50MSEC)

Unvoiced Speech

Page 6: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

6

FOURIER TRANSFORM CONVOLUTION AND CORRELATION

• [3, pp. 501-505]

g[t] · h[t] G ⋇ H (f) = −∞∞𝐺 𝜑 𝐻 𝑓 − 𝜑 𝑑𝜑

g ⋇ h (t) = −∞∞𝑔 𝜏 ℎ 𝑡 − 𝜏 𝑑𝜏 G(f)· H(f)

BAND PASS FILTERAudio Signal

g(t)

Bandpass Filter

h(t)

Result Audio Signal

g ⋇ h (t)

Note that the band pass filter can be

defined as:

• a convolution with a filter response

function h(t) in the time domain

• a multiplication with a filter response

H(f) function in the frequency domain

g ⋇ h (t) = −∞∞𝑔 𝜏 ℎ 𝑡 − 𝜏 𝑑𝜏 ↔G(f)· H(f)

Page 7: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

7

BANK OF FILTERS ANALYSIS MODEL

Central Frequency

BANK OF FILTERS ANALYSIS MODEL

• Speech Signal: s(n), n=0,1,…• Sampling frequency Fs

• Band Pass Filters: • Central frequencies: f1, …,fq

• Filterk(s(n)) = Xn(𝑒𝑖𝜔𝑘), where ωk = 2πfk/Fs is equal to the normalized frequency fk, where k =1, …, q.

• Xn(𝑒𝑖𝜔𝑘) is the short time spectral representation of s(n)at time n, as seen through the Filterk with centrefrequency ωk, where k =1, …, q.

• Note: Each band pass filter independently processes the speech signal s to produce the spectral representation x

Page 8: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

8

MEL-CEPSTRUM [4]

Auditory characteristics

• Mel-scaled filter banks

De-correlating properties

• by applying a discrete cosine transform (which is close to a Karhunen-Loeve transform) a de-correlation of the mel-scale filter log-energies results

• => probabilistic modeling on these de-correlated coefficients will be more effective.

One of the most successful feature for speech recognition, speaker recognition, and other speech related recognition tasks.

[1, pp 712-717]API2014

MFCCS

Mel-Scale

Filter Bank

MFCC’s

first 12 most

Significant

coefficients

Log()

Speech

Audio, …Preemphasis Windowing

Fast Fourier

Transform

Direct Cosine

Transform

Page 9: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

9

JFA, I-VECTORS [5, 6]

Both very successful in Speaker Verification Systems

Joint Factor Analysis (JFA)

• Defines a low dimensional space: total variability factor space.

I-Vectors

• Combine channel and speaker variability at the same time.

API2015

MACHINE LEARNING METHODS

• Vector Quantization• Finite code book of spectral shapes• The code book codes for ‘typical’ spectral

shape• Method for all spectral representations (e.g.

Filter Banks, LPC, ZCR, etc. …)

• Support Vector Machines

• Markov Models

• Hidden Markov Models

• Neural Networks Etc.

Page 10: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

10

PATTERN RECOGNITION

Reference

Patterns

Parameter

Measurements

Decision

Rules

Pattern

Comparison

Speech

Audio, …

Recognized

Speech, Audio, …

Test Pattern

Query Pattern

VECTOR QUANTIZATION

• Data represented as feature vectors.

• Use a Vector Quantization (VQ) Training set to determine a set of code words that constitute a code book.

• Code words are centroids using a similarity or distance measure d.

• Code words together with measure d divide the space into a Voronoi regions.

• A query vector falls into a Voronoi region and will be represented by the respective code word.

[2, pp. 466 – 467]

Page 11: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

11

VECTOR QUANTIZATION

Distance measures d(x,y):

• Euclidean distance

• Taxi cab distance

• Hamming distance

• etc.

VECTOR QUANTIZATIONLet a training set of L vectors be given for a certain class of objects.

Assume a codebook of M code words is wanted for this class.

Initialize:

• choose M arbitrary vectors of the L vectors of the training set.

• This is the initial code book.

Nearest Neighbor Search:

• for each training vector v, find the code word w in the current code book that is closest and assign v to the corresponding cell of w.

Centroid Update:

• For each cell with code word w determine the centroid c of the training vectors that are assigned to the cell of w.

• Update the code word w with the new vector c.

Iteration:

• repeat the steps Nearest Neighbor Search and Centroid Update until the average distance between the new and previous code words falls below a preset threshold.

Page 12: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

12

VECTOR CLASSIFICATION

For an M-vector code book CB with codes

CB = {yi | 1 ≤ i ≤ M} ,

the index m* of the best codebook entry for a given vector v is:

m* = arg min d(v, yi)

1 ≤ i ≤ M

VQ FOR CLASSIFICATION

A code book CBk = {yki | 1 ≤ i ≤ M}, can be used

to define a class Ck.

Example Audio Classification:

• Classes ‘crowd’, ‘car’, ‘silence’, ‘scream’, ‘explosion’, etc.

• Determine by using VQ code books CBk for each of the respective classes Ck.

• VQ is very often used as a baseline method for classification problems.

Page 13: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

13

SUPPORT VECTOR MACHINES

• A generalization of linear decision boundaries for classification.

• Necessary when classes overlap when using linear decision boundaries (non separable classes).

Find hyper plane P: xTβ + β0 = 0, such that

𝛽 is minimized over 𝑦𝑖 𝑥𝑖

𝑇𝛽 + 𝛽0 ≥ 1 − 𝜀𝑖 ∀𝑖

𝜀𝑖 ≥ 0, 𝜀𝑖 ≤ 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡

Where ε = (ε1 , ε2 , …, εN ) are the slack variables, i.e., the amount

that points are on the wrong side of their margin C = 1

𝛽from the

hyper plane P.

=> Problem is quadratic with linear inequalities constraint.

[2, pp 377-389]

SUPPORT VECTOR MACHINE (SVM)

In this method so called support vectors define decision boundaries for classification and regression.

An example where

a straight line

separates the two

Classes: a linear

classifier

Images from: www.statsoft.com.

Page 14: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

14

SUPPORT VECTOR MACHINE (SVM)

In general classification is not that simple.

SVM is a method that can handle the more complex cases where the decision boundary requires a curve.

SVM uses a set of mapping

functions (kernels) to map

the feature space into

a transformed space so

that hyperplanes can be

used for the classification.

SUPPORT VECTOR MACHINE (SVM)

SVM uses a set of mapping functions (kernels)to map the feature space into a transformed space so that hyperplanes can be used for the classification.

Page 15: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

15

SUPPORT VECTOR MACHINE (SVM)

Training of an SVM is an iterative process: • optimize the mapping function while minimizing an

error function

• The error function should capture the penalties for misclassified, i.e., non separable data points.

SUPPORT VECTOR MACHINE (SVM)

SVM uses kernels that define the mapping function used in the method. Kernels can be:

• Linear

• Polynomial

• RBF

• Sigmoid

• Etc.

• RBF (radial basis function) is the most popular kernel, again with different possible base functions.

• The final choice depends on characteristics of the classification task.

Page 16: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

16

From: T.Joachims, SIGIR 2003 Tutorial.From: T.Joachims, SIGIR 2003 Tutorial.

Page 17: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

17

NEURAL NETWORKS

API2015

RECENT BREAKTHROUGHS

Conversational Speech Transcription Using Context-Dependent Deep Neural Networks.

By F. Seide, G. Li, D. Yu,

INTERSPEECH (August) 2011.

• Out-of-the-box speaker-independent speech-recognition is especially important for mobile devices like smart phones, etc.

• Artificial Neural Networks until recently did not show state of the art performance for automatic speech recognition task like context dependent Gaussian mixture model HMMs.

• Context dependent deep neural network HMMs (CD-DNN-HMMs).

• DNN’s are able to extract discriminative internal representations that are robust to the many sources of variability in speech signals. [D.Yu et al. 2013]

• On the Switchboard benchmark a word error rate of 18.5 percent was obtained, which is a 33% relative improvement wrt the performance of conventional state-of-the-art speech recognition systems.

Page 18: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

18

NEURAL NETWORKS FOR SPEECH RECOGNITION

Neural Networks have been visited many times:

• Adaline digit recognition, L.R. Talbert et al, Stanford 1963

• Speaker dependent digit recognizer

• Multi-layer Perceptrons (end 80’s, begin 90’s: Kohonen, Makino, Kammerer, and many more)

• A NN ASR around 2000 did not improve the now matured Gaussian mixture approaches

• G. Hinton et al general unsupervised approach using multiple layers: new initiation of DNN for ASR

N. Morgan, Multilayer perceptrons for speech recognition: There and Back Again. See: http://www.superlectures.com/asru2013/

GEORGE SAON, TOM SERCU, STEVEN RENNIE AND HONG-KWANG J. KUO, THE IBM 2016 ENGLISH CONVERSATIONAL TELEPHONE SPEECH RECOGNITION

SYSTEM, INTERSPEECH 2016.

In the past to build good ASR systems the following was needed:

• Speaker adaptation

• Phonetic context modeling

• Discriminative feature processing, etc.

Now neural network architectures ease the work by:

• ASR to be trained agnostically on a lot of data.

• Neural network toolkits which can implement models out-of-the-box.

• Availability of powerful hard-ware based on GPUs.

Page 19: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

19

End-to-end ASR will become readily available to non-ASR specialists as:

1. Front-end processing has been simplified by CNNs which treat the log-mel spectral representation as an image and don’t require any extra processing steps.

2. End-to-end ASR systems such as Deepspeech, and Eesen, do not use phonetic context decision trees and HMMs and directly map the sequence of acoustic features to a sequence of characters or context independent phones.

3. New training algorithms don’t require an initial alignment of the training data which is typically done with a GMM-based base-line model.

• Three different acoustic models that were trained on 2000 hours of English conversational telephone speech

1. Recurrent nets with maxout activations and annealed dropout.

2. Very deep convolutional nets with 3×3 kernels,

3. Bidirectional long short-term memory nets operating on FM-LLR and i-vector features.

• All models are used in a hybrid HMM decoding scenario.

Result:

The word error rate of the presented English conversational telephone LVCSR system is equal to a record 6.6% on the Switchboard subset of the Hub5 2000 evaluation test-set.

Page 20: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

20

HTTPS://ARXIV.ORG/PDF/1610.05256V1.PDF

W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu and G. Zweig ACHIEVING HUMAN PARITY IN CONVERSATIONAL SPEECH RECOGNITION, October 2016.

Human WER NIST 2000 Test Set for professional transcriptionists:

• 5.9% for the Switchboard portion of the data, in which newly acquainted pairs of people discuss an assigned topic,

• 11.3% for the CallHome portion where friends and family members have open-ended conversations.

• the human error rate on the widely used NIST 2000 test set, and find that our latest automated system has reached human parity.

In this study:

• use of convolutional and LSTM (Long short-term memory) neural networks, combined with a novel spatial smoothing method and lattice-free acoustic training give on par WER’s

SEQUENCES: SOUND AND DNA

• DNA: helix-shaped molecule

whose constituents are two

parallel strands of nucleotides

• DNA is usually represented by

sequences of these four

nucleotides

• This assumes only one strand is

considered; the second strand

is always derivable from the first

by pairing A’s with T’s and C’s

with G’s and vice-versa

Nucleotides (bases)

Adenine (A)

Cytosine (C)

Guanine (G)

Thymine (T)

Page 21: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

21

BIOLOGICAL INFORMATION: FROM GENES TO PROTEINS

GeneDNA

RNA

Transcription

Translation

Protein Protein folding

genomics

molecular

biology

structural

biology

biophysics

MOTIVATION FOR MARKOV MODELS

• There are many cases in which we would like to represent the statistical regularities of some class of sequences

• genes

• proteins in a given family

• Sequences of audio features

• Markov models are well suited to this type of task

Page 22: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

22

A MARKOV CHAIN MODEL

Transition probabilities

• Pr(xi=a|xi-1=g)=0.16

• Pr(xi=c|xi-1=g)=0.34

• Pr(xi=g|xi-1=g)=0.38

• Pr(xi=t|xi-1=g)=0.12

API2014

1)|Pr( 1 gxx ii

DEFINITION OF MARKOV CHAIN MODEL

• A Markov chain[1] model is defined by

• a set of states

• some states emit symbols (for each symbol only one state)

• other states (e.g., the begin state) are silent

• a set of transitions with associated probabilities

• the transitions emanating from a given state define a

distribution over the possible next states

[1] Марков А. А., Распространение закона больших чисел на величины, зависящие

друг от друга. — Известия физико-математического общества при Казанском

университете. — 2-я серия. — Том 15. (1906) — С. 135—156

Page 23: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

23

MARKOV CHAIN MODELS: PROPERTIES

Given some sequence x of length L

How probable the sequence is given our model?

For any probabilistic model of sequences, we can write this probability as

Key property of a (1st order) Markov chain:

The probability of each xi depends only on the value of xi-1

)Pr()...,...,|Pr(),...,|Pr(

),...,,Pr()Pr(

112111

11

xxxxxxx

xxxx

LLLL

LL

L

i

ii

LLLL

xxx

xxxxxxxx

2

11

112211

)|Pr()Pr(

)Pr()|Pr()...|Pr()|Pr()Pr(

THE PROBABILITY OF A SEQUENCE FOR A MARKOV

CHAIN MODEL

Pr(cggt)=Pr(c)Pr(g|c)Pr(g|g)Pr(t|g)

Page 24: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

24

EXAMPLE APPLICATIONS

Isolated Word Recognition

• A spectral vector is modeled by a state in a Markov chain

• An utterance is represented by a sequence of states

Algorithmic Music Composition

• States are note or pitch values

• Probabilities for transitions to other notes are given (can be 1st

order or second order etc.)

MARKOV CHAINS FOR DISCRIMINATION

Suppose we want to distinguish CpG islands from other sequence regions

• Given sequences from CpG islands, and sequences from other regions, we can construct

• a model to represent CpG islands

• a null model to represent the other regions

• Using the models we can now score a test sequence by:

)|Pr(

)|Pr(log)(

nullModelx

CpGModelxxscore

Page 25: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

25

MARKOV CHAINS FOR DISCRIMINATION

• Why can we use

• According to Bayes’ rule:

• If we are not taking into account prior probabilities (Pr(CpG) and Pr(null)) of the two classes, then from Bayes’ rule it is clear that we just need to compare Pr(x|CpG) and Pr(x|null) as is done in our scoring function score().

)Pr(

)Pr()|Pr()|Pr(

x

CpGCpGxxCpG

)Pr(

)Pr()|Pr()|Pr(

x

nullnullxxnull

)|Pr(

)|Pr(log)(

nullModelx

CpGModelxxscore

As the real score should be:

Assume x is observed what is the

probability that x was modeled by

CpG compared to the probability

that x was modeled by the null

model, i.e., Pr(CpG | x) / Pr(null | x)

HIGHER ORDER MARKOV CHAINS

• The Markov property specifies that the probability of a state

depends only on the probability of the previous state

• But we can build more “memory” into our states by using a

higher order Markov model

• In an n-th order Markov model

The probability of the current state depends on the previous n

states.

),...,|Pr(),...,,|Pr( 1121 niiiiii xxxxxxx

Page 26: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

26

SELECTING THE ORDER OF AMARKOV CHAIN MODEL

• But the number of parameters we need to

estimate grows exponentially with the order

• for modeling DNA we need parameters for an

n-th order model

• The higher the order, the less reliable we can

expect our parameter estimates to be

• estimating the parameters of a 2nd order Markov chain

from the complete genome of E. Coli (5.44 x 106 bases) ,

we’d see each word ~ 85.000 times on average (divide

by 43)

• estimating the parameters of a 9th order chain, we’d see

each word ~ 5 times on average (divide by 410 ~ 106)

)4( 1nO

HIGHER ORDER MARKOV CHAINS

• An n-th order Markov chain over some alphabet A is

equivalent to a first order Markov chain over the alphabet of

n-tuples: An

• Example: A 2nd order Markov model for DNA can be treated

as a 1st order Markov model over alphabet

AA, AC, AG, AT

CA, CC, CG, CT

GA, GC, GG, GT

TA, TC, TG, TT

Page 27: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

27

A FIFTH ORDER MARKOV CHAIN

Pr(gctaca) = Pr(gctac) . Pr(a | gctac)

HIDDEN MARKOV MODEL: A SIMPLE HMM

Given observed sequence AGGCT, which state emits every item?

Model 1 Model 2

Page 28: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

28

TUTORIAL ON HMM

L.R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceeding of the IEEE, Vol. 77, No. 22, February 1989.

HMM FOR HIDDEN COIN TOSSING

HT

T

T T

T

H

T

……… H H T T H T H H T T H

Page 29: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

29

HIDDEN STATE

• We’ll distinguish between the observed parts of a

problem and the hidden parts

• In the Markov models we’ve considered previously, it is

clear which state accounts for each part of the

observed sequence

• In the model above, there are multiple states that could

account for each part of the observed sequence

• this is the hidden part of the problem

LEARNING AND PREDICTION TASKS(IN GENERAL, I.E., APPLIES ON BOTH MM AS HMM)

• Learning• Given: a model, a set of training sequences

• Do: find model parameters that explain the training sequences withrelatively high probability (goal is to find a model that generalizes well to sequences we haven’t seen before)

• Classification• Given: a set of models representing different sequence classes, and

given a test sequence

• Do: determine which model/class best explains the sequence

• Segmentation• Given: a model representing different sequence classes, and given

a test sequence

• Do: segment the sequence into subsequences, predicting the class of each subsequence

Page 30: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

30

ALGORITHMS FOR LEARNING & PREDICTION

• Learning• correct path known for each training sequence -> simple maximum

likelihood or Bayesian estimation

• correct path not known -> Forward-Backward algorithm + ML or

Bayesian estimation

• Classification• simple Markov model -> calculate probability of sequence along

single path for each model

• hidden Markov model -> Forward algorithm to calculate

probability of sequence along all paths for each model

• Segmentation

• hidden Markov model -> Viterbi algorithm to find most probable

path for sequence

THE PARAMETERS OF AN HMM

• Transition Probabilities

• Probability of transition from state k to state l

• Emission Probabilities

• Probability of emitting character b in state k

Note: HMM’s can also be formulated using an emission probability associated with a transition from state k to state l.

)|Pr( 1 kla iikl

)|Pr()( kbxbe iik

Page 31: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

31

AN HMM EXAMPLE

Emission probabilities∑ pi = 1

Transition probabilities∑ pi = 1

THREE IMPORTANT QUESTIONS

(SEE ALSO L.R. RABINER (1989))

• How likely is a given sequence?

• The Forward algorithm

• What is the most probable “path” for generating a given sequence?

• The Viterbi algorithm

• How can we learn the HMM parameters

given a set of sequences?

• The Forward-Backward (Baum-Welch) algorithm

Page 32: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

32

HOW LIKELY IS A GIVEN SEQUENCE?

• The probability that a given path is taken and the sequence is generated:

L

i

iNL iiiaxeaxx

1

001 11)()...,...Pr(

6.3.8.4.2.4.5.

)(

)()(

),Pr(

35313

111101

aCea

AeaAea

AAC

HOW LIKELY IS A GIVEN SEQUENCE?

• But we need the probability over all paths:

• The number of paths can be exponential in the length of the sequence...

• The Forward algorithm enables us to compute this efficiently

Page 33: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

33

THE FORWARD ALGORITHM

• Define to be the probability of being in state k having observed the first icharacters of sequence x of length L

• To compute , the probability of being in the end state having observed all of sequence x

• Can be defined recursively

• Compute using dynamic programming

)(ifk

)(LfN

THE FORWARD ALGORITHM

• fk(i) equal to the probability of being in state khaving observed the first i characters of sequence x

• Initialization

• f0(0) = 1 for start state; fi(0) = 0 for other state

• Recursion

• For emitting state (i = 1, … L)

• For silent state

• Termination

k

klkl aifif )()(

k

klkll aifieif )1()()(

k

kNkNL aLfLfxxx )()()...Pr()Pr( 1

Page 34: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

34

FORWARD ALGORITHM EXAMPLE

Given the sequence x=TAGA

FORWARD ALGORITHM EXAMPLE

• Initialization• f0(0)=1, f1(0)=0…f5(0)=0

• Computing other values• f1(1)=e1(T)*(f0(0)a01+f1(0)a11)

=0.3*(1*0.5+0*0.2)=0.15• f2(1)=0.4*(1*0.5+0*0.8)• f1(2)=e1(A)*(f0(1)a01+f1(1)a11)

=0.4*(0*0.5+0.15*0.2)…• Pr(TAGA)= f5(4)=f3(4)a35+f4(4)a45

Page 35: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

35

THREE IMPORTANT QUESTIONS

• How likely is a given sequence?

• What is the most probable “path” for

generating a given sequence?

• How can we learn the HMM parameters

given a set of sequences?

FINDING THE MOST PROBABLE

PATH: THE VITERBI ALGORITHM

• Define vk(i) to be the probability of the most probable

path accounting for the first i characters of x and

ending in state k

• We want to compute vN(L), the probability of the most

probable path accounting for all of the sequence and

ending in the end state

• Can be defined recursively

• Again we can use use Dynamic Programming to

compute vN(L) and find the most probable path

efficiently

Page 36: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

36

FINDING THE MOST PROBABLE

PATH: THE VITERBI ALGORITHM

Define vk(i) to be the probability of the most probablepath π accounting for the first i characters of x and ending in state k

The Viterbi Algorithm:

1. Initialization (i = 0)v0(0) = 1, vk(0) = 0 for k>0

2. Recursion (i = 1,…,L)vl(i) = el(xi) .maxk(vk(i-1).akl)

ptri(i) = argmaxk(vk(i-1).akl) keep pointer to previous max

3. Termination:

P(x,π*) = maxk(vk(L).ak0)

π*L = argmaxk(vk(L).ak0)

THREE IMPORTANT QUESTIONS

• How likely is a given sequence?

• What is the most probable “path” for generating a

given sequence?

• How can we learn the HMM parameters given a set of

sequences?

Page 37: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

37

LEARNING WITHOUT HIDDEN STATE

• Learning is simple if we know the correct path for each sequence in our training set

• estimate parameters by counting the number of times each parameter is used across the training set

LEARNING WITH HIDDEN STATE

• If we don’t know the correct path for each sequence in our training set, consider all possible paths for the sequence

• Estimate parameters through a procedure that counts the expected number of times each parameter is used across the training set

Page 38: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

38

LEARNING PARAMETERS: THE BAUM-WELCH ALGORITHM

• Also known as the Forward-Backward

algorithm

• An Expectation Maximization (EM) algorithm

• EM is a family of algorithms for learning

probabilistic models in problems that involve

hidden states

• In this context, the hidden state is the path

that best explains each training sequence

LEARNING PARAMETERS: THE BAUM-WELCH ALGORITHM

• Algorithm sketch:

• initialize parameters of model

• iterate until convergence

• calculate the expected number of

times each transition or emission is

used

• adjust the parameters to maximize the

likelihood of these expected values

Page 39: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

39

COMPUTATIONAL COMPLEXITY OF HMM ALGORITHMS

• Given an HMM with S states and a sequence of

length L, the complexity of the Forward, Backward

and Viterbi algorithms is

• This assumes that the states are densely interconnected

• Given M sequences of length L, the complexity of

Baum Welch on each iteration is

)( 2LSO

)( 2LMSO

REFERENCES1. T.F. Quatieri, Discrete-Time Speech Signal Processing,

Principles and Practice, Prentice-Hall, Inc. 2002.

2. T. Hastc, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, Data Mining, Inference, and Prediction, Springer, 2001.

3. W.H. Press, S.A.Teukolsky, W.T. Vetterling, B.P. Flannery, Numerical Recipies in C++, The Art of Scientific Computing, 2nd Edition, Cambridge University Press, 2002.

4. S.B. Davies, P. Mermelstein, Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences, IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP-28, no.4, pp. 357-366, Aug. 1980.

API2014

Page 40: FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …

40

REFERENCES5. P. Kenny, “Joint Factor Analysis of Speaker and

Session Variability: Theory and Algorithms, Tech. Report CRIM-06/08-13,” 2005.

Available: http://www.crim.ca/perso/patrick.kenny

6. N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Frontend factor analysis for speaker verification,” IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 4, pp. 788–798, May 2011.

API2014