Toolkits for ASR; Sphinx · Toolkits for ASR; Sphinx Samudravijaya K ... • A collection of tools and resources that enables developers/researchers to build successful speech recognizers

Toolkits for ASR; Sphinx

Samudravijaya [email protected]

08-MAR-2011

Workshop on Fundamentals of Automatic Speech RecognitionCDAC Noida, 08-MAR-2011

Samudravijaya K [email protected] Toolkits for ASR; Sphinx 1/31

A Block Diagram of an ASR SystemSignal

Feature

MatchingAcoustic

Extraction

LanguageModel

Matching

Model (acoustic domain)

(symbolic domain)

Sentence Hypothesis

Symbol sequence

TestingTraining


Hierachy of Units in an Utterance

source: “state of art ASR” by Steve Young, 2000


Sentence HMM is composed of Phone HMMs


Toolkits for Automatic Speech Recognition

(1) Training, (2) Testing, (3) Performance Evaluation

There are several public domain toolkits that help to build an ASRsystem:

• HTK: Hidden Markov Model ToolKit [1]. Public domain, butdecoder cannot be distributed (C).

• Sphinxes [2]: Open source: (C, C++, java)


Toolkits for Automatic Speech Recognition

(1) Training, (2) Testing, (3) Performance Evaluation

There are several public domain toolkits that help to build an ASRsystem:

• HTK: Hidden Markov Model ToolKit [1]. Public domain, butdecoder cannot be distributed (C).

• Sphinxes [2]: Open source: (C, C++, java)

• ISIP Production system [3]. Public domain ( without anyrestrictions) (C++)

• Julius Open-Source Large Vocabulary CSR Engine [4]. It usesAcoustic Models in HTK format, and Grammar files in its ownformat. Open license (no limitations on distribution) (C++).

• HMM toolbox for Matlab Useful for Isolated WordRecognition [5].


What is CMU Sphinx?

According to Arthur Chan (the editor of Hieroglyphs[6], the sphinxmanual in a book form), there are two definitions of Sphinx:

• A large vocabulary speech recognizer with high accuracy andspeed performance.

• A collection of tools and resources that enablesdevelopers/researchers to build successful speech recognizers


Pocketsphinxsource: “SphinxLunch20041021.ppt” by Arthur Chan, 2004


A Block Diagram of an ASR SystemSignal

Feature

MatchingAcoustic

Extraction

LanguageModel

Matching

Model (acoustic domain)

(symbolic domain)

Sentence Hypothesis

Symbol sequence

TestingTraining


Language model training

source: “state of art ASR” by Steve Young, 2000


CMU-Cambridge SLM toolkit


Lexicon (Pronunciation Dictionary)

source: “Ph.D. thesis” of Ravi Shankar M., CMU [7]


source: ”http://speech.tifr.res.in/resources/data/labelSetASR100815.pdf”


source: “www.liacs.nl/ erwin/SR2003/Sphinx.ppt”


Feature Extraction (Frontend processing)* wave2feat program computes 13 MFCCs from speech files stored

in any of wav,NIST,raw format.* Caution: use -dither yes option. Excise long silences.* cepView s0001.cep prints the cepstral coefficients.

source: “Ph.D. thesis” of Ravi Shankar M., CMU [7].


SphinxTrain Training sub-word HMMs

Stages of training (Reference: http://www.speech.cs.cmu.edu/sphinxman/fr4.html):

1 Training context Independent phone HMMs

2 Training context Dependent phone HMMs

3 Decision tree building

4 Training context Dependent tied phone HMMs

5 Recursive Gaussian splitting


Training Context Independent phone HMMs

2 steps: Initialization and Embedding re-estimation.

Inputs:* Feature vector sequences* Word-level transcriptions* Pronunciation dictionary


Training Context Independent phone HMMs

2 steps: Initialization and Embedding re-estimation.

Inputs:* Feature vector sequences* Word-level transcriptions* Pronunciation dictionary

(I) Initialization:

1 Make a proto-type HMM (5-state, left-to-right, skipping 1state permitted); copy to all phone HMMs.

2 Compute means and variance of all training feature vectors

3 Initialise Gaussians of all states of phone HMMs with globalmeans and variance.

4 For each and every utterance, generate phone-leveltranscriptions from word-level transcriptions using thepronunciation dictionary.


Training subword HMMs

An iterative algorithm (Baum-Welch, also known asForward-Backward) is used. The Maximum Likelihood approachguarantees increase of the likelihood of the trained model matchingwith training data with each iteration. To begin with, an initialestimation of parameters of HMMs (A,B , π) is required.

Q: How to get an initial estimation of (λ = {A,B , π}?

A: We can estimate parameters if we know the boundaries of everysubword HMM in training utterances.


Training subword HMMs

An iterative algorithm (Baum-Welch, also known asForward-Backward) is used. The Maximum Likelihood approachguarantees increase of the likelihood of the trained model matchingwith training data with each iteration. To begin with, an initialestimation of parameters of HMMs (A,B , π) is required.

Q: How to get an initial estimation of (λ = {A,B , π}?

A: We can estimate parameters if we know the boundaries of everysubword HMM in training utterances.

Practical solution: Assume that the durations of all units (phones)are equal. If there are N phones in a training utterance, divide thefeature vector sequence into N equal parts. Assign each part, to aphoneme in the phoneme sequence corresponding to thetranscription of the utterance. Repeat for all training utterances.


Initial estimation of HMM parameters: an illustration

Let the transcription of the 1st wave file be the following sequenceof words: mera bhaarat mahaan

Let the relevant lines in the dictionary be as follows:bhaarata bh aa r a tmahaana m a h aa nmera m e r aa

The phonemeHMM sequence (of length 16) corresponding to thissentence is sil m e r aa bh aa r a t m a h aa n sil


Initial estimation of HMM parameters: an illustration

Let the transcription of the 1st wave file be the following sequenceof words: mera bhaarat mahaan

Let the relevant lines in the dictionary be as follows:bhaarata bh aa r a tmahaana m a h aa nmera m e r aa

The phonemeHMM sequence (of length 16) corresponding to thissentence is sil m e r aa bh aa r a t m a h aa n sil

If the duration of the wavefile is 1.0sec, there will 98 featurevectors (frame shift = 10msec and frame size = 25msec).

Assign the first 6 feature vectors to “sil” HMM; the next 6 (7through 12) to “m”; the next 6 (13 through 18) to “e”; ... ; thelast 8 feature vectors to “sil”. If HMM has 3 states, assign 2feature vector to each state; compute mean,SD.Assume ai ,j=0.5 if j=i or j=i+1; else assign 0.


Embedded Re-estimation

(II) Embedding re-estimation:

1 For each utterance, do the following:• Using the phone-level transcriptions, compose a sentence

HMM out of phone HMMs.• Forward-Backward algorithm: compute the likelihood of each

feature vector being generated by each state of each phoneHMM in the sentence HMM

• Accumulate likelihoods of feature vectors being generated byeach state.

2 For each state: re-estimate HMM parameters using theaccumulated likelihoods.


Embedded Re-estimation

(II) Embedding re-estimation:

1 For each utterance, do the following:• Using the phone-level transcriptions, compose a sentence

HMM out of phone HMMs.• Forward-Backward algorithm: compute the likelihood of each

feature vector being generated by each state of each phoneHMM in the sentence HMM

• Accumulate likelihoods of feature vectors being generated byeach state.

2 For each state: re-estimate HMM parameters using theaccumulated likelihoods.

Repeat the Embedded Re-estimation a few times.


Training Context Dependent phone HMMs

1 Initialise N3 triphone models, where N is the number of

phones.

2 Compose sentence HMM out of triphone (CD) models insteadof monophone (CI) models.

3 Carry out the Embedded Re-estimation for a few iterations.

The sequence of CI HMMs wassil m e r aa bh aa r a t m a h aa n silThe sequence of CD HMMs (triphones) issil sil-m+e m-e+r e-r+aa r-aa+bh ...




phones.




If N = 50, each HMM has 3 states, there may be upto 375,000states. Each state is associated with one Gaussian. Huge amountof speech data is needed for robust estimation of the parameters(µ,Σ) of 375,000 Gaussians!




phones.




If N = 50, each HMM has 3 states, there may be upto 375,000states. Each state is associated with one Gaussian. Huge amountof speech data is needed for robust estimation of the parameters(µ,Σ) of 375,000 Gaussians!

Reduce the number of states by state-tying; use Decision Trees.Samudravijaya K [email protected] Toolkits for ASR; Sphinx 20/31

Training Context Dependent tied phone HMMs

* Build Decision Trees for parameter sharing.* One decision tree is built for each state position (5 decision treesif there are 5 emitting states of HMMs).



* Build Decision Trees for parameter sharing.* One decision tree is built for each state position (5 decision treesif there are 5 emitting states of HMMs).

The first step is to generate Linguistic Questions. Two methods:

1 Manually create linguistic questions using phonetic knowledge.

2 Run make quests program to automatically form phonegroups.

First few lines of a “linguistic-questions” file may look like this.

SIL sil h s shVOWELS a aa i ii u uu e ee o ooNASAL m n ngLABPLO p ph b bh


Decision trees are used to decide which of the HMM states of all the triphones (seen and unseen) are similar to

each other, so that data from all these states are collected together and used to train one global state, which is

called a senone (also called a tied state). Example: Left states of 1st and 3rd triphones above would be similar.



1 Prune the Decision trees so that the number of senones (tiedstates) is commensurate with the amount of training data.




2 Create CD tied model definition file that has (a) all triphoneswhich are seen during training, and (b) has the statescorresponding to these triphones identified with senones fromthe pruned trees (state-senone mapping).

3 Carry out the Embedded Re-estimation (tied CD models) fora few iterations.






4 Generate Gaussian mixtures for each senone (tied state) andre-train. Repeat this step till the desired number (say 8) ofmixtures are created for each GMM (senone).






4 Generate Gaussian mixtures for each senone (tied state) andre-train. Repeat this step till the desired number (say 8) ofmixtures are created for each GMM (senone).

5 One can carry out discriminative training following theMaximum Mutual Information Estimation scheme (maximisesthe posterior probability of the correct word sequence given allpossible word sequences) [9].




Inputs to sphinx3 decoder

source: “www.liacs.nl/ erwin/SR2003/Sphinx.ppt”Samudravijaya K [email protected] Toolkits for ASR; Sphinx 25/31

Sphinx3 decoders



Output of recogniser



source: “SphinxLunch20041021.ppt” by Arthur Chan, 2004Samudravijaya K [email protected] Toolkits for ASR; Sphinx 28/31

Sphinx4

Sphinx-4 is a state-of-the-art speech recognition system writtenentirely in the Java programming language [10].

• Generalized pluggable front end architecture: MFCC, CMN

• Generalized pluggable language model architecture: trigram,JSGF and ARPA-format FST grammars.

• Generalized acoustic model architecture: Sphinx-3 acousticmodels.

• Generalized search management: breadth first and wordpruning

• Post-processing recognition results: obtaining confidencescores, generating lattices.

• Standalone tools: displaying waveforms and spectrograms;generating features from audio.


Comparison of Performance of Sphinxes

source: [10].

PocketSphinx[11]: It is a small-footprint continuous speech recognition

system, suitable for handheld and desktop applications.


Sphinx, the eternal mystery

source: [10].


Bibliography

Cambridge University, UK; Entropic; Microsoft“HTK, Hidden Markov Model ToolKit”http://htk.eng.cam.ac.uk/

Project by Carnegie Mellon University“The CMU Sphinx group open source speech recognitionengines”http://cmusphinx.sourceforge.net/html/cmusphinx.php

Joe Picone et al.

“ISIP Production system” (r02 n02) (23-JUL-2009)http://www.isip.piconepress.com/projects/speech/software/

Japanese Universities and Laboratories“Open-Source Large Vocabulary CSR Engine: Julius”http://julius.sourceforge.jp/en/

Kevin Murphy“HMM toolbox for Matlab”


http://www.cs.ubc.ca/ murphyk/Software/HMM/hmm.html

Arthur Chan“Hieroglyphs: Building Speech Application Using Sphinx andRelated Resources”, (3rd Draft) 11-MAR-2007http://www.cs.cmu.edu/archan/sphinxDoc.html

Ravishankar M.,“Efficient Algorithms for Speech Recognition”Ph.D Thesis, Carnegie Mellon University, May 1996, TechReport. CMU-CS-96-143http://www.cs.cmu.edu/ rkm/th/th.pdf

Cambridge University, UK; Entropic; Microsoft“HTK Book”, Documentation of HTKhttp://htk.eng.cam.ac.uk/docs/docs.shtml

L Qin and A Rudnicky“Implementing and Improving MMIE Training in SphinxTrain”CMU Sphinx Workshop 2010, 13 March 2010, Dallas, USAhttp://www.cs.cmu.edu/ sphinx/Sphinx2010/papers/107.unblinded.p


Bhiksharaj et al.

A speech recognizer written entirely in the Java programminglanguagehttp://cmusphinx.sourceforge.net/sphinx4/

A small-footprint continuous speech recognition systemhttp://cmusphinx.sourceforge.net/2010/03/pocketsphinx-0-6-release/


Documents

Toolkits for ASR; Sphinx · Toolkits for ASR; Sphinx Samudravijaya K ... • A collection of tools and resources that enables developers/researchers to build successful speech recognizers