56
Progress of Sphinx 3.X From X=5 to X=6 Arthur Chan Evandro Gouvea David J. Huggins-Daines Alex I. Rudnicky Mosur Ravishankar Yitao Sun

Progress of Sphinx 3.X From X=5 to X=6

  • Upload
    bryson

  • View
    56

  • Download
    0

Embed Size (px)

DESCRIPTION

Progress of Sphinx 3.X From X=5 to X=6. Arthur Chan Evandro Gouvea David J. Huggins-Daines Alex I. Rudnicky Mosur Ravishankar Yitao Sun. If you want to leave now…… Take home message 1. Sphinx 3.6 Rocks!. Here is another one…… Take home message 2. We need Better Acoustic Models. - PowerPoint PPT Presentation

Citation preview

Page 1: Progress of Sphinx 3.X From X=5 to X=6

Progress of Sphinx 3.XFrom X=5 to X=6Arthur ChanEvandro GouveaDavid J. Huggins-DainesAlex I. RudnickyMosur RavishankarYitao Sun

Page 2: Progress of Sphinx 3.X From X=5 to X=6

If you want to leave now……Take home message 1

Sphinx 3.6 Rocks!

Page 3: Progress of Sphinx 3.X From X=5 to X=6

Here is another one……Take home message 2

We need Better We need Better Acoustic ModelsAcoustic Models.

Page 4: Progress of Sphinx 3.X From X=5 to X=6

This talk (~37 pages) Overview (6 pages) Better Software Architecture (9 pages) Speed of Sphinx 3.6 (3 pages) Accuracy Improvement (7 pages) Functionalities Improvement (3 pages) Documentation (4 pages) Sphinx 3.X (X>6) and Conclusion (~5 pages) Discussion (10 mins?)

Page 5: Progress of Sphinx 3.X From X=5 to X=6

Overview of CMU Sphinx

Page 6: Progress of Sphinx 3.X From X=5 to X=6

What is CMU Sphinx? Definition 1 :

Large vocabulary speech recognizers with high accuracy and speed performance.

Definition 2 : A collection of tools and resources that

enables developers/researchers to build successful speech recognition systems

Page 7: Progress of Sphinx 3.X From X=5 to X=6

Family of CMU Sphinx Decoders

Sphinx {II – IV} PocketSphinx (by Dave at Oct 2005)

Acoustic Model Trainer SphinxTrain

Documentation Hieroglyphs Robust/SphinxTrain Tutorial

Page 8: Progress of Sphinx 3.X From X=5 to X=6

Sphinx Developers Sphinx is maintained by

Volunteer programmers/researchers who like speech recognition

Funded by different projects Motivated by different reasons

All contribution go to the same codebase Goal : Sustainable development of Sphinx

Sphinx Developer Meetings are held regularly secretly to decide the way to go in Sphinx

Page 9: Progress of Sphinx 3.X From X=5 to X=6

What is Sphinx 3.X? An extension of Sphinx 3’s recognizers “Sphinx 3.X (X=6)” means “Sphinx 3.6” Provide more functionalities such as

Real-time speech recognition Speaker adaptation Developers Application Interfaces (APIs) Different search algorithms

3.X (X>3) is motivated by Project CALO and GALE

Page 10: Progress of Sphinx 3.X From X=5 to X=6

Development History of Sphinx 3.X

S3 -Sphinx 3 flat-lexicon recognizer (s3 slow)

S3.2 -Sphinx 3 tree-lexicon recognizer (s3 fast)

S3.3 -live-mode demoS3.4 -fast GMM, class-based LM, dynamic LM

S3.5 –some support on speaker adaptation-live mode APIs

3.X/3.0 merge

- Better Search Architecture/Implementation -More support for Speaker Adaptation- Gentle Re-factoring of code-base

-Somme support on FSG decoding and confidence

-Better Documentation/Tutorial

lm_convert(lm3g2dmp)

dp3.6

Page 11: Progress of Sphinx 3.X From X=5 to X=6

This talk – Progress of Sphinx 3.6 From the perspective of

a developer an observer

Sphinx 3.6 Where are we now? Where will we go?

Summary of 5 talks http://www.cs.cmu.edu/~archan/sphinxPresentation.h

tml

Page 12: Progress of Sphinx 3.X From X=5 to X=6

Software Architecture of Sphinx 3.X (X=6)

Page 13: Progress of Sphinx 3.X From X=5 to X=6

Motivation of Re-Architecting Sphinx 3.X We start to need a new search algorithms

New search algorithm development could have risk. We don’t want to throw away the old one. Mere replacement could cause backward

compatibility problem. Code has grown to a stage where

Some changes could be very hard. Multiple programmers become active at the

same time CVS conflict could become often if things are

controlled by “if-else” structure

Page 14: Progress of Sphinx 3.X From X=5 to X=6

Architecture of Sphinx 3.X (X<6)

Batch sequential Architecture (Shaw 96) Each executable has customized sub-routines

decode livepretend Decode_anytopo align allphone

GMM Computation 1approx_cont_mgau

Search 1

Process Controller 1

GMM Computation 2(Using gauden &

senone Method 1)

Search 2

Process Controller 2

GMM Computation 3(Using gauden &

senone Method 2)

Search 3

Process Controller 3

GMM Computation 4(Using gauden & senone Method 3)

Search 4

Process Controller 4

Command Line 1 Command Line 2 Command Line 3 Command Line 4

Initialization 1(kb and kbcore) Initialization 2 Initialization 3 Initialization 4

Page 15: Progress of Sphinx 3.X From X=5 to X=6

Architecture Diagram of Sphinx 3.6

Applications Controllers/Abstractions

Implementations Libraries

decode

livepretend

alignallphone

dag

astar

livedecodeAPI

SearchController

ProcessController

SearchInitializerCommand

LineProcessor

User Defined Applications

Fast Single Stream GMM

ComputationMulti Stream

GMMComputation

FSG Search

Flat Lexicon Search

DictionaryLibrarySearchLibrary

LM Library

AM LibraryUtility LibraryFeatureLibrary

MiscellaneousLibrary

decode(anytopo)

Tree Lexicon Search

Page 16: Progress of Sphinx 3.X From X=5 to X=6

Separation of Mechanism and Implementation

Search MechanismModule (srch.c)

-A class provides Atomic Search Operations (ASOs) in the form of function pointers-Configured by just setting function pointers- A single interface for applications

Search ImplementationModule (srch.c)Search Implementation

Module (srch.c)Search ImplementationModule (srch.c)Search Implementation

Module (srch.c)Search Implementation

Modules(srch_????.c)

-Could have many of them-Possibilities:A, Decoding with different implementationsB, Concept of search including -alignment, -phoneme recognition

-keyword spotting.

Page 17: Progress of Sphinx 3.X From X=5 to X=6

Search Mechanism Module – What does it do?

Computation of One Frame

SelectActive

CDSenone

ComputeApprox.

GMMScore

(CI senone)

ComputeDetailGMMScore

(CD senone)

ComputeDetailHMMScore(CD)

PropagateGraph (Phone-Level)

RescoringAt word

End usingHigh-Level

KS(e.g. LM)

PropagateGraph(Word-Level)

Search For One FrameGMMCompute

Page 18: Progress of Sphinx 3.X From X=5 to X=6

Search Implementations Implemented (-op_mode)

Finite State Grammar Search (Mode 2) Flat Lexicon Search (Mode 3) Tree Search (Mode 4)

Not in 3.6 Aligner (Mode 0) Phoneme recognition (Mode 1) A new tree search (Mode 5)

Page 19: Progress of Sphinx 3.X From X=5 to X=6

Different ways to implement search implementations 1, Use default implementation

Just specify all atomic search operations (ASOs) provided

2, Override “search_one_frame” Only need to specify GMM computation

and how to “search_one_frame” 3, Override the whole mechanism

For people who dislike the default so much Override how to “search”

Page 20: Progress of Sphinx 3.X From X=5 to X=6

Consequence of Re-factoring Calling decode

Could use flat-lexicon decoding as well decode_anytopo still exists

For backward compatibility decode_anytopo = decode

allphone, align, decode_anytopo could use fast GMM computation

decode could use S3’s SCHMM Command-line is now synchronized

Page 21: Progress of Sphinx 3.X From X=5 to X=6

Summary on the Architecture Sphinx 3.6

A gentle re-factoring has carried out. A more flexible architecture A better playground for AM and

search people S2 SCHMM computation routine? NN, SVM, ML techniques for AM?

Page 22: Progress of Sphinx 3.X From X=5 to X=6

Speed of Sphinx 3.6

Page 23: Progress of Sphinx 3.X From X=5 to X=6

Speed in Sphinx 3.6 Further work on Context-Independent Senone-

based GMM Selection (CIGMMS) 20-30% Speed Up

3 tricks were proposed Fixed amount of CD senone compute. Use of best Gaussian index Tightening factor of CI-phone beam

Published in “On Improvements of CI-based GMM Selection “ (Chan et al 2005)

but not very well received Alright, there are accuracy lost

Page 24: Progress of Sphinx 3.X From X=5 to X=6

A note on Sphinx 3.6 Speed Performance Sphinx 3.X works under 1xRT in most

tasks. E.g. Smartnote/Sphinx Integration Broadcast News UNTUNED RESULT: 1.5xRT

Sphinx 3.X is still slower than Sphinx 2 Fast setup of Sphinx 2: use 256 codeword

SCHMM Fast setup of Sphinx 3: use 2000-6000

senone FCHMM Historical notes: Comparable SCHMM setup has

4096 codewords Need benchmarking to truly judge

Page 25: Progress of Sphinx 3.X From X=5 to X=6

Speed - Conclusion Sphinx 3.X is in a reasonable level

Sphinx 2 should still be used in speed-critical condition

Further work GALE/CALO will still be around in

3.6/3.7 Accuracy become more motivated than

speed

Page 26: Progress of Sphinx 3.X From X=5 to X=6

Accuracy Improvement During Sphinx 3.6

Page 27: Progress of Sphinx 3.X From X=5 to X=6

Our Immediate Problem What help us more in accuracy?

Acoustic modeling ? Speaker Adaptation ? Search Improvement ?

Page 28: Progress of Sphinx 3.X From X=5 to X=6

Accuracy Improvement of Sphinx 3.6 – Speaker Adaptation Speaker adaptation techniques are

shown to be crucial Even in tough task (e.g. CALO)

10-15% relative improvement Gain similar to LM/AM modeling work

Page 29: Progress of Sphinx 3.X From X=5 to X=6

Accuracy Improvement of Sphinx 3.6 – Speaker Adaptation (cont.) Dave has done a great job on

Multiple-class MLLR MAP adaptation

Things to watch Ziad’s VTLN implementation

Page 30: Progress of Sphinx 3.X From X=5 to X=6

Conclusion in Speaker Adaptation Observation in 3.6

Speaker adaptation is very important.

What we still need: Maximum likelihood linear transformation

(MLLT) Combination of MLLT, MLLR, MAP and

VTLN Proved to be additive

Page 31: Progress of Sphinx 3.X From X=5 to X=6

Accuracy Improvement of Sphinx 3.6 - Search Our Attempts in Flat Lexicon Decoder

Full triphones 2.5% rel. gain But 100xRT

Full trigram Will give another 5-10 times slowdown

Diff between Tree vs Flat Lexicon Decoders 5% relative

Conclusion: Further improvement in search is limited

Page 32: Progress of Sphinx 3.X From X=5 to X=6

Accuracy Improvement in Sphinx 3.6 -Modeling Mainly

on addition of data (Major contributor) interpolation of LM (very decent gain)

Things to watch: Yi’s LDA Yet to explore

Speaker Adaptive Training (SAT) Semi-tied Covariance (STC) Matrix

Conclusion: Commodity techniques are still not

widely used in Sphinx (Bad sign).

Page 33: Progress of Sphinx 3.X From X=5 to X=6

Conclusion of Accuracy Improvement 3.6 3.6 has a healthy development in

speaker adaptation Improvement in search is hard Need 10x effort on acoustic modeling

Commodity techniques are still not there Three final keywords: MLLT, SAT, STC

Priorities: Adaptation > AM, LM > 2 stage Search

>> 1st Stage

Page 34: Progress of Sphinx 3.X From X=5 to X=6

Other Extensions in Sphinx 3.6

Page 35: Progress of Sphinx 3.X From X=5 to X=6

FSG search 3.6 supports FSG search

Adapted from Sphinx 2’s implementation Current Issues

No lextree implementation Static allocation of all HMMs; not allocated “on

demand” FSG transitions represented by NxN matrix

Other wish list No histogram pruning No state-based implementation

Need more testing

Page 36: Progress of Sphinx 3.X From X=5 to X=6

Confidence Annotation conf Adapted from Rong with

permission Compute Word Posterior Probability of

a word given lattice Still under work

Page 37: Progress of Sphinx 3.X From X=5 to X=6

Language Model Related Now fully supports

Text-based LM reading Inter-conversion of LM in TXT & DMP

format lm_convert = lm3g2dmp++

LM switching API in live_decode_API

Page 38: Progress of Sphinx 3.X From X=5 to X=6

Documentation/Tutorial

Page 39: Progress of Sphinx 3.X From X=5 to X=6

Hieroglyphs A collection of documentation of

using Sphinx 3, SphinxTrain and CMU LM Tool kit

1st Draft is completed All chapter are filled with information. Writing the 2nd Draft

“Chief Editor”: Arthur Chan Does it even exist?

Page 40: Progress of Sphinx 3.X From X=5 to X=6

Cover of Hieroglyphs

Page 41: Progress of Sphinx 3.X From X=5 to X=6

Hieroglyph: An outline Chapter 1: Licensing of Sphinx, SphinxTrain and LM Toolkit Chapter 2: Introduction to Sphinx Chapter 3: Introduction to Speech Recognition Chapter 4: Recipe of Building Speech Application using Sphinx Chapter 5: Different Software Toolkits of Sphinx Chapter 6: Acoustic Model Training Chapter 7: Language Model Training Chapter 8: Search Structure and Speed-up of the Speech

recognizer Chapter 9: Speaker Adaptation Chapter 10: Research using Sphinx Chapter 11: Development using Sphinx Appendix A: Command Line Information Appendix B: FAQ

Page 42: Progress of Sphinx 3.X From X=5 to X=6

Book Reviews of Hieroglyphs “You wrote the worst preface I have ever

seen in my life. “ Dr. Evandro Gouvea “The content is o. k., but the writing is

still ……” Prof. Alex I. Rudnicky “Wow, it is thick. And, oh…… there are

no blank spaces! You are not supposed to add contents in any CMU open source manuals, don’t you know?” Dr. Alan W. Black

Page 43: Progress of Sphinx 3.X From X=5 to X=6

Other Documents Robust Tutorial (Aka Sphinx 101)

Thanks to Evandro Now could be used for

archive_s3 Sphinx 2 Sphinx 3

http://www.cs.cmu.edu/~robust/Tutorial/ Doxygen documentation for Sphinx 3.x

is fully available http://www.speech.cs.cmu.edu/sphinx/sphinx3/doxygen/html/

Page 44: Progress of Sphinx 3.X From X=5 to X=6

Sphinx 3.X (X>6) and Conclusion

Page 45: Progress of Sphinx 3.X From X=5 to X=6

What is important? Keep the current design priorities:

1, Accuracy We are just OK and we badly need to improve it.

2, Speed We are OK and it doesn’t hurt to improve it

3, Functionalities Still a pain to use Sphinx 3 but it is constant

improved Usability eventually implies distributing models.

Accuracy should be prior to Speed No excuse in 3.7

Page 46: Progress of Sphinx 3.X From X=5 to X=6

Roadmap: In X=7…… For GALE/CALO

Speaker Clustering/SAT Bridging SI and SA

VTLN LDA

0.5 x CALO may need further speed improvement BBI More secret ideas in GMM computation

Page 47: Progress of Sphinx 3.X From X=5 to X=6

Roadmap (cont.) X=8

D.T. MMIE, MCE

STC Interface with HTK model

X=9 D.T. + S.A.

X>10 Time to fire Arthur Chan and hire an

assistant professor

Page 48: Progress of Sphinx 3.X From X=5 to X=6

Sphinx in Other Languages?

Page 49: Progress of Sphinx 3.X From X=5 to X=6

Other Possibilities of Sphinx?[You fill in this part]

Page 50: Progress of Sphinx 3.X From X=5 to X=6

We need your help! Project Manager: Enable Development of Sphinx

Translation: Kick/Fix people and Kicked/Fixed by Evandro Developers: Incorporate state-of-art speech

technology into Sphinx Translation: Fix 1 bug and Generate 5 more

Maintainer: Ensure integrity of Sphinx code and resource

Translation: You become so called the “Grand Janitor of Sphinx”.

Tester: Enable test-based development in Sphinx Translation: You will learn a lot of Zen-Buddhism.

Page 51: Progress of Sphinx 3.X From X=5 to X=6

Our Current Motto (Subject to Change)

“Don’t ever underestimate yourself…… You never know what kind of a mess you could make.”

-Dr. Evandro Gouvea

Page 52: Progress of Sphinx 3.X From X=5 to X=6

Conclusion for Sphinx 3.X We have done something We are making some sense in the

system development now We have healthy growth in

accuracy But we still need more

Page 53: Progress of Sphinx 3.X From X=5 to X=6

Q & A

Page 54: Progress of Sphinx 3.X From X=5 to X=6

Thank you Acknowledgement

Rich/Alan: for your constant encouragement Alex: for your understanding of Yin/Yang Rong: for contributing the confidence

estimation program Bano: for reminding me I could die at any

time when we were in Lake Arthur -> Hieroglyphs 1st draft’s progress sped up.

Sphinx developers: without you, I won’t be the “Grand Janitor”.

Sphinx users: for your capabilities of giving me nightmares

Page 55: Progress of Sphinx 3.X From X=5 to X=6

Reserved Slides

Page 56: Progress of Sphinx 3.X From X=5 to X=6

Pros/Cons of Batch Sequential Architecture Pros:

Great flexibility for individual programmers No assumption, data structure are usually

optimized for the application. Align and allphone have optimization.

Crafting in individual application has high quality Cons:

Great difficulty in maintenance Most changes need to be carried out for 5-6 times.

Spread disease of code duplication Code with functionality was duplicated multiple times