Upload
bryson
View
56
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Progress of Sphinx 3.X From X=5 to X=6. Arthur Chan Evandro Gouvea David J. Huggins-Daines Alex I. Rudnicky Mosur Ravishankar Yitao Sun. If you want to leave now…… Take home message 1. Sphinx 3.6 Rocks!. Here is another one…… Take home message 2. We need Better Acoustic Models. - PowerPoint PPT Presentation
Citation preview
Progress of Sphinx 3.XFrom X=5 to X=6Arthur ChanEvandro GouveaDavid J. Huggins-DainesAlex I. RudnickyMosur RavishankarYitao Sun
If you want to leave now……Take home message 1
Sphinx 3.6 Rocks!
Here is another one……Take home message 2
We need Better We need Better Acoustic ModelsAcoustic Models.
This talk (~37 pages) Overview (6 pages) Better Software Architecture (9 pages) Speed of Sphinx 3.6 (3 pages) Accuracy Improvement (7 pages) Functionalities Improvement (3 pages) Documentation (4 pages) Sphinx 3.X (X>6) and Conclusion (~5 pages) Discussion (10 mins?)
Overview of CMU Sphinx
What is CMU Sphinx? Definition 1 :
Large vocabulary speech recognizers with high accuracy and speed performance.
Definition 2 : A collection of tools and resources that
enables developers/researchers to build successful speech recognition systems
Family of CMU Sphinx Decoders
Sphinx {II – IV} PocketSphinx (by Dave at Oct 2005)
Acoustic Model Trainer SphinxTrain
Documentation Hieroglyphs Robust/SphinxTrain Tutorial
Sphinx Developers Sphinx is maintained by
Volunteer programmers/researchers who like speech recognition
Funded by different projects Motivated by different reasons
All contribution go to the same codebase Goal : Sustainable development of Sphinx
Sphinx Developer Meetings are held regularly secretly to decide the way to go in Sphinx
What is Sphinx 3.X? An extension of Sphinx 3’s recognizers “Sphinx 3.X (X=6)” means “Sphinx 3.6” Provide more functionalities such as
Real-time speech recognition Speaker adaptation Developers Application Interfaces (APIs) Different search algorithms
3.X (X>3) is motivated by Project CALO and GALE
Development History of Sphinx 3.X
S3 -Sphinx 3 flat-lexicon recognizer (s3 slow)
S3.2 -Sphinx 3 tree-lexicon recognizer (s3 fast)
S3.3 -live-mode demoS3.4 -fast GMM, class-based LM, dynamic LM
S3.5 –some support on speaker adaptation-live mode APIs
3.X/3.0 merge
- Better Search Architecture/Implementation -More support for Speaker Adaptation- Gentle Re-factoring of code-base
-Somme support on FSG decoding and confidence
-Better Documentation/Tutorial
lm_convert(lm3g2dmp)
dp3.6
This talk – Progress of Sphinx 3.6 From the perspective of
a developer an observer
Sphinx 3.6 Where are we now? Where will we go?
Summary of 5 talks http://www.cs.cmu.edu/~archan/sphinxPresentation.h
tml
Software Architecture of Sphinx 3.X (X=6)
Motivation of Re-Architecting Sphinx 3.X We start to need a new search algorithms
New search algorithm development could have risk. We don’t want to throw away the old one. Mere replacement could cause backward
compatibility problem. Code has grown to a stage where
Some changes could be very hard. Multiple programmers become active at the
same time CVS conflict could become often if things are
controlled by “if-else” structure
Architecture of Sphinx 3.X (X<6)
Batch sequential Architecture (Shaw 96) Each executable has customized sub-routines
decode livepretend Decode_anytopo align allphone
GMM Computation 1approx_cont_mgau
Search 1
Process Controller 1
GMM Computation 2(Using gauden &
senone Method 1)
Search 2
Process Controller 2
GMM Computation 3(Using gauden &
senone Method 2)
Search 3
Process Controller 3
GMM Computation 4(Using gauden & senone Method 3)
Search 4
Process Controller 4
Command Line 1 Command Line 2 Command Line 3 Command Line 4
Initialization 1(kb and kbcore) Initialization 2 Initialization 3 Initialization 4
Architecture Diagram of Sphinx 3.6
Applications Controllers/Abstractions
Implementations Libraries
decode
livepretend
alignallphone
dag
astar
livedecodeAPI
SearchController
ProcessController
SearchInitializerCommand
LineProcessor
User Defined Applications
Fast Single Stream GMM
ComputationMulti Stream
GMMComputation
FSG Search
Flat Lexicon Search
DictionaryLibrarySearchLibrary
LM Library
AM LibraryUtility LibraryFeatureLibrary
MiscellaneousLibrary
decode(anytopo)
Tree Lexicon Search
Separation of Mechanism and Implementation
Search MechanismModule (srch.c)
-A class provides Atomic Search Operations (ASOs) in the form of function pointers-Configured by just setting function pointers- A single interface for applications
Search ImplementationModule (srch.c)Search Implementation
Module (srch.c)Search ImplementationModule (srch.c)Search Implementation
Module (srch.c)Search Implementation
Modules(srch_????.c)
-Could have many of them-Possibilities:A, Decoding with different implementationsB, Concept of search including -alignment, -phoneme recognition
-keyword spotting.
Search Mechanism Module – What does it do?
Computation of One Frame
SelectActive
CDSenone
ComputeApprox.
GMMScore
(CI senone)
ComputeDetailGMMScore
(CD senone)
ComputeDetailHMMScore(CD)
PropagateGraph (Phone-Level)
RescoringAt word
End usingHigh-Level
KS(e.g. LM)
PropagateGraph(Word-Level)
Search For One FrameGMMCompute
Search Implementations Implemented (-op_mode)
Finite State Grammar Search (Mode 2) Flat Lexicon Search (Mode 3) Tree Search (Mode 4)
Not in 3.6 Aligner (Mode 0) Phoneme recognition (Mode 1) A new tree search (Mode 5)
Different ways to implement search implementations 1, Use default implementation
Just specify all atomic search operations (ASOs) provided
2, Override “search_one_frame” Only need to specify GMM computation
and how to “search_one_frame” 3, Override the whole mechanism
For people who dislike the default so much Override how to “search”
Consequence of Re-factoring Calling decode
Could use flat-lexicon decoding as well decode_anytopo still exists
For backward compatibility decode_anytopo = decode
allphone, align, decode_anytopo could use fast GMM computation
decode could use S3’s SCHMM Command-line is now synchronized
Summary on the Architecture Sphinx 3.6
A gentle re-factoring has carried out. A more flexible architecture A better playground for AM and
search people S2 SCHMM computation routine? NN, SVM, ML techniques for AM?
Speed of Sphinx 3.6
Speed in Sphinx 3.6 Further work on Context-Independent Senone-
based GMM Selection (CIGMMS) 20-30% Speed Up
3 tricks were proposed Fixed amount of CD senone compute. Use of best Gaussian index Tightening factor of CI-phone beam
Published in “On Improvements of CI-based GMM Selection “ (Chan et al 2005)
but not very well received Alright, there are accuracy lost
A note on Sphinx 3.6 Speed Performance Sphinx 3.X works under 1xRT in most
tasks. E.g. Smartnote/Sphinx Integration Broadcast News UNTUNED RESULT: 1.5xRT
Sphinx 3.X is still slower than Sphinx 2 Fast setup of Sphinx 2: use 256 codeword
SCHMM Fast setup of Sphinx 3: use 2000-6000
senone FCHMM Historical notes: Comparable SCHMM setup has
4096 codewords Need benchmarking to truly judge
Speed - Conclusion Sphinx 3.X is in a reasonable level
Sphinx 2 should still be used in speed-critical condition
Further work GALE/CALO will still be around in
3.6/3.7 Accuracy become more motivated than
speed
Accuracy Improvement During Sphinx 3.6
Our Immediate Problem What help us more in accuracy?
Acoustic modeling ? Speaker Adaptation ? Search Improvement ?
Accuracy Improvement of Sphinx 3.6 – Speaker Adaptation Speaker adaptation techniques are
shown to be crucial Even in tough task (e.g. CALO)
10-15% relative improvement Gain similar to LM/AM modeling work
Accuracy Improvement of Sphinx 3.6 – Speaker Adaptation (cont.) Dave has done a great job on
Multiple-class MLLR MAP adaptation
Things to watch Ziad’s VTLN implementation
Conclusion in Speaker Adaptation Observation in 3.6
Speaker adaptation is very important.
What we still need: Maximum likelihood linear transformation
(MLLT) Combination of MLLT, MLLR, MAP and
VTLN Proved to be additive
Accuracy Improvement of Sphinx 3.6 - Search Our Attempts in Flat Lexicon Decoder
Full triphones 2.5% rel. gain But 100xRT
Full trigram Will give another 5-10 times slowdown
Diff between Tree vs Flat Lexicon Decoders 5% relative
Conclusion: Further improvement in search is limited
Accuracy Improvement in Sphinx 3.6 -Modeling Mainly
on addition of data (Major contributor) interpolation of LM (very decent gain)
Things to watch: Yi’s LDA Yet to explore
Speaker Adaptive Training (SAT) Semi-tied Covariance (STC) Matrix
Conclusion: Commodity techniques are still not
widely used in Sphinx (Bad sign).
Conclusion of Accuracy Improvement 3.6 3.6 has a healthy development in
speaker adaptation Improvement in search is hard Need 10x effort on acoustic modeling
Commodity techniques are still not there Three final keywords: MLLT, SAT, STC
Priorities: Adaptation > AM, LM > 2 stage Search
>> 1st Stage
Other Extensions in Sphinx 3.6
FSG search 3.6 supports FSG search
Adapted from Sphinx 2’s implementation Current Issues
No lextree implementation Static allocation of all HMMs; not allocated “on
demand” FSG transitions represented by NxN matrix
Other wish list No histogram pruning No state-based implementation
Need more testing
Confidence Annotation conf Adapted from Rong with
permission Compute Word Posterior Probability of
a word given lattice Still under work
Language Model Related Now fully supports
Text-based LM reading Inter-conversion of LM in TXT & DMP
format lm_convert = lm3g2dmp++
LM switching API in live_decode_API
Documentation/Tutorial
Hieroglyphs A collection of documentation of
using Sphinx 3, SphinxTrain and CMU LM Tool kit
1st Draft is completed All chapter are filled with information. Writing the 2nd Draft
“Chief Editor”: Arthur Chan Does it even exist?
Cover of Hieroglyphs
Hieroglyph: An outline Chapter 1: Licensing of Sphinx, SphinxTrain and LM Toolkit Chapter 2: Introduction to Sphinx Chapter 3: Introduction to Speech Recognition Chapter 4: Recipe of Building Speech Application using Sphinx Chapter 5: Different Software Toolkits of Sphinx Chapter 6: Acoustic Model Training Chapter 7: Language Model Training Chapter 8: Search Structure and Speed-up of the Speech
recognizer Chapter 9: Speaker Adaptation Chapter 10: Research using Sphinx Chapter 11: Development using Sphinx Appendix A: Command Line Information Appendix B: FAQ
Book Reviews of Hieroglyphs “You wrote the worst preface I have ever
seen in my life. “ Dr. Evandro Gouvea “The content is o. k., but the writing is
still ……” Prof. Alex I. Rudnicky “Wow, it is thick. And, oh…… there are
no blank spaces! You are not supposed to add contents in any CMU open source manuals, don’t you know?” Dr. Alan W. Black
Other Documents Robust Tutorial (Aka Sphinx 101)
Thanks to Evandro Now could be used for
archive_s3 Sphinx 2 Sphinx 3
http://www.cs.cmu.edu/~robust/Tutorial/ Doxygen documentation for Sphinx 3.x
is fully available http://www.speech.cs.cmu.edu/sphinx/sphinx3/doxygen/html/
Sphinx 3.X (X>6) and Conclusion
What is important? Keep the current design priorities:
1, Accuracy We are just OK and we badly need to improve it.
2, Speed We are OK and it doesn’t hurt to improve it
3, Functionalities Still a pain to use Sphinx 3 but it is constant
improved Usability eventually implies distributing models.
Accuracy should be prior to Speed No excuse in 3.7
Roadmap: In X=7…… For GALE/CALO
Speaker Clustering/SAT Bridging SI and SA
VTLN LDA
0.5 x CALO may need further speed improvement BBI More secret ideas in GMM computation
Roadmap (cont.) X=8
D.T. MMIE, MCE
STC Interface with HTK model
X=9 D.T. + S.A.
X>10 Time to fire Arthur Chan and hire an
assistant professor
Sphinx in Other Languages?
Other Possibilities of Sphinx?[You fill in this part]
We need your help! Project Manager: Enable Development of Sphinx
Translation: Kick/Fix people and Kicked/Fixed by Evandro Developers: Incorporate state-of-art speech
technology into Sphinx Translation: Fix 1 bug and Generate 5 more
Maintainer: Ensure integrity of Sphinx code and resource
Translation: You become so called the “Grand Janitor of Sphinx”.
Tester: Enable test-based development in Sphinx Translation: You will learn a lot of Zen-Buddhism.
Our Current Motto (Subject to Change)
“Don’t ever underestimate yourself…… You never know what kind of a mess you could make.”
-Dr. Evandro Gouvea
Conclusion for Sphinx 3.X We have done something We are making some sense in the
system development now We have healthy growth in
accuracy But we still need more
Q & A
Thank you Acknowledgement
Rich/Alan: for your constant encouragement Alex: for your understanding of Yin/Yang Rong: for contributing the confidence
estimation program Bano: for reminding me I could die at any
time when we were in Lake Arthur -> Hieroglyphs 1st draft’s progress sped up.
Sphinx developers: without you, I won’t be the “Grand Janitor”.
Sphinx users: for your capabilities of giving me nightmares
Reserved Slides
Pros/Cons of Batch Sequential Architecture Pros:
Great flexibility for individual programmers No assumption, data structure are usually
optimized for the application. Align and allphone have optimization.
Crafting in individual application has high quality Cons:
Great difficulty in maintenance Most changes need to be carried out for 5-6 times.
Spread disease of code duplication Code with functionality was duplicated multiple times