CMU Shpinx Speech Recognition Engine

Reporter : Chun-Feng LiaoNCCU Dept. of Computer Sceince

Intelligent Media Lab

Purposes of this project

• Finding out how an efficient speech recognition engine can be implemented.

• Examine the source code of Sphinx2 to find out the role and function of each component.

• Reading key chapters of Dr. Mosur K. Ravishankar’s thesis as a reference.

• Some demo programs will be given during oral presentation.

Presentation Agenda• Project Summary/ Agenda/ Goal. (In English)• Introduction.• Basics of Speech Recognitions.• Architecture of CMU Sphinx.

– Acoustic Model and HMM.– Language Model.

• Java™ Platform Issues.• Demo• Conclusion.

Voice Technologies

• In the mid- to late 1990s, personal computers started to become powerful enough to support ASR

• The two key underlying technologies behind these advances are speech recognition (SR) and text-to-speech synthesis (TTS).

Basics of Speech Recognition

Speech Recognition

• Capturing speech (analog) signals• Digitizing the sound waves, converting the

m to basic language units or phonemes( 音素 ).

• Constructing words from phonemes, and contextually analyzing the words to ensure correct spelling for words that sound alike (such as write and right).

Speech Recognition Process Flow

Source:Microsoft Speech.NET Home(http://www.microsoft.com/speech/ )

Recognition Process Flow Summary

• Step 1:User Input– The system catches user’s voice in

the form of analog acoustic signal .

• Step 2:Digitization– Digitize the analog acoustic signal.

• Step 3:Phonetic Breakdown– Breaking signals into phonemes.

Recognition Process Flow Summary(2)

• Step 4:Statistical Modeling– Mapping phonemes to their phonetic representati

on using statistics model.• Step 5:Matching

– According to grammar , phonetic representation and Dictionary , the system returns an n-best list (I.e.:a word plus a confidence score)

– Grammar-the union words or phrases to constraint the range of input or output in the voice application.

– Dictionary-the mapping table of phonetic representation and word(EX:thu,theethe)

Architecture of CMU Sphinx.

Introduction to CMU Sphinx

• A speech recognition system developed at Carnegie Mellon University.

• Consists of a set of libraries – core speech recognition functions – low-level audio capture

• Continuous speech decoding• Speaker-independent

Brief History of CMU Sphinx

• Sphinx-I (1987)– The first user independent ,high performance A

SR of the world.– Written in C by Kai-Fu Lee ( 李開復博士，現任 Mi

crosoft Asia 首席技術顧問 / 副總裁 ).• Sphinx-II (1992)

– Written by Xuedong Huang in C. ( 黃學東博士，現為 Microsoft Speech.NET 團隊領導人 )

– 5-state HMM / N-gram LM.• ( 我們可以推測， CMU Sphinx 的核心技術對

Microsoft Speech SDK 影響很大。 )

Brief History of CMU Sphinx (2)

• Sphinx 3 (1996)– Built by Eric Thayer and Mosur Ravishank

ar.– Slower than Sphinx-II but the design is m

ore flexible.• Sphinx 4 (Originally Sphinx 3j)

– Refactored from Sphinx 3.– Fully implemented in Java.– Not finished yet.

Components of CMU Sphinx

Front End

• libsphinx2fe.lib / libsphinx2ad.lib• Low-level audio access• Continuous Listening and Silence Filte

ring• Front End API overview.

Knowledge Base

• The data that drives the decoder.• Three sets of data

– Acoustic Model.– Language Model.– Lexicon (Dictionary).

Acoustic Model

• /model/hmm/6k• Database of statistical model.• Each statistical model represents a

phoneme.• Acoustic Models are trained by

analyzing large amount of speech data.

HMM in Acoustic Model

• HMM represent each unit of speech in the Acoustic Model.

• Typical HMM use 3-5 states to model a phoneme.

• Each state of HMM is represented by a set of Gaussian mixture density functions.

• Sphinx2 default phone set.

Gaussian Mixtures• Refer to text book p 33 eq 38 • Represent each state in HMM.• Each set of Gaussian Mixtures are called “s

enones”.• HMM can share “senones”.

Language Model• Describes what is likely to be spoken in a par

ticular context• Word transitions are defined in terms of tran

sition probabilities• Helps to constrain the search space• See examples of LM.

N-gram Language Model

• Probability of word N dependent on word N-1, N-2, ...

• Bigrams and trigrams most commonly used• Used for large vocabulary applications such a

s dictation• Typically trained by very large (millions of wo

rds) corpus

Decoder

• Selects next set of likely states• Scores incoming features against thes

e states• Drop low scoring states• Generates results

Speech in Java™ Platform

Sun Java Speech API

• First released on October 26, 1998.• The Java™ Speech API allows Java

applications to incorporate speech technology into their user interfaces.

• Defines a cross-platform API to support command and control recognizers, dictation systems and speech synthesizers.

Implementations of Java Speech API

• Open Source– FreeTTS / CMU Sphinx4.

• IBM Speech for Java.• Cloud Garden.• L&H TTS for Java Speech API.• Conversa Web 3.0.

Free TTS

• Fully implemented with Java.• Based upon Flite 1.1: a small run-time

speech synthesis engine developed at CMU.

• Partial support for JSAPI 1.0.– Speech Recognition functions.– JSML.

Sphinx 4 (Sphinx 3j)

• Fully implemented with Java.• Speed is equal or faster than Sphinx3.• Acoustic model and Language model

is under construction.• Source code are available by CVS.(but

you can not run any applications without models !)

For Example : To check out the Sphinx4 ,you can using the following command.cvs -z3 -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/cmusphinx co sphinx4

Java™ Platform Issues

• GC makes managing data much easier

• Native engines typically optimize inner loops for the CPU – can't do that on the Java platform.

• Native engines arrange data to• optimize cache hits – can't really

do that either.

• Sphinx-II batch mode.• Sphinx-II live mode.• Sphinx-II Client / Server mode.• A Simple Free TTS Application.• (Java-based) TTS vs (c-based)SR .• Motion Planner with Free TTS-using J

ava Web Start™.(This is GRA course final project)

Summary• Sphinx is a open source Speech

Recognition developed at CMU.• FE / KB / Decoder form the core of SR

system.• FE receives and processes speech signal.• Knowledge Base provide data for

Decoder.• Decoder search the states and return the

results.• Speech Recognition is a challenging

problem for the Java platform.

Reference• Mosur K.Ravishankar, Efficient Alogrit

hms for Speech Recognition, CMU, 1996.

• Mosur K.Ravishankar, Kevin A. Lenzo ,Sphinx-II User Guide , CMU,2001.

• Xuedong Huang,Alex Acerd,Hsiao-Wuen hon,Spoken Language Processing,Prentice Hall,2000.

Reference (on-line)

• On-line documents of Java™ Speech API – http://java.sun.com/products/java-media/spee

• On-line documents of Free TTS– http://freetts.sourceforge.net/docs/

• On-line documents of Sphinx-II– http://www.speech.cs.cmu.edu/sphinx/

CMU Shpinx Speech Recognition Engine

Documents

ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION. Introduction What is Speech Recognition? also known as automatic speech recognition or computer speech

Speech Recognition with CMU Sphinx

Speech Recognition and Speech Translation

An Analysis of Using Semantic Parsing for Speech Recognition...CMU Sphinx-4 speech recognition system which was used in this work. Semantic parsing as is relevant to this work is also

Speech Recognition

Speech and Speech Recognition resources

Speech Recognition Template matching - Speech at CMU · Speech Recognition by Templates A little history … Matching Templates DTW (Dynamic Time Warping) Beyond template matching

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W)

The Practical Guide to Speech Recognition · Speech recognition offers a rapid and substantial payback. Table One: Increasing Self-Help with Speech Recognition 3 Speech Recognition

ISSUES IN SPEECH RECOGNITION Shraddha Sharma. Contents: Introduction What is speech recognition? Terminology of speech recognition Why we want speech

Information for Speech Recognition Joint Processing of ... Speech Recognition ... speech onset cues with audio-based speech energy Audio-Visual Speech synthesis ... speech recognition

Speech Recognition. What makes speech recognition hard?

Chapter 5: Speech Recognition An example of a speech recognition system Speech recognition techniques Ch5., v.5b1

SPEECH RECOGNITION:

ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight

A Crash Course on Speech Recognition and Using CMU Sphinx to Build ASRs 379

Speech Recognition using Neural Networksisl.anthropomatik.kit.edu/pdf/Tebelskis1995.pdfSpeech Recognition using Neural Networks Joe Tebelskis May 1995 CMU-CS-95-142 School of Computer

Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy

FROM CMU SPHINX-II TO MICROSOFT WHISPER · Web viewFROM CMU SPHINX-II TO MICROSOFT WHISPER — Making Speech Recognition Usable X. Huang, A. Acero, F. Alleva D. Beeferman, M. Hwang,

SpeM: Modeling Human Speech Recognition - MRC ... · Web viewKeywords: human speech recognition; automatic speech recognition; spoken word recognition; computational modeling Abstract