Upload
galia
View
61
Download
3
Embed Size (px)
DESCRIPTION
Introduction to Speech Signal Processing. Dr. Zhang Sen [email protected] Chinese Academy of Sciences Beijing, China 2014/9/22. Introduction Sampling and quantization Speech coding Features and Analysis Main features Some transformations Speech-to-Text State of the art - PowerPoint PPT Presentation
Citation preview
Introduction to Speech Signal Processing
Dr. Zhang SenDr. Zhang Sen
Chinese Academy of SciencesBeijing, China
23/04/21
Report
Docum
ent
2
•Introduction–Sampling and quantization–Speech coding
•Features and Analysis–Main features–Some transformations
•Speech-to-Text –State of the art–Main approaches
•Text-to-Speech–State of the art–Main approaches
•Applications–Human-machine dialogue systems
Report
Docum
ent
3
• Some useful websites for ASR Tools– http://htk.eng.cam.ac.uk
• Free, available since 2000, relation with MS
• Over 12000 users, ver. 2.1, 3.0, 3.1, 3.2
• Include source code and HTK books
• A set of tools for training, decoding, evaluation
• Steve Young in Cambridge University
– http://www.cs.cmu.edu• Free for research and education
• Sphinx 2 and 3
• Tools, source code, speech database
• Reddy in CMU
Report
Docum
ent
4
Research on speech recognition in the world
Report
Docum
ent
5
• Carnegie Mellon University– CMU SCS Speech Group– Interact Lab
• Oregon Graduate Institute– Center for Spoken Language Understanding
• MIT– Lab for Computer Science, Spoken Language Systems– Acoustics & Vibration Lab– AI LAB– Lincoln Lab, Speech Systems Technology Group
• Stanford University– Center for Computer Research in Music and Acoustics– Center for the Study of Language and Information
Report
Docum
ent
6
• University of California– Berkeley, Santa Cruz, Los Angeles
• Boston University– Signal Processing and Interpretation Lab
• Georgia Institute of Technology– Digital Signal Processing Lab
• Johns Hopkins University– Center for Language and Speech Processing
• Brown University– Lab for Engineering Man-Machine Systems
• Mississippi State University• Colorado University• Cornell University
Report
Docum
ent
7
• Cambridge University– speech Vision and Robotics Group
• Edinburgh University – human Communication Research Center – center for Speech Technology Research
• University College London– Phonetics and Linguistics
• University of Essex– Dept. Language and Linguistics
Report
Docum
ent
8
• LIMSI, France • INRIA
– Institut National de Recherche en Informatique et Automatique
• University of Karlsruhe, Germany – Interractive Systems Lab
• DFKI– German Research Center for Artificial Intelligence
• KTH Speech Communication & Music Acoustics • CSELT, Italy
– Centro Studi e Laboratori Telecommunicazioni, Torino
• IRST– Istituto per la Ricerca Scientifica e Tecnologica, Trento
• ATR, Japan
Report
Docum
ent
9
• AT&T, Advanced Speech Product Group• Lucent Technologies, Bell Laboratories • IBM , IBM VoiceType • Texas Instruments Incorporated• National Institute of Standards and Technology• Apple Computer Co.• Digital Equipment Corporation (DEC) • SRI International • Dragon systems Co.• Sun Microsystems Lab. , speech applications • Microsoft Corporation, Speech technology SAPI • Entropic Research Laboratory, Inc.
Report
Docum
ent
10
• Important conferences and journals– IEEE trans. on ASSP– ICASSP (every year) – EUROSPEECH (every odd year) – ICSLP (every even year)– STAR
• Speech Technology and Research at SRI
Report
Docum
ent
11
Brief history and state-of-the-art of the research on speech recognition
Report
Docum
ent
12
ASR Progress Overview
• 50'S – ISOLATED DIGIT RECOGNITION (BELL LAB)
• 60'S : – HARDWARE SPEECH SEGMENTATOR (JAPAN)
– DYNAMIC PROGRAMMING (U.S.S.R)
• 70'S : – CLUSTERING ALGORITHM (SPEAKER INDEPENDECY)
– DTW
• 80'S: – HMM, DARPA, SPHINX
• 90'S : – ADAPTION, ROBUSTNESS
Report
Docum
ent
13
1952 Bell Labs Digits1952 Bell Labs Digits
• First word (digit) recognizer
• Approximates energy in formants (vocal
tract resonances) over word
• Already has some robust ideas
(insensitive to amplitude, timing variation)
• Worked very well
• Main weakness was technological (resistors
and capacitors)
Report
Docum
ent
14
The 60’sThe 60’s
• Better digit recognition
• Breakthroughs: Spectrum Estimation (FFT,
cepstra, LPC), Dynamic Time Warp (DTW),
and Hidden Markov Model (HMM) theory
• HARDWARE SPEECH SEGMENTATOR (JAPAN)
Report
Docum
ent
15
1971-76 ARPA Project1971-76 ARPA Project
• Focus on Speech Understanding
• Main work at 3 sites: System Development
Corporation, CMU and BBN
• Other work at Lincoln, SRI, Berkeley
• Goal was 1000-word ASR, a few speakers,
connected speech, constrained grammar,
less than 10% semantic error
Report
Docum
ent
16
ResultsResults
• Only CMU Harpy fulfilled goals -
used LPC, segments, lots of high level
knowledge, learned from Dragon *
(Baker)
* The CMU system done in the early ‘70’s; as opposed to the company formed in the ‘80’s
Report
Docum
ent
17
Achieved by 1976Achieved by 1976
• Spectral and cepstral features, LPC
• Some work with phonetic features
• Incorporating syntax and semantics
• Initial Neural Network approaches
• DTW-based systems (many)
• HMM-based systems (Dragon, IBM)
Report
Docum
ent
18
Dynamic Time WarpDynamic Time Warp
• Optimal time normalization with dynamic programming
• Proposed by Sakoe and Chiba, circa 1970• Similar time, proposal by Itakura• Probably Vintsyuk was first (1968)
Report
Docum
ent
19
HMMs for SpeechHMMs for Speech
• Math from Baum and others, 1966-1972
• Applied to speech by Baker in the
original CMU Dragon System (1974)
• Developed by IBM (Baker, Jelinek, Bahl,
Mercer,….) (1970-1993)
• Extended by others in the mid-1980’s
Report
Docum
ent
20
The 1980’sThe 1980’s
• Collection of large standard corpora
• Front ends: auditory models, dynamics
• Engineering: scaling to large
vocabulary continuous speech
• Second major (D)ARPA ASR project
• HMMs become ready for prime time
Report
Docum
ent
21
Standard Corpora CollectionStandard Corpora Collection
• Before 1984, chaos
• TIMIT
• RM (later WSJ)
• ATIS
• NIST, ARPA, LDC
Report
Docum
ent
22
Front Ends in the 1980’sFront Ends in the 1980’s
• Mel cepstrum (Bridle, Mermelstein)
• PLP (Hermansky)
• Delta cepstrum (Furui)
• Auditory models (Seneff, Ghitza, others)
Report
Docum
ent
23
Dynamic Speech FeaturesDynamic Speech Features
• temporal dynamics useful for ASR
• local time derivatives of cepstra
• “delta’’ features estimated over
multiple frames (typically 5)
• usually augments static features
• can be viewed as a temporal filter
Report
Docum
ent
24
HMM’s for Continuous SpeechHMM’s for Continuous Speech
• Using dynamic programming for cts speech
(Vintsyuk, Bridle, Sakoe, Ney….)
• Application of Baker-Jelinek ideas to
continuous speech (IBM, BBN, Philips, ...)
• Multiple groups developing major HMM
systems (CMU, SRI, Lincoln, BBN, ATT)
• Engineering development - coping with
data, fast computers
Report
Docum
ent
25
2nd (D)ARPA Project2nd (D)ARPA Project
• Common task• Frequent evaluations• Convergence to good, but similar, systems • Lots of engineering development - now up to
60,000 word recognition, in real time, on aworkstation, with less than 10% word error
• Competition inspired others not in project -Cambridge did HTK, now widely distributed
Report
Docum
ent
26
Some 1990’s IssuesSome 1990’s Issues
• Independence to long-term spectrum
• Adaptation
• Effects of spontaneous speech
• Information retrieval/extraction with
broadcast material
• Query-style systems (e.g., ATIS)
• Applying ASR technology to related
areas (language ID, speaker verification)
Report
Docum
ent
27
Real UsesReal Uses
• Telephone: phone company services
(collect versus credit card)
• Telephone: call centers for query
information (e.g., stock quotes,
parcel tracking)
• Dictation products: continuous
recognition, speaker dependent/adaptive
Report
Docum
ent
28
State-of-the-art of ASR
• Tremendous technical advances in the last few years
• From small to large vocabularies– 5,000 - 10,000 word vocabulary
– 10,000-60,000 word vocabulary
• From isolated word to spontaneous talk– Continuous speech recognition
– Conversational and spontaneous speech recognition
• From speaker-dependent to speaker-independent– Modern ASR is fully speaker independent
Report
Docum
ent
29
SOTA ASR Systems
• IBM, Via Voice– Speaker independent, continuous command
recognition – Large vocabulary recognition – Text-to-speech confirmation – Barge in (The ability to interrupt an audio
prompt as it is playing)
• Microsoft, Whisper, Dr Who
Report
Docum
ent
30
SOTA ASR Systems• DARPA
– 1982– GOAL
• HIGH ACCURACY• REAL-TIME PERFORMANCE• UNDERSTANDING CAPABILITY• CONTINUOUS SPEECH RECOGNITION
– DARPA DATABASE• 997 WORDS (RM)• ABOVE 100 SPEAKERS• TIMID
Report
Docum
ent
31
SOTA ASR Systems• SPHINX II
– CMU
– HMM BASED SPEECH RECOGNITION
– BIGRAM, WORD PAIR
– GENERALIZED TRIPHONE
– DARPA DATABASE
– 97% RECOGNITION (PERPLEXITY 20)
• SPHINX III– CHMM BASED
– WER, about 15% on WSJ
Report
Docum
ent
32
all speakers of the language
including foreign
application independent or
adaptive
all styles including human-human (unaware)
wherever speech occurs
2005
ASR Advances
vehicle noise radiocell phones
regional accentsnative speakers
competent foreign speakers
some application–
specific data and one engineer
year
natural human-machine dialog (user can adapt)
2000
expert years to create app– specific language model
speaker independent and adaptive
normal officevarious microphonestelephone
planned speech
1995
NOISE ENVIRONMENT
SPEECH STYLE
USER
POPULATION
COMPLEXITY
1985
quiet roomfixed high –quality mic
careful reading
speaker-dep.
application– specific speech and language
Report
Docum
ent
33
ButBut
• Still <97% accurate on “yes” for telephone
• Unexpected rate of speech causes doubling
or tripling of error rate
• Unexpected accent hurts badly
• Accuracy on unrestricted speech at 60%
• Don’t know when we know
• Few advances in basic understanding
Report
Docum
ent
34
How to Measure the Performance?
• What benchmarks? – DARPA– NIST (hub-4, hub-5, …)
• What was training? • What was the test? • Were they independent? • The vocabulary and the sample size? • Was the noise added or coincident with speech?
What kind of noise?
Report
Docum
ent
35
• Spontaneous telephone speech is still a “grand challenge”.
• Telephone-quality speech is still central to the problem.
• Broadcast news is a very dynamic domain.
0%
10%
30%
40%
20%
Word Error Rate (WER)
Level Of Difficulty
Digits
ContinuousDigits
Command and Control
Letters and Numbers
BroadcastNews
Read Speech
ConversationalSpeech
ASR Performance
Report
Docum
ent
36
0%
5%
15%
20%
10%
10 dB 16 dB 22 dB Quiet
Wall Street Journal (Additive Noise)
Machines
Human Listeners (Committee)
Word Error Rate
Speech-To-Noise Ratio
• Human performance exceeds machine performance by a factor ranging from 4x to 10x depending on the task.
• On some tasks, such as credit card number recognition, machine performance exceeds humans due to human memory retrieval capacity.
• The nature of the noise is as important as the SNR (e.g., cellular phones).
• A primary failure mode for humans is inattention.
• A second major failure mode is the lack of familiarity with the domain (i.e., business terms and corporation names).
Machine vs Human Performance
Report
Docum
ent
37
Core technology for ASR
Report
Docum
ent
38
Why is ASR Hard?Why is ASR Hard?
• Natural speech is continuous
• Natural speech has disfluencies
• Natural speech is variable over:
global rate, local rate, pronunciation
within speaker, pronunciation across
speakers, phonemes in different
contexts
Report
Docum
ent
39
Why is ASR Hard?Why is ASR Hard?(continued)(continued)
• Large vocabularies are confusable• Out of vocabulary words inevitable• Recorded speech is variable over:
room acoustics, channel characteristics,background noise
• Large training times are not practical• User expectations are for equal to or
greater than “human performance”
Report
Docum
ent
40
Main Causes of Speech VariabilityMain Causes of Speech Variability
Environment
Speaker
InputEquipment
Speech - correlated noisereverberation, reflection
Uncorrelated noiseadditive noise(stationary, nonstationary)
Attributes of speakersdialect, gender, age
Manner of speakingbreath & lip noisestressLombard effectratelevelpitchcooperativeness
Microphone (Transmitter)Distance from microphoneFilterTransmission system
distortion, noise, echoRecording equipment
Report
Docum
ent
41
ASR DimensionsASR Dimensions
• Speaker dependent, independent
• Isolated, continuous, keywords
• Lexicon size and difficulty
• Task constraints, perplexity
• Adverse or easy conditions
• Natural or read speech
Report
Docum
ent
42
Telephone SpeechTelephone Speech
• Limited bandwidth (F vs S)• Large speaker variability• Large noise variability• Channel distortion • Different handset microphones• Mobile and handsfree acoustics
Report
Docum
ent
43
What is Speech Recognition?
SpeechRecognition
Words“How are you?”
Speech Signal
• Related area’s:– Who is the talker (speaker recognition, identification)
– What language did he speak? (language recognition)– What is his meaning? (speech understanding)
Report
Docum
ent
44
What is the problem?
Find the most likely word sequence Ŵ among all possible sequences given acoustic evidence AA tractable reformulation of the problem is:
Language model
Acoustic model
Daunting search task
Report
Docum
ent
45
View ASR as Pattern Recognition
FrontEnd
Recognition
O1O2 OT
AnalogSpeech
ObservationSequence
W1W2 WT
Best WordSequence
Decoder
AcousticModel
DictionaryLanguage
Model
Report
Docum
ent
46
View ASR in Hierarchy
SpeechWaveform
SpectralFeatureVectors
PhoneLikelihoodsP(o|q)
Words
Feature Extraction(Signal Processing)
Phone LikelihoodEstimation (Gaussiansor Neural Networks)
Decoding (Viterbior Stack Decoder)
Neural Net
N-gram Grammar
HMM Lexicon
Report
Docum
ent
47
Front-End Processing
Dynamic featuresK.F. Lee
Report
Docum
ent
48
Feature Extraction• GOAL :
– LESS COMPUTATION & MEMORY – SIMPLE REPRESENTATION OF SIGNAL
• METHODS : – FOURIER SPECTRUM BASED
• MFCC (mel frequency ceptrum coeffcient) • LFCC (linear frequency ceptrum coefficient) • filter-bank energy
– LINEAR PREDICTION SPECTRUM BASED• LPC (linear predictive coding) • LPCC (linear predictive ceptrum coefficeint)
– OTHERS• ZERO CROSSING, PITCH, FORMANT, AMPLITUDE
Report
Docum
ent
49
Cepstrum Computation
• Cepstrum is the inverse Fourier transform of the log spectrum
1,,1,0,)(log2
1)( LndeeSnc njj
IDFT takes form of weighted DCT in computation, see in HTK
Report
Docum
ent
50
Mel Cepstral Coefficients
• Construct mel-frequency domain using a triangularly-shaped weighting function applied to mel-transformed log-magnitude spectral samples
• Filter-bank, under 1k hz, linear, above 1k hz, log • Motivated by human auditory response characteristics
DCT transform
FFT and log
Report
Docum
ent
51
Cepstrum as Vector Space Features
Overlap
Report
Docum
ent
52
Other Features
• LPC– Linear predictive coefficients
• PLP– Perceptual Linear Prediction
• Though MFCC has been successfully used,
what is the robust speech feature?
Report
Docum
ent
53
Acoustic Models
• Template-based AM, used in DTW, obsolete • Hidden Markov Model based AM, popular now• Other AMs
– Articulatory AM
– KNOWLEDGE BASED APPROACH• spectrogram reading (expert system)
– CONNECTIONIST APPROACH - TDNN
Report
Docum
ent
54
Template-based Approach
• DYNAMIC PROGRAMMING ALGORITHM• DISTANCE MEASURE • ISOLATED WORD • SCALING INVARIANCE • TIME WARPING• CLUSTER METHOD
A SSR presentation: 8.2 Definition of the Hidden Markov Model
Definition of HMMDefinition of HMM
Formal definition HMM
An output observation alphabet
The set of states
A transition probability matrix
An output probability matrix
An initial state distribution
Assumptions• Markov assumption• Output independence assumption
},...,,{21 M
oooO
},...,2,1{ N
)|(}{1
isjsPaAttij
)}({ kbBi
)(0
isP
)|()( isoXPkbtkti
A SSR presentation: 8.2 Definition of the Hidden Markov Model
Three Three Problems of HMMProblems of HMM
Given a model Ф and a sequence of observations
• The Evaluation ProblemThe Evaluation Problem
How to compute the probability of the observation sequence?
Forward algorithm
• The Decoding ProblemThe Decoding Problem
How to find the optimal sequence associated with a given observation?
Viterbi algorithm
• The Training/Learning ProblemThe Training/Learning Problem
How can we adjust the model parameter to maximize the joint probability?
Baum-Welch algorithm (FORWARD-BACKWARD ALGORITHM )
Report
Docum
ent
57
Advantages of HMM
• ISOLATED & CONTINUOUS SPEECH RECOGNITION
• NO ATTEMPT TO FIND WORD BOUNDARIES
• RECOVERY OF ERRONEOUS ASSUMPTION
• SCALING INVARIANCE, TIME WARPING, LEARNING CAPABILITY
Report
Docum
ent
58
Limitations of HMM
• HMMs assume the duration follows an exponential
distribution• The transition probability depends only on the
origin and destination • All observation frames are dependent only on the
state that generated them, not on the neighboring
observation frames (observation frames dependent)
Report
Docum
ent
59
HMM-based AM
• Hidden Markov Models (HMMs)
– Probabilistic State Machines - state sequence unknown, only feature vector outputs observed
– Each state has output symbol distribution
– Each state has transition probability distribution
– Issues: • what topology is proper?
• how many states in a model?
• how many mixtures in a state?
Report
Docum
ent
60
• Acoustic models encode the temporal evolution of the features (spectrum).
• Gaussian mixture distributions are used to account for variations in speaker, accent, and pronunciation.
• Phonetic model topologies are simple left-to-right structures.
• Skip states (time-warping) and multiple paths (alternate pronunciations) are also common features of models.
• Sharing model parameters (tied) is a common strategy to reduce complexity.
Hidden Markov Models
Report
Docum
ent
61
• Closed-loop data-driven modeling supervised only from a word-level transcription.
• The expectation/maximization (EM) algorithm is used to improve our parameter estimates.
• Computationally efficient training algorithms (Forward-Backward) have been crucial.
• Batch mode parameter updates are typically preferred.
• Decision trees are used to optimize parameter-sharing, system complexity, and the use of additional linguistic knowledge.
AM Parameter Estimation
• Initialization
• Single Gaussian Estimation
• 2-Way Split
• Mixture Distribution Reestimation
• 4-Way Split
• Reestimation
•••
Report
Docum
ent
62
Basic Speech Units
• RECOGNITION UNITS– PHONEME – WORD – SYLLABLE – DEMISYLLABLE – TRIPHONE– DIPHONE
Report
Docum
ent
63
Basic Units Selection
• Create a set of HMM’s representing the basic sounds (phones) of a language?– English has about 40 distinct phonemes
– Chinese has about 22 Initials + 37 Finials
– Need “lexicon” for pronunciations
– Letter to sound rules for unusual words
– Co-articulation effects must be modeled
• tri-phones - each phone modified by onset and trailing context phones (1k-2k used in English)– e.g. pl-c+pr
Report
Docum
ent
64
Language Models
• What is a language model?– Quantitative ordering of the likelihood of word
sequences (statistical viewpoint)– A set of rule specifying how to create word sequences or sentences (grammar viewpoint)
• Why use language models?– Not all word sequences equally likely– Search space optimization (*)– Improve accuracy (multiple passes)– Wordlattice to n-best
Report
Docum
ent
65
Finite-State Language Model
• Write Grammar of Possible Sentence Patterns• Advantages:
– Long History/ Context– Don’t Need Large Text Database (Rapid Prototyping)– Integrated Syntactic Parsing
• Problem:– Work to write grammars– Words sequences not enabled do not exist– Used in small vocabulary ASR, not for LVCASR
show me
display
any
the next
the last
page
picture
text file
Report
Docum
ent
66
Statistical Language Models• Predict next word based on current and history• Probability of next word is given by
– Trigram: P(wi | wi-1, wi-2)– Bigram: P(wi | wi-1)– Unigram: P(wi)
• Advantage:– Trainable on Large Text Databases– ‘Soft’ Prediction (Probabilities)– Can be directly combined with AM in decoding
• Problem:– Need Large Text Database for each Domain– Sparse problems, smoothing approaches
• backoff approach• word class approach
• Used in LVCASR
Report
Docum
ent
67
Statistical LM Performance
Report
Docum
ent
68
ASR Decoding Levels
/w/ -> /ah/ -> /ts/
/th/ -> /ax/
what's the
display
kirk's
willamette's
sterett's
location
longitude
lattitude
/w/ /ah/ /ts/
/th/ /ax/
States
Phonemes
Words
Sentences
AcousticModels
Dictionary
LanguageModel
Report
Docum
ent
69
Decoding Algorithms
• Given observations, how to determine the most probable utterance/word sequence? (DTW in template-based match)
• Dynamic Programming ( DP) algorithm was proposed by Bellman in 50s for multistep decision process,
the “principle of optimality” is divide and conquer.• The DP-based search algorithms have been used in speech r
ecognition decoder to return n-best paths or wordlattice through the acoustic model and the language model
• Complete search is usually impossible since the search space is too large, so beam search is required to prune less probable paths and save computation load.
• Issues: computation underflow, balance of LM, AM.
Report
Docum
ent
70
Viterbi Search
• Uses Viterbi decoding– Takes MAX, not SUM (Viterbi vs. Forward)– Finds the optimal state sequence, not optimal
word sequence– Computation load: O(T*N2)
• Time synchronous– Extends all paths at each time step– All paths have same length (no need to normalize
to compare scores, but A* decoding needs)
Report
Docum
ent
71
Viterbi Search AlgorithmFunction Viterbi(observations length T, state-graph) returns best-pathNum-states<-num-of-states(state-graph)Create path prob matrix viterbi[num-states+2,T+2]Viterbi[0,0]<- 1.0For each time step t from 0 to T do for each state s from 0 to num-states do for each transition s’ from s in state-graph new-score<-viterbi[s,t]*at[s,s’]*bs’(ot) if ((viterbi[s’,t+1]=0) || (viterbi[s’,t+1]<new-score))
then viterbi[s’,t+1] <- new-score back-pointer[s’,t+1]<-s
Backtrace from highest prob state in final column of viterbi[] & return
Report
Docum
ent
72
Viterbi Search Trellis
W1
W2
0 1 2 3 t
Report
Docum
ent
73
Viterbi Search Insight
Word 1 Word 2
time t time t+1
Word 1
Word 2
S1S2S3
S1
S1 S1S2S2
S2S3
S3 S3
OldProb(S1) • OutProb • Transprob OldProb(S3) • P(W2 | W1)
scorebackptrparmptr
Report
Docum
ent
74
Bachtracking
• Find Best Association between Word and Signal• Compose Words from Phones Using Dictionary• Backtracking is to find the best state sequence
/th/
/e/
t1 tn
Report
Docum
ent
75
N-Best Speech Results
• Use grammar to guide recognition • Post-processing based on grammar/LM• Wordlattice to n-best conversion
“Get me two movie tickets…”“I want to movie trips…”“My car’s too groovy”ASR
SpeechWaveform
Grammar
N-Best Result
N=1N=2N=3
Report
Docum
ent
76
Complexity of Search
•Lexicon: contains all the words in the system’s vocabulary along with their pronunciations (often there are multiple pronunciations per word, # of items in lexicon)
•Acoustic Models: HMMs that represent the basic sound units the system is capable of recognizing (# of models, # of states per model, # of mixtures per state)
•Language Model: determines the possible word sequences allowed by the system (fan-out, PP, entropy)
Report
Docum
ent
77
ASR vs Modern AI
• ASR is based on AI techniques– Knowledge representation & manipulation
• AM and LM, lexicon, observation vector
– Machine Learning• Baum-Welch for HMMs
• Nearest neighbor & k-means clustering for signal id
– “Soft” probabilistic reasoning/Bayes rule• Manage uncertainty mapping in signal, phone, word
– ASR is an expert system
Report
Docum
ent
78
ASR Summary
• Performance criterion is WER (word error rate)
• Three main knowledge sources– Acoustic Model (Gaussian Mixture Models)– Language Model (N-Grams, FS Grammars)– Dictionary (Context-dependent sub-phonetic units)
• Decoding– Viterbi Decoder– Time-synchronous– A* decoding (stack decoding, IBM, X.D. Huang)
Report
Docum
ent
79
We still needWe still need
• We still need science
• Need language, intelligence
• Acoustic robustness still poor
• Perceptual research, models
• Fundamentals of statistical pattern
recognition for sequences
• Robustness to accent, stress,
rate of speech, ……..
Report
Docum
ent
80
Conclusions:
• supervised training is a good machine learning technique
• large databases are essential for the development of robust statistics
Challenges:
• discrimination vs. representation
• generalization vs. memorization
• pronunciation modeling
• human-centered language modelingThe algorithmic issues for the next decade:• Better features by extracting articulatory information?
• Bayesian statistics? Bayesian networks?
• Decision Trees? Information-theoretic measures?
• Nonlinear dynamics? Chaos?
Future Directions
1970
Hidden Markov ModelsAnalog Filter Banks Dynamic Time-Warping
1980 19902004
1960
Report
Docum
ent
81
References• Speech & Language Processing
– Jurafsky & Martin -Prentice Hall - 2000• Spoken Language Processing
– X.. D. Huang, al et, Prentice Hall, Inc., 2000
• Statistical Methods for Speech Recognition – Jelinek - MIT Press - 1999
• Foundations of Statistical Natural Language Processing– Manning & Schutze - MIT Press - 1999
• Fundamentals of Speech Recognition– L. R. Rabiner and B. H. Juang, Prentice-Hall, 1993
• Dr. J. Picone - Speech Website– www.isip.msstate.edu
Report
Docum
ent
82
Test
• Mode– A final 4-page report or– A 30-min presentation
• Content– Review of speech processing– Speech features and processing approaches– Review of TTS or ASR– Audio in computer engineering
Report
Docum
ent
83
TTHHAANNKKSS