Exemplar-Based Processing for Speech Recognition
presented by
Andreas Gaich
SS 2013 SPSC
Advanced Signal Processing Seminar 2
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
2 Andreas Gaich
Overview
Introduction
• Speech Recognition • Frame- and segment-based models • Global-data vs. exemplar-based models
State of the art techniques
• Overview • k-NN Classification • Sparse Representations • Template Matching • Latent Perceptual Mapping
Experimental results Conclusions
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
3 Andreas Gaich
Introduction – Speech Recognition
Speech Recognition Problem
• Find a principled way of modeling the physical phenomena generating the observed data and the uncertainty in it.
• Uncertainties : f.e. dialects, vocal tract variations, corruption by noies, etc...
Approaches of modeling
• Eager (offline) learning => GLOBAL-DATA MODELING
Uses all available training data to build a model before the test sample is seen
• Lazy (memory-based) learning => EXEMPLAR-BASED MODELING
Selects a subset of exemplars from the training data to build a local model specifically for every test sample
Test sample informs the construction of the model
Introduction – Speech Recognition
The sequence recognition problem
• Find the sequence of words that corresponds to a waveform
• Done by maximizing the posterior probability
U... Sequence of subword units; X… Sequence of observations
• Subword units are „u“ representations for words „w“
=> …Language model; …Pronunciation model; …Acoustic model
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
4 Andreas Gaich
Introduction – Frame- and segement-based models
Frame-based models
• Observations are a temporal sequence of acoustic feature vectors computet at regular time intervals
• HMMs as a popular methodology to calculate the acoustic model
• Efficient probabilistic acoustic models by making a „First Order Markov
Assumption“ between states and assuming „Frame-by-Frame Independece“ between observations
!!! FRAME-BY-FRAME INDEPENDENCE ASSUMPTION IS UNREALISTIC !!!
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
5 Andreas Gaich
Introduction – Frame- and segement-based models
Segment-based models
• Introduction of an unobservable segmentation variable S
• Allows for modeling dependencies of multiple frames in each segment
• Difficulty of defining segments
Boundaries f.e. at large spectral changes in frame-level observations X
• Calculation of
State trajectory modeling Converting all observations within a segment to a new „Segmental
Feature Vector“ [2]
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
6 Andreas Gaich
Introduction – Global-data vs. exemplar-based models
• Virtually all speech recognition systems uses global-data models
e.g. GMMs, NNs, SVMs as underlying computational engine
• Global models must simplify the process of speech production by making assumptions of independence
e.g. gender, dialect, environmental noise, etc…
• Global modals seek for an average representation that is reliable
• This is questionable at least for two reasons
1.) A lower-dimensional representation of speech feature vectors is possible (around 40 suggested for frame-based systems)
2.) Model parameters can be unreliable if there are not enough training samples for the specific class
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
7 Andreas Gaich
Introduction – Global-data vs. exemplar-based models
• Exemplar-based techniques build local modals using a few relevant training examples
=> Do not suffer from data sparsity issues
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
8 Andreas Gaich
Techniques – Overview
Workflow
• Feature Extraction
Potentially apply feature transformation to reduce the influence of noisy and irrelevant features
Acoustic inventory can be composed of fixed-length feature vectors (frames), or variable length sequences of such vectors (templates)
• Exemplar Selection
Ident. instances from training data most relevant to each test instance
• Instance Modeling
• Frame- or segment-based decoding
Compute the acoustic score Perform f.e. a Viterbi Search
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
9 Andreas Gaich
Techniques – Overview
Exemplar Selection
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
10 Andreas Gaich
Techniques – kNN Classification
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
11 Andreas Gaich
Techniques – kNN Classification
Exemplar Selection
• Obtain a set C containing indices of the nearest k exemplars out of the whole training data collected in matrix H with dim(D x N)
Instance Modeling
• Estimate class posteriors
… rows of G corresponding to class q G… binary matrix that associates each exemplar in H with class labels i(C)… indicator vector representing the kNNs in H of a test instance
• Weighted kNN also possible
F.e. proportional to the distance between test and training instance
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
12 Andreas Gaich
Techniques – kNN Classification
Decoding
• Classification by majority vote
• For speech recognition needs to be converted to frame-based observation likelihods
Normalize that it sums up to 1, then it‘s a direct measure for the observation likelihoods
• kNN also possible to use in „multiple-frame“ and „segment-based“
frameworks
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
13 Andreas Gaich
Techniques – Sparse representations
Formulation
• Concatenate training samples of class i into Matrix
x… feature vector from training sets of class i
• Then a test sample from the same class can be represented as a linear combination
• A priori the membership of y is unknown, so define a matrix with the whole training set containing all k classes
• H is an overcomplete dictionary in terms of solving
=> should be sparse and only be non-zero for elements inH which belong to the same class as y
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
14 Andreas Gaich
Techniques – Sparse representations
• Sparse solution is found by means of Lasso (refer to ASP SE1 WS12/13)
Minimize L2 distance while enforcing sparsity using an L1 constraint Exemplar Selection
• Intelligent choice of dictionary H is necessary
kNN search Random sampling of training data
Instance Modeling
• Estimate class posteriors (same precedure as kNN)
• If posteriors represent phone classes, they are refered as „SR phone identification features“
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
15 Andreas Gaich
Techniques – Sparse representations
Decoding
• Same procedure as in the kNN chapter • Alternatively construct new features to train a GMM
Noise robustness using SR
• Spares imputation (SI)
Some features stay relatively uncorrupted under noisy speech Define a „binary mask“ m that represents the uncorrupted dimensions Obtain SR by solving
• Source seperation
Describe x as linear combination Requires a representation where noise and speech add linearly Estimate class posteriors
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
16 Andreas Gaich
Techniques – Template matching
• Nonparametric method
• Compares reference templates directly with the observed features X
Make use of the entire database Use Meta Information enriched in the training data (labels, gender, etc)
Exemplar Selection
• First generate a rough word graph to keep the number of candidate segments and corresponding class labels manageable
Done by HMM system or Bottom-up template selection Augmented to subword segmentation
• Collect a set of k-NN templates for each word arc, that match the word
identity u and resemble the sequence of acoustic features X
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
17 Andreas Gaich
Techniques – Template matching
Modeling
• Template matching uses variable-length units
=> Dynamic Time Warping (DTW)
Sum up local distances between template and input and minimize the overall cost to find the right trajectory
• Within-template allignment:
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
18 Andreas Gaich
Techniques – Template matching
• Consider additional costs at boundaries
Template transition cost Language model cost
• Template Transision Costs
Penalty for incorrect acoustic-phonetic context Incorporate non-verbal information to find consistent paths (f.e. male –
female)
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
19 Andreas Gaich
Techniques – Template matching
Decoding
• Single best Viterbi decoding
Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations, etc)
Improvement by „Data Sharpening“
• Simpler and yet better:
Average scores before decoding Then use the Viterbi decoder again
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
20 Andreas Gaich
Techniques – Template matching
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
21 Andreas Gaich
Techniques – Latent perceptual mapping (LPM)
• Operates with data-driven acoustic units (no prior model assumed)
• Speech segments are treated as a bag of acoustic units drawn from a limited acoustic vocabulary
• Training comprises three main steps:
Extracting relevant „units“ from a given set of phoneme instances Deriving a unit-document co-occurance matrix Mapping the phoneme instances to a dimensionality reduced latent
space after singular value decomposition (SVD) of the co-occurance matrix
LPM is related to SRs and Template Matching
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
22 Andreas Gaich
Techniques – Latent perceptual mapping (LPM)
Exemplar Selection
• Unsupervised clustering of feature vectors from m phoneme segments
• After vector quantization the resulting sequence of units are broken into n-gram units (1≤n≤20)
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
23 Andreas Gaich
Techniques – Latent perceptual mapping (LPM)
Exemplar Selection – cont‘d
• Choose the best informative units by investigation of empirical measurements
Indexing Power
Empirical probability
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
24 Andreas Gaich
Techniques – Latent perceptual mapping (LPM)
Exemplar Selection – cont‘d
• Compute Co-occurance matrix by counting the number of times each unit appears in a phoneme instance
• Dimensionality reduction of matrix F by SVD R approximates the rank of F
Phoneme segments from the training set are mapped to the vectors in the latent space and then used as acoustic prototypes
A test segment X is mapped onto the latent space as well and classification is done by distance calculation to the acoustic prototypes
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
25 Andreas Gaich
Techniques – Link between different approaches
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
26 Andreas Gaich
LPM
Experimental results
• Most results are evaluated on TIMIT
A small-vocabulary phonetic corpus Recorded and transcribed by TI and MIT Provides standardized training, development, and test sets as well as
time-alligned phonetic transcriptions
• Exemplar-based methods are compared to the state of the art implementation of the „best“ classical approach
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
27 Andreas Gaich
Experimental results – kNN
Classification
• Classification error about 21% in contrast to GMM classifier with 21,6%
Recognition
• Same performance as GMM/HMM-based approach for small vocabulary
• Improvement over GMM/HMM-based approach if less than 3 hours of training data is used in a large vocabulary task
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
28 Andreas Gaich
Experimental results – Sparse representations
Classification
• Classification error less than 15%, the best result reported on TIMIT
Recognition
• Phonetic error rate (PER) of 18, 6% on TIMIT; 0,8% absolute improvement over GMM/HMM systems
• Word error rate (WER) of 18,7%; 0,3% improvement over state-of-the-art GMM/HMM systems
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
29 Andreas Gaich
Experimental results – Template matching
Classification
• Classification error less of 20,7% (GMM classifier about 21,6%)
Recognition
• Small vocabulary: outperforms a HMM-based systes (3,07% versus 3,35% WER)
• Large vocabulary: 21% relative improvement, which results overall in a 7,6% absolute WER
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
30 Andreas Gaich
Experimental results – Latent Perceptual Mapping
Classification
• Early experiments with LPM focused on dimensionalty reduction rather than accuracy improvements
• By retaining 10% of the maximum dimensionality of the latent space, frame-based LPM operating on vector-quantized phone segments scores at a level of both DTW and discrete-parameter HMM systems
• Template-based LPM using short, variable-length units achieve the same level of performance at a dimensionality less than or comparable to that of the original acoustic space
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
31 Andreas Gaich
Conclusions
• Exemplar-based techniques stay closer to the underlying speech process by building local modals, while at the same time keeping the number of parameters parsimonious
• Exemplar-based methods complement, in a robust manner, the information captured within global-data models
• Exemplar-based processing could potentially support inference from any corpus enriched with information not immediately usable by global-data models, such as prosody
• It is critical to keep on improving the computational efficiency of exemplar-based methods due to growing data sets in speech recognition
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
32 Andreas Gaich
Institut für Signalverarbeitung und Sprachkommunikation
Graz 29.04.2013 Advanced Signal Processing Seminar 2
33 Andreas Gaich
References [1] B. Ramabhadran, D. Nahamoo, D. Kanavesky, D. Van Compernolle, K. Demuynck, J.F. Gemmeke, J.R. Bellegarda, S. Sundaram, “Exemplar-Based Processing for Speech Recognition”, IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 98 -113, Nov. 2012. [2] J.R. Glass, “A probabilistic framework for segment-based speech recognition”, Comput. Speech Lang., vol. 2-3, pp. 137-152, Apr.-July 2003. [3] T. N. Sainath, B. Ramabhadran, D. Nahamoo, D. Kanevsky, and A. Sethy, “Exemplar-based sparse representation features for speech recognition”, in Proc. Interspeech, 2010, pp. 2254–2257. [4] K. Demuynck, D. Seppi, H. Van hamme, and D. Van Compernolle, “Progress in example-based automatic speech recognition”, in Proc. Int. Conf. Acoustics, Speech and Signal Processing, 2011, pp. 4692–4695. [5] M. De Wachter, K. Demuynck, D. Van Compernolle, and P. Wambacq, “Data driven example based continuous speech recognition”, in Proc. European Conf. on Speech Communication and Technology, 2003, pp. 1133-1136. [6] S. Sundaram and J. Bellegarda, “Latent perceptual mapping with datadriven variable-length acoustic units for template-based speech recognition”, in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Mar. 2012, pp. 4125–4128.