Representing Acoustics with Mel Frequency Cepstral Coefficients
Lecture 7 Spoken Language Processing Prof. Andrew Rosenberg
Slide 3
Representing Acoustic Information 16-bit samples 44.1kHz
sampling rate ~86kB/sec ~5MB/min Waves repeat -- Much of this data
is redundant. A good representation of speech (for recognition)
Keeps all of the information to discriminate between phones Is
Compact. i.e. Gets rid of everything else 1
Slide 4
Frame Based analysis Using a short window of analysis, analyze
the wave form every 10ms (or other analysis rate) Usually performed
with overlapping windows. e.g. FFT and Spectrogram 2
Slide 5
Overlapping frames Spectrograms allow for visual inspection of
spectral information. We are looking for a compact, numerical
representation 3 10ms
Slide 6
Example Spectrogram 4
Slide 7
Standard Representation in the field Mel Frequency Cepstral
Coefficients MFCC 5 Pre- Emphasis window FFT Mel-Filter Bank log
FFT -1 Deltas energy 12 MFCC 12 MFCC 12 MFCC 1 energy 1 energy 1
energy
Slide 8
Pre-emphasis Looking at spectrum for voiced segments, there is
more energy at the lower frequencies than higher frequencies.
Boosting high frequencies helps make the high frequency information
more available. First-order high-pass filter for pre-emphasis.
6
Slide 9
Windowing Overlapping windows allow analysis centered at a
frame point, while using more information. 7
Slide 10
Hamming Windowing Discontinuities at the edge of the window can
cause problems for the FFT Hamming window smoothes-out the edges.
8
Slide 11
Hamming Windowing Discontinuities at the edge of the window can
cause problems for the FFT Hamming window smoothes-out the edges.
9
Slide 12
Discrete Fourier Transform The algorithm for calculating the
Discrete Fourier Transform (DFT) is the Fast Fourier Transform. 10
http://clas.mq.edu.au/acoustics/speech_spectra/fft_lpc_settings.html
Australian male /i:/ from heed FFT analysis window 12.8ms
Slide 13
Mel Filter Bank and Log Human hearing is not equally sensitive
at all frequency regions. Modeling human hearing sensitivity helps
phone recognition. MFCC approach: Warp frequencies from Hz to Mel
frequency scale. Mel: pairs of sounds that are perceptually
equidistant in pitch are separated by an equal number of mels.
11
Slide 14
Mel frequency Filter bank Create a bank of filters collecting
energy from each frequency band, 10 filters linearly spaced below
1000Hz, logarithmic spread over 1000Hz. 12
Slide 15
Cepstrum Separation of source and filter. Source differences
are speaker dependent Filter differences are phone dependent.
Cepstrum is the Spectrum of the Log of the Spectrum inverse DFT of
the log magnitude of the DFT of the signal 13
Slide 16
Cepstrum Visualization Peak at 120 samples represents the
glottal pulse, corresponding to the F0 Large values closer to zero
correspond to vocal tract filter (tongue position, jaw opening,
etc.) Common to take the first12 coefficients 14
Slide 17
Deltas and Energy Energy within a frame is just the sum of the
power of the samples. The spectrum of some phones change over time
the stop closure to stop burst, or slope of a formant. Taking the
delta or velocity and double delta or acceleration incorporates
this information 15
Slide 18
Summary: MFCC Commonly MFCCs have 39 Features 16 39MFCC
Features 12Cepstral Coefficients 12Delta Cepstral Coefficients
12Delta Delta Cepstral Coefficieints 1Energy Coefficients 1Delta
Energy Coefficients 1Delta Delta Energy Coefficients
Slide 19
Next Class Introduction to Statistical Modeling and
Classification Reading: J&M 9.4, optional 6.6 17