Upload
agatha-dickerson
View
219
Download
3
Embed Size (px)
Citation preview
Audio Processing for Ubiquitous Computing
Uichin LeeKAIST KSE
Audio processing apps
• Speech recognition (e.g., Google voice search)• Situation awareness– Conversation detection– Location (environment) classification• E.g., home, office, bar, beach, car, street, etc.
– Dietary intake of a person (eating habit)– Everyday sound logging (and event detection) as in
SoundSense
Discrete representation of signal
• Represent continuous signal into discrete form.
Time domain audio waveform
Vertical axis: amplitude, relative sound pressuretypical unit: Pa (micro-pascals)
(digital signal usually unitless)quantization (-32768 to 32767)
Horizontal axis: timetypical unit: msec (milliseconds)sampling (8000, 16000, 44.1K samp/sec)
Digitizing audio signal
• Sampling – Measuring amplitude of signal at time t– 16,000 Hz (samples/sec) Microphone
(“Wideband”):– 8,000 Hz (samples/sec) Telephone– Why?• Need at least 2 samples per cycle (Nyquist) • Max measurable frequency is half sampling rate• Human speech < 10,000 Hz, so need max 20K• Telephone filtered at 4K, so 8K is enough
Digitizing audio signal
• Quantization – Representing real value of each amplitude as integer– 8-bit (-128 to 127) or 16-bit (-32768 to 32767)
• Formats:– 16 bit PCM– 8 bit mu-law; log compression
• LSB (Intel) vs. MSB (Sun, Apple)– Little-endian vs. Big-endian
• Headers: meta info such as sampling rates, recording condition– Raw (no header)– Microsoft wav– Sun .au
40 byteheader
Visualization of audio signals• What makes one phoneme, /aa/, sound different from another phoneme, /iy/?
• Different shapes of the vocal tract… /aa/ is produced with the tongue low and in the back of the mouth; /iy/ is produced with the tongue high and toward the front.
• The different shapes of the vocal tract produce different “resonant frequencies”, or frequencies at which energy in the signal is concentrated. (Simple example of resonant energy: a tuning fork may have resonant frequency equal to 440 Hz or “A”)
• Resonant frequencies in speech (or other sounds) can be displayed by computing a “power spectrum” or “spectrogram,” showing the power in the signal at different frequencies
Power spectrum• Time-domain signal can be expressed in terms of sinusoids at a range of
frequencies using the Fourier transform:
• Power Spectral Density (PSD) shows power distribution in frequency domain– PSD = Fourier transform of the autocorrelation function of the signal
t
t
ftj
dtftjfttx
dtetxfX
)2sin()2cos()(
)()( 2
1
0
1
0
2
)]2
sin()2
)[cos((1
1,0for)(1
)(
N
k
N
k
N
knj
N
knj
N
knkx
N
NnekxN
nX
Continuous
Discrete (N samples)
Power spectrum
• The power spectrum can be plotted like this (vowel /aa/):
time-domain
amplitude
spectralpower
(dB)(512 samp)
0 Hz 4000 Hzfrequency (Hz)
Automatic speech recognition: noisy channel analogy
• Search through space of all possible “source” sentences
• Choose the one which has the highest probability of generating the “noisy” sentence
If music be the food of love..
Noisy Channel
Noisy channel model
• What is the most likely sentence out of all sentences in the language L given some acoustic input O?
• Treat acoustic input O as sequence of individual observations – O = o1,o2,o3,…,ot
• Define a sentence as a sequence of words:– W = w1,w2,w3,…,wn
Noisy channel model
• Probabilistic implication: Pick the highest prob:
• We can use Bayes rule to rewrite this:
• Since denominator is the same for each candidate sentence W, we can ignore it for the argmax:
ˆ W argmaxW L
P(W | O)
ˆ W argmaxW L
P(O |W )P(W )
ˆ W argmaxW L
P(O |W )P(W )
P(O)
Noisy channel model
ˆ W argmaxW L
P(O |W )P(W )
likelihood prior
Noisy channel model
• Ignoring the denominator leaves us with two factors: P(Source) and P(Signal|Source)
If music be the food of love..
Noisy Channel
Speech recognition• Feature extraction (or signal processing):
– Acoustic waveform is sampled into frames (usually of 10, 15, or 20 milliseconds) ; transformed into spectral features (mostly MFCC)
• Acoustic model (or phone recognition): – Compute the likelihood of the observed spectral feature vectors given linguistic units (e.g.,
words, phones) [used for decoding (and also training)] – Example: Gaussian Mixture Model (GMM) classifier
• For each HMM state q, corresponding to a phone or subphone, the likelihood of a given feature vector given this phone p(o|q).
• Decoding: – Take: Acoustic model (sequence of acoustic likelihoods) + HMM dictionary of word
pronunciation + Language model: P(W)– Output: the most likely sequence of words
Acoustic Model
Decoding
LanguageModel
FeatureExtraction
O
P(O|W)
W
P(W)Speech recognizer
block diagram
Feature extraction
Feature Extraction
Feature extraction
• Mel-Frequency Cepstral Coefficient (MFCC)–Most widely used spectral representation
Feature extraction
• Window size: 25ms• Window shift: 10ms• Pre-emphasis coefficient: 0.97• MFCC:
– 12 MFCC (mel frequency cepstral coefficients)– 1 energy feature– 12 delta MFCC features – 12 double-delta MFCC features– 1 delta energy feature– 1 double-delta energy feature
• Total 39-dimensional features
Hidden Markov models
Hidden Markov modelsBakis network Ergodic (fully-connected) network
Left-to-right network
Hidden Markov models: example
• State: Hot or Cold day• Emission: # of ice creams
Three basic problems for HMMs
• Problem 1 (Evaluation): Given the observation sequence O=(o1o2…oT), and an HMM model = (A,B), how do we efficiently compute P(O| ), the probability of the observation sequence, given the model
• Problem 2 (Decoding): Given the observation sequence O=(o1o2…oT), and an HMM model = (A,B), how do we choose a corresponding state sequence Q=(q1q2…qT) that is optimal in some sense (i.e., best explains the observations)
• Problem 3 (Learning): How do we adjust the model parameters = (A,B) to maximize P(O| )?
Evaluation• Evaluation: how likely is the sequence 3 1 3?• Computing observation likelihood for a given hidden state sequence
– Suppose we knew the weather and wanted to predict how much ice cream Jason would eat: i.e., P( 3 1 3 | H H C)
• Summing over all possible hidden state sequences– But N states + T observations O(N^T) combinations (intractable..)
Dynamic programming: use a table to store intermediate values.. (called forward algorithm)
Decoding
• Given – an observation sequence; e.g., 3 1 3– an HMM (N states each has T outcomes)
• The task of the decoder is to find the best hidden state sequence
• Again # possible sequences: N^T (intractable)– E.g., P(3 1 3| * * * )
• Instead:– Viterbi algorithm (dynamic programming) – Uses a very similar technique as in Evaluation
Digit recognition example
• Based on lexicon, build a HMM for each digit– HMM states trained, e.g., likelihood: p(o|q), transition
• Given input observations, use Viterbi to find the best matching digit
Lexicon
Speech recognition: summary
Feature Extraction
Acoustic Model
Decoding
Extracted Features