Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE

Audio Processing for Ubiquitous Computing

Uichin LeeKAIST KSE

Audio processing apps

• Speech recognition (e.g., Google voice search)• Situation awareness– Conversation detection– Location (environment) classification• E.g., home, office, bar, beach, car, street, etc.

– Dietary intake of a person (eating habit)– Everyday sound logging (and event detection) as in

SoundSense

Discrete representation of signal

• Represent continuous signal into discrete form.

Time domain audio waveform

Vertical axis: amplitude, relative sound pressuretypical unit: Pa (micro-pascals)

(digital signal usually unitless)quantization (-32768 to 32767)

Horizontal axis: timetypical unit: msec (milliseconds)sampling (8000, 16000, 44.1K samp/sec)

Digitizing audio signal

• Sampling – Measuring amplitude of signal at time t– 16,000 Hz (samples/sec) Microphone

(“Wideband”):– 8,000 Hz (samples/sec) Telephone– Why?• Need at least 2 samples per cycle (Nyquist) • Max measurable frequency is half sampling rate• Human speech < 10,000 Hz, so need max 20K• Telephone filtered at 4K, so 8K is enough

Digitizing audio signal

• Quantization – Representing real value of each amplitude as integer– 8-bit (-128 to 127) or 16-bit (-32768 to 32767)

• Formats:– 16 bit PCM– 8 bit mu-law; log compression

• LSB (Intel) vs. MSB (Sun, Apple)– Little-endian vs. Big-endian

• Headers: meta info such as sampling rates, recording condition– Raw (no header)– Microsoft wav– Sun .au

40 byteheader

Visualization of audio signals• What makes one phoneme, /aa/, sound different from another phoneme, /iy/?

• Different shapes of the vocal tract… /aa/ is produced with the tongue low and in the back of the mouth; /iy/ is produced with the tongue high and toward the front.

• The different shapes of the vocal tract produce different “resonant frequencies”, or frequencies at which energy in the signal is concentrated. (Simple example of resonant energy: a tuning fork may have resonant frequency equal to 440 Hz or “A”)

• Resonant frequencies in speech (or other sounds) can be displayed by computing a “power spectrum” or “spectrogram,” showing the power in the signal at different frequencies

Power spectrum• Time-domain signal can be expressed in terms of sinusoids at a range of

frequencies using the Fourier transform:

• Power Spectral Density (PSD) shows power distribution in frequency domain– PSD = Fourier transform of the autocorrelation function of the signal

t

t

ftj

dtftjfttx

dtetxfX

)2sin()2cos()(

)()( 2

1

0

1

0

2

)]2

sin()2

)[cos((1

1,0for)(1

)(

N

k

N

k

N

knj

N

knj

N

knkx

N

NnekxN

nX

Continuous

Discrete (N samples)

Power spectrum

• The power spectrum can be plotted like this (vowel /aa/):

time-domain

amplitude

spectralpower

(dB)(512 samp)

0 Hz 4000 Hzfrequency (Hz)

Automatic speech recognition: noisy channel analogy

• Search through space of all possible “source” sentences

• Choose the one which has the highest probability of generating the “noisy” sentence

If music be the food of love..

Noisy Channel

Noisy channel model

• What is the most likely sentence out of all sentences in the language L given some acoustic input O?

• Treat acoustic input O as sequence of individual observations – O = o1,o2,o3,…,ot

• Define a sentence as a sequence of words:– W = w1,w2,w3,…,wn

Noisy channel model

• Probabilistic implication: Pick the highest prob:

• We can use Bayes rule to rewrite this:

• Since denominator is the same for each candidate sentence W, we can ignore it for the argmax:

ˆ W argmaxW L

P(W | O)

ˆ W argmaxW L

P(O |W )P(W )

ˆ W argmaxW L

P(O |W )P(W )

P(O)

Noisy channel model

ˆ W argmaxW L

P(O |W )P(W )

likelihood prior

Noisy channel model

• Ignoring the denominator leaves us with two factors: P(Source) and P(Signal|Source)

If music be the food of love..

Noisy Channel

Speech recognition• Feature extraction (or signal processing):

– Acoustic waveform is sampled into frames (usually of 10, 15, or 20 milliseconds) ; transformed into spectral features (mostly MFCC)

• Acoustic model (or phone recognition): – Compute the likelihood of the observed spectral feature vectors given linguistic units (e.g.,

words, phones) [used for decoding (and also training)] – Example: Gaussian Mixture Model (GMM) classifier

• For each HMM state q, corresponding to a phone or subphone, the likelihood of a given feature vector given this phone p(o|q).

• Decoding: – Take: Acoustic model (sequence of acoustic likelihoods) + HMM dictionary of word

pronunciation + Language model: P(W)– Output: the most likely sequence of words

Acoustic Model

Decoding

LanguageModel

FeatureExtraction

O

P(O|W)

W

P(W)Speech recognizer

block diagram

Feature extraction

Feature Extraction

Feature extraction

• Mel-Frequency Cepstral Coefficient (MFCC)–Most widely used spectral representation

Feature extraction

• Window size: 25ms• Window shift: 10ms• Pre-emphasis coefficient: 0.97• MFCC:

– 12 MFCC (mel frequency cepstral coefficients)– 1 energy feature– 12 delta MFCC features – 12 double-delta MFCC features– 1 delta energy feature– 1 double-delta energy feature

• Total 39-dimensional features

Hidden Markov models

Hidden Markov modelsBakis network Ergodic (fully-connected) network

Left-to-right network

Hidden Markov models: example

• State: Hot or Cold day• Emission: # of ice creams

Three basic problems for HMMs

• Problem 1 (Evaluation): Given the observation sequence O=(o1o2…oT), and an HMM model = (A,B), how do we efficiently compute P(O| ), the probability of the observation sequence, given the model

• Problem 2 (Decoding): Given the observation sequence O=(o1o2…oT), and an HMM model = (A,B), how do we choose a corresponding state sequence Q=(q1q2…qT) that is optimal in some sense (i.e., best explains the observations)

• Problem 3 (Learning): How do we adjust the model parameters = (A,B) to maximize P(O| )?

Evaluation• Evaluation: how likely is the sequence 3 1 3?• Computing observation likelihood for a given hidden state sequence

– Suppose we knew the weather and wanted to predict how much ice cream Jason would eat: i.e., P( 3 1 3 | H H C)

• Summing over all possible hidden state sequences– But N states + T observations O(N^T) combinations (intractable..)

Dynamic programming: use a table to store intermediate values.. (called forward algorithm)

Decoding

• Given – an observation sequence; e.g., 3 1 3– an HMM (N states each has T outcomes)

• The task of the decoder is to find the best hidden state sequence

• Again # possible sequences: N^T (intractable)– E.g., P(3 1 3| * * * )

• Instead:– Viterbi algorithm (dynamic programming) – Uses a very similar technique as in Evaluation

Digit recognition example

• Based on lexicon, build a HMM for each digit– HMM states trained, e.g., likelihood: p(o|q), transition

• Given input observations, use Viterbi to find the best matching digit

Lexicon

Speech recognition: summary

Feature Extraction

Acoustic Model

Decoding

Extracted Features

Documents

Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE