Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and...

Preview:

Citation preview

Audio and (Multimedia) Methods for Video Analysis

Dr. Gerald Friedland, fractor@icsi.berkeley.edu

CS-298 Seminar Fall 2011•

1

More on Sound

2

•Responses to Email Questions•Useful Filters•More Features•Typical Machine Learning on

Audio•More on Tools

Q&A

3

Why didn’t you talk about Loudspeakers?

Loudspeaker Characteristics

Frequency dependent! (As are the response patterns of Microphones!)

Q&A

6

- Speakers don’t really help with video analysis...

- Exception: Multichannel!

Q&A

7

How is the directionality graph for Microphones generated?

- According to IEC 60268-4

Q&A

8

Does lossy compression influence audio analysis?

- Yes, but degree unknown- Common Sense: It’s better to train your models based on the same data compression algorithm.

Q&A

9

Why is the CD sampling rate 44100Hz and not just 44000?

- Because it is dividable by both 25 video frames per second (PAL) and 30 fps (NTSC)

Useful Filters

10

•Low-,High-, Band-pass, Graphic Equalizer•Dynamic Range Compression•Filter by Example: Spectral Subtraction, Wiener Filter•Handling Multiple Channels

Remember: Fourier Transform

11

DFT

iDFT

More Information: “Introduction to Multimedia Computing”, Draft Chapter 8

Remember: Fast Fourier Transform

12

More Information: “Introduction to Multimedia Computing”, Draft Chapter 8

// Input: A number N of samples X. N must be an integer power of two.// s the stride// Output: The Discrete Fourier Transform of X in the array Y.FFT(X, N, s)

IF (N==1) THENY[0]:=X{0] // Trivial size-1 case: Sample = DC coefficient

ELSEY[0,...,N/2]:=FFT(X,N/2,2*s) // DFT of (X[0],X[2s},X[4s},...)Y[N/2,...,N-1]:=FFT(X[s,...,N-1],N/2,2*s)// DFT of (X[s],X[s+2s],X[s+4s],...)

FOR k:=0 TO N/2 // Combine the two halves into onet:=Y[k]Y[k]:=t+exp(-2*pi*k/N)*Y[k=N/2]Y[k+N/2]:=t-exp(-2*pi*k/N)*Y[k=N/2]

FFT := Y

Remember: Convolution Theorem

13

More Information: “Introduction to Multimedia Computing”, Draft Chapter 8

Let f and g be two functions and f * g their convolution. Then:

Bandpass Filters

14

Dynamic Range Compression

15

Dynamic Range Compression

16

Dark blue: Original, Light Blue: Dynamic Range Compressed

Filter by Example: Spectral Subtraction

17

Dark blue: Original with humming noise, Light blue: Corrected using spectral subtraction of humming.

Filter by Example: Wiener Filter

18

Assume:

where:•s(t) is original signal•n(t) is noise footprint•s(t) is estimated signal•g(t) is Wiener Filter

Filter by Example: Wiener Filter

19

Define Error:

with being the delay. Thus:

(squared error)

α

Filter by Example: Wiener Filter

20

Define Error:

with being the delay. Thus:

(squared error)

α

Filter by Example: Wiener Filter

21

Results in:

where: •X(s(t)) is the power spectrum of the signal,•s(t) and N(s(t)) is the power spectrum of the noise, and •G(s(t)) is the frequency-space Wiener filter.

Problem: Do we know N(s(t))?

Multichannel Audio

22

•Microphone Array Recordings popular with Speech Community•Stereo popular since the 60s.•Most cameras have stereo microphones•Most DVDs are encoded in Dolby Digital 5.1 and in Dolby Surround on Stereo Track

Dolby Digital

23

Source: http://www.aspecttv.co.uk/surround-sound-setup/

Dolby Surround

24

How to Cope with Multiple Channels?

25

•Treat them individually (different channel=different content)•Combine them, ideally while reducing noise•Cross-Infer between channels•Any combination of the above

How to Cope with Multiple Channels?

26

•Treat them individually (different channel=different content)•Combine them, ideally while reducing noise•Cross-Infer between channels•Any combination of the above

Combining Channels

27

•Beamforming, two major techniques:- Delay-Sum- Frequency-Domain Beamformer

Delay-Sum Beamforming

28

More Information: http://www.xavieranguera.com/phdthesis/

Delay-Sum Beamforming

29

Output:- Sum of all channels (lower noise)- Delay between microphones (directional signal)

More on Features

30

• Mel-Frequency-Scaled Coefficients (MFCC)

Other (not explained here):• LPC (Linear Prediction Coefficients)• PLP (Perceptual Linear Predictive) Features• RASTA (see Morgan et al)• MSG (Modulation Spectrogram)

MFCC: Idea

31

power cepstrum of signal

Pre-emphasis

Windowing

FFT

Mel-Scale

Filterbank

Log-Scale

DCT

Audio Signal

MFCC

MFCC: Mel Scale

32

MFCC: Result

33

MFCC Variants and Derivates

34

Derivates: •LFCC (no Mel scale)•AMFCC (anti Mel scale)

Parameters: •MFCC12 (often used for ASR)•MFCC19 (often used in speaker id, diarization)•“delta”: coefficients subtracted (“first derivative”)•“deltadelta”: “second derivative”•Short term: Usually calculated on 10-50ms window

Typical Machine Learning for Audio Analysis

35

•Gaussian Mixture Models (today)

•HMMs•Bayesian Information Criterion•Supervector Approaches

Reminder: K-Means

36

Choose k initial means µi at randomloop for all samples xj: assign membership of each element to a mean (closest mean) for all means µi calculate a new µi by averaging all values xj that were assigned membersuntil means µi are not updated significantly anymore

Algorithm Outline (Expectation Maximization)

Reminder: Gaussian Mixtures

37

Reminder: Training of Mixture Models

38

Goal: Find ai for

Expectation:

Maximization:

Magic Duo

39

~ 90% of audio papers use the combination of MFCCs and Gaussian Mixture Models to model audio signals!

More Tools: HTK

40

Open Source Speech Recognition Toolkit: http://htk.eng.cam.ac.uk/

More Tools: A nice collection of them

41

Princeton SoundLab:http://soundlab.cs.princeton.edu/software/

More Tools: BeamformIT

42

Open Source Beamformer: http://beamformit.sourceforge.net/

Next Week

43

•High-Level Design of Typical Audio Analytics

•Measuring Error•Multimedia

Organizing the Presentations

44

1) Shaunack Chatterjee2) David Sun3) David Sheffield4) Armin Samii5) Jonathan Kolker6) Venkye Enkambaram7) Tim Althoff8) Giulia Fanti9) Shuo-Yiin Chang10) Ekatarina Gonina11) Jaeyoung Choi

Recommended