44
Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, [email protected] CS-298 Seminar Fall 2011 1

Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, [email protected]

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Audio and (Multimedia) Methods for Video Analysis

Dr. Gerald Friedland, [email protected]

CS-298 Seminar Fall 2011•

1

Page 2: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

More on Sound

2

•Responses to Email Questions•Useful Filters•More Features•Typical Machine Learning on

Audio•More on Tools

Page 3: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Q&A

3

Why didn’t you talk about Loudspeakers?

Page 5: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Loudspeaker Characteristics

Frequency dependent! (As are the response patterns of Microphones!)

Page 6: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Q&A

6

- Speakers don’t really help with video analysis...

- Exception: Multichannel!

Page 7: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Q&A

7

How is the directionality graph for Microphones generated?

- According to IEC 60268-4

Page 8: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Q&A

8

Does lossy compression influence audio analysis?

- Yes, but degree unknown- Common Sense: It’s better to train your models based on the same data compression algorithm.

Page 9: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Q&A

9

Why is the CD sampling rate 44100Hz and not just 44000?

- Because it is dividable by both 25 video frames per second (PAL) and 30 fps (NTSC)

Page 10: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Useful Filters

10

•Low-,High-, Band-pass, Graphic Equalizer•Dynamic Range Compression•Filter by Example: Spectral Subtraction, Wiener Filter•Handling Multiple Channels

Page 11: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Remember: Fourier Transform

11

DFT

iDFT

More Information: “Introduction to Multimedia Computing”, Draft Chapter 8

Page 12: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Remember: Fast Fourier Transform

12

More Information: “Introduction to Multimedia Computing”, Draft Chapter 8

// Input: A number N of samples X. N must be an integer power of two.// s the stride// Output: The Discrete Fourier Transform of X in the array Y.FFT(X, N, s)

IF (N==1) THENY[0]:=X{0] // Trivial size-1 case: Sample = DC coefficient

ELSEY[0,...,N/2]:=FFT(X,N/2,2*s) // DFT of (X[0],X[2s},X[4s},...)Y[N/2,...,N-1]:=FFT(X[s,...,N-1],N/2,2*s)// DFT of (X[s],X[s+2s],X[s+4s],...)

FOR k:=0 TO N/2 // Combine the two halves into onet:=Y[k]Y[k]:=t+exp(-2*pi*k/N)*Y[k=N/2]Y[k+N/2]:=t-exp(-2*pi*k/N)*Y[k=N/2]

FFT := Y

Page 13: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Remember: Convolution Theorem

13

More Information: “Introduction to Multimedia Computing”, Draft Chapter 8

Let f and g be two functions and f * g their convolution. Then:

Page 14: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Bandpass Filters

14

Page 15: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Dynamic Range Compression

15

Page 16: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Dynamic Range Compression

16

Dark blue: Original, Light Blue: Dynamic Range Compressed

Page 17: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Filter by Example: Spectral Subtraction

17

Dark blue: Original with humming noise, Light blue: Corrected using spectral subtraction of humming.

Page 18: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Filter by Example: Wiener Filter

18

Assume:

where:•s(t) is original signal•n(t) is noise footprint•s(t) is estimated signal•g(t) is Wiener Filter

Page 19: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Filter by Example: Wiener Filter

19

Define Error:

with being the delay. Thus:

(squared error)

α

Page 20: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Filter by Example: Wiener Filter

20

Define Error:

with being the delay. Thus:

(squared error)

α

Page 21: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Filter by Example: Wiener Filter

21

Results in:

where: •X(s(t)) is the power spectrum of the signal,•s(t) and N(s(t)) is the power spectrum of the noise, and •G(s(t)) is the frequency-space Wiener filter.

Problem: Do we know N(s(t))?

Page 22: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Multichannel Audio

22

•Microphone Array Recordings popular with Speech Community•Stereo popular since the 60s.•Most cameras have stereo microphones•Most DVDs are encoded in Dolby Digital 5.1 and in Dolby Surround on Stereo Track

Page 23: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Dolby Digital

23

Source: http://www.aspecttv.co.uk/surround-sound-setup/

Page 24: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Dolby Surround

24

Page 25: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

How to Cope with Multiple Channels?

25

•Treat them individually (different channel=different content)•Combine them, ideally while reducing noise•Cross-Infer between channels•Any combination of the above

Page 26: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

How to Cope with Multiple Channels?

26

•Treat them individually (different channel=different content)•Combine them, ideally while reducing noise•Cross-Infer between channels•Any combination of the above

Page 27: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Combining Channels

27

•Beamforming, two major techniques:- Delay-Sum- Frequency-Domain Beamformer

Page 28: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Delay-Sum Beamforming

28

More Information: http://www.xavieranguera.com/phdthesis/

Page 29: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Delay-Sum Beamforming

29

Output:- Sum of all channels (lower noise)- Delay between microphones (directional signal)

Page 30: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

More on Features

30

• Mel-Frequency-Scaled Coefficients (MFCC)

Other (not explained here):• LPC (Linear Prediction Coefficients)• PLP (Perceptual Linear Predictive) Features• RASTA (see Morgan et al)• MSG (Modulation Spectrogram)

Page 31: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

MFCC: Idea

31

power cepstrum of signal

Pre-emphasis

Windowing

FFT

Mel-Scale

Filterbank

Log-Scale

DCT

Audio Signal

MFCC

Page 32: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

MFCC: Mel Scale

32

Page 33: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

MFCC: Result

33

Page 34: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

MFCC Variants and Derivates

34

Derivates: •LFCC (no Mel scale)•AMFCC (anti Mel scale)

Parameters: •MFCC12 (often used for ASR)•MFCC19 (often used in speaker id, diarization)•“delta”: coefficients subtracted (“first derivative”)•“deltadelta”: “second derivative”•Short term: Usually calculated on 10-50ms window

Page 35: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Typical Machine Learning for Audio Analysis

35

•Gaussian Mixture Models (today)

•HMMs•Bayesian Information Criterion•Supervector Approaches

Page 36: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Reminder: K-Means

36

Choose k initial means µi at randomloop for all samples xj: assign membership of each element to a mean (closest mean) for all means µi calculate a new µi by averaging all values xj that were assigned membersuntil means µi are not updated significantly anymore

Algorithm Outline (Expectation Maximization)

Page 37: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Reminder: Gaussian Mixtures

37

Page 38: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Reminder: Training of Mixture Models

38

Goal: Find ai for

Expectation:

Maximization:

Page 39: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Magic Duo

39

~ 90% of audio papers use the combination of MFCCs and Gaussian Mixture Models to model audio signals!

Page 40: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

More Tools: HTK

40

Open Source Speech Recognition Toolkit: http://htk.eng.cam.ac.uk/

Page 41: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

More Tools: A nice collection of them

41

Princeton SoundLab:http://soundlab.cs.princeton.edu/software/

Page 42: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

More Tools: BeamformIT

42

Open Source Beamformer: http://beamformit.sourceforge.net/

Page 43: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Next Week

43

•High-Level Design of Typical Audio Analytics

•Measuring Error•Multimedia

Page 44: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu

Organizing the Presentations

44

1) Shaunack Chatterjee2) David Sun3) David Sheffield4) Armin Samii5) Jonathan Kolker6) Venkye Enkambaram7) Tim Althoff8) Giulia Fanti9) Shuo-Yiin Chang10) Ekatarina Gonina11) Jaeyoung Choi