SOUND SOURCE RECOGNITION AND MODELING - TUT · SOUND SOURCE RECOGNITION AND MODELING CASA seminar, ... interactive properties of the ... • Can require long speech utterances to

1 of 27

SOUND SOURCE RECOGNITION

n

events

AND MODELING

CASA seminar, summer 2000

Antti Eronen

[email protected]

Contents:

• Basics of human sound source recognitio

• Timbre

• Voice recognition

• Recognition of environmental sounds and

• Musical instrument recognition

Audio Research Group, TUT

2 of 27

gnition

cing objectstening

ction process

oduced at each

rom different sound sources

ingle objects in belonging to


Human sound source recoabilities

• Different acoustic properties of sound produenable us to recognize sound sources by lis

• These properties are the result of the produ

• The produced sound waves are different prevent

• Acoustic properties change over time

• The acoustic world is linear: sound waves fsources combine together and result larger

• Combination and interaction of properties sthe mix generate new, emergent propertiesthe larger sound producing system

3 of 27

)

at is, “what it

tener can tell thatissimilar”

properties

of properties

teractive

of a violin sounds like this)

learn how the violin sounds


Timbre (= äänen väri

• The perceptual qualities of objects and events; thsounds like”

• ANSI 1973: “The quality of a sound by which a listwo sounds of the same loudness and pitch are d

• There are many stable and time-varying acoustic affecting timbre

• It is unlikely that any one property or combinationuniquely determines timbre

• The sense of timbre comes from the emergent, inproperties of the vibration pattern

• The identification is the result of

• the apprehension of acoustical invariants (the bowing

• inferences made according to learned experience (we

like in different acoustic environments)

4 of 27

roduction

a vibration

nt vibration

nant frequency

, it modifies the source input

m of the signal

re of its

harp peak inignal (and vice


Source-filter model of sound p

• The source is excited by energy to generatepattern

• The filter acts as a resonator, having differemodes

• Each mode can be characterized by its resoand by its damping or quality factor Q

• When the excitation is imposed on the filterelative amplitudes of the components of the

• This results peaks in the frequency spectruat resonant frequencies

• Damping of the vibration modes is a measusharpness of tuning and temporal response

• Lightly damped mode (high Q) results a sspectrum and a longer time delay into the sversa)

5 of 27

spectrum and few

ted by theeristics

e modeled asesulting signale partialically

(1)

f the output a sn

z-transforms of

ract and the


• We can hear both the change in the sound the time differences (if they are more than amilliseconds)

• The final sound is the result of effects resulexcitation, resonators and radiation charact

• In sound producing mechanisms, that can blinear systems, the transfer function of the ris the product of the transfer functions of thsystems (if they are in cascade), mathemat

,

where and are the z-transforms o

excitation signal, respectively. are the

the N subsystems (for instance, the vocal treflections at lips)

Y z( ) X z( ) Hi z( )i 1=

N

∏=

Y z( ) X z( )Hi z( )

6 of 27

gnition

tem should:

of the same kind of instance, musicals or by different

ble to work withverberation and

ditional sounds and

s performancegree ofnd sources


Machine sound source reco

A good sound source recognition sys

• Exhibit generalization. Different instancessound should be recognized as similar. (forinstruments played at different environmentplayers)

• Hande real world complexity . Should be arealistic recording conditions, with noise, reeven competing sound sources.

• Be scalable. Ability to learn to recognize adaffects on performance.

• Exhibit graceful degradation. The systemshould gradually worsen while noise, the dereverberation and number of competing souincreases.

7 of 27

uld be able to its refine

simpler out of twoemory ornderstand how the


• Employ a flexible learning strategy. It shointroduce new categories as necessary andclassification criteria.

• Simplicity, computational efficiency. Thesystems performing equally well is better. (mprocessing requirements, how easy is it to usystem works)

8 of 27

gnition

ervised)

etworks,

classes and


A typical sound source recosystem

• Preprocessing (filtering, noise removal)

• Feature extraction

• Training and learning (supervised or unsup

• Classification (pattern recognition, neural nstochastic models)

• Is able to work with limited number of soundtest data

9 of 27

ognition

nergy distribution,

(2)

k) it’s amplitude andersion can be used,

ncy of a harmonic

oments have been


Features for sound source rec

Frequency spectrum

• spectral centroid measures the spectral ecorresponds to the perceived “brightness”

,

k is the spectral component, and f(k) and A(frequency, respectively. Also a normalized v

, where f0 is the fundamental freque

sound.

• is the same as first moment, also higher order mused as features

f c

A k( ) f k( )k 1=

N

∑

A k( )k 1=

N

∑----------------------------------=

f cnorm

f c

f 0-----=

10 of 27

al bands or

r transform of the

(3)

d Fourier transform of

(4)

(5)


• The power spectrum accross a set of criticsuccessive frequency regions

• The power spectrum of signal x(n) is the Fourieautocorrelation sequence r(n):

.

• This can be calculated as the magnitude squarethe signal x(n):

,where

P ω( ) r n( )e jωn–

n ∞–=

∞

∑=

P ω( ) X ω( ) 2=

X ω( ) x n( )e jωn–

n ∞–=

∞

∑=

11 of 27

veraged harmonic

(6)

al spectrum

, . (7)

Ak 1–-----------------

k

n2--- M N

2----=


• Spectral irregularity

• Corresponds to the standard deviation of time-aamplitudes from a spectral envelope

• Even and odd harmonic content in the sign

• Even harmonic content

IRR 20 Ak

Ak 1+ Ak+ +

3-----------------------------–

k 2=

n 1–

∑

log=

hev

A22

A42

A62 …+ + +

A12

A22

A32 …+ + +

--------------------------------------------

A22

k 1=

M

∑

An 1=

N

∑-----------------==

12 of 27

, . (8)

sonances in the filter of

stral coefficients

(9)

LN2---- 1–=


• Odd harmonic content

• Formants

• Spectral prominences created by one or more rethe sound source

• A robust feature for measuring formants are cep

• The cepstrum of a signal x(n) is defined as

hodd

A12

A32

A52 …+ + +

A12

A22

A32 …+ + +

--------------------------------------------

A2k 1–2

k 0=

L

∑

An2

n 1=

N

∑---------------------------==

c n( ) F1–

F x n( ){ }log{ }=

13 of 27

instance with linear

ted with an all-pole filter

d

trum of the sound

oefficients, which

s(n)

white noise) is scaled


• In practise the coefficients may be obtained for prediction (LP)

• In LP, the filter of the sound source is approxima

• The coefficients of the all-pole filter can be solve

• These coefficients describe the magnitude specsource filter

• These coefficients are converted into cepstral cbehave nicely for recognition purposes

X

1

A(z)

u(n)

G

Figure. The normalized input u(n) (pulse train orby the gain G and filtered with an all-pole filter

14 of 27

2 2.5

x 104

guitar tone, and an


0 0.5 1 1.5−80

−60

−40

−20

0

20

40

60

Frequency [Hz]

Magn

itude [

dB]

Figure. Magnitude spectrum of a 40 ms frame of a approximating LPC spectrum of order 15.

15 of 27

2 2.5

x 104

2 2.5

x 104

in and a trumpet


0 0.5 1 1.5−30

−20

−10

0

10

20

30

40

50

60Violin

Magni

tude [d

B]

0 0.5 1 1.5−30

−20

−10

0

10

20

30

40

50Trumpet

Magni

tude [d

B]

Frequency [Hz]

Figure. Average LPC spectrum of a violtone, respectively.

16 of 27

ant of maximal

to locate these points

rmonics or different

es of the harmonic


Onset and offset transients

• Rise time (the duration of attack)

• the time interval between the onset and the instamplitude

• Usually some kind of energy threshols are usedfrom an overall amplitude envelope

• Onset asynchrony

• Calculate the individual rise times of different hafrequency ranges

• Onset harmonic skew: a linear fit to the onset timpartials as a function of frequency

17 of 27

m modulations

m modulations

ude envelope


Modulations

• Frequency modulation

• Vibrato (periodic), jitter (random)

• Difficult to measure reliably

• Presence/absence/degree of periodic and rando

• Amplitude modulation

• Tremolo

• Presence/absence/degree of periodic and rando

• These features can be extracted from an amplit

18 of 27
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

5

10

15

20

0

5

10

15

20

25

30

35

Time [sec]Bark frequency

Inte

nsity

[dB]

Flute, violin, trumpet and clarinet

19 of 27

vectors, which are

identified as a member

signed to a hitherto

5

ss1ss2


Classification

• Pattern recognition

• The data is presented as N dimensional featureassigned to different classes or clusters

• Supervised classification : the input pattern isof a predefined class

• Unsupervised classification : the pattern is asunknown class (e.g. clustering)

−2 −1 0 1 2 3 4−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3

3.5

Feature

1

Feature 2

clacla

20 of 27

problem

s, leave only the

ing applications

able long-term statistics

in phonetic

rain and test utterance


Speaker recognition

• The most studied sound source recognition

• Recognition and verification

Three major approaches

• Long-term averages of acoustic features

• Average out phonetic variations affecting featurespeaker dependent component

• Earliest, has been used successfully in demand

• Discards much speaker-dependent information

• Can require long speech utterances to derive st

• Model the speaker-dependent features withsounds

• Compare within similar phonetic sounds in the t

21 of 27

l (HMM)-based little or no

g of acoustic featuresre models, GMM)

hich best discriminates

tic conditions vary from


• Explicit segmentation : A Hidden Markov Modecontinuous speech recognizer as a front end ->inmprovement in performance

• Implicit segmentation : unsupervised clusterinduring training and recognition (Gaussian mixtu

• Discriminative neural networks (NN)

• NN’s are trained to model the decision function wspealers within a known set

• Problems

• Fundamental frequency information not used

• Speech rhythm not used

• Lack of generality: do not work well when acousthose used in training

• Cannot deal with mixtures of sounds

• Performance suffers as population size grows

22 of 27

ms frames

odel is formed

els (GMM) in

r-dependent GMM areasses

oothly the long-term


Case Reynolds 1995

• 20 mel-frequency cepstral coefficients at 20

• Given a recorded utterance, a probalistic mbased on Gaussian distributions

• Motivations for using Gaussian Mixture Modspeaker recognition:

• The individual component Gaussians in speakeinterpreted to represent some broad acoustic cl

• A Gaussian mixture density is able to model smsample distribution

23 of 27

ordings (630

g differenting)


• The performance of the system depends on

• The noise characteristics of the signal

• The population size

• Nearly perfect performance with pristine rectalkers)

• Under varying acoustic conditions (e.g. usintelephone handsets during testing and train

• 94% with population of 10 talkers

• 83% with population of 113 talkers

24 of 27

ition

es


Automatic noise recogn

Case Gaunard 1998

• Car, truck, moped, aircraft, train

• 12 cepstral coefficients from 50-100 ms fram

• 1-5 state HMM

• Recognition performance

• 90-95% with cepestral coefficients as features

• 80% with 1/3 octave filter bank as front end

25 of 27

nvironments

order 10 LPC

tness

oises (restaurant

endent speech signals)perimpositions

more as babble than as


Case El-Maleh 1999

• Frame level noise classification for mobile e

• Car, voice babble, street, bus and factory

• Line spectral frequences (LSF:s) based on analysis as features

• 89% average performance

• Shows some ability to generalize and robus

• New noises were classified as similar training n(babble, music) -> babble or bus noise)

• Human speech-like noise (superimposed indepwas classified as speech with low number of su

• As number of superimposed signals increased,speech

26 of 27

nition

ith different techniques

rius vs. cheap violin)

finding

what’s the thing thatle (timbre)


Musical instrument recog

• Difficulties

• Wide pitch ranges

• Variety of playing tecniques

• Properties of sounds may change completely wand different notes

• Interfering sounds in polyphony

• Different recording conditions

• Differences between instrument pieces (stradiva

• Psychological research as starting point forfeatures

• Lots of work have been done in order to resolvemakes musical instrument sounds distinguishab

27 of 27

ment recognition

s

ne example of a

only four instruments

ith several instruments

tion


• This knowledge has been used in musical instrusystems

• Also lots of work with human voices

• Much less knowledge with environmental sound

• The state-of-the-art still quite low

• Good results with isolated tones but with only oparticular instrument

• Good results with monophonic phrases but with

• Not so good results with monophonic phrases w

• Some first attempts towards polyphonic recogni

Documents

SOUND SOURCE RECOGNITION AND MODELING - TUT · SOUND SOURCE RECOGNITION AND MODELING CASA seminar, ... interactive properties of the ... • Can require long speech utterances to