Upload
vukiet
View
222
Download
0
Embed Size (px)
Citation preview
1 of 27
SOUND SOURCE RECOGNITION
n
events
AND MODELING
CASA seminar, summer 2000
Antti Eronen
Contents:
• Basics of human sound source recognitio
• Timbre
• Voice recognition
• Recognition of environmental sounds and
• Musical instrument recognition
Audio Research Group, TUT
2 of 27
gnition
cing objectstening
ction process
oduced at each
rom different sound sources
ingle objects in belonging to
Audio Research Group, TUT
Human sound source recoabilities
• Different acoustic properties of sound produenable us to recognize sound sources by lis
• These properties are the result of the produ
• The produced sound waves are different prevent
• Acoustic properties change over time
• The acoustic world is linear: sound waves fsources combine together and result larger
• Combination and interaction of properties sthe mix generate new, emergent propertiesthe larger sound producing system
3 of 27
)
at is, “what it
tener can tell thatissimilar”
properties
of properties
teractive
of a violin sounds like this)
learn how the violin sounds
Audio Research Group, TUT
Timbre (= äänen väri
• The perceptual qualities of objects and events; thsounds like”
• ANSI 1973: “The quality of a sound by which a listwo sounds of the same loudness and pitch are d
• There are many stable and time-varying acoustic affecting timbre
• It is unlikely that any one property or combinationuniquely determines timbre
• The sense of timbre comes from the emergent, inproperties of the vibration pattern
• The identification is the result of
• the apprehension of acoustical invariants (the bowing
• inferences made according to learned experience (we
like in different acoustic environments)
4 of 27
roduction
a vibration
nt vibration
nant frequency
, it modifies the source input
m of the signal
re of its
harp peak inignal (and vice
Audio Research Group, TUT
Source-filter model of sound p
• The source is excited by energy to generatepattern
• The filter acts as a resonator, having differemodes
• Each mode can be characterized by its resoand by its damping or quality factor Q
• When the excitation is imposed on the filterelative amplitudes of the components of the
• This results peaks in the frequency spectruat resonant frequencies
• Damping of the vibration modes is a measusharpness of tuning and temporal response
• Lightly damped mode (high Q) results a sspectrum and a longer time delay into the sversa)
5 of 27
spectrum and few
ted by theeristics
e modeled asesulting signale partialically
(1)
f the output a sn
z-transforms of
ract and the
Audio Research Group, TUT
• We can hear both the change in the sound the time differences (if they are more than amilliseconds)
• The final sound is the result of effects resulexcitation, resonators and radiation charact
• In sound producing mechanisms, that can blinear systems, the transfer function of the ris the product of the transfer functions of thsystems (if they are in cascade), mathemat
,
where and are the z-transforms o
excitation signal, respectively. are the
the N subsystems (for instance, the vocal treflections at lips)
Y z( ) X z( ) Hi z( )i 1=
N
∏=
Y z( ) X z( )Hi z( )
6 of 27
gnition
tem should:
of the same kind of instance, musicals or by different
ble to work withverberation and
ditional sounds and
s performancegree ofnd sources
Audio Research Group, TUT
Machine sound source reco
A good sound source recognition sys
• Exhibit generalization. Different instancessound should be recognized as similar. (forinstruments played at different environmentplayers)
• Hande real world complexity . Should be arealistic recording conditions, with noise, reeven competing sound sources.
• Be scalable. Ability to learn to recognize adaffects on performance.
• Exhibit graceful degradation. The systemshould gradually worsen while noise, the dereverberation and number of competing souincreases.
7 of 27
uld be able to its refine
simpler out of twoemory ornderstand how the
Audio Research Group, TUT
• Employ a flexible learning strategy. It shointroduce new categories as necessary andclassification criteria.
• Simplicity, computational efficiency. Thesystems performing equally well is better. (mprocessing requirements, how easy is it to usystem works)
8 of 27
gnition
ervised)
etworks,
classes and
Audio Research Group, TUT
A typical sound source recosystem
• Preprocessing (filtering, noise removal)
• Feature extraction
• Training and learning (supervised or unsup
• Classification (pattern recognition, neural nstochastic models)
• Is able to work with limited number of soundtest data
9 of 27
ognition
nergy distribution,
(2)
k) it’s amplitude andersion can be used,
ncy of a harmonic
oments have been
Audio Research Group, TUT
Features for sound source rec
Frequency spectrum
• spectral centroid measures the spectral ecorresponds to the perceived “brightness”
,
k is the spectral component, and f(k) and A(frequency, respectively. Also a normalized v
, where f0 is the fundamental freque
sound.
• is the same as first moment, also higher order mused as features
f c
A k( ) f k( )k 1=
N
∑
A k( )k 1=
N
∑----------------------------------=
f cnorm
f c
f 0-----=
10 of 27
al bands or
r transform of the
(3)
d Fourier transform of
(4)
(5)
Audio Research Group, TUT
• The power spectrum accross a set of criticsuccessive frequency regions
• The power spectrum of signal x(n) is the Fourieautocorrelation sequence r(n):
.
• This can be calculated as the magnitude squarethe signal x(n):
,where
P ω( ) r n( )e jωn–
n ∞–=
∞
∑=
P ω( ) X ω( ) 2=
X ω( ) x n( )e jωn–
n ∞–=
∞
∑=
11 of 27
veraged harmonic
(6)
al spectrum
, . (7)
Ak 1–-----------------
k
n2--- M N
2----=
Audio Research Group, TUT
• Spectral irregularity
• Corresponds to the standard deviation of time-aamplitudes from a spectral envelope
• Even and odd harmonic content in the sign
• Even harmonic content
IRR 20 Ak
Ak 1+ Ak+ +
3-----------------------------–
k 2=
n 1–
∑
log=
hev
A22
A42
A62 …+ + +
A12
A22
A32 …+ + +
--------------------------------------------
A22
k 1=
M
∑
An 1=
N
∑-----------------==
12 of 27
, . (8)
sonances in the filter of
stral coefficients
(9)
LN2---- 1–=
Audio Research Group, TUT
• Odd harmonic content
• Formants
• Spectral prominences created by one or more rethe sound source
• A robust feature for measuring formants are cep
• The cepstrum of a signal x(n) is defined as
hodd
A12
A32
A52 …+ + +
A12
A22
A32 …+ + +
--------------------------------------------
A2k 1–2
k 0=
L
∑
An2
n 1=
N
∑---------------------------==
c n( ) F1–
F x n( ){ }log{ }=
13 of 27
instance with linear
ted with an all-pole filter
d
trum of the sound
oefficients, which
s(n)
white noise) is scaled
Audio Research Group, TUT
• In practise the coefficients may be obtained for prediction (LP)
• In LP, the filter of the sound source is approxima
• The coefficients of the all-pole filter can be solve
• These coefficients describe the magnitude specsource filter
• These coefficients are converted into cepstral cbehave nicely for recognition purposes
X
1
A(z)
u(n)
G
Figure. The normalized input u(n) (pulse train orby the gain G and filtered with an all-pole filter
14 of 27
2 2.5
x 104
guitar tone, and an
Audio Research Group, TUT
0 0.5 1 1.5−80
−60
−40
−20
0
20
40
60
Frequency [Hz]
Magn
itude [
dB]
Figure. Magnitude spectrum of a 40 ms frame of a approximating LPC spectrum of order 15.
15 of 27
2 2.5
x 104
2 2.5
x 104
in and a trumpet
Audio Research Group, TUT
0 0.5 1 1.5−30
−20
−10
0
10
20
30
40
50
60Violin
Magni
tude [d
B]
0 0.5 1 1.5−30
−20
−10
0
10
20
30
40
50Trumpet
Magni
tude [d
B]
Frequency [Hz]
Figure. Average LPC spectrum of a violtone, respectively.
16 of 27
ant of maximal
to locate these points
rmonics or different
es of the harmonic
Audio Research Group, TUT
Onset and offset transients
• Rise time (the duration of attack)
• the time interval between the onset and the instamplitude
• Usually some kind of energy threshols are usedfrom an overall amplitude envelope
• Onset asynchrony
• Calculate the individual rise times of different hafrequency ranges
• Onset harmonic skew: a linear fit to the onset timpartials as a function of frequency
17 of 27
m modulations
m modulations
ude envelope
Audio Research Group, TUT
Modulations
• Frequency modulation
• Vibrato (periodic), jitter (random)
• Difficult to measure reliably
• Presence/absence/degree of periodic and rando
• Amplitude modulation
• Tremolo
• Presence/absence/degree of periodic and rando
• These features can be extracted from an amplit
18 of 27
Audio Research Group, TUT0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
5
10
15
20
0
5
10
15
20
25
30
35
Time [sec]Bark frequency
Inte
nsity
[dB]
Flute, violin, trumpet and clarinet
19 of 27
vectors, which are
identified as a member
signed to a hitherto
5
ss1ss2
Audio Research Group, TUT
Classification
• Pattern recognition
• The data is presented as N dimensional featureassigned to different classes or clusters
• Supervised classification : the input pattern isof a predefined class
• Unsupervised classification : the pattern is asunknown class (e.g. clustering)
−2 −1 0 1 2 3 4−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
3
3.5
Feature
1
Feature 2
clacla
20 of 27
problem
s, leave only the
ing applications
able long-term statistics
in phonetic
rain and test utterance
Audio Research Group, TUT
Speaker recognition
• The most studied sound source recognition
• Recognition and verification
Three major approaches
• Long-term averages of acoustic features
• Average out phonetic variations affecting featurespeaker dependent component
• Earliest, has been used successfully in demand
• Discards much speaker-dependent information
• Can require long speech utterances to derive st
• Model the speaker-dependent features withsounds
• Compare within similar phonetic sounds in the t
21 of 27
l (HMM)-based little or no
g of acoustic featuresre models, GMM)
hich best discriminates
tic conditions vary from
Audio Research Group, TUT
• Explicit segmentation : A Hidden Markov Modecontinuous speech recognizer as a front end ->inmprovement in performance
• Implicit segmentation : unsupervised clusterinduring training and recognition (Gaussian mixtu
• Discriminative neural networks (NN)
• NN’s are trained to model the decision function wspealers within a known set
• Problems
• Fundamental frequency information not used
• Speech rhythm not used
• Lack of generality: do not work well when acousthose used in training
• Cannot deal with mixtures of sounds
• Performance suffers as population size grows
22 of 27
ms frames
odel is formed
els (GMM) in
r-dependent GMM areasses
oothly the long-term
Audio Research Group, TUT
Case Reynolds 1995
• 20 mel-frequency cepstral coefficients at 20
• Given a recorded utterance, a probalistic mbased on Gaussian distributions
• Motivations for using Gaussian Mixture Modspeaker recognition:
• The individual component Gaussians in speakeinterpreted to represent some broad acoustic cl
• A Gaussian mixture density is able to model smsample distribution
23 of 27
ordings (630
g differenting)
Audio Research Group, TUT
• The performance of the system depends on
• The noise characteristics of the signal
• The population size
• Nearly perfect performance with pristine rectalkers)
• Under varying acoustic conditions (e.g. usintelephone handsets during testing and train
• 94% with population of 10 talkers
• 83% with population of 113 talkers
24 of 27
ition
es
Audio Research Group, TUT
Automatic noise recogn
Case Gaunard 1998
• Car, truck, moped, aircraft, train
• 12 cepstral coefficients from 50-100 ms fram
• 1-5 state HMM
• Recognition performance
• 90-95% with cepestral coefficients as features
• 80% with 1/3 octave filter bank as front end
25 of 27
nvironments
order 10 LPC
tness
oises (restaurant
endent speech signals)perimpositions
more as babble than as
Audio Research Group, TUT
Case El-Maleh 1999
• Frame level noise classification for mobile e
• Car, voice babble, street, bus and factory
• Line spectral frequences (LSF:s) based on analysis as features
• 89% average performance
• Shows some ability to generalize and robus
• New noises were classified as similar training n(babble, music) -> babble or bus noise)
• Human speech-like noise (superimposed indepwas classified as speech with low number of su
• As number of superimposed signals increased,speech
26 of 27
nition
ith different techniques
rius vs. cheap violin)
finding
what’s the thing thatle (timbre)
Audio Research Group, TUT
Musical instrument recog
• Difficulties
• Wide pitch ranges
• Variety of playing tecniques
• Properties of sounds may change completely wand different notes
• Interfering sounds in polyphony
• Different recording conditions
• Differences between instrument pieces (stradiva
• Psychological research as starting point forfeatures
• Lots of work have been done in order to resolvemakes musical instrument sounds distinguishab
27 of 27
ment recognition
s
ne example of a
only four instruments
ith several instruments
tion
Audio Research Group, TUT
• This knowledge has been used in musical instrusystems
• Also lots of work with human voices
• Much less knowledge with environmental sound
• The state-of-the-art still quite low
• Good results with isolated tones but with only oparticular instrument
• Good results with monophonic phrases but with
• Not so good results with monophonic phrases w
• Some first attempts towards polyphonic recogni