View
1
Download
0
Category
Preview:
Citation preview
1
A Brief Overview ofAudio Information Retrieval
Unjung NamCCRMAStanford University
December 2000 by Unjung Nam 2
Outline
� What is AIR?� Motivation� Related Field of Research� Elements of AIR� Experiments and discussion
– Music Classification System
December 2000 by Unjung Nam 3
What is AIR?
� Audio Information Retrieval (AIR):– Audio Information: Speech, Music, Natural sounds, etc.
– To develop various methods in order to recognize the audioinformation
Speech?Music?Animal Sound?Clock Alarm?
:
HumanAudio
Computer
December 2000 by Unjung Nam 4
What is AIR? Applications
� Speech-related retrieval– Recognizing and Transcribing the content of Radio
programs, Telephone conversations, Recorded Meetings
� Music-related retrieval– Music similarity, Music style classification, Instrument
recognition
� Others audio retrieval applications– Alarms, animal sounds, natural sounds, etc.
December 2000 by Unjung Nam 5
What is AIR? Muscle Fish Audio Retrieval
http://www.musclefish.com
December 2000 by Unjung Nam 6
Motivation
� Multimedia Information stored on computersystems increases due to Internet.
� Multimedia database is classified/retrieved in amanual process which is often subjective andinaccurate when describing audio.
� Multimedia database should be handled by themethods of “automatic” analysis, segmentation,indexing and retrieval.
December 2000 by Unjung Nam 7
Related Field of Research
� Automatic Speech Recognition– Overview of the process
� Audio/Video Classification– Video Mail Retrieval
� Computer Vision– Image Retrieval, Face Recognition
� Multimedia Database Management
December 2000 by Unjung Nam 8
Automatic Speech Recognition:Overview
December 2000 by Unjung Nam 9
Audio/Video Classification:Video Mail Retrieval
December 2000 by Unjung Nam 10
Computer Vision:Image Retrieval
December 2000 by Unjung Nam 11
Computer Vision:Face Recognition
December 2000 by Unjung Nam 12
Multimedia Database Management:Characteristics of Multimedia Information Retrieval
� Content based retrieval�Automatic Indexing� Similarity Matching
– Similar content may have different representation
– Data filtering rather than exact matching (data selection)
� Browsing and Relevance feedback– No ideal mathematical model for defining similarity,
human feedback is required
December 2000 by Unjung Nam 13
Elements of AIR
Building Classification Database Retrieval of Best Matching Audio
Audio Input
Feature Extraction
Audio Signal
Feature Extraction
Feature Classification/Clustering
ClassificationModel Space
Projection to Model Space
Find Matching Model Space
Retrieval of Best Match
December 2000 by Unjung Nam 14
Feature Extraction
� Time domain feature modules– Short-Time Energy and Average Magnitude
– Short-Time Average Zero-Crossing Rate– Linear Prediction– Pulse metric, etc.
� Spectral domain feature models• STFT• Spectral Centroid• Harmony Analysis
• MFCC• Constant Q, etc.
December 2000 by Unjung Nam 15
Classification/Clustering Methods
� Deterministic Methods– Minimum Distance Classifier– k-nearest neighbor (k-NN)– Discriminant functions– Generalized Discriminators, etc.
� Statistical Methods– Class-related Probability Functions– Minimum Error Classification– Likelihood-based MAP Classification– Approximating a Bayes Classifier– Parameterization and Probability Estimation: Hidden Markov Model,
etc.
December 2000 by Unjung Nam 16
Experiments: Music Classification System
Genre category File No. Filename Duration(secs.)
MATLAB Graphic notation
1 dance.wav 11 ‘go’2 rocky.wav 13 ‘gx’
Jazz
3 band.wav 46 ‘g+’4 queen.wav 3 ‘gd’5 susqnet.wav 16 ‘*’6 pop.wav 26 ‘v’7 latin.wav 26 ‘*’8 latin2.wav 16 ‘+’9 reggae.wav 26 ‘d’
Pop/Rock
10 reggae2.wav 26 ‘^’11 phantom.wav 6 ‘ro’12 quartet4.wav 10 ‘rv’13 quartet3.wav 7 ‘rx’14 quartet2.wav 8 ‘r*’15 quartet1.wav 10 ‘r^’16 musicnight.wav 14 ‘r<’17 angel.wav 9 ‘r>’18 piacel.wav 8 ‘r+’19 piavio.wav 7 ‘rx’
Classic
20 piavio1.wav 7 ‘rs’Test Signal b3.wav 7 ‘k.’
December 2000 by Unjung Nam 17
Experiments: Feature Modules
� Spectral Centroid– Center Gravity of Spectrum: Brightness of a sound
Sound File frame STFT
Input
– The individual centroid of a spectral frame is defined as the averagefrequency weighted by amplitudes, divided by the sum of the amplitudes,or:
– Here, F [k] is the amplitude corresponding to bin k in DFT spectrum.
∑
∑
=
== N
k
N
k
kF
kkFCentroidSpectral
1
1
][
][
NPreprocessing
&Windowing
Number of NSpectral Centroid
December 2000 by Unjung Nam 18
Experiments: Spectral Centroid
Jazz
Pop/Rock
Classic
December 2000 by Unjung Nam 19
Experiments: Spectral Centroid
Spe
ctra
lCen
troi
d
The weighted averagespectral centroid of eachframes in 20 sound files.
Green: jazzBlue: pop/rockRed: classic
• pop/rock higher thanclassical
• classical fluctuatinga lot
December 2000 by Unjung Nam 20
Experiments: Feature Modules
� Short-Time Energy Function– Amplitude variation over the time– Rhythm and periodicity information
Sound File frame
Input
– Where x(m) is the discrete time audio signal, n is time index ofthe short-time energy, and w(m) is a rectangle window
NPreprocessing
&Windowing
Number of NEnergy Change
[ ]21 )()(∑ −=m
N mnwmxEn −≤≤
=otherwise.,0
,10,1)(
Nxxw
December 2000 by Unjung Nam 21
Experiments: Short-Time Energy Function
Jazz: dance.wav
December 2000 by Unjung Nam 22
Experiments: Short-Time Energy Function
The pop/rock musicsamples show the mostfluctuating energy whileclassical music samplesshows stable energyfluctuation. The energychanges of jazz samplesseem to show mediumfluctuation.
December 2000 by Unjung Nam 23
Experiments: Feature Modules
� Short-Time Average Zero Crossing Rate– Zero-Crossing Rate (ZCR) is a measure of how often the signal
– crosses zero per unit time.– occur if successive samples have different signs.
– The rate at which zero-crossings occur is a simple measure of thefrequency content of a signal.
– w(n) is a rectangle window of length N
[ ] [ ]
[ ]
<−≥
=
−−−= ∑
,0)(,1
,0)(,1)(sgn
),()1(sgn)(sgn2
1
nx
nxnxwhere
mnwmxmxZnm
December 2000 by Unjung Nam 24
Experiments: Short-Time Average ZCR
Classic: quartet4.wav
December 2000 by Unjung Nam 25
Experiments: Short-Time Average ZCR
It doesn’t seem toshow any indicationof classifying thethree different musicgenres.
December 2000 by Unjung Nam 26
Experiments: Classification/Clustering Methods
� K-means clustering algorithm– K-means cluster analysis programs begin by creating
the K clusters according to some arbitrary procedure.– The program calculates the means or centroids of each
of the clusters.– If one of the observations is closer to the centroid of another
cluster, then the observation is made a member of that cluster.
� K-Nearest Neighbour Classifier (KNN)– to classify a feature space with a given set of sample data by
evaluating the k nearest sample points of each point inthe feature space.
December 2000 by Unjung Nam 27
Experiments: Classification/Clustering Methods
K-means clustering KNN classifier
December 2000 by Unjung Nam 28
Experiments: Classification/Clustering Methods
Feature vectors of the 20input files get extracted ineach frame and gotplotted.red: classicalblue: pop/rockgreen: jazz
The first figure: spectralcentroid in x-axis andshort-time energy in yaxis. The second: short-time energy in x-axis andshort-time ZCR in y-axis.The third figure: 3dimensional space
December 2000 by Unjung Nam 29
Experiments: Classification/Clustering Methods
The means of featurevectors of 20 musicsamples are plotted in3 dimensional space
December 2000 by Unjung Nam 30
Experiments: Classification/Clustering Methods
The feature vectors of the test signal b3.wav gets plotted as blackdots in 2 dimensional space.
December 2000 by Unjung Nam 31
Experiments: Classification/Clustering Methods
The nearest neighbours aredetermined using Euclideandistance. Each mean of the 20sound samples gets the predictedclass labels as an index 1 to 20.Each of the feature vectors of thetest signal is assigned to one of 20means. The test signal in thefigure above determined that thenumber of feature vectors assignedto 13 is the greatest. The followingdescribes the result in MATLAB.
Test 13. Quartet3.wav
>> classfier
ans =
class is 13
>> classPoint
classPoint =
16556
14002
12326
27 % 27 feature vectors assigned to 13 th class5
1222007
>>
December 2000 by Unjung Nam 32
Experiments: Discussion
� Though the test signal and the quartet3.wav are not fell into asame category, they sounded similar in terms of rhythm andtempo information. It seems that this system doesn’teffectively classify the timbre information.
� The number of feature modules are limited in this system.
� The variance factor of the feature vectors is not considered.
� Need to experiment with more samples.
Recommended