12
Hindawi Publishing Corporation e Scientific World Journal Volume 2013, Article ID 752464, 11 pages http://dx.doi.org/10.1155/2013/752464 Research Article Music Identification System Using MPEG-7 Audio Signature Descriptors Shingchern D. You, 1 Wei-Hwa Chen, 2 and Woei-Kae Chen 1 1 Department of Computer Science and Information Engineering, National Taipei University of Technology, Taipei 104, Taiwan 2 Hon-Hai Precision Industry Co. Ltd, Tucheng District, New Taipei City 236, Taiwan Correspondence should be addressed to Shingchern You; [email protected] Received 3 January 2013; Accepted 24 January 2013 Academic Editors: P. Melin, J. Pav´ on, and K. Polat Copyright © 2013 Shingchern D. You et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. is paper describes a multiresolution system based on MPEG-7 audio signature descriptors for music identification. Such an identification system may be used to detect illegally copied music circulated over the Internet. In the proposed system, low- resolution descriptors are used to search likely candidates, and then full-resolution descriptors are used to identify the unknown (query) audio. With this arrangement, the proposed system achieves both high speed and high accuracy. To deal with the problem that a piece of query audio may not be inside the system’s database, we suggest two different methods to find the decision threshold. Simulation results show that the proposed method II can achieve an accuracy of 99.4% for query inputs both inside and outside the database. Overall, it is highly possible to use the proposed system for copyright control. 1. Introduction With the SOPA (stop online piracy act) bill [1] proposed in 2011, the protection of copyrighted intellectual property, such as digital content, once again brought to public attention. Despite the controversial issues of the SOPA bill, it is commonly agreed that copyrighted digital content should be protected. However, the first step toward the protection of copyrighted content is to identify whether a piece of digital content is copyrighted, and if so, who owns it. In this regard, it is important to identify (detect) whether a digital work is copyrighted or not. Among digital content, soundtracks (usually in the form of audio files) are one type of content that is easily to be ille- gally reproduced. Owing to the advanced techniques in audio compression, music soundtracks are usually distributed over the Internet in compressed form rather than in uncompressed form. erefore, any approach for copyright detection must be able to deal with both compressed and uncompressed audio files. A typical method to attach the copyright information to a piece of music is by embedding watermarks [2]. ough effective, this method has some limitations, such as the watermarks must be embedded into the source soundtracks before release. erefore, it is not possible to identify the rights owner of a piece of music without watermarks. Another concern is that the embedding process usually introduces distortion. us, the quality of the embedded audio may be degraded. In addition to the watermarking technique, it is also possible to identify rights owner by comparison. For example, if an unknown soundtrack is very similar to a soundtrack owned by a company, then the unknown soundtrack is highly likely copyrighted in that company’s name. is type of approach is especially suitable for audio files because, in prac- tice, tremendous amount of records currently available do not embed watermarks or any kind of copyright information. When comparing a piece of music with a music database, the comparison may be accomplished based on the melody (i.e., musical notes) of the music [3]. For this type of comparison, however, if two persons sing the same song, these two works will be recognized as the same one. Since different artists may perform the same song, known as the cover version, a comparison based on melody cannot solve this problem. Another type of comparison is based on the waveform of the music. is technique is also known as music identifica- tion. In this case, the same song performed by different artists

Research Article Music Identification System Using MPEG-7 ...downloads.hindawi.com/journals/tswj/2013/752464.pdf · Gracenote s MusicID [ ]. According to [ ], there are more ... one

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Research Article Music Identification System Using MPEG-7 ...downloads.hindawi.com/journals/tswj/2013/752464.pdf · Gracenote s MusicID [ ]. According to [ ], there are more ... one

Hindawi Publishing CorporationThe Scientific World JournalVolume 2013 Article ID 752464 11 pageshttpdxdoiorg1011552013752464

Research ArticleMusic Identification System Using MPEG-7 AudioSignature Descriptors

Shingchern D You1 Wei-Hwa Chen2 and Woei-Kae Chen1

1 Department of Computer Science and Information Engineering National Taipei University of Technology Taipei 104 Taiwan2Hon-Hai Precision Industry Co Ltd Tucheng District New Taipei City 236 Taiwan

Correspondence should be addressed to Shingchern You youcsientutedutw

Received 3 January 2013 Accepted 24 January 2013

Academic Editors P Melin J Pavon and K Polat

Copyright copy 2013 Shingchern D You et al This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited

This paper describes a multiresolution system based on MPEG-7 audio signature descriptors for music identification Such anidentification system may be used to detect illegally copied music circulated over the Internet In the proposed system low-resolution descriptors are used to search likely candidates and then full-resolution descriptors are used to identify the unknown(query) audio With this arrangement the proposed system achieves both high speed and high accuracy To deal with the problemthat a piece of query audio may not be inside the systemrsquos database we suggest two different methods to find the decision thresholdSimulation results show that the proposed method II can achieve an accuracy of 994 for query inputs both inside and outsidethe database Overall it is highly possible to use the proposed system for copyright control

1 Introduction

With the SOPA (stop online piracy act) bill [1] proposed in2011 the protection of copyrighted intellectual property suchas digital content once again brought to public attentionDespite the controversial issues of the SOPA bill it iscommonly agreed that copyrighted digital content should beprotected However the first step toward the protection ofcopyrighted content is to identify whether a piece of digitalcontent is copyrighted and if so who owns it In this regardit is important to identify (detect) whether a digital work iscopyrighted or not

Among digital content soundtracks (usually in the formof audio files) are one type of content that is easily to be ille-gally reproduced Owing to the advanced techniques in audiocompression music soundtracks are usually distributed overthe Internet in compressed form rather than in uncompressedform Therefore any approach for copyright detection mustbe able to deal with both compressed and uncompressedaudio files

A typical method to attach the copyright information toa piece of music is by embedding watermarks [2] Thougheffective this method has some limitations such as thewatermarks must be embedded into the source soundtracks

before release Therefore it is not possible to identify therights owner of a piece ofmusicwithoutwatermarks Anotherconcern is that the embedding process usually introducesdistortion Thus the quality of the embedded audio may bedegraded

In addition to the watermarking technique it is alsopossible to identify rights owner by comparison For exampleif an unknown soundtrack is very similar to a soundtrackowned by a company then the unknown soundtrack is highlylikely copyrighted in that companyrsquos name This type ofapproach is especially suitable for audio files because in prac-tice tremendous amount of records currently available do notembed watermarks or any kind of copyright information

When comparing a piece of music with a music databasethe comparison may be accomplished based on the melody(ie musical notes) of the music [3] For this type ofcomparison however if two persons sing the same songthese two works will be recognized as the same one Sincedifferent artists may perform the same song known as thecover version a comparison based on melody cannot solvethis problem

Another type of comparison is based on the waveform ofthe music This technique is also known as music identifica-tion In this case the same song performed by different artists

2 The Scientific World Journal

generally does not have the same waveforms and thereforethey can be correctly identifiedThough conceptually simpleit is not plausible to directly compare PCM samples of twopieces of music because it would take too much time forthe comparison For example a typical compact disc (CD)has about 600M bytes of PCM samples to store about tensongs If a database contains 10000 different songs then thePCM samples occupy about 600G bytes of space A piece ofunknownmusic with duration of ten seconds has about 880 kbytes of data It is obvious that it requires a huge amountof computation to sequentially compare the 880 k bytes ofdata with the 600G bytes of data in the database Thereforedimension-reduced representations of the PCM waveformsknown as fingerprints are used in comparison Amongthe fingerprints most of them are defined by individualcompanies or groups Some of them are briefly explained inthe following

Researchers in Google develop a fingerprinting schemefor audio calledWaveprint [4] based onwaveletsWith the aidof wavelets the fingerprint is invariant to timescale changeIn other words whether the audio piece is played faster orslower than the normal speed the fingerprint is unchangedThe fingerprint of a piece of 4-minute music is around 64 kbytes equivalently 2133 bits per second

Shazam [5] is a company (and service) dedicated formusic identification Its database contains around elevenmillion soundtracks As described in [6] the fingerprintsused are sets of triplets based on spectrogram peaks Forexample if (119905

1 1198911) and (119905

2 1198912) are two peaks at time 119905

1and

1199052and frequency 119891

1and 119891

2 then the triplet ((119905

2minus 1199051) 1198911

(1198912minus 1198911)) is a feature Based on the realization of [7] the

fingerprint in this scheme uses 400 bits per secondResearchers in Philips also propose a fingerprinting

scheme [8] The computation of the fingerprints includesframing windowing (von Hann window) FFT (fast fouriertransform) band decision energy computation and thenquantization (into binary) In the typical setting one secondof audio has around 2730 bits of fingerprint

Microsoftrsquos Robust Audio Recognition Engine (RARE)[9] divides the incoming audio into overlapping frames Eachframe is converted to spectral domain by MCLT (modulatedcomplex lapped transform)The spectral values are applied totwo layers of OPCA (oriented principle component analysis)to reduce the dimensionality of the spectral data For thismethod 344 features (11008 bits if one feature is stored in4 bytes in a floating point) are obtained per second

In addition to the above methods there are actuallymany more different types of audio fingerprinting schemesavailable such as Music Brainz [10] Audible Magic [11] andGracenotersquos MusicID [12] According to [13] there are morethan ten different audio fingerprinting schemes available

Since there are vast amount of different fingerprintingschemes available some researchers then conducted exper-iments to compare the relative performance among someof them The results show that if the schemes use thesame number of bits to represent fingerprints they havecomparable performance [14] Therefore the selection of thefingerprinting schemes should also consider other factors

(such as interoperability to be addressed below) rather thanmerely the minor performance difference

With the ever-increasing amount of multimedia contentover the Internet and in the multimedia databases it is animportant task to exchange multimedia content To respondthe public demands ISOrsquos (International StandardizationOrganization) working group developed MPEG-7 standard[15 16] In the audio part of the standard [17] a high-level tool is developed for audio identification called audiosignature description scheme The fingerprints used in thescheme are called audio signature descriptors and theyhave good identification accuracy [18 19] In the followingwe interchangeably use descriptors and fingerprints withoutdistinction

Although proprietary audio fingerprints have excellentidentification performance the MPEG-7 audio descriptorsoffer some advantages First being an international standardensures the open and fair use (subject to license fee) of thetechnology Second such an international standard makesthe interoperability possible For example if a mobile phoneinstalls an application program to convert a piece of recordedaudio to MPEG-7 descriptors the descriptors can be sent toany website accepting the descriptors On the other handit is not possible to send proprietary fingerprints used inone company to database systems owned by competitorsThird different companies may share or exchange theiraudio descriptors (fingerprints) in their databases withoutany difficulties In the current situation each company has tocompute fingerprints for newly released albumsWith the useof MPEG-7 descriptors the redundant efforts of computingfingerprints can be minimized

Although a music identification system based on audiofingerprints has several applications [8] we concentrate onthe issue of detecting if a piece of circulated music is highlysimilar to a copyrighted work or not In a typical case thesimilarity is measured by a distance metric If the distanceis shorter than a threshold the two pieces of music areconsidered as similar Although it is not trivial to determine asuitable threshold [20] this problem is not fully studied Forexample [20] does not indicate any approach to determinethe threshold In addition the audio files to be comparedmay be very large therefore it is very important to reducethe comparison time while maintaining high identificationaccuracy Since not many papers address these two issuesbased on MPEG-7 descriptors we report in this paper ourapproaches and experimental results

This paper is organized as follows Section 2 is anoverview of the MPEG-7 audio signature descriptorsSection 3 is the system model for music identificationSection 4 describes the dimensionality reduction methodused in the paper Section 5 is the proposed strategy todetermine the threshold Section 6 covers the experimentsand results Section 7 is the conclusion

2 Overview of MPEG-7 AudioSignature Descriptors

Part 4 [17] of the MPEG-7 standard includes low-leveland high-level descriptors for various applications Low-level descriptors are derived from the temporal and spec-tral characteristics of the waveform whereas the high-level

The Scientific World Journal 3

descriptors are constructed based on low-level descriptors Inthis section we will briefly describe the descriptors related tomusic identification

21 Low-Level and High-Level Descriptors of the MPEG-7Audio There are 17 low-level audio descriptors defined inthe standard All of these low-level descriptors are derivedfrom the waveform of the music They can be dividedinto six different categories basic basic spectral signalparameter temporal timbral spectral timbral and spectralbasis representationThese low-level features may be directlyused or may serve as the basics for constructing high-leveldescriptors

Based on low-level descriptors MPEG-7 audio standardalso defines high-level description schemes for various appli-cations These include audio signature description instru-ment timbre description general sound recognition andindexing description and spoken content description Theaudio signature descriptors one type of fingerprints are usedfor music identification

22 MPEG-7 Audio Signature Descriptors Although low-level descriptors in MPEG-7 audio may also be used toidentifymusic it is shown that the audio signature descriptorsprovide better identification performance [18 19] Thereforethe audio signature descriptors are adopted in the proposedsystem

TheMPEG-7 audio signature descriptors are computed asfollows

(1) Time-to-frequency conversion this step is based onaudio spectrum envelope descriptors including thefollowing substeps

(i) Determine the hop length between two con-secutive windows in the unit of samples Thedefault value is equivalent to 30ms

(ii) Define the window length 119897119908 This value is set to

three times the hop length The chosen windowis Hamming window

(iii) Determine the length of FFT denoted as 119873FFTTo reduce the computational complexity 119873FFTis the smallest power-of-2 number equal to orgreater than 119897

119908 For example if the sample rate of

the audio is 44100 ss then 119897119908= 44100sdot003sdot3 =

3969Therefore119873FFT =4096The extra samplesafter 119897

119908are padded with zeros

(iv) Perform the FFT on the windowed samples

(2) Divide the FFT coefficients into subbands with eachsubband having a bandwidth of one-fourth of anoctave In addition the bandwidths of two consecu-tive subbands should overlap each other by 10Thatis the computed bandwidth for each subbandmust bemultiplied by 11 The beginning frequency of the firstsubband denoted as loEdge is fixed at 250Hz Table 1lists the frequency range of the first three subbandsbefore and after (spectral) overlapping

Subband

middot middot middot middot middot middot

119860 =

11988612 119886124

11988621 11988622 119886224

119886301 119886302 1198863024

119886311 119886312 1198863124

middot middot middot middot middot middot

middot middot middot middot middot middot

middot middot middot middot middot middot

11988611

Time

Figure 1 MPEG-7 audio signature descriptors in a matrix

(3) Find the flatness measure F(b) for subband b by

119865 (119887) =

ℎ(119887)minus119897(119887)+1radicprodℎ(119887)

119894=119897(119887)119888 (119894)

(1 (ℎ (119887) minus 119897 (119887) + 1))sumℎ(119887)

119894=119897(119887)119888 (119894)

(1)

where c(i) is the power spectrum computed by FFT(in step 1) and h(b) and l(b) are the lower and upperindices of c(i) within subband b

(4) Find themean and variance of F(b) for subband b overa certain number of successive FFT windows calledscaling ratio The default value of the scaling ratio is16

(5) The series of mean and variance values are the audiosignature descriptors

With the computational steps of the audio signaturedescriptor we may compute the number of descriptors ina piece of 15-second music In the time domain about(15minus009)003 = 497windows are used to cover the 15-secondsignal because the hop size is 30ms Since the scaling ratio isset to 16 there are 49716 asymp 31 values per subband In thespectral domain since there are four subbands per octave andwe use six octaves (250Hz sim 16 kHz) there are totally 24subbands in the spectral domain Thus the 15-second signalproduces 24 times 31 = 744 mean values and other 744 variancevalues The descriptors are arranged in a matrix form asshown in Figure 1

Since the mean values are sufficient for the identificationpurpose [21] we will not consider the variance descriptorsin the following The obtained descriptors are referred to ashigh-resolution (or full-dimensional) descriptors Accordingto [21] if a piece of query music is to be compared with areference piece it should be done with a sliding comparisonas shown in Figure 2 In the figure one segment of linerepresents descriptors in a piece of music arranged in onedimensional structure Using the representation in Figure 1the query segment is arranged as

119876 = [11990211 sdot sdot sdot 119902124 sdot sdot sdot 119902311 sdot sdot sdot 1199023124] (2)

Since only descriptors from the same subband are to becompared the hop size between two query segments inFigure 2 is 24 descriptors or 003 sdot 16 = 048 second Inother words one descriptor (in a subband) represents 048second of audio samples The smallest Euclidean distance

4 The Scientific World Journal

middot middot middot

middot middot middot

119886124 11988621 119886224 1198863124middot middot middot

middot middot middot

11988611 11988631 middot middot middot 119886324 middot middot middot 1198863224 1198866224

11990211 119902124 11990222411990221 1199023124

middot middot middot11990211 119902124 11990222411990221 1199023124middot middot middot

middot middot middot11990211 119902124 1199023124middot middot middot

middot middot middot

middot middot middot

middot middot middot

Figure 2 The sliding operation to compare the query input Q (15seconds) and the reference music A (30 seconds)

obtained during the sliding operation is recorded as thedistance between the query input and the reference musicFor example if the query Q is to be compared with referenceA then we may obtain the Euclidean distance 119864

119860119861(119896) for the

k-th sliding comparison as

119864119876119860(119896) =

31

sum

119894=1

24

sum

119895=1

10038161003816100381610038161003816119902119894119895minus 119886(119894+119896)119895

10038161003816100381610038161003816 (3)

The recorded distance between these two pieces of music is

119889119876119860

= min119896

119864119876119860(119896) (4)

Although using audio signature descriptors greatlyreduces the computational burden for comparison therequired computation using (3) is still very large Accordingto [21] suppose that a database contains 1000 audio files witheach one having duration of 30 seconds and the music tobe identified has a duration of 15 seconds then 47 616 000arithmetic operations are required Therefore it is beneficialto further reduce the number of comparison which can beaccomplished by employing a multiresolution strategy

3 Music Identification Based onMultiresolution Strategy

As discussed previously the computational cost is still veryhigh even if we use the MPEG-7 descriptors Thereforewe will use the multiresolution strategy to reduce thecomputational complexity To do so in addition to theabove-mentioned high-resolution descriptors we also needto generate low-resolution descriptors (see also Section 4)Therefore the music database contains high-resolution andlow-resolution descriptors for each soundtrack to be identi-fied This step can be accomplished during the setup of thedatabase In addition a training process is conducted to finda distance threshold to determine whether the query inputis in the database or not Section 5 has a detailed descriptionabout the training process

Once the database is constructed as shown in Figure 3the music identification procedure consists of the followingsteps

(1) When a query input is sent to the system it com-putes the MPEG-7 audio signature descriptors andlow-resolution descriptors for the query music If amobile device is used to record the music usually the

Table 1 Bandwidth of the first three subbands

Bandwidth of subband(nonoverlapped)

Bandwidth of subband(overlapped)

250ndash2973Hz 2375ndash3122Hz2973ndash3536Hz 2824ndash3712Hz3535ndash4204Hz 3359ndash4415Hz

descriptors (fingerprints) are sent instead of the PCMsamples to reduce the size of the transmitted data

(2) The computed low-resolution descriptors are com-pared with those in the database Based on thedistance metric a list of candidates is obtainedSince the query music may start at any position ofthe soundtrack a sliding comparison as shown inFigure 2 is necessary

(3) After obtaining the candidate list high-resolutiondescriptors are used to find the distances between thequery input and the candidates The shortest distanceand its associated soundtrack are recorded

(4) If the shortest distance is less than the threshold thequery input is considered as highly similar to therecorded soundtrack in the database Otherwise thequery input is not in the database

During the low-resolution comparison it is also possibleto use an existing algorithm to reduce the comparison timeFor example we may arrange the descriptors in k-d tree (k-dimensional binary search tree) [22 23] structure to reducethe search time However using the k-d tree structure impliesthat every segment of music in the database has the samenumber of descriptors or same duration Note that the unitto be compared is one segment and one soundtrack canbe divided into many overlapped segments as shown inFigure 2 Therefore if the duration of the segment is set to 15seconds then the querymusic must also have a duration of 15seconds to use the system In a practical situation if the queryinput is longer than 15 seconds then only 15 seconds of themusic is used for computing the descriptors and comparison

4 Dimensionality Reduction Method forMPEG-7 Audio Signature Descriptors

This section briefly explains the dimensionality reductiontechnique used in the experiments Although a block-averagemethod is proposed in [21] for this purpose we use a differenttechnique in this paper

41 Problems of Reducing Dimensionality Using Scaling RatioConceptually we may increase the scaling ratio (given inSection 2) to reduce the number of descriptors duringcomputing them For example by increasing the scalingratio from 16 to 256 the number of descriptors is reducedby 16 times Unfortunately this approach does not yieldsatisfactory results because of insufficient time resolutionRecall that the descriptors are derived based on thewindowedwaveform Therefore if two segments of the soundtrack are

The Scientific World Journal 5

Query

Database

Dim reduction

In database

Reject Result

Threshold

Candidate

MPEG-7descriptors

Fullresolution

search

Lowresolution

search

YesNo

Figure 3 Procedure for multiresolution music identification

Descriptors

Descriptors Descriptors1198861198991 11988611989924 119886119899+11 119886119899+124

1199021198991 11990211989924

Figure 4 The audio samples and the descriptors In a real applica-tion 119886

119899119896is stored in the database whereas 119902

119899119896is computed from

the query input Usually 119886119899119896

is not equal to 119902119899119896

due to differentscopes of the windows even though they are all derived from thesame soundtrack

not highly similar their corresponding descriptors usuallyhave large differences With a scaling ratio of 256 eachdescriptor represents around 77 seconds of audio samplesTherefore unless the query input also starts at a point veryclose to the segment boundary of a soundtrack a comparisonbased on these descriptors may fail As illustrated in Figure 4descriptors of 119886

119899119896(or 119886119899+1119896

) and 119902119899119896

are quite different andtherefore the descriptor-based identification scheme cannotidentify the query inputTherefore a suitable time resolutionfor example 048 second should be maintained

42 Proposed Dimensionality Reduction Method In contrastto reduce the number of descriptors by increasing the scalingratio we may reduce them in each temporal-spectral block(ie the matrix in Figure 1) representing a segment of (15-second) music However to maintain a high time resolutionsuccessive temporal-spectral block should have a small timedifference as shown in Figure 5 With this arrangement weachieve both high identification rate and low comparisoncomplexity

For each temporal-spectral block we use PCA (princi-pal component analysis) [21 24] to reduce the number ofdescriptors Since it is difficult to directly use PCA to obtaingood low-resolution descriptors [21] we use an alternativeapproach Its basic idea is to partition the entire block into

119860 =

middot middot middot

119886124

11988621 119886224

119886311

119886321

1198863124

middot middot middot

11988611

11988631 119886324middot middot middot

middot middot middot

middot middot middot

middot middot middot

1198863224

119902331 1199023324

middot middot middot

middot middot middot

middot middot middot 1198866224119886621

11989111 11989118

11989121 11989128

11989131 11989138

Figure 5 Low-resolution descriptors In the figure each temporal-spectral block corresponds to descriptors from 15-second music

119860 = =11986011 1198601211986021 11986022

middot middot middot middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

11988611 11988618

119886151

119886161

119886291

119886158

119886168

119886298

11988619

119886159

119886169

119886299

119886116

1198861516

1198861616

1198862916 1198862917 1198862924

119886124119886117

Figure 6 Partition of a temporal-spectral block into four subblocksin the experiments

four subblocks as shown in Figure 6 and then use PCA toreduce the number of descriptors of each subblock into twovalues (descriptors) called low-resolution descriptors Alsohigh-frequency descriptors in the original temporal-spectralblock are all discarded because they are susceptible to noise[21] With this arrangement totally eight descriptors are usedto represent a segment of 15-second music Note that theactual duration of a segment used in the experiments is 1404seconds (though we still say 15 seconds) because some audiosamples are lost after MP-3 compression and decompressionin the experiments

6 The Scientific World Journal

Table 2 Identification accuracy using high-resolution features

Uncompressed 192 k MP-3 96 k MP-3High-frequencydescriptors Used Not used Used Not used Used Not used

Match in firstone 100 100 100 100 94 100

Match in first 15 100 100 100 100 98 100

Wenowdescribe how to calculate low-resolution descrip-tors Since using PCA for dimensionality reduction is a well-known approach we only describe how to construct thecovariance matrix for PCA computation and omit the com-putation of finding principal components Suppose that thereare N segments of music pieces in the database with theirdescriptor matrices denoted as 119860(1) to 119860(119873) To simplify theargument we consider the subblock matrix119860(119896)

11(referring to

Figure 6) in the following Other subblocks can be computedby the same procedure First collect all subblocks from 119860

(119896)

11

to form a big matrix as follows

119861 =

[[[

[

119886(1)

11sdot sdot sdot 119886

(1)

18119886(2)

11sdot sdot sdot 119886

(2)

18sdot sdot sdot 119886(119873)

11sdot sdot sdot 119886(119873)

18

sdot sdot sdot

119886(1)

151sdot sdot sdot 119886(1)

158119886(2)

151sdot sdot sdot 119886(2)

158sdot sdot sdot 119886(119873)

151sdot sdot sdot 119886(119873)

158

]]]

]

= [

[

uarr sdot sdot sdot uarr

b1sdot sdot sdot b8119873

darr sdot sdot sdot darr

]

]

(5)

Then we may consider column vectors of matrix B as b119894

vectors Having the data vectors b119894 the covariance matrix

and subsequently the principal components can be foundBy keeping only two principal components we can obtainc119894= [11988811989411198881198942] from b

119894 Since there are eight column vectors in

a subblock descriptors in119860(119896)11

are reduced to eight c119894(8119896minus7 le

119894 le 8119896) vectors Next we can rearrange the obtained c119894vectors

as

119862 =[[

[

11988811

11988812

11988891

11988892

sdot sdot sdot 1198888119873minus71

1198888119873minus72

sdot sdot sdot

11988881

11988882

119888161

119888162

sdot sdot sdot 11988881198731

11988881198732

]]

]

(6)

Again by treating each row of matrix C as the data to beprocessed by PCA we can compute the covariance matrixand finally reduce each column vector to one value Sincethere are two columns originally from one subblock thereare totally two (low-resolution) descriptors per subblockEquivalently a segment of 15-second music is representedby eight descriptors During the computation the principalcomponents obtained in the first and the second steps shouldbe stored in the database When a query input (in the formof high-resolution descriptors) is supplied these componentsare used to obtain the low-resolution descriptors for theinput

Table 3 Comparison of identification accuracies using low-resolution descriptors obtained by the proposed approach and bythe averaging (avg) method [21]

Uncompressed 192 k MP-3 96 k MP-3Method Proposed Avg Proposed Avg Proposed AvgMatch in firstone 993 995 991 993 901 898

Match in first15 100 100 100 100 996 995

5 Threshold to Determine the Membership ofthe Query Input

Since in practice the database cannot collect all musicsoundtracks in the world we have to have a strategy todetermine if the query input is actually in the database ornot In our case the decision is accomplished with the aid ofa threshold If the shortest distances between the input andthe candidates are greater than a threshold then the queryinput is not in the database otherwise it is in the databaseThough the concept is simple it is not trivial to determinethe threshold [20]

Recall that when a query input is applied to the proposedsystem the system computes the Euclidean distances betweenthe low-resolution descriptors of the query input and thosein the database Accordingly m (say m = 20) segments ofsoundtracks from distinct titles are selected based on thecomputed distances Next the Euclidean distances betweenthe query and the selected segments are computed againusing full-resolution descriptors By sorting the distancesfrom small to large the system creates a candidate list for thequery input

To compute the previously mentioned threshold weexamine two different methods In the first method (methodI) we define the ldquofirstrdquo distance 119889

1as the (full-resolution)

distance associated with the first candidate in the list Simi-larly the ldquosecondrdquo distance 119889

2is the distance associated with

the second candidate as shown in Figure 7 The methodto compute the first and second distances is used both intraining and in identification For the training phase assumethat there are 119873

119879soundtracks used For a segment from a

soundtrack indexed n let the first distance be 1198891(119899) and the

second distance be 1198892(119899) The threshold 119879

119860is then computed

as

119879119860= 05 sdot (

119873119879

sum

119899=1

1198891(119899) +

119873119879

sum

119899=1

1198892(119899)) (7)

During the identification phase the first distance forthe query input is computed If the first distance is smallerthan the threshold 119879

119860 the query input is determined to be

highly similar to the top soundtrack title in the candidate listOtherwise the query input is not inside the database

In addition to method I we also propose an alternativemethod (method II) to compute the first and seconddistances

The Scientific World Journal 7

Firs

t dist

ance

Seco

nd d

istan

ce

Figure 7 The first and second distances in method I

and the threshold Referring to Figure 8 the first distance1198891015840

1(119899) of method II for a training segment in soundtrack n is

1198891015840

1(119899) = 119889

1(119899) divide

sum119872

119898=2119889119898(119899)

119872 minus 1 (8)

where 119889119898(119899) is the distance associatedwith themth candidate

in the list andM is a constant Based on our experimentM =10 is sufficient By the same arrangement the second distance1198891015840

2(119899) is calculated as

1198891015840

2(119899) = 119889

2(119899) divide

sum119872+1

119898=3119889119898(119899)

119872 minus 1 (9)

Similar to (7) the threshold 1198791015840119860is determined as

1198791015840

119860= 05 sdot (

119873119879

sum

119899=1

1198891015840

1(119899) +

119873119879

sum

119899=1

1198891015840

2(119899)) (10)

During identification phase if the computed distance 11988910158401

is greater than1198791015840119860 the query input is determined as not in the

database Otherwise the query input is highly similar to thetop soundtrack title in the candidate list

6 Experiments and Results

Before conducting the experiments we collect 750 sound-tracks frommany CD titles for constructing the database andfor identification In the experiments 30 seconds of musicis excerpted from the soundtracks as the reference items tobe identified Then low- and full-resolution descriptors ofthe reference items are calculated and stored in the databaseWe also randomly excerpt 15-second query items from thereference items Note that a query item may start fromany sample on the first half of the reference To test theidentification accuracy for compressed audio the 15-secondquery inputs are also encoded and then decoded with anMP-3 coder with bitrates of 192 k and 96 k respectivelyHowever the database only contains descriptors from theuncompressed items In addition the principal componentsare also obtained using uncompressed items

Table 4 Comparison of search time between 119896-119889 tree search andlinear search

119870-119889 tree Linear searchAverage time 0255 sec 323 sec

Several experiments are conducted to examine the per-formance of the proposed systemThefirst experiment checksthe identification accuracy using high-resolution descriptorsThe second one compares the relative identification accuracybetween the proposed method (in Section 42) and themethod given in [21] The third experiment compares thesearch speed by using the linear search and k-d tree searchThe next experiment intends to determine a suitable valueof M used in (8) and (9) Having the value of M we reportthe identification accuracy using a multiresolution strategyin experiment five In this experiment both method I andmethod II are examined To have a complete comparisonbetween method I and II we also report the results in termsof the ROC (receiver operating characteristics) [25] curveand the DET-like (detection error tradeoff) [26] curve butwithout logarithm

61 Experiment One Identification Accuracy Using High-Resolution Descriptors The first experiment is to examinethe identification accuracy using the high-resolutionMPEG-7 audio signature descriptors (without reduction) To evaluatethe influences of high-frequency descriptors we also exam-ine the accuracies with and without using high-frequency(greater than 25 kHz) descriptors The results are given inTable 2 In the table ldquomatch in first onerdquo is the rate that thequery input is correctly identified with the shortest distanceamong all reference items Similarly ldquomatch in first 15rdquo is therate that the query is correctly identified within a list of 15 ref-erence items sorted by distanceThe results indicate that high-resolution descriptors have very good identification accuracyAlso for uncompressed or 192 k MP-3 items whether usinghigh-frequency descriptors does not affect the identificationaccuracy However for 96 k MP-3 items discarding high-frequency descriptors greatly improves the accuracy

8 The Scientific World Journal

Seco

nd d

istan

ce

Firs

t dist

ance

AvgAvg

dividedivide

Figure 8 The first and second distances in method II

62 Experiment Two Identification Accuracy Using Low-Resolution Descriptors The second experiment compares therelative identification accuracies of the proposed dimension-ality reduction technique and the averaging method pro-posed in [21] In this experiment eight descriptors are used torepresent one segment of music (15 seconds) The results areshown in Table 3 For this experiment the accuracy rate inldquomatch in first 15rdquo ismore important than the figure in ldquomatchin first onerdquo because the low-resolution descriptors are usedto produce a candidate list Therefore it is acceptable as longas the correct item is within the list as the high-resolutiondescriptors are to be used in the second stage In this regardthe proposed approach is slightly better than the averagingmethod proposed in [21] for 96 k MP-3 query inputs

63 Experiment Three Time for k-d Tree Search and LinearSearch The third experiment compares the time to generatea list of 20 candidates by using k-d tree search and linear(exhausted) search for low-resolution descriptors The com-puting time is obtained by averaging the required time in tentrials For each trial 500 query items are randomly selectedamong the 750 available items The experimental platform isa personal computer with a 16GHz PentiumCPU and 15 GBmemory The program is written in C with a Borland C++compiler The results are shown in Table 4 The results showthat we can increase the search time by 10-fold if the low-resolution descriptors are embedded in the k-d tree structure

64 Experiment Four Determination of M in Method II Thefourth experiment is to determine the numberM in methodII Recall that the thresholds in both methods are computedbased on high-resolution descriptors To determine the opti-mal value of M we vary this value from one to eighteen in(8) and (9) to compute the threshold 1198791015840

119860and then to perform

identification Note that method II is degenerated to methodI if M = 1 In this experiment 96 k MP-3-coded items areused as the query In addition we deliberately keep the high-frequency descriptors in computing the distances becausekeeping them makes it easier to examine the influence of Mversus identification rate (defined in experiment five ) Theexperimental results are given in Figure 9 The results showthat whenM is equal or greater than 10 the identification rate

085408560858

0860862086408660868

087087208740876

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Iden

tifica

tion

rate

119872 value

Figure 9The identification rate versus the numberM for 96 kMP-3query inputs

remains almost constant Therefore M = 10 is sufficient formethod II This value is then used in experiment five

65 Experiment Five Comparison between Method I andMethod II The fifth experiment compares the identificationrates between method I and method II The thresholds arecalculated using (7) and (10) respectively In this experiment375 of the 750 15-second items are randomly chosen fortraining (ie to obtain the thresholds) The rest of 375 itemsare for query Note that all of the items are excerpted from thereference items in the database In addition we test other 500query items (also with duration of 15 sec) not excerpted fromdatabase references

Since some query inputs are not in the database wealso need to consider the situations of falsely identifying anoutside database item as one inside the database and viceversa Suppose that 119879

119889query items are actually inside the

database and 119879119900query items are not inside the database

Assume that the identification system correctly identifies119873119889

inside-database query item erroneously rejects 119873119903inside-

database query items and erroneously accepts 119873119886outside

database query items Then the IDR (identification rate)

The Scientific World Journal 9

Table 5 Comparison of identification performance using method I and method II with the use of high-frequency descriptors

Uncompressed 192 k MP-3 96 k MP-3Method I Method II Method I Method II Method I Method II

IDR 988 999 99 997 841 874FAR 21 02 18 02 04 02FRR 00 00 00 05 366 296ACC 984 999 989 995 718 768

Table 6 Comparison of method I and method II for IDR FAR and FRR with high-frequency descriptors removed

Uncompressed 192 k MP-3 96 k MP-3Method I Method II Method I Method II Method I Method II

IDR 991 997 989 997 990 997FAR 15 02 19 02 16 02FRR 03 04 02 04 03 06ACC 987 995 985 995 986 994

FAR (false-accept rate) and FRR (false-rejection rate) arecomputed as follows

IDR =119873119889

119879119889minus 119873119903

(11)

FAR =119873119886

119879119900

(12)

FRR =119873119903

119879119889

(13)

In addition the accuracy of the system is computed as

ACC =119873119889+ 119873119900

119879119889+ 119879119900

(14)

where119873119900is the number of query items correctly identified as

outside-database items and can be computed as119873119900= 119879119900minus119873119886

In this experiment the low-resolution descriptors areused to find 20 candidates Next high-resolution descriptorsare used to find the best-matched one in the list If thedistance associated with the best-match is greater than thethreshold the query input is determined as not in thedatabase Otherwise the query input is identified as the best-matched one In this experiment high-frequency descriptorsare kept in the comparison With the use of (11) to (14) theaverage values after four trials are given in Table 5 Fromthe results we know that the system does not perform wellfor identifying 96 k MP-3 query items This is mainly dueto high distortion in high-frequency descriptors A similarphenomenon can also be observed in Table 2 Later onwe will repeat this experiment but discard high-frequencydescriptors

To further examine the performance differences betweenmethod I and method II we again use 96 k MP-3 items asthe query inputs But this time we vary the threshold values(for both methods) and then record the IDR FAR and FRRvalues The results are presented as ROC curves [25] andDET-like curves [26] (but without logarithm) as shown in

Figures 10 and 11 respectively The ROC curve plots FARversus TPR (true positive rate) which is defined as

TPR =119873119889

119879119889

(15)

Conceptually a better classifier should have a higher TPRfor a fixed FAR Therefore a better classifier should havean ROC curve closer to the left-upper corner With thisinterpretation from Figure 10 we know that method II is abetter classifier

In this experiment we also show DET-like curves Aldquotruerdquo DET curve plots FAR versus FRR using logarithmscales In our case however we do not use logarithm scales toexhibit the similarity between ROC and DET curves Againconceptually we know that a better classifier should have alower FRR over a fixed FAR Thus a better classifier shouldhave a curve closer to the left-bottom corner By examiningFigure 11 we again confirm thatmethod II is a better method

Since the high-frequency descriptors actually affect theidentification rate at low bitrates we again repeat this exper-iment without using high-frequency descriptors The resultsare given in Table 6 By comparing the results in Tables 5 and6 we know that removing high-frequency descriptors slightlyreduces the accuracy of uncompressed items However itgreatly improves the accuracy for 96 k MP-3 items Sincemethod II has an accuracy of at least 994 in all cases it ishighly plausible to identify copyrighted audiomaterials usingthis method

7 Conclusion

In this paper we propose a system using a multiresolutionstrategy to identify whether a piece of unknown music isidentical to one of the pieces in the database In the systemhigh-resolution descriptors are MPEG-7 audio signaturedescriptors and the low-resolution descriptors are obtainedfrom high-resolution descriptors with the aid of PCA toreduce their dimensionality Experimental results show that

10 The Scientific World Journal

0 20 40 60 80 1000

10

20

30

40

50

60

70

80

90

100

FAR ()

TPR

()

ROC curve (96 k MP-3)

Method IMethod II

Figure 10 The ROC curves for both methods

0 20 40 60 80 1000

10

20

30

40

50

60

70

80

90

100

FAR ()

FRR

()

Method IMethod II

DET curve (96 k MP-3)

Figure 11 The DET-like curves without logarithm for both meth-ods

low-resolution descriptors still have high identification accu-racies To reduce the time to generate the candidate list weuse the k-d tree structure to store low-resolution descriptorsExperimental results show that using k-d tree structureincreases the search time by ten-folds Since not every pieceof query input is within the database we also proposed twomethods to determine the distance thresholds Experimentalresults show that the proposed method II provides an accu-racy of 994 Therefore the proposed system can be usedin real applications such as identifying copyrighted audiofiles circulated over the Internet As it can be easily extendedto operate with multiple computers the proposed system

is a plausible starting point to construct a large operabledatabase

Acknowledgment

This work was supported in part by National Science Councilof Taiwan through Grants NSC 94-2213-E-027-042 and 99-2221-E-027-097

References

[1] For a brief description of the bill please refer to httpenwikipediaorgwikiSOPA

[2] R Eklund ldquoAudio watermarking techniquesrdquo httpwwwmusemagiccompaperswatermarkhtml

[3] J S R Jang and H R Lee ldquoA general framework of progressivefiltering and its application to query by singinghummingrdquoIEEE Transactions on Audio Speech and Language Processingvol 16 no 2 pp 350ndash358 2008

[4] S Baluja and M Covell ldquoAudio fingerprinting combiningcomputer visionamp data streamprocessingrdquo in Proceedings of theIEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo07) pp II213ndashII216 Honolulu HawaiiUSA April 2007

[5] Shazam httpwwwshazamcommusicwebhomehtml[6] A Wang ldquoThe Shazam music recognition servicerdquo Communi-

cations of the ACM vol 49 no 8 pp 44ndash48 2006[7] V Chandrasekha M Sharifi and D A Ross ldquoSurvey and eval-

uation of audio fingerprinting schemes for mobile query-by-example applicationsrdquo in Proceedinjgs of the 12th InternationalConference onMusic Information Retrieval pp 801ndash806MiamiFla USA October 2011

[8] J A Haitsma and T Kalker ldquoA highly robust audio fingerprint-ing systemrdquo in Proceedinjgs of the 12th International Conferenceon Music Information Retrieval pp 107ndash115 Paris FranceOctober 2002

[9] C J C Burges J C Platt and S Jana ldquoDistortion discriminantanalysis for audio fingerprintingrdquo IEEE Transactions on Speechand Audio Processing vol 11 no 3 pp 165ndash174 2003

[10] Official website httpmusicbrainzorg[11] Official website httpwwwaudiblemagiccom[12] Official website httpwwwgracenotecom[13] httpenwikipediaorgwikiAcoustic fingerprint[14] P J O Doets M M Gisbert and R L Lagendijk ldquoOn

the comparison of audio fingerprints for extracting qualityparameters of compressed audiordquo in Security Steganographyand Watermarking of Multimedia Contents VIII vol 6072 ofProceedings of SPIE pp 60720L-1ndash60720L-12 San Jose CalifUSA January 2006

[15] F Nack and A T Lindsay ldquoEverything you wanted to knowaboutMPEG-7 part 1rdquo IEEEMultimedia vol 6 no 3 pp 65ndash771999

[16] F Nack and A Lindsay ldquoEverything you wanted to know aboutMPEG-7 part 2rdquo IEEEMultimedia vol 6 no 4 pp 64ndash73 1999

[17] ISOIEC Information TechnologymdashMultimedia ContentDescription Interface -Part 4 Audio IS 15938-4 2002

[18] H Crysandt ldquoMusic identification with MPEG-7rdquo in Proceed-ings of the 115th AES Convention October 2003 Paper 5967

The Scientific World Journal 11

[19] OHellmuth E AllamanceMCremerHGrossmann J Herreand T Kastner ldquoUsing MPEG-7 audio fingerprinting in real-world applicationrdquo in Proceedings of the 115th AES ConventionOctober 2003 Paper 5961

[20] H-G Kim N Moreau and T Sikora MPEG-7 Audio andBeyond John Wiley amp Sons 2005

[21] J-Y Lee and S D You ldquoDimension-reduction technique forMPEG-7 Audio Descriptorsrdquo in Proceedings of the 6th PacificRim Conference on Multimedia (PCM rsquo05) vol 3768 of LectureNotes in Computer Science pp 526ndash537 2005

[22] J L Bentley ldquoMultidimensional binary search trees used forassociative searchingrdquo Communications of the ACM vol 18 no9 pp 509ndash517 1975

[23] Open source program for k-d tree httpwwwcsumdedusimmountANN

[24] J Shlens ldquoA tutorial on principal component analysisrdquohttpwwwsnlsalkedusimshlenspcapdf

[25] T Fawcell ldquoROC graphs notes and practical considerations forresearchersrdquo Tech Rep HP Lab 2004

[26] A Martin G Doddington T Kamm M Ordowski and MPrzybocki ldquoThe DET curve in assessment of detection taskperformancerdquo in Proceedings of the Eurospeech vol 4 pp 1899ndash1903 Rhodes Greece September 1997

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 2: Research Article Music Identification System Using MPEG-7 ...downloads.hindawi.com/journals/tswj/2013/752464.pdf · Gracenote s MusicID [ ]. According to [ ], there are more ... one

2 The Scientific World Journal

generally does not have the same waveforms and thereforethey can be correctly identifiedThough conceptually simpleit is not plausible to directly compare PCM samples of twopieces of music because it would take too much time forthe comparison For example a typical compact disc (CD)has about 600M bytes of PCM samples to store about tensongs If a database contains 10000 different songs then thePCM samples occupy about 600G bytes of space A piece ofunknownmusic with duration of ten seconds has about 880 kbytes of data It is obvious that it requires a huge amountof computation to sequentially compare the 880 k bytes ofdata with the 600G bytes of data in the database Thereforedimension-reduced representations of the PCM waveformsknown as fingerprints are used in comparison Amongthe fingerprints most of them are defined by individualcompanies or groups Some of them are briefly explained inthe following

Researchers in Google develop a fingerprinting schemefor audio calledWaveprint [4] based onwaveletsWith the aidof wavelets the fingerprint is invariant to timescale changeIn other words whether the audio piece is played faster orslower than the normal speed the fingerprint is unchangedThe fingerprint of a piece of 4-minute music is around 64 kbytes equivalently 2133 bits per second

Shazam [5] is a company (and service) dedicated formusic identification Its database contains around elevenmillion soundtracks As described in [6] the fingerprintsused are sets of triplets based on spectrogram peaks Forexample if (119905

1 1198911) and (119905

2 1198912) are two peaks at time 119905

1and

1199052and frequency 119891

1and 119891

2 then the triplet ((119905

2minus 1199051) 1198911

(1198912minus 1198911)) is a feature Based on the realization of [7] the

fingerprint in this scheme uses 400 bits per secondResearchers in Philips also propose a fingerprinting

scheme [8] The computation of the fingerprints includesframing windowing (von Hann window) FFT (fast fouriertransform) band decision energy computation and thenquantization (into binary) In the typical setting one secondof audio has around 2730 bits of fingerprint

Microsoftrsquos Robust Audio Recognition Engine (RARE)[9] divides the incoming audio into overlapping frames Eachframe is converted to spectral domain by MCLT (modulatedcomplex lapped transform)The spectral values are applied totwo layers of OPCA (oriented principle component analysis)to reduce the dimensionality of the spectral data For thismethod 344 features (11008 bits if one feature is stored in4 bytes in a floating point) are obtained per second

In addition to the above methods there are actuallymany more different types of audio fingerprinting schemesavailable such as Music Brainz [10] Audible Magic [11] andGracenotersquos MusicID [12] According to [13] there are morethan ten different audio fingerprinting schemes available

Since there are vast amount of different fingerprintingschemes available some researchers then conducted exper-iments to compare the relative performance among someof them The results show that if the schemes use thesame number of bits to represent fingerprints they havecomparable performance [14] Therefore the selection of thefingerprinting schemes should also consider other factors

(such as interoperability to be addressed below) rather thanmerely the minor performance difference

With the ever-increasing amount of multimedia contentover the Internet and in the multimedia databases it is animportant task to exchange multimedia content To respondthe public demands ISOrsquos (International StandardizationOrganization) working group developed MPEG-7 standard[15 16] In the audio part of the standard [17] a high-level tool is developed for audio identification called audiosignature description scheme The fingerprints used in thescheme are called audio signature descriptors and theyhave good identification accuracy [18 19] In the followingwe interchangeably use descriptors and fingerprints withoutdistinction

Although proprietary audio fingerprints have excellentidentification performance the MPEG-7 audio descriptorsoffer some advantages First being an international standardensures the open and fair use (subject to license fee) of thetechnology Second such an international standard makesthe interoperability possible For example if a mobile phoneinstalls an application program to convert a piece of recordedaudio to MPEG-7 descriptors the descriptors can be sent toany website accepting the descriptors On the other handit is not possible to send proprietary fingerprints used inone company to database systems owned by competitorsThird different companies may share or exchange theiraudio descriptors (fingerprints) in their databases withoutany difficulties In the current situation each company has tocompute fingerprints for newly released albumsWith the useof MPEG-7 descriptors the redundant efforts of computingfingerprints can be minimized

Although a music identification system based on audiofingerprints has several applications [8] we concentrate onthe issue of detecting if a piece of circulated music is highlysimilar to a copyrighted work or not In a typical case thesimilarity is measured by a distance metric If the distanceis shorter than a threshold the two pieces of music areconsidered as similar Although it is not trivial to determine asuitable threshold [20] this problem is not fully studied Forexample [20] does not indicate any approach to determinethe threshold In addition the audio files to be comparedmay be very large therefore it is very important to reducethe comparison time while maintaining high identificationaccuracy Since not many papers address these two issuesbased on MPEG-7 descriptors we report in this paper ourapproaches and experimental results

This paper is organized as follows Section 2 is anoverview of the MPEG-7 audio signature descriptorsSection 3 is the system model for music identificationSection 4 describes the dimensionality reduction methodused in the paper Section 5 is the proposed strategy todetermine the threshold Section 6 covers the experimentsand results Section 7 is the conclusion

2 Overview of MPEG-7 AudioSignature Descriptors

Part 4 [17] of the MPEG-7 standard includes low-leveland high-level descriptors for various applications Low-level descriptors are derived from the temporal and spec-tral characteristics of the waveform whereas the high-level

The Scientific World Journal 3

descriptors are constructed based on low-level descriptors Inthis section we will briefly describe the descriptors related tomusic identification

21 Low-Level and High-Level Descriptors of the MPEG-7Audio There are 17 low-level audio descriptors defined inthe standard All of these low-level descriptors are derivedfrom the waveform of the music They can be dividedinto six different categories basic basic spectral signalparameter temporal timbral spectral timbral and spectralbasis representationThese low-level features may be directlyused or may serve as the basics for constructing high-leveldescriptors

Based on low-level descriptors MPEG-7 audio standardalso defines high-level description schemes for various appli-cations These include audio signature description instru-ment timbre description general sound recognition andindexing description and spoken content description Theaudio signature descriptors one type of fingerprints are usedfor music identification

22 MPEG-7 Audio Signature Descriptors Although low-level descriptors in MPEG-7 audio may also be used toidentifymusic it is shown that the audio signature descriptorsprovide better identification performance [18 19] Thereforethe audio signature descriptors are adopted in the proposedsystem

TheMPEG-7 audio signature descriptors are computed asfollows

(1) Time-to-frequency conversion this step is based onaudio spectrum envelope descriptors including thefollowing substeps

(i) Determine the hop length between two con-secutive windows in the unit of samples Thedefault value is equivalent to 30ms

(ii) Define the window length 119897119908 This value is set to

three times the hop length The chosen windowis Hamming window

(iii) Determine the length of FFT denoted as 119873FFTTo reduce the computational complexity 119873FFTis the smallest power-of-2 number equal to orgreater than 119897

119908 For example if the sample rate of

the audio is 44100 ss then 119897119908= 44100sdot003sdot3 =

3969Therefore119873FFT =4096The extra samplesafter 119897

119908are padded with zeros

(iv) Perform the FFT on the windowed samples

(2) Divide the FFT coefficients into subbands with eachsubband having a bandwidth of one-fourth of anoctave In addition the bandwidths of two consecu-tive subbands should overlap each other by 10Thatis the computed bandwidth for each subbandmust bemultiplied by 11 The beginning frequency of the firstsubband denoted as loEdge is fixed at 250Hz Table 1lists the frequency range of the first three subbandsbefore and after (spectral) overlapping

Subband

middot middot middot middot middot middot

119860 =

11988612 119886124

11988621 11988622 119886224

119886301 119886302 1198863024

119886311 119886312 1198863124

middot middot middot middot middot middot

middot middot middot middot middot middot

middot middot middot middot middot middot

11988611

Time

Figure 1 MPEG-7 audio signature descriptors in a matrix

(3) Find the flatness measure F(b) for subband b by

119865 (119887) =

ℎ(119887)minus119897(119887)+1radicprodℎ(119887)

119894=119897(119887)119888 (119894)

(1 (ℎ (119887) minus 119897 (119887) + 1))sumℎ(119887)

119894=119897(119887)119888 (119894)

(1)

where c(i) is the power spectrum computed by FFT(in step 1) and h(b) and l(b) are the lower and upperindices of c(i) within subband b

(4) Find themean and variance of F(b) for subband b overa certain number of successive FFT windows calledscaling ratio The default value of the scaling ratio is16

(5) The series of mean and variance values are the audiosignature descriptors

With the computational steps of the audio signaturedescriptor we may compute the number of descriptors ina piece of 15-second music In the time domain about(15minus009)003 = 497windows are used to cover the 15-secondsignal because the hop size is 30ms Since the scaling ratio isset to 16 there are 49716 asymp 31 values per subband In thespectral domain since there are four subbands per octave andwe use six octaves (250Hz sim 16 kHz) there are totally 24subbands in the spectral domain Thus the 15-second signalproduces 24 times 31 = 744 mean values and other 744 variancevalues The descriptors are arranged in a matrix form asshown in Figure 1

Since the mean values are sufficient for the identificationpurpose [21] we will not consider the variance descriptorsin the following The obtained descriptors are referred to ashigh-resolution (or full-dimensional) descriptors Accordingto [21] if a piece of query music is to be compared with areference piece it should be done with a sliding comparisonas shown in Figure 2 In the figure one segment of linerepresents descriptors in a piece of music arranged in onedimensional structure Using the representation in Figure 1the query segment is arranged as

119876 = [11990211 sdot sdot sdot 119902124 sdot sdot sdot 119902311 sdot sdot sdot 1199023124] (2)

Since only descriptors from the same subband are to becompared the hop size between two query segments inFigure 2 is 24 descriptors or 003 sdot 16 = 048 second Inother words one descriptor (in a subband) represents 048second of audio samples The smallest Euclidean distance

4 The Scientific World Journal

middot middot middot

middot middot middot

119886124 11988621 119886224 1198863124middot middot middot

middot middot middot

11988611 11988631 middot middot middot 119886324 middot middot middot 1198863224 1198866224

11990211 119902124 11990222411990221 1199023124

middot middot middot11990211 119902124 11990222411990221 1199023124middot middot middot

middot middot middot11990211 119902124 1199023124middot middot middot

middot middot middot

middot middot middot

middot middot middot

Figure 2 The sliding operation to compare the query input Q (15seconds) and the reference music A (30 seconds)

obtained during the sliding operation is recorded as thedistance between the query input and the reference musicFor example if the query Q is to be compared with referenceA then we may obtain the Euclidean distance 119864

119860119861(119896) for the

k-th sliding comparison as

119864119876119860(119896) =

31

sum

119894=1

24

sum

119895=1

10038161003816100381610038161003816119902119894119895minus 119886(119894+119896)119895

10038161003816100381610038161003816 (3)

The recorded distance between these two pieces of music is

119889119876119860

= min119896

119864119876119860(119896) (4)

Although using audio signature descriptors greatlyreduces the computational burden for comparison therequired computation using (3) is still very large Accordingto [21] suppose that a database contains 1000 audio files witheach one having duration of 30 seconds and the music tobe identified has a duration of 15 seconds then 47 616 000arithmetic operations are required Therefore it is beneficialto further reduce the number of comparison which can beaccomplished by employing a multiresolution strategy

3 Music Identification Based onMultiresolution Strategy

As discussed previously the computational cost is still veryhigh even if we use the MPEG-7 descriptors Thereforewe will use the multiresolution strategy to reduce thecomputational complexity To do so in addition to theabove-mentioned high-resolution descriptors we also needto generate low-resolution descriptors (see also Section 4)Therefore the music database contains high-resolution andlow-resolution descriptors for each soundtrack to be identi-fied This step can be accomplished during the setup of thedatabase In addition a training process is conducted to finda distance threshold to determine whether the query inputis in the database or not Section 5 has a detailed descriptionabout the training process

Once the database is constructed as shown in Figure 3the music identification procedure consists of the followingsteps

(1) When a query input is sent to the system it com-putes the MPEG-7 audio signature descriptors andlow-resolution descriptors for the query music If amobile device is used to record the music usually the

Table 1 Bandwidth of the first three subbands

Bandwidth of subband(nonoverlapped)

Bandwidth of subband(overlapped)

250ndash2973Hz 2375ndash3122Hz2973ndash3536Hz 2824ndash3712Hz3535ndash4204Hz 3359ndash4415Hz

descriptors (fingerprints) are sent instead of the PCMsamples to reduce the size of the transmitted data

(2) The computed low-resolution descriptors are com-pared with those in the database Based on thedistance metric a list of candidates is obtainedSince the query music may start at any position ofthe soundtrack a sliding comparison as shown inFigure 2 is necessary

(3) After obtaining the candidate list high-resolutiondescriptors are used to find the distances between thequery input and the candidates The shortest distanceand its associated soundtrack are recorded

(4) If the shortest distance is less than the threshold thequery input is considered as highly similar to therecorded soundtrack in the database Otherwise thequery input is not in the database

During the low-resolution comparison it is also possibleto use an existing algorithm to reduce the comparison timeFor example we may arrange the descriptors in k-d tree (k-dimensional binary search tree) [22 23] structure to reducethe search time However using the k-d tree structure impliesthat every segment of music in the database has the samenumber of descriptors or same duration Note that the unitto be compared is one segment and one soundtrack canbe divided into many overlapped segments as shown inFigure 2 Therefore if the duration of the segment is set to 15seconds then the querymusic must also have a duration of 15seconds to use the system In a practical situation if the queryinput is longer than 15 seconds then only 15 seconds of themusic is used for computing the descriptors and comparison

4 Dimensionality Reduction Method forMPEG-7 Audio Signature Descriptors

This section briefly explains the dimensionality reductiontechnique used in the experiments Although a block-averagemethod is proposed in [21] for this purpose we use a differenttechnique in this paper

41 Problems of Reducing Dimensionality Using Scaling RatioConceptually we may increase the scaling ratio (given inSection 2) to reduce the number of descriptors duringcomputing them For example by increasing the scalingratio from 16 to 256 the number of descriptors is reducedby 16 times Unfortunately this approach does not yieldsatisfactory results because of insufficient time resolutionRecall that the descriptors are derived based on thewindowedwaveform Therefore if two segments of the soundtrack are

The Scientific World Journal 5

Query

Database

Dim reduction

In database

Reject Result

Threshold

Candidate

MPEG-7descriptors

Fullresolution

search

Lowresolution

search

YesNo

Figure 3 Procedure for multiresolution music identification

Descriptors

Descriptors Descriptors1198861198991 11988611989924 119886119899+11 119886119899+124

1199021198991 11990211989924

Figure 4 The audio samples and the descriptors In a real applica-tion 119886

119899119896is stored in the database whereas 119902

119899119896is computed from

the query input Usually 119886119899119896

is not equal to 119902119899119896

due to differentscopes of the windows even though they are all derived from thesame soundtrack

not highly similar their corresponding descriptors usuallyhave large differences With a scaling ratio of 256 eachdescriptor represents around 77 seconds of audio samplesTherefore unless the query input also starts at a point veryclose to the segment boundary of a soundtrack a comparisonbased on these descriptors may fail As illustrated in Figure 4descriptors of 119886

119899119896(or 119886119899+1119896

) and 119902119899119896

are quite different andtherefore the descriptor-based identification scheme cannotidentify the query inputTherefore a suitable time resolutionfor example 048 second should be maintained

42 Proposed Dimensionality Reduction Method In contrastto reduce the number of descriptors by increasing the scalingratio we may reduce them in each temporal-spectral block(ie the matrix in Figure 1) representing a segment of (15-second) music However to maintain a high time resolutionsuccessive temporal-spectral block should have a small timedifference as shown in Figure 5 With this arrangement weachieve both high identification rate and low comparisoncomplexity

For each temporal-spectral block we use PCA (princi-pal component analysis) [21 24] to reduce the number ofdescriptors Since it is difficult to directly use PCA to obtaingood low-resolution descriptors [21] we use an alternativeapproach Its basic idea is to partition the entire block into

119860 =

middot middot middot

119886124

11988621 119886224

119886311

119886321

1198863124

middot middot middot

11988611

11988631 119886324middot middot middot

middot middot middot

middot middot middot

middot middot middot

1198863224

119902331 1199023324

middot middot middot

middot middot middot

middot middot middot 1198866224119886621

11989111 11989118

11989121 11989128

11989131 11989138

Figure 5 Low-resolution descriptors In the figure each temporal-spectral block corresponds to descriptors from 15-second music

119860 = =11986011 1198601211986021 11986022

middot middot middot middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

11988611 11988618

119886151

119886161

119886291

119886158

119886168

119886298

11988619

119886159

119886169

119886299

119886116

1198861516

1198861616

1198862916 1198862917 1198862924

119886124119886117

Figure 6 Partition of a temporal-spectral block into four subblocksin the experiments

four subblocks as shown in Figure 6 and then use PCA toreduce the number of descriptors of each subblock into twovalues (descriptors) called low-resolution descriptors Alsohigh-frequency descriptors in the original temporal-spectralblock are all discarded because they are susceptible to noise[21] With this arrangement totally eight descriptors are usedto represent a segment of 15-second music Note that theactual duration of a segment used in the experiments is 1404seconds (though we still say 15 seconds) because some audiosamples are lost after MP-3 compression and decompressionin the experiments

6 The Scientific World Journal

Table 2 Identification accuracy using high-resolution features

Uncompressed 192 k MP-3 96 k MP-3High-frequencydescriptors Used Not used Used Not used Used Not used

Match in firstone 100 100 100 100 94 100

Match in first 15 100 100 100 100 98 100

Wenowdescribe how to calculate low-resolution descrip-tors Since using PCA for dimensionality reduction is a well-known approach we only describe how to construct thecovariance matrix for PCA computation and omit the com-putation of finding principal components Suppose that thereare N segments of music pieces in the database with theirdescriptor matrices denoted as 119860(1) to 119860(119873) To simplify theargument we consider the subblock matrix119860(119896)

11(referring to

Figure 6) in the following Other subblocks can be computedby the same procedure First collect all subblocks from 119860

(119896)

11

to form a big matrix as follows

119861 =

[[[

[

119886(1)

11sdot sdot sdot 119886

(1)

18119886(2)

11sdot sdot sdot 119886

(2)

18sdot sdot sdot 119886(119873)

11sdot sdot sdot 119886(119873)

18

sdot sdot sdot

119886(1)

151sdot sdot sdot 119886(1)

158119886(2)

151sdot sdot sdot 119886(2)

158sdot sdot sdot 119886(119873)

151sdot sdot sdot 119886(119873)

158

]]]

]

= [

[

uarr sdot sdot sdot uarr

b1sdot sdot sdot b8119873

darr sdot sdot sdot darr

]

]

(5)

Then we may consider column vectors of matrix B as b119894

vectors Having the data vectors b119894 the covariance matrix

and subsequently the principal components can be foundBy keeping only two principal components we can obtainc119894= [11988811989411198881198942] from b

119894 Since there are eight column vectors in

a subblock descriptors in119860(119896)11

are reduced to eight c119894(8119896minus7 le

119894 le 8119896) vectors Next we can rearrange the obtained c119894vectors

as

119862 =[[

[

11988811

11988812

11988891

11988892

sdot sdot sdot 1198888119873minus71

1198888119873minus72

sdot sdot sdot

11988881

11988882

119888161

119888162

sdot sdot sdot 11988881198731

11988881198732

]]

]

(6)

Again by treating each row of matrix C as the data to beprocessed by PCA we can compute the covariance matrixand finally reduce each column vector to one value Sincethere are two columns originally from one subblock thereare totally two (low-resolution) descriptors per subblockEquivalently a segment of 15-second music is representedby eight descriptors During the computation the principalcomponents obtained in the first and the second steps shouldbe stored in the database When a query input (in the formof high-resolution descriptors) is supplied these componentsare used to obtain the low-resolution descriptors for theinput

Table 3 Comparison of identification accuracies using low-resolution descriptors obtained by the proposed approach and bythe averaging (avg) method [21]

Uncompressed 192 k MP-3 96 k MP-3Method Proposed Avg Proposed Avg Proposed AvgMatch in firstone 993 995 991 993 901 898

Match in first15 100 100 100 100 996 995

5 Threshold to Determine the Membership ofthe Query Input

Since in practice the database cannot collect all musicsoundtracks in the world we have to have a strategy todetermine if the query input is actually in the database ornot In our case the decision is accomplished with the aid ofa threshold If the shortest distances between the input andthe candidates are greater than a threshold then the queryinput is not in the database otherwise it is in the databaseThough the concept is simple it is not trivial to determinethe threshold [20]

Recall that when a query input is applied to the proposedsystem the system computes the Euclidean distances betweenthe low-resolution descriptors of the query input and thosein the database Accordingly m (say m = 20) segments ofsoundtracks from distinct titles are selected based on thecomputed distances Next the Euclidean distances betweenthe query and the selected segments are computed againusing full-resolution descriptors By sorting the distancesfrom small to large the system creates a candidate list for thequery input

To compute the previously mentioned threshold weexamine two different methods In the first method (methodI) we define the ldquofirstrdquo distance 119889

1as the (full-resolution)

distance associated with the first candidate in the list Simi-larly the ldquosecondrdquo distance 119889

2is the distance associated with

the second candidate as shown in Figure 7 The methodto compute the first and second distances is used both intraining and in identification For the training phase assumethat there are 119873

119879soundtracks used For a segment from a

soundtrack indexed n let the first distance be 1198891(119899) and the

second distance be 1198892(119899) The threshold 119879

119860is then computed

as

119879119860= 05 sdot (

119873119879

sum

119899=1

1198891(119899) +

119873119879

sum

119899=1

1198892(119899)) (7)

During the identification phase the first distance forthe query input is computed If the first distance is smallerthan the threshold 119879

119860 the query input is determined to be

highly similar to the top soundtrack title in the candidate listOtherwise the query input is not inside the database

In addition to method I we also propose an alternativemethod (method II) to compute the first and seconddistances

The Scientific World Journal 7

Firs

t dist

ance

Seco

nd d

istan

ce

Figure 7 The first and second distances in method I

and the threshold Referring to Figure 8 the first distance1198891015840

1(119899) of method II for a training segment in soundtrack n is

1198891015840

1(119899) = 119889

1(119899) divide

sum119872

119898=2119889119898(119899)

119872 minus 1 (8)

where 119889119898(119899) is the distance associatedwith themth candidate

in the list andM is a constant Based on our experimentM =10 is sufficient By the same arrangement the second distance1198891015840

2(119899) is calculated as

1198891015840

2(119899) = 119889

2(119899) divide

sum119872+1

119898=3119889119898(119899)

119872 minus 1 (9)

Similar to (7) the threshold 1198791015840119860is determined as

1198791015840

119860= 05 sdot (

119873119879

sum

119899=1

1198891015840

1(119899) +

119873119879

sum

119899=1

1198891015840

2(119899)) (10)

During identification phase if the computed distance 11988910158401

is greater than1198791015840119860 the query input is determined as not in the

database Otherwise the query input is highly similar to thetop soundtrack title in the candidate list

6 Experiments and Results

Before conducting the experiments we collect 750 sound-tracks frommany CD titles for constructing the database andfor identification In the experiments 30 seconds of musicis excerpted from the soundtracks as the reference items tobe identified Then low- and full-resolution descriptors ofthe reference items are calculated and stored in the databaseWe also randomly excerpt 15-second query items from thereference items Note that a query item may start fromany sample on the first half of the reference To test theidentification accuracy for compressed audio the 15-secondquery inputs are also encoded and then decoded with anMP-3 coder with bitrates of 192 k and 96 k respectivelyHowever the database only contains descriptors from theuncompressed items In addition the principal componentsare also obtained using uncompressed items

Table 4 Comparison of search time between 119896-119889 tree search andlinear search

119870-119889 tree Linear searchAverage time 0255 sec 323 sec

Several experiments are conducted to examine the per-formance of the proposed systemThefirst experiment checksthe identification accuracy using high-resolution descriptorsThe second one compares the relative identification accuracybetween the proposed method (in Section 42) and themethod given in [21] The third experiment compares thesearch speed by using the linear search and k-d tree searchThe next experiment intends to determine a suitable valueof M used in (8) and (9) Having the value of M we reportthe identification accuracy using a multiresolution strategyin experiment five In this experiment both method I andmethod II are examined To have a complete comparisonbetween method I and II we also report the results in termsof the ROC (receiver operating characteristics) [25] curveand the DET-like (detection error tradeoff) [26] curve butwithout logarithm

61 Experiment One Identification Accuracy Using High-Resolution Descriptors The first experiment is to examinethe identification accuracy using the high-resolutionMPEG-7 audio signature descriptors (without reduction) To evaluatethe influences of high-frequency descriptors we also exam-ine the accuracies with and without using high-frequency(greater than 25 kHz) descriptors The results are given inTable 2 In the table ldquomatch in first onerdquo is the rate that thequery input is correctly identified with the shortest distanceamong all reference items Similarly ldquomatch in first 15rdquo is therate that the query is correctly identified within a list of 15 ref-erence items sorted by distanceThe results indicate that high-resolution descriptors have very good identification accuracyAlso for uncompressed or 192 k MP-3 items whether usinghigh-frequency descriptors does not affect the identificationaccuracy However for 96 k MP-3 items discarding high-frequency descriptors greatly improves the accuracy

8 The Scientific World Journal

Seco

nd d

istan

ce

Firs

t dist

ance

AvgAvg

dividedivide

Figure 8 The first and second distances in method II

62 Experiment Two Identification Accuracy Using Low-Resolution Descriptors The second experiment compares therelative identification accuracies of the proposed dimension-ality reduction technique and the averaging method pro-posed in [21] In this experiment eight descriptors are used torepresent one segment of music (15 seconds) The results areshown in Table 3 For this experiment the accuracy rate inldquomatch in first 15rdquo ismore important than the figure in ldquomatchin first onerdquo because the low-resolution descriptors are usedto produce a candidate list Therefore it is acceptable as longas the correct item is within the list as the high-resolutiondescriptors are to be used in the second stage In this regardthe proposed approach is slightly better than the averagingmethod proposed in [21] for 96 k MP-3 query inputs

63 Experiment Three Time for k-d Tree Search and LinearSearch The third experiment compares the time to generatea list of 20 candidates by using k-d tree search and linear(exhausted) search for low-resolution descriptors The com-puting time is obtained by averaging the required time in tentrials For each trial 500 query items are randomly selectedamong the 750 available items The experimental platform isa personal computer with a 16GHz PentiumCPU and 15 GBmemory The program is written in C with a Borland C++compiler The results are shown in Table 4 The results showthat we can increase the search time by 10-fold if the low-resolution descriptors are embedded in the k-d tree structure

64 Experiment Four Determination of M in Method II Thefourth experiment is to determine the numberM in methodII Recall that the thresholds in both methods are computedbased on high-resolution descriptors To determine the opti-mal value of M we vary this value from one to eighteen in(8) and (9) to compute the threshold 1198791015840

119860and then to perform

identification Note that method II is degenerated to methodI if M = 1 In this experiment 96 k MP-3-coded items areused as the query In addition we deliberately keep the high-frequency descriptors in computing the distances becausekeeping them makes it easier to examine the influence of Mversus identification rate (defined in experiment five ) Theexperimental results are given in Figure 9 The results showthat whenM is equal or greater than 10 the identification rate

085408560858

0860862086408660868

087087208740876

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Iden

tifica

tion

rate

119872 value

Figure 9The identification rate versus the numberM for 96 kMP-3query inputs

remains almost constant Therefore M = 10 is sufficient formethod II This value is then used in experiment five

65 Experiment Five Comparison between Method I andMethod II The fifth experiment compares the identificationrates between method I and method II The thresholds arecalculated using (7) and (10) respectively In this experiment375 of the 750 15-second items are randomly chosen fortraining (ie to obtain the thresholds) The rest of 375 itemsare for query Note that all of the items are excerpted from thereference items in the database In addition we test other 500query items (also with duration of 15 sec) not excerpted fromdatabase references

Since some query inputs are not in the database wealso need to consider the situations of falsely identifying anoutside database item as one inside the database and viceversa Suppose that 119879

119889query items are actually inside the

database and 119879119900query items are not inside the database

Assume that the identification system correctly identifies119873119889

inside-database query item erroneously rejects 119873119903inside-

database query items and erroneously accepts 119873119886outside

database query items Then the IDR (identification rate)

The Scientific World Journal 9

Table 5 Comparison of identification performance using method I and method II with the use of high-frequency descriptors

Uncompressed 192 k MP-3 96 k MP-3Method I Method II Method I Method II Method I Method II

IDR 988 999 99 997 841 874FAR 21 02 18 02 04 02FRR 00 00 00 05 366 296ACC 984 999 989 995 718 768

Table 6 Comparison of method I and method II for IDR FAR and FRR with high-frequency descriptors removed

Uncompressed 192 k MP-3 96 k MP-3Method I Method II Method I Method II Method I Method II

IDR 991 997 989 997 990 997FAR 15 02 19 02 16 02FRR 03 04 02 04 03 06ACC 987 995 985 995 986 994

FAR (false-accept rate) and FRR (false-rejection rate) arecomputed as follows

IDR =119873119889

119879119889minus 119873119903

(11)

FAR =119873119886

119879119900

(12)

FRR =119873119903

119879119889

(13)

In addition the accuracy of the system is computed as

ACC =119873119889+ 119873119900

119879119889+ 119879119900

(14)

where119873119900is the number of query items correctly identified as

outside-database items and can be computed as119873119900= 119879119900minus119873119886

In this experiment the low-resolution descriptors areused to find 20 candidates Next high-resolution descriptorsare used to find the best-matched one in the list If thedistance associated with the best-match is greater than thethreshold the query input is determined as not in thedatabase Otherwise the query input is identified as the best-matched one In this experiment high-frequency descriptorsare kept in the comparison With the use of (11) to (14) theaverage values after four trials are given in Table 5 Fromthe results we know that the system does not perform wellfor identifying 96 k MP-3 query items This is mainly dueto high distortion in high-frequency descriptors A similarphenomenon can also be observed in Table 2 Later onwe will repeat this experiment but discard high-frequencydescriptors

To further examine the performance differences betweenmethod I and method II we again use 96 k MP-3 items asthe query inputs But this time we vary the threshold values(for both methods) and then record the IDR FAR and FRRvalues The results are presented as ROC curves [25] andDET-like curves [26] (but without logarithm) as shown in

Figures 10 and 11 respectively The ROC curve plots FARversus TPR (true positive rate) which is defined as

TPR =119873119889

119879119889

(15)

Conceptually a better classifier should have a higher TPRfor a fixed FAR Therefore a better classifier should havean ROC curve closer to the left-upper corner With thisinterpretation from Figure 10 we know that method II is abetter classifier

In this experiment we also show DET-like curves Aldquotruerdquo DET curve plots FAR versus FRR using logarithmscales In our case however we do not use logarithm scales toexhibit the similarity between ROC and DET curves Againconceptually we know that a better classifier should have alower FRR over a fixed FAR Thus a better classifier shouldhave a curve closer to the left-bottom corner By examiningFigure 11 we again confirm thatmethod II is a better method

Since the high-frequency descriptors actually affect theidentification rate at low bitrates we again repeat this exper-iment without using high-frequency descriptors The resultsare given in Table 6 By comparing the results in Tables 5 and6 we know that removing high-frequency descriptors slightlyreduces the accuracy of uncompressed items However itgreatly improves the accuracy for 96 k MP-3 items Sincemethod II has an accuracy of at least 994 in all cases it ishighly plausible to identify copyrighted audiomaterials usingthis method

7 Conclusion

In this paper we propose a system using a multiresolutionstrategy to identify whether a piece of unknown music isidentical to one of the pieces in the database In the systemhigh-resolution descriptors are MPEG-7 audio signaturedescriptors and the low-resolution descriptors are obtainedfrom high-resolution descriptors with the aid of PCA toreduce their dimensionality Experimental results show that

10 The Scientific World Journal

0 20 40 60 80 1000

10

20

30

40

50

60

70

80

90

100

FAR ()

TPR

()

ROC curve (96 k MP-3)

Method IMethod II

Figure 10 The ROC curves for both methods

0 20 40 60 80 1000

10

20

30

40

50

60

70

80

90

100

FAR ()

FRR

()

Method IMethod II

DET curve (96 k MP-3)

Figure 11 The DET-like curves without logarithm for both meth-ods

low-resolution descriptors still have high identification accu-racies To reduce the time to generate the candidate list weuse the k-d tree structure to store low-resolution descriptorsExperimental results show that using k-d tree structureincreases the search time by ten-folds Since not every pieceof query input is within the database we also proposed twomethods to determine the distance thresholds Experimentalresults show that the proposed method II provides an accu-racy of 994 Therefore the proposed system can be usedin real applications such as identifying copyrighted audiofiles circulated over the Internet As it can be easily extendedto operate with multiple computers the proposed system

is a plausible starting point to construct a large operabledatabase

Acknowledgment

This work was supported in part by National Science Councilof Taiwan through Grants NSC 94-2213-E-027-042 and 99-2221-E-027-097

References

[1] For a brief description of the bill please refer to httpenwikipediaorgwikiSOPA

[2] R Eklund ldquoAudio watermarking techniquesrdquo httpwwwmusemagiccompaperswatermarkhtml

[3] J S R Jang and H R Lee ldquoA general framework of progressivefiltering and its application to query by singinghummingrdquoIEEE Transactions on Audio Speech and Language Processingvol 16 no 2 pp 350ndash358 2008

[4] S Baluja and M Covell ldquoAudio fingerprinting combiningcomputer visionamp data streamprocessingrdquo in Proceedings of theIEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo07) pp II213ndashII216 Honolulu HawaiiUSA April 2007

[5] Shazam httpwwwshazamcommusicwebhomehtml[6] A Wang ldquoThe Shazam music recognition servicerdquo Communi-

cations of the ACM vol 49 no 8 pp 44ndash48 2006[7] V Chandrasekha M Sharifi and D A Ross ldquoSurvey and eval-

uation of audio fingerprinting schemes for mobile query-by-example applicationsrdquo in Proceedinjgs of the 12th InternationalConference onMusic Information Retrieval pp 801ndash806MiamiFla USA October 2011

[8] J A Haitsma and T Kalker ldquoA highly robust audio fingerprint-ing systemrdquo in Proceedinjgs of the 12th International Conferenceon Music Information Retrieval pp 107ndash115 Paris FranceOctober 2002

[9] C J C Burges J C Platt and S Jana ldquoDistortion discriminantanalysis for audio fingerprintingrdquo IEEE Transactions on Speechand Audio Processing vol 11 no 3 pp 165ndash174 2003

[10] Official website httpmusicbrainzorg[11] Official website httpwwwaudiblemagiccom[12] Official website httpwwwgracenotecom[13] httpenwikipediaorgwikiAcoustic fingerprint[14] P J O Doets M M Gisbert and R L Lagendijk ldquoOn

the comparison of audio fingerprints for extracting qualityparameters of compressed audiordquo in Security Steganographyand Watermarking of Multimedia Contents VIII vol 6072 ofProceedings of SPIE pp 60720L-1ndash60720L-12 San Jose CalifUSA January 2006

[15] F Nack and A T Lindsay ldquoEverything you wanted to knowaboutMPEG-7 part 1rdquo IEEEMultimedia vol 6 no 3 pp 65ndash771999

[16] F Nack and A Lindsay ldquoEverything you wanted to know aboutMPEG-7 part 2rdquo IEEEMultimedia vol 6 no 4 pp 64ndash73 1999

[17] ISOIEC Information TechnologymdashMultimedia ContentDescription Interface -Part 4 Audio IS 15938-4 2002

[18] H Crysandt ldquoMusic identification with MPEG-7rdquo in Proceed-ings of the 115th AES Convention October 2003 Paper 5967

The Scientific World Journal 11

[19] OHellmuth E AllamanceMCremerHGrossmann J Herreand T Kastner ldquoUsing MPEG-7 audio fingerprinting in real-world applicationrdquo in Proceedings of the 115th AES ConventionOctober 2003 Paper 5961

[20] H-G Kim N Moreau and T Sikora MPEG-7 Audio andBeyond John Wiley amp Sons 2005

[21] J-Y Lee and S D You ldquoDimension-reduction technique forMPEG-7 Audio Descriptorsrdquo in Proceedings of the 6th PacificRim Conference on Multimedia (PCM rsquo05) vol 3768 of LectureNotes in Computer Science pp 526ndash537 2005

[22] J L Bentley ldquoMultidimensional binary search trees used forassociative searchingrdquo Communications of the ACM vol 18 no9 pp 509ndash517 1975

[23] Open source program for k-d tree httpwwwcsumdedusimmountANN

[24] J Shlens ldquoA tutorial on principal component analysisrdquohttpwwwsnlsalkedusimshlenspcapdf

[25] T Fawcell ldquoROC graphs notes and practical considerations forresearchersrdquo Tech Rep HP Lab 2004

[26] A Martin G Doddington T Kamm M Ordowski and MPrzybocki ldquoThe DET curve in assessment of detection taskperformancerdquo in Proceedings of the Eurospeech vol 4 pp 1899ndash1903 Rhodes Greece September 1997

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 3: Research Article Music Identification System Using MPEG-7 ...downloads.hindawi.com/journals/tswj/2013/752464.pdf · Gracenote s MusicID [ ]. According to [ ], there are more ... one

The Scientific World Journal 3

descriptors are constructed based on low-level descriptors Inthis section we will briefly describe the descriptors related tomusic identification

21 Low-Level and High-Level Descriptors of the MPEG-7Audio There are 17 low-level audio descriptors defined inthe standard All of these low-level descriptors are derivedfrom the waveform of the music They can be dividedinto six different categories basic basic spectral signalparameter temporal timbral spectral timbral and spectralbasis representationThese low-level features may be directlyused or may serve as the basics for constructing high-leveldescriptors

Based on low-level descriptors MPEG-7 audio standardalso defines high-level description schemes for various appli-cations These include audio signature description instru-ment timbre description general sound recognition andindexing description and spoken content description Theaudio signature descriptors one type of fingerprints are usedfor music identification

22 MPEG-7 Audio Signature Descriptors Although low-level descriptors in MPEG-7 audio may also be used toidentifymusic it is shown that the audio signature descriptorsprovide better identification performance [18 19] Thereforethe audio signature descriptors are adopted in the proposedsystem

TheMPEG-7 audio signature descriptors are computed asfollows

(1) Time-to-frequency conversion this step is based onaudio spectrum envelope descriptors including thefollowing substeps

(i) Determine the hop length between two con-secutive windows in the unit of samples Thedefault value is equivalent to 30ms

(ii) Define the window length 119897119908 This value is set to

three times the hop length The chosen windowis Hamming window

(iii) Determine the length of FFT denoted as 119873FFTTo reduce the computational complexity 119873FFTis the smallest power-of-2 number equal to orgreater than 119897

119908 For example if the sample rate of

the audio is 44100 ss then 119897119908= 44100sdot003sdot3 =

3969Therefore119873FFT =4096The extra samplesafter 119897

119908are padded with zeros

(iv) Perform the FFT on the windowed samples

(2) Divide the FFT coefficients into subbands with eachsubband having a bandwidth of one-fourth of anoctave In addition the bandwidths of two consecu-tive subbands should overlap each other by 10Thatis the computed bandwidth for each subbandmust bemultiplied by 11 The beginning frequency of the firstsubband denoted as loEdge is fixed at 250Hz Table 1lists the frequency range of the first three subbandsbefore and after (spectral) overlapping

Subband

middot middot middot middot middot middot

119860 =

11988612 119886124

11988621 11988622 119886224

119886301 119886302 1198863024

119886311 119886312 1198863124

middot middot middot middot middot middot

middot middot middot middot middot middot

middot middot middot middot middot middot

11988611

Time

Figure 1 MPEG-7 audio signature descriptors in a matrix

(3) Find the flatness measure F(b) for subband b by

119865 (119887) =

ℎ(119887)minus119897(119887)+1radicprodℎ(119887)

119894=119897(119887)119888 (119894)

(1 (ℎ (119887) minus 119897 (119887) + 1))sumℎ(119887)

119894=119897(119887)119888 (119894)

(1)

where c(i) is the power spectrum computed by FFT(in step 1) and h(b) and l(b) are the lower and upperindices of c(i) within subband b

(4) Find themean and variance of F(b) for subband b overa certain number of successive FFT windows calledscaling ratio The default value of the scaling ratio is16

(5) The series of mean and variance values are the audiosignature descriptors

With the computational steps of the audio signaturedescriptor we may compute the number of descriptors ina piece of 15-second music In the time domain about(15minus009)003 = 497windows are used to cover the 15-secondsignal because the hop size is 30ms Since the scaling ratio isset to 16 there are 49716 asymp 31 values per subband In thespectral domain since there are four subbands per octave andwe use six octaves (250Hz sim 16 kHz) there are totally 24subbands in the spectral domain Thus the 15-second signalproduces 24 times 31 = 744 mean values and other 744 variancevalues The descriptors are arranged in a matrix form asshown in Figure 1

Since the mean values are sufficient for the identificationpurpose [21] we will not consider the variance descriptorsin the following The obtained descriptors are referred to ashigh-resolution (or full-dimensional) descriptors Accordingto [21] if a piece of query music is to be compared with areference piece it should be done with a sliding comparisonas shown in Figure 2 In the figure one segment of linerepresents descriptors in a piece of music arranged in onedimensional structure Using the representation in Figure 1the query segment is arranged as

119876 = [11990211 sdot sdot sdot 119902124 sdot sdot sdot 119902311 sdot sdot sdot 1199023124] (2)

Since only descriptors from the same subband are to becompared the hop size between two query segments inFigure 2 is 24 descriptors or 003 sdot 16 = 048 second Inother words one descriptor (in a subband) represents 048second of audio samples The smallest Euclidean distance

4 The Scientific World Journal

middot middot middot

middot middot middot

119886124 11988621 119886224 1198863124middot middot middot

middot middot middot

11988611 11988631 middot middot middot 119886324 middot middot middot 1198863224 1198866224

11990211 119902124 11990222411990221 1199023124

middot middot middot11990211 119902124 11990222411990221 1199023124middot middot middot

middot middot middot11990211 119902124 1199023124middot middot middot

middot middot middot

middot middot middot

middot middot middot

Figure 2 The sliding operation to compare the query input Q (15seconds) and the reference music A (30 seconds)

obtained during the sliding operation is recorded as thedistance between the query input and the reference musicFor example if the query Q is to be compared with referenceA then we may obtain the Euclidean distance 119864

119860119861(119896) for the

k-th sliding comparison as

119864119876119860(119896) =

31

sum

119894=1

24

sum

119895=1

10038161003816100381610038161003816119902119894119895minus 119886(119894+119896)119895

10038161003816100381610038161003816 (3)

The recorded distance between these two pieces of music is

119889119876119860

= min119896

119864119876119860(119896) (4)

Although using audio signature descriptors greatlyreduces the computational burden for comparison therequired computation using (3) is still very large Accordingto [21] suppose that a database contains 1000 audio files witheach one having duration of 30 seconds and the music tobe identified has a duration of 15 seconds then 47 616 000arithmetic operations are required Therefore it is beneficialto further reduce the number of comparison which can beaccomplished by employing a multiresolution strategy

3 Music Identification Based onMultiresolution Strategy

As discussed previously the computational cost is still veryhigh even if we use the MPEG-7 descriptors Thereforewe will use the multiresolution strategy to reduce thecomputational complexity To do so in addition to theabove-mentioned high-resolution descriptors we also needto generate low-resolution descriptors (see also Section 4)Therefore the music database contains high-resolution andlow-resolution descriptors for each soundtrack to be identi-fied This step can be accomplished during the setup of thedatabase In addition a training process is conducted to finda distance threshold to determine whether the query inputis in the database or not Section 5 has a detailed descriptionabout the training process

Once the database is constructed as shown in Figure 3the music identification procedure consists of the followingsteps

(1) When a query input is sent to the system it com-putes the MPEG-7 audio signature descriptors andlow-resolution descriptors for the query music If amobile device is used to record the music usually the

Table 1 Bandwidth of the first three subbands

Bandwidth of subband(nonoverlapped)

Bandwidth of subband(overlapped)

250ndash2973Hz 2375ndash3122Hz2973ndash3536Hz 2824ndash3712Hz3535ndash4204Hz 3359ndash4415Hz

descriptors (fingerprints) are sent instead of the PCMsamples to reduce the size of the transmitted data

(2) The computed low-resolution descriptors are com-pared with those in the database Based on thedistance metric a list of candidates is obtainedSince the query music may start at any position ofthe soundtrack a sliding comparison as shown inFigure 2 is necessary

(3) After obtaining the candidate list high-resolutiondescriptors are used to find the distances between thequery input and the candidates The shortest distanceand its associated soundtrack are recorded

(4) If the shortest distance is less than the threshold thequery input is considered as highly similar to therecorded soundtrack in the database Otherwise thequery input is not in the database

During the low-resolution comparison it is also possibleto use an existing algorithm to reduce the comparison timeFor example we may arrange the descriptors in k-d tree (k-dimensional binary search tree) [22 23] structure to reducethe search time However using the k-d tree structure impliesthat every segment of music in the database has the samenumber of descriptors or same duration Note that the unitto be compared is one segment and one soundtrack canbe divided into many overlapped segments as shown inFigure 2 Therefore if the duration of the segment is set to 15seconds then the querymusic must also have a duration of 15seconds to use the system In a practical situation if the queryinput is longer than 15 seconds then only 15 seconds of themusic is used for computing the descriptors and comparison

4 Dimensionality Reduction Method forMPEG-7 Audio Signature Descriptors

This section briefly explains the dimensionality reductiontechnique used in the experiments Although a block-averagemethod is proposed in [21] for this purpose we use a differenttechnique in this paper

41 Problems of Reducing Dimensionality Using Scaling RatioConceptually we may increase the scaling ratio (given inSection 2) to reduce the number of descriptors duringcomputing them For example by increasing the scalingratio from 16 to 256 the number of descriptors is reducedby 16 times Unfortunately this approach does not yieldsatisfactory results because of insufficient time resolutionRecall that the descriptors are derived based on thewindowedwaveform Therefore if two segments of the soundtrack are

The Scientific World Journal 5

Query

Database

Dim reduction

In database

Reject Result

Threshold

Candidate

MPEG-7descriptors

Fullresolution

search

Lowresolution

search

YesNo

Figure 3 Procedure for multiresolution music identification

Descriptors

Descriptors Descriptors1198861198991 11988611989924 119886119899+11 119886119899+124

1199021198991 11990211989924

Figure 4 The audio samples and the descriptors In a real applica-tion 119886

119899119896is stored in the database whereas 119902

119899119896is computed from

the query input Usually 119886119899119896

is not equal to 119902119899119896

due to differentscopes of the windows even though they are all derived from thesame soundtrack

not highly similar their corresponding descriptors usuallyhave large differences With a scaling ratio of 256 eachdescriptor represents around 77 seconds of audio samplesTherefore unless the query input also starts at a point veryclose to the segment boundary of a soundtrack a comparisonbased on these descriptors may fail As illustrated in Figure 4descriptors of 119886

119899119896(or 119886119899+1119896

) and 119902119899119896

are quite different andtherefore the descriptor-based identification scheme cannotidentify the query inputTherefore a suitable time resolutionfor example 048 second should be maintained

42 Proposed Dimensionality Reduction Method In contrastto reduce the number of descriptors by increasing the scalingratio we may reduce them in each temporal-spectral block(ie the matrix in Figure 1) representing a segment of (15-second) music However to maintain a high time resolutionsuccessive temporal-spectral block should have a small timedifference as shown in Figure 5 With this arrangement weachieve both high identification rate and low comparisoncomplexity

For each temporal-spectral block we use PCA (princi-pal component analysis) [21 24] to reduce the number ofdescriptors Since it is difficult to directly use PCA to obtaingood low-resolution descriptors [21] we use an alternativeapproach Its basic idea is to partition the entire block into

119860 =

middot middot middot

119886124

11988621 119886224

119886311

119886321

1198863124

middot middot middot

11988611

11988631 119886324middot middot middot

middot middot middot

middot middot middot

middot middot middot

1198863224

119902331 1199023324

middot middot middot

middot middot middot

middot middot middot 1198866224119886621

11989111 11989118

11989121 11989128

11989131 11989138

Figure 5 Low-resolution descriptors In the figure each temporal-spectral block corresponds to descriptors from 15-second music

119860 = =11986011 1198601211986021 11986022

middot middot middot middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

11988611 11988618

119886151

119886161

119886291

119886158

119886168

119886298

11988619

119886159

119886169

119886299

119886116

1198861516

1198861616

1198862916 1198862917 1198862924

119886124119886117

Figure 6 Partition of a temporal-spectral block into four subblocksin the experiments

four subblocks as shown in Figure 6 and then use PCA toreduce the number of descriptors of each subblock into twovalues (descriptors) called low-resolution descriptors Alsohigh-frequency descriptors in the original temporal-spectralblock are all discarded because they are susceptible to noise[21] With this arrangement totally eight descriptors are usedto represent a segment of 15-second music Note that theactual duration of a segment used in the experiments is 1404seconds (though we still say 15 seconds) because some audiosamples are lost after MP-3 compression and decompressionin the experiments

6 The Scientific World Journal

Table 2 Identification accuracy using high-resolution features

Uncompressed 192 k MP-3 96 k MP-3High-frequencydescriptors Used Not used Used Not used Used Not used

Match in firstone 100 100 100 100 94 100

Match in first 15 100 100 100 100 98 100

Wenowdescribe how to calculate low-resolution descrip-tors Since using PCA for dimensionality reduction is a well-known approach we only describe how to construct thecovariance matrix for PCA computation and omit the com-putation of finding principal components Suppose that thereare N segments of music pieces in the database with theirdescriptor matrices denoted as 119860(1) to 119860(119873) To simplify theargument we consider the subblock matrix119860(119896)

11(referring to

Figure 6) in the following Other subblocks can be computedby the same procedure First collect all subblocks from 119860

(119896)

11

to form a big matrix as follows

119861 =

[[[

[

119886(1)

11sdot sdot sdot 119886

(1)

18119886(2)

11sdot sdot sdot 119886

(2)

18sdot sdot sdot 119886(119873)

11sdot sdot sdot 119886(119873)

18

sdot sdot sdot

119886(1)

151sdot sdot sdot 119886(1)

158119886(2)

151sdot sdot sdot 119886(2)

158sdot sdot sdot 119886(119873)

151sdot sdot sdot 119886(119873)

158

]]]

]

= [

[

uarr sdot sdot sdot uarr

b1sdot sdot sdot b8119873

darr sdot sdot sdot darr

]

]

(5)

Then we may consider column vectors of matrix B as b119894

vectors Having the data vectors b119894 the covariance matrix

and subsequently the principal components can be foundBy keeping only two principal components we can obtainc119894= [11988811989411198881198942] from b

119894 Since there are eight column vectors in

a subblock descriptors in119860(119896)11

are reduced to eight c119894(8119896minus7 le

119894 le 8119896) vectors Next we can rearrange the obtained c119894vectors

as

119862 =[[

[

11988811

11988812

11988891

11988892

sdot sdot sdot 1198888119873minus71

1198888119873minus72

sdot sdot sdot

11988881

11988882

119888161

119888162

sdot sdot sdot 11988881198731

11988881198732

]]

]

(6)

Again by treating each row of matrix C as the data to beprocessed by PCA we can compute the covariance matrixand finally reduce each column vector to one value Sincethere are two columns originally from one subblock thereare totally two (low-resolution) descriptors per subblockEquivalently a segment of 15-second music is representedby eight descriptors During the computation the principalcomponents obtained in the first and the second steps shouldbe stored in the database When a query input (in the formof high-resolution descriptors) is supplied these componentsare used to obtain the low-resolution descriptors for theinput

Table 3 Comparison of identification accuracies using low-resolution descriptors obtained by the proposed approach and bythe averaging (avg) method [21]

Uncompressed 192 k MP-3 96 k MP-3Method Proposed Avg Proposed Avg Proposed AvgMatch in firstone 993 995 991 993 901 898

Match in first15 100 100 100 100 996 995

5 Threshold to Determine the Membership ofthe Query Input

Since in practice the database cannot collect all musicsoundtracks in the world we have to have a strategy todetermine if the query input is actually in the database ornot In our case the decision is accomplished with the aid ofa threshold If the shortest distances between the input andthe candidates are greater than a threshold then the queryinput is not in the database otherwise it is in the databaseThough the concept is simple it is not trivial to determinethe threshold [20]

Recall that when a query input is applied to the proposedsystem the system computes the Euclidean distances betweenthe low-resolution descriptors of the query input and thosein the database Accordingly m (say m = 20) segments ofsoundtracks from distinct titles are selected based on thecomputed distances Next the Euclidean distances betweenthe query and the selected segments are computed againusing full-resolution descriptors By sorting the distancesfrom small to large the system creates a candidate list for thequery input

To compute the previously mentioned threshold weexamine two different methods In the first method (methodI) we define the ldquofirstrdquo distance 119889

1as the (full-resolution)

distance associated with the first candidate in the list Simi-larly the ldquosecondrdquo distance 119889

2is the distance associated with

the second candidate as shown in Figure 7 The methodto compute the first and second distances is used both intraining and in identification For the training phase assumethat there are 119873

119879soundtracks used For a segment from a

soundtrack indexed n let the first distance be 1198891(119899) and the

second distance be 1198892(119899) The threshold 119879

119860is then computed

as

119879119860= 05 sdot (

119873119879

sum

119899=1

1198891(119899) +

119873119879

sum

119899=1

1198892(119899)) (7)

During the identification phase the first distance forthe query input is computed If the first distance is smallerthan the threshold 119879

119860 the query input is determined to be

highly similar to the top soundtrack title in the candidate listOtherwise the query input is not inside the database

In addition to method I we also propose an alternativemethod (method II) to compute the first and seconddistances

The Scientific World Journal 7

Firs

t dist

ance

Seco

nd d

istan

ce

Figure 7 The first and second distances in method I

and the threshold Referring to Figure 8 the first distance1198891015840

1(119899) of method II for a training segment in soundtrack n is

1198891015840

1(119899) = 119889

1(119899) divide

sum119872

119898=2119889119898(119899)

119872 minus 1 (8)

where 119889119898(119899) is the distance associatedwith themth candidate

in the list andM is a constant Based on our experimentM =10 is sufficient By the same arrangement the second distance1198891015840

2(119899) is calculated as

1198891015840

2(119899) = 119889

2(119899) divide

sum119872+1

119898=3119889119898(119899)

119872 minus 1 (9)

Similar to (7) the threshold 1198791015840119860is determined as

1198791015840

119860= 05 sdot (

119873119879

sum

119899=1

1198891015840

1(119899) +

119873119879

sum

119899=1

1198891015840

2(119899)) (10)

During identification phase if the computed distance 11988910158401

is greater than1198791015840119860 the query input is determined as not in the

database Otherwise the query input is highly similar to thetop soundtrack title in the candidate list

6 Experiments and Results

Before conducting the experiments we collect 750 sound-tracks frommany CD titles for constructing the database andfor identification In the experiments 30 seconds of musicis excerpted from the soundtracks as the reference items tobe identified Then low- and full-resolution descriptors ofthe reference items are calculated and stored in the databaseWe also randomly excerpt 15-second query items from thereference items Note that a query item may start fromany sample on the first half of the reference To test theidentification accuracy for compressed audio the 15-secondquery inputs are also encoded and then decoded with anMP-3 coder with bitrates of 192 k and 96 k respectivelyHowever the database only contains descriptors from theuncompressed items In addition the principal componentsare also obtained using uncompressed items

Table 4 Comparison of search time between 119896-119889 tree search andlinear search

119870-119889 tree Linear searchAverage time 0255 sec 323 sec

Several experiments are conducted to examine the per-formance of the proposed systemThefirst experiment checksthe identification accuracy using high-resolution descriptorsThe second one compares the relative identification accuracybetween the proposed method (in Section 42) and themethod given in [21] The third experiment compares thesearch speed by using the linear search and k-d tree searchThe next experiment intends to determine a suitable valueof M used in (8) and (9) Having the value of M we reportthe identification accuracy using a multiresolution strategyin experiment five In this experiment both method I andmethod II are examined To have a complete comparisonbetween method I and II we also report the results in termsof the ROC (receiver operating characteristics) [25] curveand the DET-like (detection error tradeoff) [26] curve butwithout logarithm

61 Experiment One Identification Accuracy Using High-Resolution Descriptors The first experiment is to examinethe identification accuracy using the high-resolutionMPEG-7 audio signature descriptors (without reduction) To evaluatethe influences of high-frequency descriptors we also exam-ine the accuracies with and without using high-frequency(greater than 25 kHz) descriptors The results are given inTable 2 In the table ldquomatch in first onerdquo is the rate that thequery input is correctly identified with the shortest distanceamong all reference items Similarly ldquomatch in first 15rdquo is therate that the query is correctly identified within a list of 15 ref-erence items sorted by distanceThe results indicate that high-resolution descriptors have very good identification accuracyAlso for uncompressed or 192 k MP-3 items whether usinghigh-frequency descriptors does not affect the identificationaccuracy However for 96 k MP-3 items discarding high-frequency descriptors greatly improves the accuracy

8 The Scientific World Journal

Seco

nd d

istan

ce

Firs

t dist

ance

AvgAvg

dividedivide

Figure 8 The first and second distances in method II

62 Experiment Two Identification Accuracy Using Low-Resolution Descriptors The second experiment compares therelative identification accuracies of the proposed dimension-ality reduction technique and the averaging method pro-posed in [21] In this experiment eight descriptors are used torepresent one segment of music (15 seconds) The results areshown in Table 3 For this experiment the accuracy rate inldquomatch in first 15rdquo ismore important than the figure in ldquomatchin first onerdquo because the low-resolution descriptors are usedto produce a candidate list Therefore it is acceptable as longas the correct item is within the list as the high-resolutiondescriptors are to be used in the second stage In this regardthe proposed approach is slightly better than the averagingmethod proposed in [21] for 96 k MP-3 query inputs

63 Experiment Three Time for k-d Tree Search and LinearSearch The third experiment compares the time to generatea list of 20 candidates by using k-d tree search and linear(exhausted) search for low-resolution descriptors The com-puting time is obtained by averaging the required time in tentrials For each trial 500 query items are randomly selectedamong the 750 available items The experimental platform isa personal computer with a 16GHz PentiumCPU and 15 GBmemory The program is written in C with a Borland C++compiler The results are shown in Table 4 The results showthat we can increase the search time by 10-fold if the low-resolution descriptors are embedded in the k-d tree structure

64 Experiment Four Determination of M in Method II Thefourth experiment is to determine the numberM in methodII Recall that the thresholds in both methods are computedbased on high-resolution descriptors To determine the opti-mal value of M we vary this value from one to eighteen in(8) and (9) to compute the threshold 1198791015840

119860and then to perform

identification Note that method II is degenerated to methodI if M = 1 In this experiment 96 k MP-3-coded items areused as the query In addition we deliberately keep the high-frequency descriptors in computing the distances becausekeeping them makes it easier to examine the influence of Mversus identification rate (defined in experiment five ) Theexperimental results are given in Figure 9 The results showthat whenM is equal or greater than 10 the identification rate

085408560858

0860862086408660868

087087208740876

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Iden

tifica

tion

rate

119872 value

Figure 9The identification rate versus the numberM for 96 kMP-3query inputs

remains almost constant Therefore M = 10 is sufficient formethod II This value is then used in experiment five

65 Experiment Five Comparison between Method I andMethod II The fifth experiment compares the identificationrates between method I and method II The thresholds arecalculated using (7) and (10) respectively In this experiment375 of the 750 15-second items are randomly chosen fortraining (ie to obtain the thresholds) The rest of 375 itemsare for query Note that all of the items are excerpted from thereference items in the database In addition we test other 500query items (also with duration of 15 sec) not excerpted fromdatabase references

Since some query inputs are not in the database wealso need to consider the situations of falsely identifying anoutside database item as one inside the database and viceversa Suppose that 119879

119889query items are actually inside the

database and 119879119900query items are not inside the database

Assume that the identification system correctly identifies119873119889

inside-database query item erroneously rejects 119873119903inside-

database query items and erroneously accepts 119873119886outside

database query items Then the IDR (identification rate)

The Scientific World Journal 9

Table 5 Comparison of identification performance using method I and method II with the use of high-frequency descriptors

Uncompressed 192 k MP-3 96 k MP-3Method I Method II Method I Method II Method I Method II

IDR 988 999 99 997 841 874FAR 21 02 18 02 04 02FRR 00 00 00 05 366 296ACC 984 999 989 995 718 768

Table 6 Comparison of method I and method II for IDR FAR and FRR with high-frequency descriptors removed

Uncompressed 192 k MP-3 96 k MP-3Method I Method II Method I Method II Method I Method II

IDR 991 997 989 997 990 997FAR 15 02 19 02 16 02FRR 03 04 02 04 03 06ACC 987 995 985 995 986 994

FAR (false-accept rate) and FRR (false-rejection rate) arecomputed as follows

IDR =119873119889

119879119889minus 119873119903

(11)

FAR =119873119886

119879119900

(12)

FRR =119873119903

119879119889

(13)

In addition the accuracy of the system is computed as

ACC =119873119889+ 119873119900

119879119889+ 119879119900

(14)

where119873119900is the number of query items correctly identified as

outside-database items and can be computed as119873119900= 119879119900minus119873119886

In this experiment the low-resolution descriptors areused to find 20 candidates Next high-resolution descriptorsare used to find the best-matched one in the list If thedistance associated with the best-match is greater than thethreshold the query input is determined as not in thedatabase Otherwise the query input is identified as the best-matched one In this experiment high-frequency descriptorsare kept in the comparison With the use of (11) to (14) theaverage values after four trials are given in Table 5 Fromthe results we know that the system does not perform wellfor identifying 96 k MP-3 query items This is mainly dueto high distortion in high-frequency descriptors A similarphenomenon can also be observed in Table 2 Later onwe will repeat this experiment but discard high-frequencydescriptors

To further examine the performance differences betweenmethod I and method II we again use 96 k MP-3 items asthe query inputs But this time we vary the threshold values(for both methods) and then record the IDR FAR and FRRvalues The results are presented as ROC curves [25] andDET-like curves [26] (but without logarithm) as shown in

Figures 10 and 11 respectively The ROC curve plots FARversus TPR (true positive rate) which is defined as

TPR =119873119889

119879119889

(15)

Conceptually a better classifier should have a higher TPRfor a fixed FAR Therefore a better classifier should havean ROC curve closer to the left-upper corner With thisinterpretation from Figure 10 we know that method II is abetter classifier

In this experiment we also show DET-like curves Aldquotruerdquo DET curve plots FAR versus FRR using logarithmscales In our case however we do not use logarithm scales toexhibit the similarity between ROC and DET curves Againconceptually we know that a better classifier should have alower FRR over a fixed FAR Thus a better classifier shouldhave a curve closer to the left-bottom corner By examiningFigure 11 we again confirm thatmethod II is a better method

Since the high-frequency descriptors actually affect theidentification rate at low bitrates we again repeat this exper-iment without using high-frequency descriptors The resultsare given in Table 6 By comparing the results in Tables 5 and6 we know that removing high-frequency descriptors slightlyreduces the accuracy of uncompressed items However itgreatly improves the accuracy for 96 k MP-3 items Sincemethod II has an accuracy of at least 994 in all cases it ishighly plausible to identify copyrighted audiomaterials usingthis method

7 Conclusion

In this paper we propose a system using a multiresolutionstrategy to identify whether a piece of unknown music isidentical to one of the pieces in the database In the systemhigh-resolution descriptors are MPEG-7 audio signaturedescriptors and the low-resolution descriptors are obtainedfrom high-resolution descriptors with the aid of PCA toreduce their dimensionality Experimental results show that

10 The Scientific World Journal

0 20 40 60 80 1000

10

20

30

40

50

60

70

80

90

100

FAR ()

TPR

()

ROC curve (96 k MP-3)

Method IMethod II

Figure 10 The ROC curves for both methods

0 20 40 60 80 1000

10

20

30

40

50

60

70

80

90

100

FAR ()

FRR

()

Method IMethod II

DET curve (96 k MP-3)

Figure 11 The DET-like curves without logarithm for both meth-ods

low-resolution descriptors still have high identification accu-racies To reduce the time to generate the candidate list weuse the k-d tree structure to store low-resolution descriptorsExperimental results show that using k-d tree structureincreases the search time by ten-folds Since not every pieceof query input is within the database we also proposed twomethods to determine the distance thresholds Experimentalresults show that the proposed method II provides an accu-racy of 994 Therefore the proposed system can be usedin real applications such as identifying copyrighted audiofiles circulated over the Internet As it can be easily extendedto operate with multiple computers the proposed system

is a plausible starting point to construct a large operabledatabase

Acknowledgment

This work was supported in part by National Science Councilof Taiwan through Grants NSC 94-2213-E-027-042 and 99-2221-E-027-097

References

[1] For a brief description of the bill please refer to httpenwikipediaorgwikiSOPA

[2] R Eklund ldquoAudio watermarking techniquesrdquo httpwwwmusemagiccompaperswatermarkhtml

[3] J S R Jang and H R Lee ldquoA general framework of progressivefiltering and its application to query by singinghummingrdquoIEEE Transactions on Audio Speech and Language Processingvol 16 no 2 pp 350ndash358 2008

[4] S Baluja and M Covell ldquoAudio fingerprinting combiningcomputer visionamp data streamprocessingrdquo in Proceedings of theIEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo07) pp II213ndashII216 Honolulu HawaiiUSA April 2007

[5] Shazam httpwwwshazamcommusicwebhomehtml[6] A Wang ldquoThe Shazam music recognition servicerdquo Communi-

cations of the ACM vol 49 no 8 pp 44ndash48 2006[7] V Chandrasekha M Sharifi and D A Ross ldquoSurvey and eval-

uation of audio fingerprinting schemes for mobile query-by-example applicationsrdquo in Proceedinjgs of the 12th InternationalConference onMusic Information Retrieval pp 801ndash806MiamiFla USA October 2011

[8] J A Haitsma and T Kalker ldquoA highly robust audio fingerprint-ing systemrdquo in Proceedinjgs of the 12th International Conferenceon Music Information Retrieval pp 107ndash115 Paris FranceOctober 2002

[9] C J C Burges J C Platt and S Jana ldquoDistortion discriminantanalysis for audio fingerprintingrdquo IEEE Transactions on Speechand Audio Processing vol 11 no 3 pp 165ndash174 2003

[10] Official website httpmusicbrainzorg[11] Official website httpwwwaudiblemagiccom[12] Official website httpwwwgracenotecom[13] httpenwikipediaorgwikiAcoustic fingerprint[14] P J O Doets M M Gisbert and R L Lagendijk ldquoOn

the comparison of audio fingerprints for extracting qualityparameters of compressed audiordquo in Security Steganographyand Watermarking of Multimedia Contents VIII vol 6072 ofProceedings of SPIE pp 60720L-1ndash60720L-12 San Jose CalifUSA January 2006

[15] F Nack and A T Lindsay ldquoEverything you wanted to knowaboutMPEG-7 part 1rdquo IEEEMultimedia vol 6 no 3 pp 65ndash771999

[16] F Nack and A Lindsay ldquoEverything you wanted to know aboutMPEG-7 part 2rdquo IEEEMultimedia vol 6 no 4 pp 64ndash73 1999

[17] ISOIEC Information TechnologymdashMultimedia ContentDescription Interface -Part 4 Audio IS 15938-4 2002

[18] H Crysandt ldquoMusic identification with MPEG-7rdquo in Proceed-ings of the 115th AES Convention October 2003 Paper 5967

The Scientific World Journal 11

[19] OHellmuth E AllamanceMCremerHGrossmann J Herreand T Kastner ldquoUsing MPEG-7 audio fingerprinting in real-world applicationrdquo in Proceedings of the 115th AES ConventionOctober 2003 Paper 5961

[20] H-G Kim N Moreau and T Sikora MPEG-7 Audio andBeyond John Wiley amp Sons 2005

[21] J-Y Lee and S D You ldquoDimension-reduction technique forMPEG-7 Audio Descriptorsrdquo in Proceedings of the 6th PacificRim Conference on Multimedia (PCM rsquo05) vol 3768 of LectureNotes in Computer Science pp 526ndash537 2005

[22] J L Bentley ldquoMultidimensional binary search trees used forassociative searchingrdquo Communications of the ACM vol 18 no9 pp 509ndash517 1975

[23] Open source program for k-d tree httpwwwcsumdedusimmountANN

[24] J Shlens ldquoA tutorial on principal component analysisrdquohttpwwwsnlsalkedusimshlenspcapdf

[25] T Fawcell ldquoROC graphs notes and practical considerations forresearchersrdquo Tech Rep HP Lab 2004

[26] A Martin G Doddington T Kamm M Ordowski and MPrzybocki ldquoThe DET curve in assessment of detection taskperformancerdquo in Proceedings of the Eurospeech vol 4 pp 1899ndash1903 Rhodes Greece September 1997

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 4: Research Article Music Identification System Using MPEG-7 ...downloads.hindawi.com/journals/tswj/2013/752464.pdf · Gracenote s MusicID [ ]. According to [ ], there are more ... one

4 The Scientific World Journal

middot middot middot

middot middot middot

119886124 11988621 119886224 1198863124middot middot middot

middot middot middot

11988611 11988631 middot middot middot 119886324 middot middot middot 1198863224 1198866224

11990211 119902124 11990222411990221 1199023124

middot middot middot11990211 119902124 11990222411990221 1199023124middot middot middot

middot middot middot11990211 119902124 1199023124middot middot middot

middot middot middot

middot middot middot

middot middot middot

Figure 2 The sliding operation to compare the query input Q (15seconds) and the reference music A (30 seconds)

obtained during the sliding operation is recorded as thedistance between the query input and the reference musicFor example if the query Q is to be compared with referenceA then we may obtain the Euclidean distance 119864

119860119861(119896) for the

k-th sliding comparison as

119864119876119860(119896) =

31

sum

119894=1

24

sum

119895=1

10038161003816100381610038161003816119902119894119895minus 119886(119894+119896)119895

10038161003816100381610038161003816 (3)

The recorded distance between these two pieces of music is

119889119876119860

= min119896

119864119876119860(119896) (4)

Although using audio signature descriptors greatlyreduces the computational burden for comparison therequired computation using (3) is still very large Accordingto [21] suppose that a database contains 1000 audio files witheach one having duration of 30 seconds and the music tobe identified has a duration of 15 seconds then 47 616 000arithmetic operations are required Therefore it is beneficialto further reduce the number of comparison which can beaccomplished by employing a multiresolution strategy

3 Music Identification Based onMultiresolution Strategy

As discussed previously the computational cost is still veryhigh even if we use the MPEG-7 descriptors Thereforewe will use the multiresolution strategy to reduce thecomputational complexity To do so in addition to theabove-mentioned high-resolution descriptors we also needto generate low-resolution descriptors (see also Section 4)Therefore the music database contains high-resolution andlow-resolution descriptors for each soundtrack to be identi-fied This step can be accomplished during the setup of thedatabase In addition a training process is conducted to finda distance threshold to determine whether the query inputis in the database or not Section 5 has a detailed descriptionabout the training process

Once the database is constructed as shown in Figure 3the music identification procedure consists of the followingsteps

(1) When a query input is sent to the system it com-putes the MPEG-7 audio signature descriptors andlow-resolution descriptors for the query music If amobile device is used to record the music usually the

Table 1 Bandwidth of the first three subbands

Bandwidth of subband(nonoverlapped)

Bandwidth of subband(overlapped)

250ndash2973Hz 2375ndash3122Hz2973ndash3536Hz 2824ndash3712Hz3535ndash4204Hz 3359ndash4415Hz

descriptors (fingerprints) are sent instead of the PCMsamples to reduce the size of the transmitted data

(2) The computed low-resolution descriptors are com-pared with those in the database Based on thedistance metric a list of candidates is obtainedSince the query music may start at any position ofthe soundtrack a sliding comparison as shown inFigure 2 is necessary

(3) After obtaining the candidate list high-resolutiondescriptors are used to find the distances between thequery input and the candidates The shortest distanceand its associated soundtrack are recorded

(4) If the shortest distance is less than the threshold thequery input is considered as highly similar to therecorded soundtrack in the database Otherwise thequery input is not in the database

During the low-resolution comparison it is also possibleto use an existing algorithm to reduce the comparison timeFor example we may arrange the descriptors in k-d tree (k-dimensional binary search tree) [22 23] structure to reducethe search time However using the k-d tree structure impliesthat every segment of music in the database has the samenumber of descriptors or same duration Note that the unitto be compared is one segment and one soundtrack canbe divided into many overlapped segments as shown inFigure 2 Therefore if the duration of the segment is set to 15seconds then the querymusic must also have a duration of 15seconds to use the system In a practical situation if the queryinput is longer than 15 seconds then only 15 seconds of themusic is used for computing the descriptors and comparison

4 Dimensionality Reduction Method forMPEG-7 Audio Signature Descriptors

This section briefly explains the dimensionality reductiontechnique used in the experiments Although a block-averagemethod is proposed in [21] for this purpose we use a differenttechnique in this paper

41 Problems of Reducing Dimensionality Using Scaling RatioConceptually we may increase the scaling ratio (given inSection 2) to reduce the number of descriptors duringcomputing them For example by increasing the scalingratio from 16 to 256 the number of descriptors is reducedby 16 times Unfortunately this approach does not yieldsatisfactory results because of insufficient time resolutionRecall that the descriptors are derived based on thewindowedwaveform Therefore if two segments of the soundtrack are

The Scientific World Journal 5

Query

Database

Dim reduction

In database

Reject Result

Threshold

Candidate

MPEG-7descriptors

Fullresolution

search

Lowresolution

search

YesNo

Figure 3 Procedure for multiresolution music identification

Descriptors

Descriptors Descriptors1198861198991 11988611989924 119886119899+11 119886119899+124

1199021198991 11990211989924

Figure 4 The audio samples and the descriptors In a real applica-tion 119886

119899119896is stored in the database whereas 119902

119899119896is computed from

the query input Usually 119886119899119896

is not equal to 119902119899119896

due to differentscopes of the windows even though they are all derived from thesame soundtrack

not highly similar their corresponding descriptors usuallyhave large differences With a scaling ratio of 256 eachdescriptor represents around 77 seconds of audio samplesTherefore unless the query input also starts at a point veryclose to the segment boundary of a soundtrack a comparisonbased on these descriptors may fail As illustrated in Figure 4descriptors of 119886

119899119896(or 119886119899+1119896

) and 119902119899119896

are quite different andtherefore the descriptor-based identification scheme cannotidentify the query inputTherefore a suitable time resolutionfor example 048 second should be maintained

42 Proposed Dimensionality Reduction Method In contrastto reduce the number of descriptors by increasing the scalingratio we may reduce them in each temporal-spectral block(ie the matrix in Figure 1) representing a segment of (15-second) music However to maintain a high time resolutionsuccessive temporal-spectral block should have a small timedifference as shown in Figure 5 With this arrangement weachieve both high identification rate and low comparisoncomplexity

For each temporal-spectral block we use PCA (princi-pal component analysis) [21 24] to reduce the number ofdescriptors Since it is difficult to directly use PCA to obtaingood low-resolution descriptors [21] we use an alternativeapproach Its basic idea is to partition the entire block into

119860 =

middot middot middot

119886124

11988621 119886224

119886311

119886321

1198863124

middot middot middot

11988611

11988631 119886324middot middot middot

middot middot middot

middot middot middot

middot middot middot

1198863224

119902331 1199023324

middot middot middot

middot middot middot

middot middot middot 1198866224119886621

11989111 11989118

11989121 11989128

11989131 11989138

Figure 5 Low-resolution descriptors In the figure each temporal-spectral block corresponds to descriptors from 15-second music

119860 = =11986011 1198601211986021 11986022

middot middot middot middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

11988611 11988618

119886151

119886161

119886291

119886158

119886168

119886298

11988619

119886159

119886169

119886299

119886116

1198861516

1198861616

1198862916 1198862917 1198862924

119886124119886117

Figure 6 Partition of a temporal-spectral block into four subblocksin the experiments

four subblocks as shown in Figure 6 and then use PCA toreduce the number of descriptors of each subblock into twovalues (descriptors) called low-resolution descriptors Alsohigh-frequency descriptors in the original temporal-spectralblock are all discarded because they are susceptible to noise[21] With this arrangement totally eight descriptors are usedto represent a segment of 15-second music Note that theactual duration of a segment used in the experiments is 1404seconds (though we still say 15 seconds) because some audiosamples are lost after MP-3 compression and decompressionin the experiments

6 The Scientific World Journal

Table 2 Identification accuracy using high-resolution features

Uncompressed 192 k MP-3 96 k MP-3High-frequencydescriptors Used Not used Used Not used Used Not used

Match in firstone 100 100 100 100 94 100

Match in first 15 100 100 100 100 98 100

Wenowdescribe how to calculate low-resolution descrip-tors Since using PCA for dimensionality reduction is a well-known approach we only describe how to construct thecovariance matrix for PCA computation and omit the com-putation of finding principal components Suppose that thereare N segments of music pieces in the database with theirdescriptor matrices denoted as 119860(1) to 119860(119873) To simplify theargument we consider the subblock matrix119860(119896)

11(referring to

Figure 6) in the following Other subblocks can be computedby the same procedure First collect all subblocks from 119860

(119896)

11

to form a big matrix as follows

119861 =

[[[

[

119886(1)

11sdot sdot sdot 119886

(1)

18119886(2)

11sdot sdot sdot 119886

(2)

18sdot sdot sdot 119886(119873)

11sdot sdot sdot 119886(119873)

18

sdot sdot sdot

119886(1)

151sdot sdot sdot 119886(1)

158119886(2)

151sdot sdot sdot 119886(2)

158sdot sdot sdot 119886(119873)

151sdot sdot sdot 119886(119873)

158

]]]

]

= [

[

uarr sdot sdot sdot uarr

b1sdot sdot sdot b8119873

darr sdot sdot sdot darr

]

]

(5)

Then we may consider column vectors of matrix B as b119894

vectors Having the data vectors b119894 the covariance matrix

and subsequently the principal components can be foundBy keeping only two principal components we can obtainc119894= [11988811989411198881198942] from b

119894 Since there are eight column vectors in

a subblock descriptors in119860(119896)11

are reduced to eight c119894(8119896minus7 le

119894 le 8119896) vectors Next we can rearrange the obtained c119894vectors

as

119862 =[[

[

11988811

11988812

11988891

11988892

sdot sdot sdot 1198888119873minus71

1198888119873minus72

sdot sdot sdot

11988881

11988882

119888161

119888162

sdot sdot sdot 11988881198731

11988881198732

]]

]

(6)

Again by treating each row of matrix C as the data to beprocessed by PCA we can compute the covariance matrixand finally reduce each column vector to one value Sincethere are two columns originally from one subblock thereare totally two (low-resolution) descriptors per subblockEquivalently a segment of 15-second music is representedby eight descriptors During the computation the principalcomponents obtained in the first and the second steps shouldbe stored in the database When a query input (in the formof high-resolution descriptors) is supplied these componentsare used to obtain the low-resolution descriptors for theinput

Table 3 Comparison of identification accuracies using low-resolution descriptors obtained by the proposed approach and bythe averaging (avg) method [21]

Uncompressed 192 k MP-3 96 k MP-3Method Proposed Avg Proposed Avg Proposed AvgMatch in firstone 993 995 991 993 901 898

Match in first15 100 100 100 100 996 995

5 Threshold to Determine the Membership ofthe Query Input

Since in practice the database cannot collect all musicsoundtracks in the world we have to have a strategy todetermine if the query input is actually in the database ornot In our case the decision is accomplished with the aid ofa threshold If the shortest distances between the input andthe candidates are greater than a threshold then the queryinput is not in the database otherwise it is in the databaseThough the concept is simple it is not trivial to determinethe threshold [20]

Recall that when a query input is applied to the proposedsystem the system computes the Euclidean distances betweenthe low-resolution descriptors of the query input and thosein the database Accordingly m (say m = 20) segments ofsoundtracks from distinct titles are selected based on thecomputed distances Next the Euclidean distances betweenthe query and the selected segments are computed againusing full-resolution descriptors By sorting the distancesfrom small to large the system creates a candidate list for thequery input

To compute the previously mentioned threshold weexamine two different methods In the first method (methodI) we define the ldquofirstrdquo distance 119889

1as the (full-resolution)

distance associated with the first candidate in the list Simi-larly the ldquosecondrdquo distance 119889

2is the distance associated with

the second candidate as shown in Figure 7 The methodto compute the first and second distances is used both intraining and in identification For the training phase assumethat there are 119873

119879soundtracks used For a segment from a

soundtrack indexed n let the first distance be 1198891(119899) and the

second distance be 1198892(119899) The threshold 119879

119860is then computed

as

119879119860= 05 sdot (

119873119879

sum

119899=1

1198891(119899) +

119873119879

sum

119899=1

1198892(119899)) (7)

During the identification phase the first distance forthe query input is computed If the first distance is smallerthan the threshold 119879

119860 the query input is determined to be

highly similar to the top soundtrack title in the candidate listOtherwise the query input is not inside the database

In addition to method I we also propose an alternativemethod (method II) to compute the first and seconddistances

The Scientific World Journal 7

Firs

t dist

ance

Seco

nd d

istan

ce

Figure 7 The first and second distances in method I

and the threshold Referring to Figure 8 the first distance1198891015840

1(119899) of method II for a training segment in soundtrack n is

1198891015840

1(119899) = 119889

1(119899) divide

sum119872

119898=2119889119898(119899)

119872 minus 1 (8)

where 119889119898(119899) is the distance associatedwith themth candidate

in the list andM is a constant Based on our experimentM =10 is sufficient By the same arrangement the second distance1198891015840

2(119899) is calculated as

1198891015840

2(119899) = 119889

2(119899) divide

sum119872+1

119898=3119889119898(119899)

119872 minus 1 (9)

Similar to (7) the threshold 1198791015840119860is determined as

1198791015840

119860= 05 sdot (

119873119879

sum

119899=1

1198891015840

1(119899) +

119873119879

sum

119899=1

1198891015840

2(119899)) (10)

During identification phase if the computed distance 11988910158401

is greater than1198791015840119860 the query input is determined as not in the

database Otherwise the query input is highly similar to thetop soundtrack title in the candidate list

6 Experiments and Results

Before conducting the experiments we collect 750 sound-tracks frommany CD titles for constructing the database andfor identification In the experiments 30 seconds of musicis excerpted from the soundtracks as the reference items tobe identified Then low- and full-resolution descriptors ofthe reference items are calculated and stored in the databaseWe also randomly excerpt 15-second query items from thereference items Note that a query item may start fromany sample on the first half of the reference To test theidentification accuracy for compressed audio the 15-secondquery inputs are also encoded and then decoded with anMP-3 coder with bitrates of 192 k and 96 k respectivelyHowever the database only contains descriptors from theuncompressed items In addition the principal componentsare also obtained using uncompressed items

Table 4 Comparison of search time between 119896-119889 tree search andlinear search

119870-119889 tree Linear searchAverage time 0255 sec 323 sec

Several experiments are conducted to examine the per-formance of the proposed systemThefirst experiment checksthe identification accuracy using high-resolution descriptorsThe second one compares the relative identification accuracybetween the proposed method (in Section 42) and themethod given in [21] The third experiment compares thesearch speed by using the linear search and k-d tree searchThe next experiment intends to determine a suitable valueof M used in (8) and (9) Having the value of M we reportthe identification accuracy using a multiresolution strategyin experiment five In this experiment both method I andmethod II are examined To have a complete comparisonbetween method I and II we also report the results in termsof the ROC (receiver operating characteristics) [25] curveand the DET-like (detection error tradeoff) [26] curve butwithout logarithm

61 Experiment One Identification Accuracy Using High-Resolution Descriptors The first experiment is to examinethe identification accuracy using the high-resolutionMPEG-7 audio signature descriptors (without reduction) To evaluatethe influences of high-frequency descriptors we also exam-ine the accuracies with and without using high-frequency(greater than 25 kHz) descriptors The results are given inTable 2 In the table ldquomatch in first onerdquo is the rate that thequery input is correctly identified with the shortest distanceamong all reference items Similarly ldquomatch in first 15rdquo is therate that the query is correctly identified within a list of 15 ref-erence items sorted by distanceThe results indicate that high-resolution descriptors have very good identification accuracyAlso for uncompressed or 192 k MP-3 items whether usinghigh-frequency descriptors does not affect the identificationaccuracy However for 96 k MP-3 items discarding high-frequency descriptors greatly improves the accuracy

8 The Scientific World Journal

Seco

nd d

istan

ce

Firs

t dist

ance

AvgAvg

dividedivide

Figure 8 The first and second distances in method II

62 Experiment Two Identification Accuracy Using Low-Resolution Descriptors The second experiment compares therelative identification accuracies of the proposed dimension-ality reduction technique and the averaging method pro-posed in [21] In this experiment eight descriptors are used torepresent one segment of music (15 seconds) The results areshown in Table 3 For this experiment the accuracy rate inldquomatch in first 15rdquo ismore important than the figure in ldquomatchin first onerdquo because the low-resolution descriptors are usedto produce a candidate list Therefore it is acceptable as longas the correct item is within the list as the high-resolutiondescriptors are to be used in the second stage In this regardthe proposed approach is slightly better than the averagingmethod proposed in [21] for 96 k MP-3 query inputs

63 Experiment Three Time for k-d Tree Search and LinearSearch The third experiment compares the time to generatea list of 20 candidates by using k-d tree search and linear(exhausted) search for low-resolution descriptors The com-puting time is obtained by averaging the required time in tentrials For each trial 500 query items are randomly selectedamong the 750 available items The experimental platform isa personal computer with a 16GHz PentiumCPU and 15 GBmemory The program is written in C with a Borland C++compiler The results are shown in Table 4 The results showthat we can increase the search time by 10-fold if the low-resolution descriptors are embedded in the k-d tree structure

64 Experiment Four Determination of M in Method II Thefourth experiment is to determine the numberM in methodII Recall that the thresholds in both methods are computedbased on high-resolution descriptors To determine the opti-mal value of M we vary this value from one to eighteen in(8) and (9) to compute the threshold 1198791015840

119860and then to perform

identification Note that method II is degenerated to methodI if M = 1 In this experiment 96 k MP-3-coded items areused as the query In addition we deliberately keep the high-frequency descriptors in computing the distances becausekeeping them makes it easier to examine the influence of Mversus identification rate (defined in experiment five ) Theexperimental results are given in Figure 9 The results showthat whenM is equal or greater than 10 the identification rate

085408560858

0860862086408660868

087087208740876

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Iden

tifica

tion

rate

119872 value

Figure 9The identification rate versus the numberM for 96 kMP-3query inputs

remains almost constant Therefore M = 10 is sufficient formethod II This value is then used in experiment five

65 Experiment Five Comparison between Method I andMethod II The fifth experiment compares the identificationrates between method I and method II The thresholds arecalculated using (7) and (10) respectively In this experiment375 of the 750 15-second items are randomly chosen fortraining (ie to obtain the thresholds) The rest of 375 itemsare for query Note that all of the items are excerpted from thereference items in the database In addition we test other 500query items (also with duration of 15 sec) not excerpted fromdatabase references

Since some query inputs are not in the database wealso need to consider the situations of falsely identifying anoutside database item as one inside the database and viceversa Suppose that 119879

119889query items are actually inside the

database and 119879119900query items are not inside the database

Assume that the identification system correctly identifies119873119889

inside-database query item erroneously rejects 119873119903inside-

database query items and erroneously accepts 119873119886outside

database query items Then the IDR (identification rate)

The Scientific World Journal 9

Table 5 Comparison of identification performance using method I and method II with the use of high-frequency descriptors

Uncompressed 192 k MP-3 96 k MP-3Method I Method II Method I Method II Method I Method II

IDR 988 999 99 997 841 874FAR 21 02 18 02 04 02FRR 00 00 00 05 366 296ACC 984 999 989 995 718 768

Table 6 Comparison of method I and method II for IDR FAR and FRR with high-frequency descriptors removed

Uncompressed 192 k MP-3 96 k MP-3Method I Method II Method I Method II Method I Method II

IDR 991 997 989 997 990 997FAR 15 02 19 02 16 02FRR 03 04 02 04 03 06ACC 987 995 985 995 986 994

FAR (false-accept rate) and FRR (false-rejection rate) arecomputed as follows

IDR =119873119889

119879119889minus 119873119903

(11)

FAR =119873119886

119879119900

(12)

FRR =119873119903

119879119889

(13)

In addition the accuracy of the system is computed as

ACC =119873119889+ 119873119900

119879119889+ 119879119900

(14)

where119873119900is the number of query items correctly identified as

outside-database items and can be computed as119873119900= 119879119900minus119873119886

In this experiment the low-resolution descriptors areused to find 20 candidates Next high-resolution descriptorsare used to find the best-matched one in the list If thedistance associated with the best-match is greater than thethreshold the query input is determined as not in thedatabase Otherwise the query input is identified as the best-matched one In this experiment high-frequency descriptorsare kept in the comparison With the use of (11) to (14) theaverage values after four trials are given in Table 5 Fromthe results we know that the system does not perform wellfor identifying 96 k MP-3 query items This is mainly dueto high distortion in high-frequency descriptors A similarphenomenon can also be observed in Table 2 Later onwe will repeat this experiment but discard high-frequencydescriptors

To further examine the performance differences betweenmethod I and method II we again use 96 k MP-3 items asthe query inputs But this time we vary the threshold values(for both methods) and then record the IDR FAR and FRRvalues The results are presented as ROC curves [25] andDET-like curves [26] (but without logarithm) as shown in

Figures 10 and 11 respectively The ROC curve plots FARversus TPR (true positive rate) which is defined as

TPR =119873119889

119879119889

(15)

Conceptually a better classifier should have a higher TPRfor a fixed FAR Therefore a better classifier should havean ROC curve closer to the left-upper corner With thisinterpretation from Figure 10 we know that method II is abetter classifier

In this experiment we also show DET-like curves Aldquotruerdquo DET curve plots FAR versus FRR using logarithmscales In our case however we do not use logarithm scales toexhibit the similarity between ROC and DET curves Againconceptually we know that a better classifier should have alower FRR over a fixed FAR Thus a better classifier shouldhave a curve closer to the left-bottom corner By examiningFigure 11 we again confirm thatmethod II is a better method

Since the high-frequency descriptors actually affect theidentification rate at low bitrates we again repeat this exper-iment without using high-frequency descriptors The resultsare given in Table 6 By comparing the results in Tables 5 and6 we know that removing high-frequency descriptors slightlyreduces the accuracy of uncompressed items However itgreatly improves the accuracy for 96 k MP-3 items Sincemethod II has an accuracy of at least 994 in all cases it ishighly plausible to identify copyrighted audiomaterials usingthis method

7 Conclusion

In this paper we propose a system using a multiresolutionstrategy to identify whether a piece of unknown music isidentical to one of the pieces in the database In the systemhigh-resolution descriptors are MPEG-7 audio signaturedescriptors and the low-resolution descriptors are obtainedfrom high-resolution descriptors with the aid of PCA toreduce their dimensionality Experimental results show that

10 The Scientific World Journal

0 20 40 60 80 1000

10

20

30

40

50

60

70

80

90

100

FAR ()

TPR

()

ROC curve (96 k MP-3)

Method IMethod II

Figure 10 The ROC curves for both methods

0 20 40 60 80 1000

10

20

30

40

50

60

70

80

90

100

FAR ()

FRR

()

Method IMethod II

DET curve (96 k MP-3)

Figure 11 The DET-like curves without logarithm for both meth-ods

low-resolution descriptors still have high identification accu-racies To reduce the time to generate the candidate list weuse the k-d tree structure to store low-resolution descriptorsExperimental results show that using k-d tree structureincreases the search time by ten-folds Since not every pieceof query input is within the database we also proposed twomethods to determine the distance thresholds Experimentalresults show that the proposed method II provides an accu-racy of 994 Therefore the proposed system can be usedin real applications such as identifying copyrighted audiofiles circulated over the Internet As it can be easily extendedto operate with multiple computers the proposed system

is a plausible starting point to construct a large operabledatabase

Acknowledgment

This work was supported in part by National Science Councilof Taiwan through Grants NSC 94-2213-E-027-042 and 99-2221-E-027-097

References

[1] For a brief description of the bill please refer to httpenwikipediaorgwikiSOPA

[2] R Eklund ldquoAudio watermarking techniquesrdquo httpwwwmusemagiccompaperswatermarkhtml

[3] J S R Jang and H R Lee ldquoA general framework of progressivefiltering and its application to query by singinghummingrdquoIEEE Transactions on Audio Speech and Language Processingvol 16 no 2 pp 350ndash358 2008

[4] S Baluja and M Covell ldquoAudio fingerprinting combiningcomputer visionamp data streamprocessingrdquo in Proceedings of theIEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo07) pp II213ndashII216 Honolulu HawaiiUSA April 2007

[5] Shazam httpwwwshazamcommusicwebhomehtml[6] A Wang ldquoThe Shazam music recognition servicerdquo Communi-

cations of the ACM vol 49 no 8 pp 44ndash48 2006[7] V Chandrasekha M Sharifi and D A Ross ldquoSurvey and eval-

uation of audio fingerprinting schemes for mobile query-by-example applicationsrdquo in Proceedinjgs of the 12th InternationalConference onMusic Information Retrieval pp 801ndash806MiamiFla USA October 2011

[8] J A Haitsma and T Kalker ldquoA highly robust audio fingerprint-ing systemrdquo in Proceedinjgs of the 12th International Conferenceon Music Information Retrieval pp 107ndash115 Paris FranceOctober 2002

[9] C J C Burges J C Platt and S Jana ldquoDistortion discriminantanalysis for audio fingerprintingrdquo IEEE Transactions on Speechand Audio Processing vol 11 no 3 pp 165ndash174 2003

[10] Official website httpmusicbrainzorg[11] Official website httpwwwaudiblemagiccom[12] Official website httpwwwgracenotecom[13] httpenwikipediaorgwikiAcoustic fingerprint[14] P J O Doets M M Gisbert and R L Lagendijk ldquoOn

the comparison of audio fingerprints for extracting qualityparameters of compressed audiordquo in Security Steganographyand Watermarking of Multimedia Contents VIII vol 6072 ofProceedings of SPIE pp 60720L-1ndash60720L-12 San Jose CalifUSA January 2006

[15] F Nack and A T Lindsay ldquoEverything you wanted to knowaboutMPEG-7 part 1rdquo IEEEMultimedia vol 6 no 3 pp 65ndash771999

[16] F Nack and A Lindsay ldquoEverything you wanted to know aboutMPEG-7 part 2rdquo IEEEMultimedia vol 6 no 4 pp 64ndash73 1999

[17] ISOIEC Information TechnologymdashMultimedia ContentDescription Interface -Part 4 Audio IS 15938-4 2002

[18] H Crysandt ldquoMusic identification with MPEG-7rdquo in Proceed-ings of the 115th AES Convention October 2003 Paper 5967

The Scientific World Journal 11

[19] OHellmuth E AllamanceMCremerHGrossmann J Herreand T Kastner ldquoUsing MPEG-7 audio fingerprinting in real-world applicationrdquo in Proceedings of the 115th AES ConventionOctober 2003 Paper 5961

[20] H-G Kim N Moreau and T Sikora MPEG-7 Audio andBeyond John Wiley amp Sons 2005

[21] J-Y Lee and S D You ldquoDimension-reduction technique forMPEG-7 Audio Descriptorsrdquo in Proceedings of the 6th PacificRim Conference on Multimedia (PCM rsquo05) vol 3768 of LectureNotes in Computer Science pp 526ndash537 2005

[22] J L Bentley ldquoMultidimensional binary search trees used forassociative searchingrdquo Communications of the ACM vol 18 no9 pp 509ndash517 1975

[23] Open source program for k-d tree httpwwwcsumdedusimmountANN

[24] J Shlens ldquoA tutorial on principal component analysisrdquohttpwwwsnlsalkedusimshlenspcapdf

[25] T Fawcell ldquoROC graphs notes and practical considerations forresearchersrdquo Tech Rep HP Lab 2004

[26] A Martin G Doddington T Kamm M Ordowski and MPrzybocki ldquoThe DET curve in assessment of detection taskperformancerdquo in Proceedings of the Eurospeech vol 4 pp 1899ndash1903 Rhodes Greece September 1997

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 5: Research Article Music Identification System Using MPEG-7 ...downloads.hindawi.com/journals/tswj/2013/752464.pdf · Gracenote s MusicID [ ]. According to [ ], there are more ... one

The Scientific World Journal 5

Query

Database

Dim reduction

In database

Reject Result

Threshold

Candidate

MPEG-7descriptors

Fullresolution

search

Lowresolution

search

YesNo

Figure 3 Procedure for multiresolution music identification

Descriptors

Descriptors Descriptors1198861198991 11988611989924 119886119899+11 119886119899+124

1199021198991 11990211989924

Figure 4 The audio samples and the descriptors In a real applica-tion 119886

119899119896is stored in the database whereas 119902

119899119896is computed from

the query input Usually 119886119899119896

is not equal to 119902119899119896

due to differentscopes of the windows even though they are all derived from thesame soundtrack

not highly similar their corresponding descriptors usuallyhave large differences With a scaling ratio of 256 eachdescriptor represents around 77 seconds of audio samplesTherefore unless the query input also starts at a point veryclose to the segment boundary of a soundtrack a comparisonbased on these descriptors may fail As illustrated in Figure 4descriptors of 119886

119899119896(or 119886119899+1119896

) and 119902119899119896

are quite different andtherefore the descriptor-based identification scheme cannotidentify the query inputTherefore a suitable time resolutionfor example 048 second should be maintained

42 Proposed Dimensionality Reduction Method In contrastto reduce the number of descriptors by increasing the scalingratio we may reduce them in each temporal-spectral block(ie the matrix in Figure 1) representing a segment of (15-second) music However to maintain a high time resolutionsuccessive temporal-spectral block should have a small timedifference as shown in Figure 5 With this arrangement weachieve both high identification rate and low comparisoncomplexity

For each temporal-spectral block we use PCA (princi-pal component analysis) [21 24] to reduce the number ofdescriptors Since it is difficult to directly use PCA to obtaingood low-resolution descriptors [21] we use an alternativeapproach Its basic idea is to partition the entire block into

119860 =

middot middot middot

119886124

11988621 119886224

119886311

119886321

1198863124

middot middot middot

11988611

11988631 119886324middot middot middot

middot middot middot

middot middot middot

middot middot middot

1198863224

119902331 1199023324

middot middot middot

middot middot middot

middot middot middot 1198866224119886621

11989111 11989118

11989121 11989128

11989131 11989138

Figure 5 Low-resolution descriptors In the figure each temporal-spectral block corresponds to descriptors from 15-second music

119860 = =11986011 1198601211986021 11986022

middot middot middot middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

11988611 11988618

119886151

119886161

119886291

119886158

119886168

119886298

11988619

119886159

119886169

119886299

119886116

1198861516

1198861616

1198862916 1198862917 1198862924

119886124119886117

Figure 6 Partition of a temporal-spectral block into four subblocksin the experiments

four subblocks as shown in Figure 6 and then use PCA toreduce the number of descriptors of each subblock into twovalues (descriptors) called low-resolution descriptors Alsohigh-frequency descriptors in the original temporal-spectralblock are all discarded because they are susceptible to noise[21] With this arrangement totally eight descriptors are usedto represent a segment of 15-second music Note that theactual duration of a segment used in the experiments is 1404seconds (though we still say 15 seconds) because some audiosamples are lost after MP-3 compression and decompressionin the experiments

6 The Scientific World Journal

Table 2 Identification accuracy using high-resolution features

Uncompressed 192 k MP-3 96 k MP-3High-frequencydescriptors Used Not used Used Not used Used Not used

Match in firstone 100 100 100 100 94 100

Match in first 15 100 100 100 100 98 100

Wenowdescribe how to calculate low-resolution descrip-tors Since using PCA for dimensionality reduction is a well-known approach we only describe how to construct thecovariance matrix for PCA computation and omit the com-putation of finding principal components Suppose that thereare N segments of music pieces in the database with theirdescriptor matrices denoted as 119860(1) to 119860(119873) To simplify theargument we consider the subblock matrix119860(119896)

11(referring to

Figure 6) in the following Other subblocks can be computedby the same procedure First collect all subblocks from 119860

(119896)

11

to form a big matrix as follows

119861 =

[[[

[

119886(1)

11sdot sdot sdot 119886

(1)

18119886(2)

11sdot sdot sdot 119886

(2)

18sdot sdot sdot 119886(119873)

11sdot sdot sdot 119886(119873)

18

sdot sdot sdot

119886(1)

151sdot sdot sdot 119886(1)

158119886(2)

151sdot sdot sdot 119886(2)

158sdot sdot sdot 119886(119873)

151sdot sdot sdot 119886(119873)

158

]]]

]

= [

[

uarr sdot sdot sdot uarr

b1sdot sdot sdot b8119873

darr sdot sdot sdot darr

]

]

(5)

Then we may consider column vectors of matrix B as b119894

vectors Having the data vectors b119894 the covariance matrix

and subsequently the principal components can be foundBy keeping only two principal components we can obtainc119894= [11988811989411198881198942] from b

119894 Since there are eight column vectors in

a subblock descriptors in119860(119896)11

are reduced to eight c119894(8119896minus7 le

119894 le 8119896) vectors Next we can rearrange the obtained c119894vectors

as

119862 =[[

[

11988811

11988812

11988891

11988892

sdot sdot sdot 1198888119873minus71

1198888119873minus72

sdot sdot sdot

11988881

11988882

119888161

119888162

sdot sdot sdot 11988881198731

11988881198732

]]

]

(6)

Again by treating each row of matrix C as the data to beprocessed by PCA we can compute the covariance matrixand finally reduce each column vector to one value Sincethere are two columns originally from one subblock thereare totally two (low-resolution) descriptors per subblockEquivalently a segment of 15-second music is representedby eight descriptors During the computation the principalcomponents obtained in the first and the second steps shouldbe stored in the database When a query input (in the formof high-resolution descriptors) is supplied these componentsare used to obtain the low-resolution descriptors for theinput

Table 3 Comparison of identification accuracies using low-resolution descriptors obtained by the proposed approach and bythe averaging (avg) method [21]

Uncompressed 192 k MP-3 96 k MP-3Method Proposed Avg Proposed Avg Proposed AvgMatch in firstone 993 995 991 993 901 898

Match in first15 100 100 100 100 996 995

5 Threshold to Determine the Membership ofthe Query Input

Since in practice the database cannot collect all musicsoundtracks in the world we have to have a strategy todetermine if the query input is actually in the database ornot In our case the decision is accomplished with the aid ofa threshold If the shortest distances between the input andthe candidates are greater than a threshold then the queryinput is not in the database otherwise it is in the databaseThough the concept is simple it is not trivial to determinethe threshold [20]

Recall that when a query input is applied to the proposedsystem the system computes the Euclidean distances betweenthe low-resolution descriptors of the query input and thosein the database Accordingly m (say m = 20) segments ofsoundtracks from distinct titles are selected based on thecomputed distances Next the Euclidean distances betweenthe query and the selected segments are computed againusing full-resolution descriptors By sorting the distancesfrom small to large the system creates a candidate list for thequery input

To compute the previously mentioned threshold weexamine two different methods In the first method (methodI) we define the ldquofirstrdquo distance 119889

1as the (full-resolution)

distance associated with the first candidate in the list Simi-larly the ldquosecondrdquo distance 119889

2is the distance associated with

the second candidate as shown in Figure 7 The methodto compute the first and second distances is used both intraining and in identification For the training phase assumethat there are 119873

119879soundtracks used For a segment from a

soundtrack indexed n let the first distance be 1198891(119899) and the

second distance be 1198892(119899) The threshold 119879

119860is then computed

as

119879119860= 05 sdot (

119873119879

sum

119899=1

1198891(119899) +

119873119879

sum

119899=1

1198892(119899)) (7)

During the identification phase the first distance forthe query input is computed If the first distance is smallerthan the threshold 119879

119860 the query input is determined to be

highly similar to the top soundtrack title in the candidate listOtherwise the query input is not inside the database

In addition to method I we also propose an alternativemethod (method II) to compute the first and seconddistances

The Scientific World Journal 7

Firs

t dist

ance

Seco

nd d

istan

ce

Figure 7 The first and second distances in method I

and the threshold Referring to Figure 8 the first distance1198891015840

1(119899) of method II for a training segment in soundtrack n is

1198891015840

1(119899) = 119889

1(119899) divide

sum119872

119898=2119889119898(119899)

119872 minus 1 (8)

where 119889119898(119899) is the distance associatedwith themth candidate

in the list andM is a constant Based on our experimentM =10 is sufficient By the same arrangement the second distance1198891015840

2(119899) is calculated as

1198891015840

2(119899) = 119889

2(119899) divide

sum119872+1

119898=3119889119898(119899)

119872 minus 1 (9)

Similar to (7) the threshold 1198791015840119860is determined as

1198791015840

119860= 05 sdot (

119873119879

sum

119899=1

1198891015840

1(119899) +

119873119879

sum

119899=1

1198891015840

2(119899)) (10)

During identification phase if the computed distance 11988910158401

is greater than1198791015840119860 the query input is determined as not in the

database Otherwise the query input is highly similar to thetop soundtrack title in the candidate list

6 Experiments and Results

Before conducting the experiments we collect 750 sound-tracks frommany CD titles for constructing the database andfor identification In the experiments 30 seconds of musicis excerpted from the soundtracks as the reference items tobe identified Then low- and full-resolution descriptors ofthe reference items are calculated and stored in the databaseWe also randomly excerpt 15-second query items from thereference items Note that a query item may start fromany sample on the first half of the reference To test theidentification accuracy for compressed audio the 15-secondquery inputs are also encoded and then decoded with anMP-3 coder with bitrates of 192 k and 96 k respectivelyHowever the database only contains descriptors from theuncompressed items In addition the principal componentsare also obtained using uncompressed items

Table 4 Comparison of search time between 119896-119889 tree search andlinear search

119870-119889 tree Linear searchAverage time 0255 sec 323 sec

Several experiments are conducted to examine the per-formance of the proposed systemThefirst experiment checksthe identification accuracy using high-resolution descriptorsThe second one compares the relative identification accuracybetween the proposed method (in Section 42) and themethod given in [21] The third experiment compares thesearch speed by using the linear search and k-d tree searchThe next experiment intends to determine a suitable valueof M used in (8) and (9) Having the value of M we reportthe identification accuracy using a multiresolution strategyin experiment five In this experiment both method I andmethod II are examined To have a complete comparisonbetween method I and II we also report the results in termsof the ROC (receiver operating characteristics) [25] curveand the DET-like (detection error tradeoff) [26] curve butwithout logarithm

61 Experiment One Identification Accuracy Using High-Resolution Descriptors The first experiment is to examinethe identification accuracy using the high-resolutionMPEG-7 audio signature descriptors (without reduction) To evaluatethe influences of high-frequency descriptors we also exam-ine the accuracies with and without using high-frequency(greater than 25 kHz) descriptors The results are given inTable 2 In the table ldquomatch in first onerdquo is the rate that thequery input is correctly identified with the shortest distanceamong all reference items Similarly ldquomatch in first 15rdquo is therate that the query is correctly identified within a list of 15 ref-erence items sorted by distanceThe results indicate that high-resolution descriptors have very good identification accuracyAlso for uncompressed or 192 k MP-3 items whether usinghigh-frequency descriptors does not affect the identificationaccuracy However for 96 k MP-3 items discarding high-frequency descriptors greatly improves the accuracy

8 The Scientific World Journal

Seco

nd d

istan

ce

Firs

t dist

ance

AvgAvg

dividedivide

Figure 8 The first and second distances in method II

62 Experiment Two Identification Accuracy Using Low-Resolution Descriptors The second experiment compares therelative identification accuracies of the proposed dimension-ality reduction technique and the averaging method pro-posed in [21] In this experiment eight descriptors are used torepresent one segment of music (15 seconds) The results areshown in Table 3 For this experiment the accuracy rate inldquomatch in first 15rdquo ismore important than the figure in ldquomatchin first onerdquo because the low-resolution descriptors are usedto produce a candidate list Therefore it is acceptable as longas the correct item is within the list as the high-resolutiondescriptors are to be used in the second stage In this regardthe proposed approach is slightly better than the averagingmethod proposed in [21] for 96 k MP-3 query inputs

63 Experiment Three Time for k-d Tree Search and LinearSearch The third experiment compares the time to generatea list of 20 candidates by using k-d tree search and linear(exhausted) search for low-resolution descriptors The com-puting time is obtained by averaging the required time in tentrials For each trial 500 query items are randomly selectedamong the 750 available items The experimental platform isa personal computer with a 16GHz PentiumCPU and 15 GBmemory The program is written in C with a Borland C++compiler The results are shown in Table 4 The results showthat we can increase the search time by 10-fold if the low-resolution descriptors are embedded in the k-d tree structure

64 Experiment Four Determination of M in Method II Thefourth experiment is to determine the numberM in methodII Recall that the thresholds in both methods are computedbased on high-resolution descriptors To determine the opti-mal value of M we vary this value from one to eighteen in(8) and (9) to compute the threshold 1198791015840

119860and then to perform

identification Note that method II is degenerated to methodI if M = 1 In this experiment 96 k MP-3-coded items areused as the query In addition we deliberately keep the high-frequency descriptors in computing the distances becausekeeping them makes it easier to examine the influence of Mversus identification rate (defined in experiment five ) Theexperimental results are given in Figure 9 The results showthat whenM is equal or greater than 10 the identification rate

085408560858

0860862086408660868

087087208740876

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Iden

tifica

tion

rate

119872 value

Figure 9The identification rate versus the numberM for 96 kMP-3query inputs

remains almost constant Therefore M = 10 is sufficient formethod II This value is then used in experiment five

65 Experiment Five Comparison between Method I andMethod II The fifth experiment compares the identificationrates between method I and method II The thresholds arecalculated using (7) and (10) respectively In this experiment375 of the 750 15-second items are randomly chosen fortraining (ie to obtain the thresholds) The rest of 375 itemsare for query Note that all of the items are excerpted from thereference items in the database In addition we test other 500query items (also with duration of 15 sec) not excerpted fromdatabase references

Since some query inputs are not in the database wealso need to consider the situations of falsely identifying anoutside database item as one inside the database and viceversa Suppose that 119879

119889query items are actually inside the

database and 119879119900query items are not inside the database

Assume that the identification system correctly identifies119873119889

inside-database query item erroneously rejects 119873119903inside-

database query items and erroneously accepts 119873119886outside

database query items Then the IDR (identification rate)

The Scientific World Journal 9

Table 5 Comparison of identification performance using method I and method II with the use of high-frequency descriptors

Uncompressed 192 k MP-3 96 k MP-3Method I Method II Method I Method II Method I Method II

IDR 988 999 99 997 841 874FAR 21 02 18 02 04 02FRR 00 00 00 05 366 296ACC 984 999 989 995 718 768

Table 6 Comparison of method I and method II for IDR FAR and FRR with high-frequency descriptors removed

Uncompressed 192 k MP-3 96 k MP-3Method I Method II Method I Method II Method I Method II

IDR 991 997 989 997 990 997FAR 15 02 19 02 16 02FRR 03 04 02 04 03 06ACC 987 995 985 995 986 994

FAR (false-accept rate) and FRR (false-rejection rate) arecomputed as follows

IDR =119873119889

119879119889minus 119873119903

(11)

FAR =119873119886

119879119900

(12)

FRR =119873119903

119879119889

(13)

In addition the accuracy of the system is computed as

ACC =119873119889+ 119873119900

119879119889+ 119879119900

(14)

where119873119900is the number of query items correctly identified as

outside-database items and can be computed as119873119900= 119879119900minus119873119886

In this experiment the low-resolution descriptors areused to find 20 candidates Next high-resolution descriptorsare used to find the best-matched one in the list If thedistance associated with the best-match is greater than thethreshold the query input is determined as not in thedatabase Otherwise the query input is identified as the best-matched one In this experiment high-frequency descriptorsare kept in the comparison With the use of (11) to (14) theaverage values after four trials are given in Table 5 Fromthe results we know that the system does not perform wellfor identifying 96 k MP-3 query items This is mainly dueto high distortion in high-frequency descriptors A similarphenomenon can also be observed in Table 2 Later onwe will repeat this experiment but discard high-frequencydescriptors

To further examine the performance differences betweenmethod I and method II we again use 96 k MP-3 items asthe query inputs But this time we vary the threshold values(for both methods) and then record the IDR FAR and FRRvalues The results are presented as ROC curves [25] andDET-like curves [26] (but without logarithm) as shown in

Figures 10 and 11 respectively The ROC curve plots FARversus TPR (true positive rate) which is defined as

TPR =119873119889

119879119889

(15)

Conceptually a better classifier should have a higher TPRfor a fixed FAR Therefore a better classifier should havean ROC curve closer to the left-upper corner With thisinterpretation from Figure 10 we know that method II is abetter classifier

In this experiment we also show DET-like curves Aldquotruerdquo DET curve plots FAR versus FRR using logarithmscales In our case however we do not use logarithm scales toexhibit the similarity between ROC and DET curves Againconceptually we know that a better classifier should have alower FRR over a fixed FAR Thus a better classifier shouldhave a curve closer to the left-bottom corner By examiningFigure 11 we again confirm thatmethod II is a better method

Since the high-frequency descriptors actually affect theidentification rate at low bitrates we again repeat this exper-iment without using high-frequency descriptors The resultsare given in Table 6 By comparing the results in Tables 5 and6 we know that removing high-frequency descriptors slightlyreduces the accuracy of uncompressed items However itgreatly improves the accuracy for 96 k MP-3 items Sincemethod II has an accuracy of at least 994 in all cases it ishighly plausible to identify copyrighted audiomaterials usingthis method

7 Conclusion

In this paper we propose a system using a multiresolutionstrategy to identify whether a piece of unknown music isidentical to one of the pieces in the database In the systemhigh-resolution descriptors are MPEG-7 audio signaturedescriptors and the low-resolution descriptors are obtainedfrom high-resolution descriptors with the aid of PCA toreduce their dimensionality Experimental results show that

10 The Scientific World Journal

0 20 40 60 80 1000

10

20

30

40

50

60

70

80

90

100

FAR ()

TPR

()

ROC curve (96 k MP-3)

Method IMethod II

Figure 10 The ROC curves for both methods

0 20 40 60 80 1000

10

20

30

40

50

60

70

80

90

100

FAR ()

FRR

()

Method IMethod II

DET curve (96 k MP-3)

Figure 11 The DET-like curves without logarithm for both meth-ods

low-resolution descriptors still have high identification accu-racies To reduce the time to generate the candidate list weuse the k-d tree structure to store low-resolution descriptorsExperimental results show that using k-d tree structureincreases the search time by ten-folds Since not every pieceof query input is within the database we also proposed twomethods to determine the distance thresholds Experimentalresults show that the proposed method II provides an accu-racy of 994 Therefore the proposed system can be usedin real applications such as identifying copyrighted audiofiles circulated over the Internet As it can be easily extendedto operate with multiple computers the proposed system

is a plausible starting point to construct a large operabledatabase

Acknowledgment

This work was supported in part by National Science Councilof Taiwan through Grants NSC 94-2213-E-027-042 and 99-2221-E-027-097

References

[1] For a brief description of the bill please refer to httpenwikipediaorgwikiSOPA

[2] R Eklund ldquoAudio watermarking techniquesrdquo httpwwwmusemagiccompaperswatermarkhtml

[3] J S R Jang and H R Lee ldquoA general framework of progressivefiltering and its application to query by singinghummingrdquoIEEE Transactions on Audio Speech and Language Processingvol 16 no 2 pp 350ndash358 2008

[4] S Baluja and M Covell ldquoAudio fingerprinting combiningcomputer visionamp data streamprocessingrdquo in Proceedings of theIEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo07) pp II213ndashII216 Honolulu HawaiiUSA April 2007

[5] Shazam httpwwwshazamcommusicwebhomehtml[6] A Wang ldquoThe Shazam music recognition servicerdquo Communi-

cations of the ACM vol 49 no 8 pp 44ndash48 2006[7] V Chandrasekha M Sharifi and D A Ross ldquoSurvey and eval-

uation of audio fingerprinting schemes for mobile query-by-example applicationsrdquo in Proceedinjgs of the 12th InternationalConference onMusic Information Retrieval pp 801ndash806MiamiFla USA October 2011

[8] J A Haitsma and T Kalker ldquoA highly robust audio fingerprint-ing systemrdquo in Proceedinjgs of the 12th International Conferenceon Music Information Retrieval pp 107ndash115 Paris FranceOctober 2002

[9] C J C Burges J C Platt and S Jana ldquoDistortion discriminantanalysis for audio fingerprintingrdquo IEEE Transactions on Speechand Audio Processing vol 11 no 3 pp 165ndash174 2003

[10] Official website httpmusicbrainzorg[11] Official website httpwwwaudiblemagiccom[12] Official website httpwwwgracenotecom[13] httpenwikipediaorgwikiAcoustic fingerprint[14] P J O Doets M M Gisbert and R L Lagendijk ldquoOn

the comparison of audio fingerprints for extracting qualityparameters of compressed audiordquo in Security Steganographyand Watermarking of Multimedia Contents VIII vol 6072 ofProceedings of SPIE pp 60720L-1ndash60720L-12 San Jose CalifUSA January 2006

[15] F Nack and A T Lindsay ldquoEverything you wanted to knowaboutMPEG-7 part 1rdquo IEEEMultimedia vol 6 no 3 pp 65ndash771999

[16] F Nack and A Lindsay ldquoEverything you wanted to know aboutMPEG-7 part 2rdquo IEEEMultimedia vol 6 no 4 pp 64ndash73 1999

[17] ISOIEC Information TechnologymdashMultimedia ContentDescription Interface -Part 4 Audio IS 15938-4 2002

[18] H Crysandt ldquoMusic identification with MPEG-7rdquo in Proceed-ings of the 115th AES Convention October 2003 Paper 5967

The Scientific World Journal 11

[19] OHellmuth E AllamanceMCremerHGrossmann J Herreand T Kastner ldquoUsing MPEG-7 audio fingerprinting in real-world applicationrdquo in Proceedings of the 115th AES ConventionOctober 2003 Paper 5961

[20] H-G Kim N Moreau and T Sikora MPEG-7 Audio andBeyond John Wiley amp Sons 2005

[21] J-Y Lee and S D You ldquoDimension-reduction technique forMPEG-7 Audio Descriptorsrdquo in Proceedings of the 6th PacificRim Conference on Multimedia (PCM rsquo05) vol 3768 of LectureNotes in Computer Science pp 526ndash537 2005

[22] J L Bentley ldquoMultidimensional binary search trees used forassociative searchingrdquo Communications of the ACM vol 18 no9 pp 509ndash517 1975

[23] Open source program for k-d tree httpwwwcsumdedusimmountANN

[24] J Shlens ldquoA tutorial on principal component analysisrdquohttpwwwsnlsalkedusimshlenspcapdf

[25] T Fawcell ldquoROC graphs notes and practical considerations forresearchersrdquo Tech Rep HP Lab 2004

[26] A Martin G Doddington T Kamm M Ordowski and MPrzybocki ldquoThe DET curve in assessment of detection taskperformancerdquo in Proceedings of the Eurospeech vol 4 pp 1899ndash1903 Rhodes Greece September 1997

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 6: Research Article Music Identification System Using MPEG-7 ...downloads.hindawi.com/journals/tswj/2013/752464.pdf · Gracenote s MusicID [ ]. According to [ ], there are more ... one

6 The Scientific World Journal

Table 2 Identification accuracy using high-resolution features

Uncompressed 192 k MP-3 96 k MP-3High-frequencydescriptors Used Not used Used Not used Used Not used

Match in firstone 100 100 100 100 94 100

Match in first 15 100 100 100 100 98 100

Wenowdescribe how to calculate low-resolution descrip-tors Since using PCA for dimensionality reduction is a well-known approach we only describe how to construct thecovariance matrix for PCA computation and omit the com-putation of finding principal components Suppose that thereare N segments of music pieces in the database with theirdescriptor matrices denoted as 119860(1) to 119860(119873) To simplify theargument we consider the subblock matrix119860(119896)

11(referring to

Figure 6) in the following Other subblocks can be computedby the same procedure First collect all subblocks from 119860

(119896)

11

to form a big matrix as follows

119861 =

[[[

[

119886(1)

11sdot sdot sdot 119886

(1)

18119886(2)

11sdot sdot sdot 119886

(2)

18sdot sdot sdot 119886(119873)

11sdot sdot sdot 119886(119873)

18

sdot sdot sdot

119886(1)

151sdot sdot sdot 119886(1)

158119886(2)

151sdot sdot sdot 119886(2)

158sdot sdot sdot 119886(119873)

151sdot sdot sdot 119886(119873)

158

]]]

]

= [

[

uarr sdot sdot sdot uarr

b1sdot sdot sdot b8119873

darr sdot sdot sdot darr

]

]

(5)

Then we may consider column vectors of matrix B as b119894

vectors Having the data vectors b119894 the covariance matrix

and subsequently the principal components can be foundBy keeping only two principal components we can obtainc119894= [11988811989411198881198942] from b

119894 Since there are eight column vectors in

a subblock descriptors in119860(119896)11

are reduced to eight c119894(8119896minus7 le

119894 le 8119896) vectors Next we can rearrange the obtained c119894vectors

as

119862 =[[

[

11988811

11988812

11988891

11988892

sdot sdot sdot 1198888119873minus71

1198888119873minus72

sdot sdot sdot

11988881

11988882

119888161

119888162

sdot sdot sdot 11988881198731

11988881198732

]]

]

(6)

Again by treating each row of matrix C as the data to beprocessed by PCA we can compute the covariance matrixand finally reduce each column vector to one value Sincethere are two columns originally from one subblock thereare totally two (low-resolution) descriptors per subblockEquivalently a segment of 15-second music is representedby eight descriptors During the computation the principalcomponents obtained in the first and the second steps shouldbe stored in the database When a query input (in the formof high-resolution descriptors) is supplied these componentsare used to obtain the low-resolution descriptors for theinput

Table 3 Comparison of identification accuracies using low-resolution descriptors obtained by the proposed approach and bythe averaging (avg) method [21]

Uncompressed 192 k MP-3 96 k MP-3Method Proposed Avg Proposed Avg Proposed AvgMatch in firstone 993 995 991 993 901 898

Match in first15 100 100 100 100 996 995

5 Threshold to Determine the Membership ofthe Query Input

Since in practice the database cannot collect all musicsoundtracks in the world we have to have a strategy todetermine if the query input is actually in the database ornot In our case the decision is accomplished with the aid ofa threshold If the shortest distances between the input andthe candidates are greater than a threshold then the queryinput is not in the database otherwise it is in the databaseThough the concept is simple it is not trivial to determinethe threshold [20]

Recall that when a query input is applied to the proposedsystem the system computes the Euclidean distances betweenthe low-resolution descriptors of the query input and thosein the database Accordingly m (say m = 20) segments ofsoundtracks from distinct titles are selected based on thecomputed distances Next the Euclidean distances betweenthe query and the selected segments are computed againusing full-resolution descriptors By sorting the distancesfrom small to large the system creates a candidate list for thequery input

To compute the previously mentioned threshold weexamine two different methods In the first method (methodI) we define the ldquofirstrdquo distance 119889

1as the (full-resolution)

distance associated with the first candidate in the list Simi-larly the ldquosecondrdquo distance 119889

2is the distance associated with

the second candidate as shown in Figure 7 The methodto compute the first and second distances is used both intraining and in identification For the training phase assumethat there are 119873

119879soundtracks used For a segment from a

soundtrack indexed n let the first distance be 1198891(119899) and the

second distance be 1198892(119899) The threshold 119879

119860is then computed

as

119879119860= 05 sdot (

119873119879

sum

119899=1

1198891(119899) +

119873119879

sum

119899=1

1198892(119899)) (7)

During the identification phase the first distance forthe query input is computed If the first distance is smallerthan the threshold 119879

119860 the query input is determined to be

highly similar to the top soundtrack title in the candidate listOtherwise the query input is not inside the database

In addition to method I we also propose an alternativemethod (method II) to compute the first and seconddistances

The Scientific World Journal 7

Firs

t dist

ance

Seco

nd d

istan

ce

Figure 7 The first and second distances in method I

and the threshold Referring to Figure 8 the first distance1198891015840

1(119899) of method II for a training segment in soundtrack n is

1198891015840

1(119899) = 119889

1(119899) divide

sum119872

119898=2119889119898(119899)

119872 minus 1 (8)

where 119889119898(119899) is the distance associatedwith themth candidate

in the list andM is a constant Based on our experimentM =10 is sufficient By the same arrangement the second distance1198891015840

2(119899) is calculated as

1198891015840

2(119899) = 119889

2(119899) divide

sum119872+1

119898=3119889119898(119899)

119872 minus 1 (9)

Similar to (7) the threshold 1198791015840119860is determined as

1198791015840

119860= 05 sdot (

119873119879

sum

119899=1

1198891015840

1(119899) +

119873119879

sum

119899=1

1198891015840

2(119899)) (10)

During identification phase if the computed distance 11988910158401

is greater than1198791015840119860 the query input is determined as not in the

database Otherwise the query input is highly similar to thetop soundtrack title in the candidate list

6 Experiments and Results

Before conducting the experiments we collect 750 sound-tracks frommany CD titles for constructing the database andfor identification In the experiments 30 seconds of musicis excerpted from the soundtracks as the reference items tobe identified Then low- and full-resolution descriptors ofthe reference items are calculated and stored in the databaseWe also randomly excerpt 15-second query items from thereference items Note that a query item may start fromany sample on the first half of the reference To test theidentification accuracy for compressed audio the 15-secondquery inputs are also encoded and then decoded with anMP-3 coder with bitrates of 192 k and 96 k respectivelyHowever the database only contains descriptors from theuncompressed items In addition the principal componentsare also obtained using uncompressed items

Table 4 Comparison of search time between 119896-119889 tree search andlinear search

119870-119889 tree Linear searchAverage time 0255 sec 323 sec

Several experiments are conducted to examine the per-formance of the proposed systemThefirst experiment checksthe identification accuracy using high-resolution descriptorsThe second one compares the relative identification accuracybetween the proposed method (in Section 42) and themethod given in [21] The third experiment compares thesearch speed by using the linear search and k-d tree searchThe next experiment intends to determine a suitable valueof M used in (8) and (9) Having the value of M we reportthe identification accuracy using a multiresolution strategyin experiment five In this experiment both method I andmethod II are examined To have a complete comparisonbetween method I and II we also report the results in termsof the ROC (receiver operating characteristics) [25] curveand the DET-like (detection error tradeoff) [26] curve butwithout logarithm

61 Experiment One Identification Accuracy Using High-Resolution Descriptors The first experiment is to examinethe identification accuracy using the high-resolutionMPEG-7 audio signature descriptors (without reduction) To evaluatethe influences of high-frequency descriptors we also exam-ine the accuracies with and without using high-frequency(greater than 25 kHz) descriptors The results are given inTable 2 In the table ldquomatch in first onerdquo is the rate that thequery input is correctly identified with the shortest distanceamong all reference items Similarly ldquomatch in first 15rdquo is therate that the query is correctly identified within a list of 15 ref-erence items sorted by distanceThe results indicate that high-resolution descriptors have very good identification accuracyAlso for uncompressed or 192 k MP-3 items whether usinghigh-frequency descriptors does not affect the identificationaccuracy However for 96 k MP-3 items discarding high-frequency descriptors greatly improves the accuracy

8 The Scientific World Journal

Seco

nd d

istan

ce

Firs

t dist

ance

AvgAvg

dividedivide

Figure 8 The first and second distances in method II

62 Experiment Two Identification Accuracy Using Low-Resolution Descriptors The second experiment compares therelative identification accuracies of the proposed dimension-ality reduction technique and the averaging method pro-posed in [21] In this experiment eight descriptors are used torepresent one segment of music (15 seconds) The results areshown in Table 3 For this experiment the accuracy rate inldquomatch in first 15rdquo ismore important than the figure in ldquomatchin first onerdquo because the low-resolution descriptors are usedto produce a candidate list Therefore it is acceptable as longas the correct item is within the list as the high-resolutiondescriptors are to be used in the second stage In this regardthe proposed approach is slightly better than the averagingmethod proposed in [21] for 96 k MP-3 query inputs

63 Experiment Three Time for k-d Tree Search and LinearSearch The third experiment compares the time to generatea list of 20 candidates by using k-d tree search and linear(exhausted) search for low-resolution descriptors The com-puting time is obtained by averaging the required time in tentrials For each trial 500 query items are randomly selectedamong the 750 available items The experimental platform isa personal computer with a 16GHz PentiumCPU and 15 GBmemory The program is written in C with a Borland C++compiler The results are shown in Table 4 The results showthat we can increase the search time by 10-fold if the low-resolution descriptors are embedded in the k-d tree structure

64 Experiment Four Determination of M in Method II Thefourth experiment is to determine the numberM in methodII Recall that the thresholds in both methods are computedbased on high-resolution descriptors To determine the opti-mal value of M we vary this value from one to eighteen in(8) and (9) to compute the threshold 1198791015840

119860and then to perform

identification Note that method II is degenerated to methodI if M = 1 In this experiment 96 k MP-3-coded items areused as the query In addition we deliberately keep the high-frequency descriptors in computing the distances becausekeeping them makes it easier to examine the influence of Mversus identification rate (defined in experiment five ) Theexperimental results are given in Figure 9 The results showthat whenM is equal or greater than 10 the identification rate

085408560858

0860862086408660868

087087208740876

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Iden

tifica

tion

rate

119872 value

Figure 9The identification rate versus the numberM for 96 kMP-3query inputs

remains almost constant Therefore M = 10 is sufficient formethod II This value is then used in experiment five

65 Experiment Five Comparison between Method I andMethod II The fifth experiment compares the identificationrates between method I and method II The thresholds arecalculated using (7) and (10) respectively In this experiment375 of the 750 15-second items are randomly chosen fortraining (ie to obtain the thresholds) The rest of 375 itemsare for query Note that all of the items are excerpted from thereference items in the database In addition we test other 500query items (also with duration of 15 sec) not excerpted fromdatabase references

Since some query inputs are not in the database wealso need to consider the situations of falsely identifying anoutside database item as one inside the database and viceversa Suppose that 119879

119889query items are actually inside the

database and 119879119900query items are not inside the database

Assume that the identification system correctly identifies119873119889

inside-database query item erroneously rejects 119873119903inside-

database query items and erroneously accepts 119873119886outside

database query items Then the IDR (identification rate)

The Scientific World Journal 9

Table 5 Comparison of identification performance using method I and method II with the use of high-frequency descriptors

Uncompressed 192 k MP-3 96 k MP-3Method I Method II Method I Method II Method I Method II

IDR 988 999 99 997 841 874FAR 21 02 18 02 04 02FRR 00 00 00 05 366 296ACC 984 999 989 995 718 768

Table 6 Comparison of method I and method II for IDR FAR and FRR with high-frequency descriptors removed

Uncompressed 192 k MP-3 96 k MP-3Method I Method II Method I Method II Method I Method II

IDR 991 997 989 997 990 997FAR 15 02 19 02 16 02FRR 03 04 02 04 03 06ACC 987 995 985 995 986 994

FAR (false-accept rate) and FRR (false-rejection rate) arecomputed as follows

IDR =119873119889

119879119889minus 119873119903

(11)

FAR =119873119886

119879119900

(12)

FRR =119873119903

119879119889

(13)

In addition the accuracy of the system is computed as

ACC =119873119889+ 119873119900

119879119889+ 119879119900

(14)

where119873119900is the number of query items correctly identified as

outside-database items and can be computed as119873119900= 119879119900minus119873119886

In this experiment the low-resolution descriptors areused to find 20 candidates Next high-resolution descriptorsare used to find the best-matched one in the list If thedistance associated with the best-match is greater than thethreshold the query input is determined as not in thedatabase Otherwise the query input is identified as the best-matched one In this experiment high-frequency descriptorsare kept in the comparison With the use of (11) to (14) theaverage values after four trials are given in Table 5 Fromthe results we know that the system does not perform wellfor identifying 96 k MP-3 query items This is mainly dueto high distortion in high-frequency descriptors A similarphenomenon can also be observed in Table 2 Later onwe will repeat this experiment but discard high-frequencydescriptors

To further examine the performance differences betweenmethod I and method II we again use 96 k MP-3 items asthe query inputs But this time we vary the threshold values(for both methods) and then record the IDR FAR and FRRvalues The results are presented as ROC curves [25] andDET-like curves [26] (but without logarithm) as shown in

Figures 10 and 11 respectively The ROC curve plots FARversus TPR (true positive rate) which is defined as

TPR =119873119889

119879119889

(15)

Conceptually a better classifier should have a higher TPRfor a fixed FAR Therefore a better classifier should havean ROC curve closer to the left-upper corner With thisinterpretation from Figure 10 we know that method II is abetter classifier

In this experiment we also show DET-like curves Aldquotruerdquo DET curve plots FAR versus FRR using logarithmscales In our case however we do not use logarithm scales toexhibit the similarity between ROC and DET curves Againconceptually we know that a better classifier should have alower FRR over a fixed FAR Thus a better classifier shouldhave a curve closer to the left-bottom corner By examiningFigure 11 we again confirm thatmethod II is a better method

Since the high-frequency descriptors actually affect theidentification rate at low bitrates we again repeat this exper-iment without using high-frequency descriptors The resultsare given in Table 6 By comparing the results in Tables 5 and6 we know that removing high-frequency descriptors slightlyreduces the accuracy of uncompressed items However itgreatly improves the accuracy for 96 k MP-3 items Sincemethod II has an accuracy of at least 994 in all cases it ishighly plausible to identify copyrighted audiomaterials usingthis method

7 Conclusion

In this paper we propose a system using a multiresolutionstrategy to identify whether a piece of unknown music isidentical to one of the pieces in the database In the systemhigh-resolution descriptors are MPEG-7 audio signaturedescriptors and the low-resolution descriptors are obtainedfrom high-resolution descriptors with the aid of PCA toreduce their dimensionality Experimental results show that

10 The Scientific World Journal

0 20 40 60 80 1000

10

20

30

40

50

60

70

80

90

100

FAR ()

TPR

()

ROC curve (96 k MP-3)

Method IMethod II

Figure 10 The ROC curves for both methods

0 20 40 60 80 1000

10

20

30

40

50

60

70

80

90

100

FAR ()

FRR

()

Method IMethod II

DET curve (96 k MP-3)

Figure 11 The DET-like curves without logarithm for both meth-ods

low-resolution descriptors still have high identification accu-racies To reduce the time to generate the candidate list weuse the k-d tree structure to store low-resolution descriptorsExperimental results show that using k-d tree structureincreases the search time by ten-folds Since not every pieceof query input is within the database we also proposed twomethods to determine the distance thresholds Experimentalresults show that the proposed method II provides an accu-racy of 994 Therefore the proposed system can be usedin real applications such as identifying copyrighted audiofiles circulated over the Internet As it can be easily extendedto operate with multiple computers the proposed system

is a plausible starting point to construct a large operabledatabase

Acknowledgment

This work was supported in part by National Science Councilof Taiwan through Grants NSC 94-2213-E-027-042 and 99-2221-E-027-097

References

[1] For a brief description of the bill please refer to httpenwikipediaorgwikiSOPA

[2] R Eklund ldquoAudio watermarking techniquesrdquo httpwwwmusemagiccompaperswatermarkhtml

[3] J S R Jang and H R Lee ldquoA general framework of progressivefiltering and its application to query by singinghummingrdquoIEEE Transactions on Audio Speech and Language Processingvol 16 no 2 pp 350ndash358 2008

[4] S Baluja and M Covell ldquoAudio fingerprinting combiningcomputer visionamp data streamprocessingrdquo in Proceedings of theIEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo07) pp II213ndashII216 Honolulu HawaiiUSA April 2007

[5] Shazam httpwwwshazamcommusicwebhomehtml[6] A Wang ldquoThe Shazam music recognition servicerdquo Communi-

cations of the ACM vol 49 no 8 pp 44ndash48 2006[7] V Chandrasekha M Sharifi and D A Ross ldquoSurvey and eval-

uation of audio fingerprinting schemes for mobile query-by-example applicationsrdquo in Proceedinjgs of the 12th InternationalConference onMusic Information Retrieval pp 801ndash806MiamiFla USA October 2011

[8] J A Haitsma and T Kalker ldquoA highly robust audio fingerprint-ing systemrdquo in Proceedinjgs of the 12th International Conferenceon Music Information Retrieval pp 107ndash115 Paris FranceOctober 2002

[9] C J C Burges J C Platt and S Jana ldquoDistortion discriminantanalysis for audio fingerprintingrdquo IEEE Transactions on Speechand Audio Processing vol 11 no 3 pp 165ndash174 2003

[10] Official website httpmusicbrainzorg[11] Official website httpwwwaudiblemagiccom[12] Official website httpwwwgracenotecom[13] httpenwikipediaorgwikiAcoustic fingerprint[14] P J O Doets M M Gisbert and R L Lagendijk ldquoOn

the comparison of audio fingerprints for extracting qualityparameters of compressed audiordquo in Security Steganographyand Watermarking of Multimedia Contents VIII vol 6072 ofProceedings of SPIE pp 60720L-1ndash60720L-12 San Jose CalifUSA January 2006

[15] F Nack and A T Lindsay ldquoEverything you wanted to knowaboutMPEG-7 part 1rdquo IEEEMultimedia vol 6 no 3 pp 65ndash771999

[16] F Nack and A Lindsay ldquoEverything you wanted to know aboutMPEG-7 part 2rdquo IEEEMultimedia vol 6 no 4 pp 64ndash73 1999

[17] ISOIEC Information TechnologymdashMultimedia ContentDescription Interface -Part 4 Audio IS 15938-4 2002

[18] H Crysandt ldquoMusic identification with MPEG-7rdquo in Proceed-ings of the 115th AES Convention October 2003 Paper 5967

The Scientific World Journal 11

[19] OHellmuth E AllamanceMCremerHGrossmann J Herreand T Kastner ldquoUsing MPEG-7 audio fingerprinting in real-world applicationrdquo in Proceedings of the 115th AES ConventionOctober 2003 Paper 5961

[20] H-G Kim N Moreau and T Sikora MPEG-7 Audio andBeyond John Wiley amp Sons 2005

[21] J-Y Lee and S D You ldquoDimension-reduction technique forMPEG-7 Audio Descriptorsrdquo in Proceedings of the 6th PacificRim Conference on Multimedia (PCM rsquo05) vol 3768 of LectureNotes in Computer Science pp 526ndash537 2005

[22] J L Bentley ldquoMultidimensional binary search trees used forassociative searchingrdquo Communications of the ACM vol 18 no9 pp 509ndash517 1975

[23] Open source program for k-d tree httpwwwcsumdedusimmountANN

[24] J Shlens ldquoA tutorial on principal component analysisrdquohttpwwwsnlsalkedusimshlenspcapdf

[25] T Fawcell ldquoROC graphs notes and practical considerations forresearchersrdquo Tech Rep HP Lab 2004

[26] A Martin G Doddington T Kamm M Ordowski and MPrzybocki ldquoThe DET curve in assessment of detection taskperformancerdquo in Proceedings of the Eurospeech vol 4 pp 1899ndash1903 Rhodes Greece September 1997

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 7: Research Article Music Identification System Using MPEG-7 ...downloads.hindawi.com/journals/tswj/2013/752464.pdf · Gracenote s MusicID [ ]. According to [ ], there are more ... one

The Scientific World Journal 7

Firs

t dist

ance

Seco

nd d

istan

ce

Figure 7 The first and second distances in method I

and the threshold Referring to Figure 8 the first distance1198891015840

1(119899) of method II for a training segment in soundtrack n is

1198891015840

1(119899) = 119889

1(119899) divide

sum119872

119898=2119889119898(119899)

119872 minus 1 (8)

where 119889119898(119899) is the distance associatedwith themth candidate

in the list andM is a constant Based on our experimentM =10 is sufficient By the same arrangement the second distance1198891015840

2(119899) is calculated as

1198891015840

2(119899) = 119889

2(119899) divide

sum119872+1

119898=3119889119898(119899)

119872 minus 1 (9)

Similar to (7) the threshold 1198791015840119860is determined as

1198791015840

119860= 05 sdot (

119873119879

sum

119899=1

1198891015840

1(119899) +

119873119879

sum

119899=1

1198891015840

2(119899)) (10)

During identification phase if the computed distance 11988910158401

is greater than1198791015840119860 the query input is determined as not in the

database Otherwise the query input is highly similar to thetop soundtrack title in the candidate list

6 Experiments and Results

Before conducting the experiments we collect 750 sound-tracks frommany CD titles for constructing the database andfor identification In the experiments 30 seconds of musicis excerpted from the soundtracks as the reference items tobe identified Then low- and full-resolution descriptors ofthe reference items are calculated and stored in the databaseWe also randomly excerpt 15-second query items from thereference items Note that a query item may start fromany sample on the first half of the reference To test theidentification accuracy for compressed audio the 15-secondquery inputs are also encoded and then decoded with anMP-3 coder with bitrates of 192 k and 96 k respectivelyHowever the database only contains descriptors from theuncompressed items In addition the principal componentsare also obtained using uncompressed items

Table 4 Comparison of search time between 119896-119889 tree search andlinear search

119870-119889 tree Linear searchAverage time 0255 sec 323 sec

Several experiments are conducted to examine the per-formance of the proposed systemThefirst experiment checksthe identification accuracy using high-resolution descriptorsThe second one compares the relative identification accuracybetween the proposed method (in Section 42) and themethod given in [21] The third experiment compares thesearch speed by using the linear search and k-d tree searchThe next experiment intends to determine a suitable valueof M used in (8) and (9) Having the value of M we reportthe identification accuracy using a multiresolution strategyin experiment five In this experiment both method I andmethod II are examined To have a complete comparisonbetween method I and II we also report the results in termsof the ROC (receiver operating characteristics) [25] curveand the DET-like (detection error tradeoff) [26] curve butwithout logarithm

61 Experiment One Identification Accuracy Using High-Resolution Descriptors The first experiment is to examinethe identification accuracy using the high-resolutionMPEG-7 audio signature descriptors (without reduction) To evaluatethe influences of high-frequency descriptors we also exam-ine the accuracies with and without using high-frequency(greater than 25 kHz) descriptors The results are given inTable 2 In the table ldquomatch in first onerdquo is the rate that thequery input is correctly identified with the shortest distanceamong all reference items Similarly ldquomatch in first 15rdquo is therate that the query is correctly identified within a list of 15 ref-erence items sorted by distanceThe results indicate that high-resolution descriptors have very good identification accuracyAlso for uncompressed or 192 k MP-3 items whether usinghigh-frequency descriptors does not affect the identificationaccuracy However for 96 k MP-3 items discarding high-frequency descriptors greatly improves the accuracy

8 The Scientific World Journal

Seco

nd d

istan

ce

Firs

t dist

ance

AvgAvg

dividedivide

Figure 8 The first and second distances in method II

62 Experiment Two Identification Accuracy Using Low-Resolution Descriptors The second experiment compares therelative identification accuracies of the proposed dimension-ality reduction technique and the averaging method pro-posed in [21] In this experiment eight descriptors are used torepresent one segment of music (15 seconds) The results areshown in Table 3 For this experiment the accuracy rate inldquomatch in first 15rdquo ismore important than the figure in ldquomatchin first onerdquo because the low-resolution descriptors are usedto produce a candidate list Therefore it is acceptable as longas the correct item is within the list as the high-resolutiondescriptors are to be used in the second stage In this regardthe proposed approach is slightly better than the averagingmethod proposed in [21] for 96 k MP-3 query inputs

63 Experiment Three Time for k-d Tree Search and LinearSearch The third experiment compares the time to generatea list of 20 candidates by using k-d tree search and linear(exhausted) search for low-resolution descriptors The com-puting time is obtained by averaging the required time in tentrials For each trial 500 query items are randomly selectedamong the 750 available items The experimental platform isa personal computer with a 16GHz PentiumCPU and 15 GBmemory The program is written in C with a Borland C++compiler The results are shown in Table 4 The results showthat we can increase the search time by 10-fold if the low-resolution descriptors are embedded in the k-d tree structure

64 Experiment Four Determination of M in Method II Thefourth experiment is to determine the numberM in methodII Recall that the thresholds in both methods are computedbased on high-resolution descriptors To determine the opti-mal value of M we vary this value from one to eighteen in(8) and (9) to compute the threshold 1198791015840

119860and then to perform

identification Note that method II is degenerated to methodI if M = 1 In this experiment 96 k MP-3-coded items areused as the query In addition we deliberately keep the high-frequency descriptors in computing the distances becausekeeping them makes it easier to examine the influence of Mversus identification rate (defined in experiment five ) Theexperimental results are given in Figure 9 The results showthat whenM is equal or greater than 10 the identification rate

085408560858

0860862086408660868

087087208740876

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Iden

tifica

tion

rate

119872 value

Figure 9The identification rate versus the numberM for 96 kMP-3query inputs

remains almost constant Therefore M = 10 is sufficient formethod II This value is then used in experiment five

65 Experiment Five Comparison between Method I andMethod II The fifth experiment compares the identificationrates between method I and method II The thresholds arecalculated using (7) and (10) respectively In this experiment375 of the 750 15-second items are randomly chosen fortraining (ie to obtain the thresholds) The rest of 375 itemsare for query Note that all of the items are excerpted from thereference items in the database In addition we test other 500query items (also with duration of 15 sec) not excerpted fromdatabase references

Since some query inputs are not in the database wealso need to consider the situations of falsely identifying anoutside database item as one inside the database and viceversa Suppose that 119879

119889query items are actually inside the

database and 119879119900query items are not inside the database

Assume that the identification system correctly identifies119873119889

inside-database query item erroneously rejects 119873119903inside-

database query items and erroneously accepts 119873119886outside

database query items Then the IDR (identification rate)

The Scientific World Journal 9

Table 5 Comparison of identification performance using method I and method II with the use of high-frequency descriptors

Uncompressed 192 k MP-3 96 k MP-3Method I Method II Method I Method II Method I Method II

IDR 988 999 99 997 841 874FAR 21 02 18 02 04 02FRR 00 00 00 05 366 296ACC 984 999 989 995 718 768

Table 6 Comparison of method I and method II for IDR FAR and FRR with high-frequency descriptors removed

Uncompressed 192 k MP-3 96 k MP-3Method I Method II Method I Method II Method I Method II

IDR 991 997 989 997 990 997FAR 15 02 19 02 16 02FRR 03 04 02 04 03 06ACC 987 995 985 995 986 994

FAR (false-accept rate) and FRR (false-rejection rate) arecomputed as follows

IDR =119873119889

119879119889minus 119873119903

(11)

FAR =119873119886

119879119900

(12)

FRR =119873119903

119879119889

(13)

In addition the accuracy of the system is computed as

ACC =119873119889+ 119873119900

119879119889+ 119879119900

(14)

where119873119900is the number of query items correctly identified as

outside-database items and can be computed as119873119900= 119879119900minus119873119886

In this experiment the low-resolution descriptors areused to find 20 candidates Next high-resolution descriptorsare used to find the best-matched one in the list If thedistance associated with the best-match is greater than thethreshold the query input is determined as not in thedatabase Otherwise the query input is identified as the best-matched one In this experiment high-frequency descriptorsare kept in the comparison With the use of (11) to (14) theaverage values after four trials are given in Table 5 Fromthe results we know that the system does not perform wellfor identifying 96 k MP-3 query items This is mainly dueto high distortion in high-frequency descriptors A similarphenomenon can also be observed in Table 2 Later onwe will repeat this experiment but discard high-frequencydescriptors

To further examine the performance differences betweenmethod I and method II we again use 96 k MP-3 items asthe query inputs But this time we vary the threshold values(for both methods) and then record the IDR FAR and FRRvalues The results are presented as ROC curves [25] andDET-like curves [26] (but without logarithm) as shown in

Figures 10 and 11 respectively The ROC curve plots FARversus TPR (true positive rate) which is defined as

TPR =119873119889

119879119889

(15)

Conceptually a better classifier should have a higher TPRfor a fixed FAR Therefore a better classifier should havean ROC curve closer to the left-upper corner With thisinterpretation from Figure 10 we know that method II is abetter classifier

In this experiment we also show DET-like curves Aldquotruerdquo DET curve plots FAR versus FRR using logarithmscales In our case however we do not use logarithm scales toexhibit the similarity between ROC and DET curves Againconceptually we know that a better classifier should have alower FRR over a fixed FAR Thus a better classifier shouldhave a curve closer to the left-bottom corner By examiningFigure 11 we again confirm thatmethod II is a better method

Since the high-frequency descriptors actually affect theidentification rate at low bitrates we again repeat this exper-iment without using high-frequency descriptors The resultsare given in Table 6 By comparing the results in Tables 5 and6 we know that removing high-frequency descriptors slightlyreduces the accuracy of uncompressed items However itgreatly improves the accuracy for 96 k MP-3 items Sincemethod II has an accuracy of at least 994 in all cases it ishighly plausible to identify copyrighted audiomaterials usingthis method

7 Conclusion

In this paper we propose a system using a multiresolutionstrategy to identify whether a piece of unknown music isidentical to one of the pieces in the database In the systemhigh-resolution descriptors are MPEG-7 audio signaturedescriptors and the low-resolution descriptors are obtainedfrom high-resolution descriptors with the aid of PCA toreduce their dimensionality Experimental results show that

10 The Scientific World Journal

0 20 40 60 80 1000

10

20

30

40

50

60

70

80

90

100

FAR ()

TPR

()

ROC curve (96 k MP-3)

Method IMethod II

Figure 10 The ROC curves for both methods

0 20 40 60 80 1000

10

20

30

40

50

60

70

80

90

100

FAR ()

FRR

()

Method IMethod II

DET curve (96 k MP-3)

Figure 11 The DET-like curves without logarithm for both meth-ods

low-resolution descriptors still have high identification accu-racies To reduce the time to generate the candidate list weuse the k-d tree structure to store low-resolution descriptorsExperimental results show that using k-d tree structureincreases the search time by ten-folds Since not every pieceof query input is within the database we also proposed twomethods to determine the distance thresholds Experimentalresults show that the proposed method II provides an accu-racy of 994 Therefore the proposed system can be usedin real applications such as identifying copyrighted audiofiles circulated over the Internet As it can be easily extendedto operate with multiple computers the proposed system

is a plausible starting point to construct a large operabledatabase

Acknowledgment

This work was supported in part by National Science Councilof Taiwan through Grants NSC 94-2213-E-027-042 and 99-2221-E-027-097

References

[1] For a brief description of the bill please refer to httpenwikipediaorgwikiSOPA

[2] R Eklund ldquoAudio watermarking techniquesrdquo httpwwwmusemagiccompaperswatermarkhtml

[3] J S R Jang and H R Lee ldquoA general framework of progressivefiltering and its application to query by singinghummingrdquoIEEE Transactions on Audio Speech and Language Processingvol 16 no 2 pp 350ndash358 2008

[4] S Baluja and M Covell ldquoAudio fingerprinting combiningcomputer visionamp data streamprocessingrdquo in Proceedings of theIEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo07) pp II213ndashII216 Honolulu HawaiiUSA April 2007

[5] Shazam httpwwwshazamcommusicwebhomehtml[6] A Wang ldquoThe Shazam music recognition servicerdquo Communi-

cations of the ACM vol 49 no 8 pp 44ndash48 2006[7] V Chandrasekha M Sharifi and D A Ross ldquoSurvey and eval-

uation of audio fingerprinting schemes for mobile query-by-example applicationsrdquo in Proceedinjgs of the 12th InternationalConference onMusic Information Retrieval pp 801ndash806MiamiFla USA October 2011

[8] J A Haitsma and T Kalker ldquoA highly robust audio fingerprint-ing systemrdquo in Proceedinjgs of the 12th International Conferenceon Music Information Retrieval pp 107ndash115 Paris FranceOctober 2002

[9] C J C Burges J C Platt and S Jana ldquoDistortion discriminantanalysis for audio fingerprintingrdquo IEEE Transactions on Speechand Audio Processing vol 11 no 3 pp 165ndash174 2003

[10] Official website httpmusicbrainzorg[11] Official website httpwwwaudiblemagiccom[12] Official website httpwwwgracenotecom[13] httpenwikipediaorgwikiAcoustic fingerprint[14] P J O Doets M M Gisbert and R L Lagendijk ldquoOn

the comparison of audio fingerprints for extracting qualityparameters of compressed audiordquo in Security Steganographyand Watermarking of Multimedia Contents VIII vol 6072 ofProceedings of SPIE pp 60720L-1ndash60720L-12 San Jose CalifUSA January 2006

[15] F Nack and A T Lindsay ldquoEverything you wanted to knowaboutMPEG-7 part 1rdquo IEEEMultimedia vol 6 no 3 pp 65ndash771999

[16] F Nack and A Lindsay ldquoEverything you wanted to know aboutMPEG-7 part 2rdquo IEEEMultimedia vol 6 no 4 pp 64ndash73 1999

[17] ISOIEC Information TechnologymdashMultimedia ContentDescription Interface -Part 4 Audio IS 15938-4 2002

[18] H Crysandt ldquoMusic identification with MPEG-7rdquo in Proceed-ings of the 115th AES Convention October 2003 Paper 5967

The Scientific World Journal 11

[19] OHellmuth E AllamanceMCremerHGrossmann J Herreand T Kastner ldquoUsing MPEG-7 audio fingerprinting in real-world applicationrdquo in Proceedings of the 115th AES ConventionOctober 2003 Paper 5961

[20] H-G Kim N Moreau and T Sikora MPEG-7 Audio andBeyond John Wiley amp Sons 2005

[21] J-Y Lee and S D You ldquoDimension-reduction technique forMPEG-7 Audio Descriptorsrdquo in Proceedings of the 6th PacificRim Conference on Multimedia (PCM rsquo05) vol 3768 of LectureNotes in Computer Science pp 526ndash537 2005

[22] J L Bentley ldquoMultidimensional binary search trees used forassociative searchingrdquo Communications of the ACM vol 18 no9 pp 509ndash517 1975

[23] Open source program for k-d tree httpwwwcsumdedusimmountANN

[24] J Shlens ldquoA tutorial on principal component analysisrdquohttpwwwsnlsalkedusimshlenspcapdf

[25] T Fawcell ldquoROC graphs notes and practical considerations forresearchersrdquo Tech Rep HP Lab 2004

[26] A Martin G Doddington T Kamm M Ordowski and MPrzybocki ldquoThe DET curve in assessment of detection taskperformancerdquo in Proceedings of the Eurospeech vol 4 pp 1899ndash1903 Rhodes Greece September 1997

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 8: Research Article Music Identification System Using MPEG-7 ...downloads.hindawi.com/journals/tswj/2013/752464.pdf · Gracenote s MusicID [ ]. According to [ ], there are more ... one

8 The Scientific World Journal

Seco

nd d

istan

ce

Firs

t dist

ance

AvgAvg

dividedivide

Figure 8 The first and second distances in method II

62 Experiment Two Identification Accuracy Using Low-Resolution Descriptors The second experiment compares therelative identification accuracies of the proposed dimension-ality reduction technique and the averaging method pro-posed in [21] In this experiment eight descriptors are used torepresent one segment of music (15 seconds) The results areshown in Table 3 For this experiment the accuracy rate inldquomatch in first 15rdquo ismore important than the figure in ldquomatchin first onerdquo because the low-resolution descriptors are usedto produce a candidate list Therefore it is acceptable as longas the correct item is within the list as the high-resolutiondescriptors are to be used in the second stage In this regardthe proposed approach is slightly better than the averagingmethod proposed in [21] for 96 k MP-3 query inputs

63 Experiment Three Time for k-d Tree Search and LinearSearch The third experiment compares the time to generatea list of 20 candidates by using k-d tree search and linear(exhausted) search for low-resolution descriptors The com-puting time is obtained by averaging the required time in tentrials For each trial 500 query items are randomly selectedamong the 750 available items The experimental platform isa personal computer with a 16GHz PentiumCPU and 15 GBmemory The program is written in C with a Borland C++compiler The results are shown in Table 4 The results showthat we can increase the search time by 10-fold if the low-resolution descriptors are embedded in the k-d tree structure

64 Experiment Four Determination of M in Method II Thefourth experiment is to determine the numberM in methodII Recall that the thresholds in both methods are computedbased on high-resolution descriptors To determine the opti-mal value of M we vary this value from one to eighteen in(8) and (9) to compute the threshold 1198791015840

119860and then to perform

identification Note that method II is degenerated to methodI if M = 1 In this experiment 96 k MP-3-coded items areused as the query In addition we deliberately keep the high-frequency descriptors in computing the distances becausekeeping them makes it easier to examine the influence of Mversus identification rate (defined in experiment five ) Theexperimental results are given in Figure 9 The results showthat whenM is equal or greater than 10 the identification rate

085408560858

0860862086408660868

087087208740876

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Iden

tifica

tion

rate

119872 value

Figure 9The identification rate versus the numberM for 96 kMP-3query inputs

remains almost constant Therefore M = 10 is sufficient formethod II This value is then used in experiment five

65 Experiment Five Comparison between Method I andMethod II The fifth experiment compares the identificationrates between method I and method II The thresholds arecalculated using (7) and (10) respectively In this experiment375 of the 750 15-second items are randomly chosen fortraining (ie to obtain the thresholds) The rest of 375 itemsare for query Note that all of the items are excerpted from thereference items in the database In addition we test other 500query items (also with duration of 15 sec) not excerpted fromdatabase references

Since some query inputs are not in the database wealso need to consider the situations of falsely identifying anoutside database item as one inside the database and viceversa Suppose that 119879

119889query items are actually inside the

database and 119879119900query items are not inside the database

Assume that the identification system correctly identifies119873119889

inside-database query item erroneously rejects 119873119903inside-

database query items and erroneously accepts 119873119886outside

database query items Then the IDR (identification rate)

The Scientific World Journal 9

Table 5 Comparison of identification performance using method I and method II with the use of high-frequency descriptors

Uncompressed 192 k MP-3 96 k MP-3Method I Method II Method I Method II Method I Method II

IDR 988 999 99 997 841 874FAR 21 02 18 02 04 02FRR 00 00 00 05 366 296ACC 984 999 989 995 718 768

Table 6 Comparison of method I and method II for IDR FAR and FRR with high-frequency descriptors removed

Uncompressed 192 k MP-3 96 k MP-3Method I Method II Method I Method II Method I Method II

IDR 991 997 989 997 990 997FAR 15 02 19 02 16 02FRR 03 04 02 04 03 06ACC 987 995 985 995 986 994

FAR (false-accept rate) and FRR (false-rejection rate) arecomputed as follows

IDR =119873119889

119879119889minus 119873119903

(11)

FAR =119873119886

119879119900

(12)

FRR =119873119903

119879119889

(13)

In addition the accuracy of the system is computed as

ACC =119873119889+ 119873119900

119879119889+ 119879119900

(14)

where119873119900is the number of query items correctly identified as

outside-database items and can be computed as119873119900= 119879119900minus119873119886

In this experiment the low-resolution descriptors areused to find 20 candidates Next high-resolution descriptorsare used to find the best-matched one in the list If thedistance associated with the best-match is greater than thethreshold the query input is determined as not in thedatabase Otherwise the query input is identified as the best-matched one In this experiment high-frequency descriptorsare kept in the comparison With the use of (11) to (14) theaverage values after four trials are given in Table 5 Fromthe results we know that the system does not perform wellfor identifying 96 k MP-3 query items This is mainly dueto high distortion in high-frequency descriptors A similarphenomenon can also be observed in Table 2 Later onwe will repeat this experiment but discard high-frequencydescriptors

To further examine the performance differences betweenmethod I and method II we again use 96 k MP-3 items asthe query inputs But this time we vary the threshold values(for both methods) and then record the IDR FAR and FRRvalues The results are presented as ROC curves [25] andDET-like curves [26] (but without logarithm) as shown in

Figures 10 and 11 respectively The ROC curve plots FARversus TPR (true positive rate) which is defined as

TPR =119873119889

119879119889

(15)

Conceptually a better classifier should have a higher TPRfor a fixed FAR Therefore a better classifier should havean ROC curve closer to the left-upper corner With thisinterpretation from Figure 10 we know that method II is abetter classifier

In this experiment we also show DET-like curves Aldquotruerdquo DET curve plots FAR versus FRR using logarithmscales In our case however we do not use logarithm scales toexhibit the similarity between ROC and DET curves Againconceptually we know that a better classifier should have alower FRR over a fixed FAR Thus a better classifier shouldhave a curve closer to the left-bottom corner By examiningFigure 11 we again confirm thatmethod II is a better method

Since the high-frequency descriptors actually affect theidentification rate at low bitrates we again repeat this exper-iment without using high-frequency descriptors The resultsare given in Table 6 By comparing the results in Tables 5 and6 we know that removing high-frequency descriptors slightlyreduces the accuracy of uncompressed items However itgreatly improves the accuracy for 96 k MP-3 items Sincemethod II has an accuracy of at least 994 in all cases it ishighly plausible to identify copyrighted audiomaterials usingthis method

7 Conclusion

In this paper we propose a system using a multiresolutionstrategy to identify whether a piece of unknown music isidentical to one of the pieces in the database In the systemhigh-resolution descriptors are MPEG-7 audio signaturedescriptors and the low-resolution descriptors are obtainedfrom high-resolution descriptors with the aid of PCA toreduce their dimensionality Experimental results show that

10 The Scientific World Journal

0 20 40 60 80 1000

10

20

30

40

50

60

70

80

90

100

FAR ()

TPR

()

ROC curve (96 k MP-3)

Method IMethod II

Figure 10 The ROC curves for both methods

0 20 40 60 80 1000

10

20

30

40

50

60

70

80

90

100

FAR ()

FRR

()

Method IMethod II

DET curve (96 k MP-3)

Figure 11 The DET-like curves without logarithm for both meth-ods

low-resolution descriptors still have high identification accu-racies To reduce the time to generate the candidate list weuse the k-d tree structure to store low-resolution descriptorsExperimental results show that using k-d tree structureincreases the search time by ten-folds Since not every pieceof query input is within the database we also proposed twomethods to determine the distance thresholds Experimentalresults show that the proposed method II provides an accu-racy of 994 Therefore the proposed system can be usedin real applications such as identifying copyrighted audiofiles circulated over the Internet As it can be easily extendedto operate with multiple computers the proposed system

is a plausible starting point to construct a large operabledatabase

Acknowledgment

This work was supported in part by National Science Councilof Taiwan through Grants NSC 94-2213-E-027-042 and 99-2221-E-027-097

References

[1] For a brief description of the bill please refer to httpenwikipediaorgwikiSOPA

[2] R Eklund ldquoAudio watermarking techniquesrdquo httpwwwmusemagiccompaperswatermarkhtml

[3] J S R Jang and H R Lee ldquoA general framework of progressivefiltering and its application to query by singinghummingrdquoIEEE Transactions on Audio Speech and Language Processingvol 16 no 2 pp 350ndash358 2008

[4] S Baluja and M Covell ldquoAudio fingerprinting combiningcomputer visionamp data streamprocessingrdquo in Proceedings of theIEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo07) pp II213ndashII216 Honolulu HawaiiUSA April 2007

[5] Shazam httpwwwshazamcommusicwebhomehtml[6] A Wang ldquoThe Shazam music recognition servicerdquo Communi-

cations of the ACM vol 49 no 8 pp 44ndash48 2006[7] V Chandrasekha M Sharifi and D A Ross ldquoSurvey and eval-

uation of audio fingerprinting schemes for mobile query-by-example applicationsrdquo in Proceedinjgs of the 12th InternationalConference onMusic Information Retrieval pp 801ndash806MiamiFla USA October 2011

[8] J A Haitsma and T Kalker ldquoA highly robust audio fingerprint-ing systemrdquo in Proceedinjgs of the 12th International Conferenceon Music Information Retrieval pp 107ndash115 Paris FranceOctober 2002

[9] C J C Burges J C Platt and S Jana ldquoDistortion discriminantanalysis for audio fingerprintingrdquo IEEE Transactions on Speechand Audio Processing vol 11 no 3 pp 165ndash174 2003

[10] Official website httpmusicbrainzorg[11] Official website httpwwwaudiblemagiccom[12] Official website httpwwwgracenotecom[13] httpenwikipediaorgwikiAcoustic fingerprint[14] P J O Doets M M Gisbert and R L Lagendijk ldquoOn

the comparison of audio fingerprints for extracting qualityparameters of compressed audiordquo in Security Steganographyand Watermarking of Multimedia Contents VIII vol 6072 ofProceedings of SPIE pp 60720L-1ndash60720L-12 San Jose CalifUSA January 2006

[15] F Nack and A T Lindsay ldquoEverything you wanted to knowaboutMPEG-7 part 1rdquo IEEEMultimedia vol 6 no 3 pp 65ndash771999

[16] F Nack and A Lindsay ldquoEverything you wanted to know aboutMPEG-7 part 2rdquo IEEEMultimedia vol 6 no 4 pp 64ndash73 1999

[17] ISOIEC Information TechnologymdashMultimedia ContentDescription Interface -Part 4 Audio IS 15938-4 2002

[18] H Crysandt ldquoMusic identification with MPEG-7rdquo in Proceed-ings of the 115th AES Convention October 2003 Paper 5967

The Scientific World Journal 11

[19] OHellmuth E AllamanceMCremerHGrossmann J Herreand T Kastner ldquoUsing MPEG-7 audio fingerprinting in real-world applicationrdquo in Proceedings of the 115th AES ConventionOctober 2003 Paper 5961

[20] H-G Kim N Moreau and T Sikora MPEG-7 Audio andBeyond John Wiley amp Sons 2005

[21] J-Y Lee and S D You ldquoDimension-reduction technique forMPEG-7 Audio Descriptorsrdquo in Proceedings of the 6th PacificRim Conference on Multimedia (PCM rsquo05) vol 3768 of LectureNotes in Computer Science pp 526ndash537 2005

[22] J L Bentley ldquoMultidimensional binary search trees used forassociative searchingrdquo Communications of the ACM vol 18 no9 pp 509ndash517 1975

[23] Open source program for k-d tree httpwwwcsumdedusimmountANN

[24] J Shlens ldquoA tutorial on principal component analysisrdquohttpwwwsnlsalkedusimshlenspcapdf

[25] T Fawcell ldquoROC graphs notes and practical considerations forresearchersrdquo Tech Rep HP Lab 2004

[26] A Martin G Doddington T Kamm M Ordowski and MPrzybocki ldquoThe DET curve in assessment of detection taskperformancerdquo in Proceedings of the Eurospeech vol 4 pp 1899ndash1903 Rhodes Greece September 1997

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 9: Research Article Music Identification System Using MPEG-7 ...downloads.hindawi.com/journals/tswj/2013/752464.pdf · Gracenote s MusicID [ ]. According to [ ], there are more ... one

The Scientific World Journal 9

Table 5 Comparison of identification performance using method I and method II with the use of high-frequency descriptors

Uncompressed 192 k MP-3 96 k MP-3Method I Method II Method I Method II Method I Method II

IDR 988 999 99 997 841 874FAR 21 02 18 02 04 02FRR 00 00 00 05 366 296ACC 984 999 989 995 718 768

Table 6 Comparison of method I and method II for IDR FAR and FRR with high-frequency descriptors removed

Uncompressed 192 k MP-3 96 k MP-3Method I Method II Method I Method II Method I Method II

IDR 991 997 989 997 990 997FAR 15 02 19 02 16 02FRR 03 04 02 04 03 06ACC 987 995 985 995 986 994

FAR (false-accept rate) and FRR (false-rejection rate) arecomputed as follows

IDR =119873119889

119879119889minus 119873119903

(11)

FAR =119873119886

119879119900

(12)

FRR =119873119903

119879119889

(13)

In addition the accuracy of the system is computed as

ACC =119873119889+ 119873119900

119879119889+ 119879119900

(14)

where119873119900is the number of query items correctly identified as

outside-database items and can be computed as119873119900= 119879119900minus119873119886

In this experiment the low-resolution descriptors areused to find 20 candidates Next high-resolution descriptorsare used to find the best-matched one in the list If thedistance associated with the best-match is greater than thethreshold the query input is determined as not in thedatabase Otherwise the query input is identified as the best-matched one In this experiment high-frequency descriptorsare kept in the comparison With the use of (11) to (14) theaverage values after four trials are given in Table 5 Fromthe results we know that the system does not perform wellfor identifying 96 k MP-3 query items This is mainly dueto high distortion in high-frequency descriptors A similarphenomenon can also be observed in Table 2 Later onwe will repeat this experiment but discard high-frequencydescriptors

To further examine the performance differences betweenmethod I and method II we again use 96 k MP-3 items asthe query inputs But this time we vary the threshold values(for both methods) and then record the IDR FAR and FRRvalues The results are presented as ROC curves [25] andDET-like curves [26] (but without logarithm) as shown in

Figures 10 and 11 respectively The ROC curve plots FARversus TPR (true positive rate) which is defined as

TPR =119873119889

119879119889

(15)

Conceptually a better classifier should have a higher TPRfor a fixed FAR Therefore a better classifier should havean ROC curve closer to the left-upper corner With thisinterpretation from Figure 10 we know that method II is abetter classifier

In this experiment we also show DET-like curves Aldquotruerdquo DET curve plots FAR versus FRR using logarithmscales In our case however we do not use logarithm scales toexhibit the similarity between ROC and DET curves Againconceptually we know that a better classifier should have alower FRR over a fixed FAR Thus a better classifier shouldhave a curve closer to the left-bottom corner By examiningFigure 11 we again confirm thatmethod II is a better method

Since the high-frequency descriptors actually affect theidentification rate at low bitrates we again repeat this exper-iment without using high-frequency descriptors The resultsare given in Table 6 By comparing the results in Tables 5 and6 we know that removing high-frequency descriptors slightlyreduces the accuracy of uncompressed items However itgreatly improves the accuracy for 96 k MP-3 items Sincemethod II has an accuracy of at least 994 in all cases it ishighly plausible to identify copyrighted audiomaterials usingthis method

7 Conclusion

In this paper we propose a system using a multiresolutionstrategy to identify whether a piece of unknown music isidentical to one of the pieces in the database In the systemhigh-resolution descriptors are MPEG-7 audio signaturedescriptors and the low-resolution descriptors are obtainedfrom high-resolution descriptors with the aid of PCA toreduce their dimensionality Experimental results show that

10 The Scientific World Journal

0 20 40 60 80 1000

10

20

30

40

50

60

70

80

90

100

FAR ()

TPR

()

ROC curve (96 k MP-3)

Method IMethod II

Figure 10 The ROC curves for both methods

0 20 40 60 80 1000

10

20

30

40

50

60

70

80

90

100

FAR ()

FRR

()

Method IMethod II

DET curve (96 k MP-3)

Figure 11 The DET-like curves without logarithm for both meth-ods

low-resolution descriptors still have high identification accu-racies To reduce the time to generate the candidate list weuse the k-d tree structure to store low-resolution descriptorsExperimental results show that using k-d tree structureincreases the search time by ten-folds Since not every pieceof query input is within the database we also proposed twomethods to determine the distance thresholds Experimentalresults show that the proposed method II provides an accu-racy of 994 Therefore the proposed system can be usedin real applications such as identifying copyrighted audiofiles circulated over the Internet As it can be easily extendedto operate with multiple computers the proposed system

is a plausible starting point to construct a large operabledatabase

Acknowledgment

This work was supported in part by National Science Councilof Taiwan through Grants NSC 94-2213-E-027-042 and 99-2221-E-027-097

References

[1] For a brief description of the bill please refer to httpenwikipediaorgwikiSOPA

[2] R Eklund ldquoAudio watermarking techniquesrdquo httpwwwmusemagiccompaperswatermarkhtml

[3] J S R Jang and H R Lee ldquoA general framework of progressivefiltering and its application to query by singinghummingrdquoIEEE Transactions on Audio Speech and Language Processingvol 16 no 2 pp 350ndash358 2008

[4] S Baluja and M Covell ldquoAudio fingerprinting combiningcomputer visionamp data streamprocessingrdquo in Proceedings of theIEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo07) pp II213ndashII216 Honolulu HawaiiUSA April 2007

[5] Shazam httpwwwshazamcommusicwebhomehtml[6] A Wang ldquoThe Shazam music recognition servicerdquo Communi-

cations of the ACM vol 49 no 8 pp 44ndash48 2006[7] V Chandrasekha M Sharifi and D A Ross ldquoSurvey and eval-

uation of audio fingerprinting schemes for mobile query-by-example applicationsrdquo in Proceedinjgs of the 12th InternationalConference onMusic Information Retrieval pp 801ndash806MiamiFla USA October 2011

[8] J A Haitsma and T Kalker ldquoA highly robust audio fingerprint-ing systemrdquo in Proceedinjgs of the 12th International Conferenceon Music Information Retrieval pp 107ndash115 Paris FranceOctober 2002

[9] C J C Burges J C Platt and S Jana ldquoDistortion discriminantanalysis for audio fingerprintingrdquo IEEE Transactions on Speechand Audio Processing vol 11 no 3 pp 165ndash174 2003

[10] Official website httpmusicbrainzorg[11] Official website httpwwwaudiblemagiccom[12] Official website httpwwwgracenotecom[13] httpenwikipediaorgwikiAcoustic fingerprint[14] P J O Doets M M Gisbert and R L Lagendijk ldquoOn

the comparison of audio fingerprints for extracting qualityparameters of compressed audiordquo in Security Steganographyand Watermarking of Multimedia Contents VIII vol 6072 ofProceedings of SPIE pp 60720L-1ndash60720L-12 San Jose CalifUSA January 2006

[15] F Nack and A T Lindsay ldquoEverything you wanted to knowaboutMPEG-7 part 1rdquo IEEEMultimedia vol 6 no 3 pp 65ndash771999

[16] F Nack and A Lindsay ldquoEverything you wanted to know aboutMPEG-7 part 2rdquo IEEEMultimedia vol 6 no 4 pp 64ndash73 1999

[17] ISOIEC Information TechnologymdashMultimedia ContentDescription Interface -Part 4 Audio IS 15938-4 2002

[18] H Crysandt ldquoMusic identification with MPEG-7rdquo in Proceed-ings of the 115th AES Convention October 2003 Paper 5967

The Scientific World Journal 11

[19] OHellmuth E AllamanceMCremerHGrossmann J Herreand T Kastner ldquoUsing MPEG-7 audio fingerprinting in real-world applicationrdquo in Proceedings of the 115th AES ConventionOctober 2003 Paper 5961

[20] H-G Kim N Moreau and T Sikora MPEG-7 Audio andBeyond John Wiley amp Sons 2005

[21] J-Y Lee and S D You ldquoDimension-reduction technique forMPEG-7 Audio Descriptorsrdquo in Proceedings of the 6th PacificRim Conference on Multimedia (PCM rsquo05) vol 3768 of LectureNotes in Computer Science pp 526ndash537 2005

[22] J L Bentley ldquoMultidimensional binary search trees used forassociative searchingrdquo Communications of the ACM vol 18 no9 pp 509ndash517 1975

[23] Open source program for k-d tree httpwwwcsumdedusimmountANN

[24] J Shlens ldquoA tutorial on principal component analysisrdquohttpwwwsnlsalkedusimshlenspcapdf

[25] T Fawcell ldquoROC graphs notes and practical considerations forresearchersrdquo Tech Rep HP Lab 2004

[26] A Martin G Doddington T Kamm M Ordowski and MPrzybocki ldquoThe DET curve in assessment of detection taskperformancerdquo in Proceedings of the Eurospeech vol 4 pp 1899ndash1903 Rhodes Greece September 1997

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 10: Research Article Music Identification System Using MPEG-7 ...downloads.hindawi.com/journals/tswj/2013/752464.pdf · Gracenote s MusicID [ ]. According to [ ], there are more ... one

10 The Scientific World Journal

0 20 40 60 80 1000

10

20

30

40

50

60

70

80

90

100

FAR ()

TPR

()

ROC curve (96 k MP-3)

Method IMethod II

Figure 10 The ROC curves for both methods

0 20 40 60 80 1000

10

20

30

40

50

60

70

80

90

100

FAR ()

FRR

()

Method IMethod II

DET curve (96 k MP-3)

Figure 11 The DET-like curves without logarithm for both meth-ods

low-resolution descriptors still have high identification accu-racies To reduce the time to generate the candidate list weuse the k-d tree structure to store low-resolution descriptorsExperimental results show that using k-d tree structureincreases the search time by ten-folds Since not every pieceof query input is within the database we also proposed twomethods to determine the distance thresholds Experimentalresults show that the proposed method II provides an accu-racy of 994 Therefore the proposed system can be usedin real applications such as identifying copyrighted audiofiles circulated over the Internet As it can be easily extendedto operate with multiple computers the proposed system

is a plausible starting point to construct a large operabledatabase

Acknowledgment

This work was supported in part by National Science Councilof Taiwan through Grants NSC 94-2213-E-027-042 and 99-2221-E-027-097

References

[1] For a brief description of the bill please refer to httpenwikipediaorgwikiSOPA

[2] R Eklund ldquoAudio watermarking techniquesrdquo httpwwwmusemagiccompaperswatermarkhtml

[3] J S R Jang and H R Lee ldquoA general framework of progressivefiltering and its application to query by singinghummingrdquoIEEE Transactions on Audio Speech and Language Processingvol 16 no 2 pp 350ndash358 2008

[4] S Baluja and M Covell ldquoAudio fingerprinting combiningcomputer visionamp data streamprocessingrdquo in Proceedings of theIEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo07) pp II213ndashII216 Honolulu HawaiiUSA April 2007

[5] Shazam httpwwwshazamcommusicwebhomehtml[6] A Wang ldquoThe Shazam music recognition servicerdquo Communi-

cations of the ACM vol 49 no 8 pp 44ndash48 2006[7] V Chandrasekha M Sharifi and D A Ross ldquoSurvey and eval-

uation of audio fingerprinting schemes for mobile query-by-example applicationsrdquo in Proceedinjgs of the 12th InternationalConference onMusic Information Retrieval pp 801ndash806MiamiFla USA October 2011

[8] J A Haitsma and T Kalker ldquoA highly robust audio fingerprint-ing systemrdquo in Proceedinjgs of the 12th International Conferenceon Music Information Retrieval pp 107ndash115 Paris FranceOctober 2002

[9] C J C Burges J C Platt and S Jana ldquoDistortion discriminantanalysis for audio fingerprintingrdquo IEEE Transactions on Speechand Audio Processing vol 11 no 3 pp 165ndash174 2003

[10] Official website httpmusicbrainzorg[11] Official website httpwwwaudiblemagiccom[12] Official website httpwwwgracenotecom[13] httpenwikipediaorgwikiAcoustic fingerprint[14] P J O Doets M M Gisbert and R L Lagendijk ldquoOn

the comparison of audio fingerprints for extracting qualityparameters of compressed audiordquo in Security Steganographyand Watermarking of Multimedia Contents VIII vol 6072 ofProceedings of SPIE pp 60720L-1ndash60720L-12 San Jose CalifUSA January 2006

[15] F Nack and A T Lindsay ldquoEverything you wanted to knowaboutMPEG-7 part 1rdquo IEEEMultimedia vol 6 no 3 pp 65ndash771999

[16] F Nack and A Lindsay ldquoEverything you wanted to know aboutMPEG-7 part 2rdquo IEEEMultimedia vol 6 no 4 pp 64ndash73 1999

[17] ISOIEC Information TechnologymdashMultimedia ContentDescription Interface -Part 4 Audio IS 15938-4 2002

[18] H Crysandt ldquoMusic identification with MPEG-7rdquo in Proceed-ings of the 115th AES Convention October 2003 Paper 5967

The Scientific World Journal 11

[19] OHellmuth E AllamanceMCremerHGrossmann J Herreand T Kastner ldquoUsing MPEG-7 audio fingerprinting in real-world applicationrdquo in Proceedings of the 115th AES ConventionOctober 2003 Paper 5961

[20] H-G Kim N Moreau and T Sikora MPEG-7 Audio andBeyond John Wiley amp Sons 2005

[21] J-Y Lee and S D You ldquoDimension-reduction technique forMPEG-7 Audio Descriptorsrdquo in Proceedings of the 6th PacificRim Conference on Multimedia (PCM rsquo05) vol 3768 of LectureNotes in Computer Science pp 526ndash537 2005

[22] J L Bentley ldquoMultidimensional binary search trees used forassociative searchingrdquo Communications of the ACM vol 18 no9 pp 509ndash517 1975

[23] Open source program for k-d tree httpwwwcsumdedusimmountANN

[24] J Shlens ldquoA tutorial on principal component analysisrdquohttpwwwsnlsalkedusimshlenspcapdf

[25] T Fawcell ldquoROC graphs notes and practical considerations forresearchersrdquo Tech Rep HP Lab 2004

[26] A Martin G Doddington T Kamm M Ordowski and MPrzybocki ldquoThe DET curve in assessment of detection taskperformancerdquo in Proceedings of the Eurospeech vol 4 pp 1899ndash1903 Rhodes Greece September 1997

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 11: Research Article Music Identification System Using MPEG-7 ...downloads.hindawi.com/journals/tswj/2013/752464.pdf · Gracenote s MusicID [ ]. According to [ ], there are more ... one

The Scientific World Journal 11

[19] OHellmuth E AllamanceMCremerHGrossmann J Herreand T Kastner ldquoUsing MPEG-7 audio fingerprinting in real-world applicationrdquo in Proceedings of the 115th AES ConventionOctober 2003 Paper 5961

[20] H-G Kim N Moreau and T Sikora MPEG-7 Audio andBeyond John Wiley amp Sons 2005

[21] J-Y Lee and S D You ldquoDimension-reduction technique forMPEG-7 Audio Descriptorsrdquo in Proceedings of the 6th PacificRim Conference on Multimedia (PCM rsquo05) vol 3768 of LectureNotes in Computer Science pp 526ndash537 2005

[22] J L Bentley ldquoMultidimensional binary search trees used forassociative searchingrdquo Communications of the ACM vol 18 no9 pp 509ndash517 1975

[23] Open source program for k-d tree httpwwwcsumdedusimmountANN

[24] J Shlens ldquoA tutorial on principal component analysisrdquohttpwwwsnlsalkedusimshlenspcapdf

[25] T Fawcell ldquoROC graphs notes and practical considerations forresearchersrdquo Tech Rep HP Lab 2004

[26] A Martin G Doddington T Kamm M Ordowski and MPrzybocki ldquoThe DET curve in assessment of detection taskperformancerdquo in Proceedings of the Eurospeech vol 4 pp 1899ndash1903 Rhodes Greece September 1997

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 12: Research Article Music Identification System Using MPEG-7 ...downloads.hindawi.com/journals/tswj/2013/752464.pdf · Gracenote s MusicID [ ]. According to [ ], there are more ... one

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014