Fingerprint and quality-based audio track retrieval

Embed Size (px)

Citation preview

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    1/38

    UNIVERSIT`A DEGLI STUDIDI MILANO BICOCCA

    FACOLTA DI SCIENZE MATEMATICHE,FISICHE E NATURALI

    Corso di Laurea Magistrale inInformatica (MSc in Computer

    Science)

    FINGERPRINT ANDQUALITY-BASED AUDIO

    TRACK RETRIEVAL

    SUPERVISORS:

    dott.ssa F. Gasparini (advisor)

    dott. S. Bianco (co-advisor)

    Submitted by:Riccardo Vincenzo Vincelli (709588)

    [email protected]

    AA 2011-2012 - third session, 20/11/2012

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    2/38

    Contents

    1 MP3 5

    1.1 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.1.1 PCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.1.2 Analysis polyphase filter bank . . . . . . . . . . . . . . . . 71.1.3 FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.1.4 Psychoacoustic model . . . . . . . . . . . . . . . . . . . . 101.1.5 MDCT with Windowing . . . . . . . . . . . . . . . . . . . 131.1.6 Quantization . . . . . . . . . . . . . . . . . . . . . . . . . 151.1.7 Huffman Coding . . . . . . . . . . . . . . . . . . . . . . . 161.1.8 Bit stream formatting and CRC word generation . . . . . 16

    1.2 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.2.1 Synchronization and Error Checking . . . . . . . . . . . . 19

    1.2.2 Huffman decoding and Huffman info decoding . . . . . . . 201.2.3 Scale-factor decoding and Requantization . . . . . . . . . 201.2.4 Reordering . . . . . . . . . . . . . . . . . . . . . . . . . . 201.2.5 Joint stereo decoding, Alias reduction and IMDCT . . . . 20

    2 The fingerprinting technique 20

    2.1 Some audio fingerprint techniques . . . . . . . . . . . . . . . . . 202.2 Synopsis of the technique . . . . . . . . . . . . . . . . . . . . . . 212.3 Forward algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 222.4 Backward algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 242.5 Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    2.5.1 Blocks generation . . . . . . . . . . . . . . . . . . . . . . . 242.5.2 Frequency rearrangement and sub-band division . . . . . 252.5.3 SBE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.5.4 PMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.5.5 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.5.6 Bit stream output . . . . . . . . . . . . . . . . . . . . . . 282.5.7 BER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    2.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    3 Testing 30

    3.1 Basic distortions . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.1.1 White noise . . . . . . . . . . . . . . . . . . . . . . . . . . 313.1.2 Echo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.1.3 Pitch shift . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    3.1.4 Voice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2 SNR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.3 Testing infrastructure . . . . . . . . . . . . . . . . . . . . . . . . 333.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    3.4.1 White noise . . . . . . . . . . . . . . . . . . . . . . . . . . 343.4.2 Echo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.4.3 Pitch shift . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.4.4 Voice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    4 Conclusions 37

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    3/38

    References 38

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    4/38

    Introduction

    An audio fingerprinting technique is, in its most general form, a pair of algo-rithms, the fingerprinting algorithm and the matching algorithm. The finger-printing algorithm examines and processes a set of salient features of a giveninput audio track, generating a small digest from them. The matching algori-thm is used to identify an unknown audio track, by computing the fingerprintfor a small sample of the track itself and comparing it with a set of known full-length fingerprints. The idea can be applied to any digital media content, butaudio identification is of great interest in practical implementations, with manycommercial and non-commercial software available for use. The literature onthe topic is large and many robust audio fingerprinting algorithms have beenimplemented in both commercial and free software. A strong mathematical mo-deling drawing from psychoacoustics, Fourier theory, information and coding

    theory, statistics and probability is at the cornerstone of the most successfultechniques, and great emphasis is also on the complexity and performance ofthe algorithms, since in the most common scenario the client-side is operatedon portable devices (e.g. smartphones).

    A successful technique exhibits robustness to common degradations, collectivelyknown as noise, that can deteriorate the quality of the track to be identified.The key to robust performance lies in the ability to identify features that areto some extent invariant to noise, pregnant information the track retains evenif its quality is noticeably degraded. In order to achieve this a deep knowledgeof how our auditory system works is imperative; for example, it is importantto observe that not all the frequency contents are equally important, and well-

    defined sensitivity peaks along the spectrum exist. The fingerprinting (forward)and matching (backward) algorithms are both deterministic, but since a goodtechnique seeks a tradeoff between robustness and efficiency, false positives withrespect to an estimated tolerance are accepted. Otherwise stated, a robust yetefficient algorithm returns the correct answer with a probability close to 1 andtakes an acceptable time to compute it.The following is the typical use case scenario for audio fingerprinting techniques:

    a DMS (Digital Media Service) maintains a large database of popularaudio tracks and implements as a service an intelligent query and retrievalsystem for the database, free of charge

    every time a new song is added, it is processed with the forward algorithm;a second associated database table stores, for each track, the correspondingfingerprint

    the client-side consists in an implementation of the forward algorithm, andthe input audio sample to be identified is fed through the computer/devicemicrophone or as an existing MP3/WAV file

    the communication medium is the Internet and once the song has beenidentified by the server, additional information is returned (e.g. ID3 fields,lyrics, pictures...)

    Curious analogies between the fingerprint digest and a DNA strain or a hashcode help to better understand the whole picture. Just like DNA material, a

    4

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    5/38

    fingerprint can be said to belong to a precise person but, even in the case that all

    laboratory operations on tissue samples are carried out with great care (i.e. nocontaminations), false positives are still possible; this is one of the main reasonswhy even though DNA tests are a precious and widely used forensic resource, inlaw enforcement investigations they are rarely the only piece of evidence. Com-pared to the output of a hash function, fingerprints are easy to compute too,and can be righteously used as references for the original input file (i.e. whensearching through a database).

    In this work the audio MP3 fingerprinting technique proposed in [1] is analyzed,and implemented with localized yet relevant improvements. The implementa-tion is thoroughly tested, with test cases designed to evaluate the stability ofthe technique with respect to noise, at different intensity levels.Basically, the fingerprinting algorithm divides the input file into blocks of MP3

    granules and generates a bit stream by examining, thanks to the entropy sta-tistic, the variation of the information contributed by the blocks. Importantaudio features are not explicitly isolated but their existence affects the statisticsof the track and emerges in the computation of entropy differences. From thisperspective, two tracks having an almost equal fingerprint share a certain trend,since a bit set to 0 in the i-th position of the bit stream means that the i-thblock is somewhat less informative, for example less rich in sounds timbre, thatthe one to follow. One of the most important improvements contributed by thisimplementation is the ability to process not only single channel (mono) MP3tracks but also stereo ones, maximizing the information taken from the stream.While this is not relevant for a large part of MP3 tracks found on the Internetwhere the channels all have the same bits, more refined musical expressions or

    ad-hoc elaborated audio tracks might present effects requiring stereo channeling(e.g. ping pong echo).

    The implementation of the technique comes as a pair of C programs, the twoalgorithms are distinct. The matching algorithm is quite simple, being just onecompilation unit, whereas the fingerprinting algorithm is implemented in twomodules, one for the algorithm itself and one for the decoding library. C wasthe language of choice for different reasons: for example, it is standardized andperformance-oriented. An early-stage implementation was completed in Math-Works MATLAB, but it did not fulfil basic time performance requirements.

    Testing aims at evaluating performance on a large and heterogeneous set ofsamples, fed to the algorithms both in a clean and noisy version. Noise effectstaken into account are white noise addition, echo/delay, pitch shift and voiceaddition. Test results confirmed what claimed by the authors in the research,with optimal results even in presence of quite obtrusive noise disturbances.

    1 MP3

    MP3 (Moving Picture Expert Group-1/2 Audio Layer 3) is the name of thewell-known audio lossy compression format standard, which has become overthe years the most adopted format for persisting and exchanging music on theInternet. The standard carefully describes the structure and interpretation of

    5

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    6/38

    an MP3 bit stream and what the decoded result is expected to be, together

    with encoding and decoding block diagrams; this constitutes the main partof the non-normative corpus of the standard. No constraints are set on theencoding algorithm, which means that, for a given uncompressed bit stream(i.e., a WAV format audio track), as long as an encoder outputs a meaningfulbit stream consistent with the standard, it can be labeled as compliant; as usual,reference implementations were provided in the standard. On the other hand,the decoding phase is carefully presented, and most of the decoders are said tobe bit stream compliant in the sense that their output matches, within a certaintolerance, the one formally defined in the standard.A deep understanding of the encoding and decoding processes is not an easymatter, as average knowledge in many different fields, ranging from informationtheory to psychoacoustics, is required. For the sake of clarity we present here theminimal information needed to understand the two phases at a general level; the

    concepts presented form the base for approaching the fingerprinting algorithmtoo, as it directly operates on the MP3 compressed bit stream. An exhaustiveyet not overwhelmingly technical guide to the standard is [2]; in [3], amongother things, Layer 3 is compared to its predecessors, Layer 1 and 2; [4] is thestandard, published in 1993 and updated in [5] in 1995.

    1.1 Encoding

    Figure 1: Encoding block diagram.

    6

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    7/38

    1.1.1 PCM

    PCM stands for Pulse code modulation, which is an easy way to digitally re-present an analog signal; a PCM signal is obtained with just the following threesteps:

    sampling of the analog signal

    quantization of the samples

    binary representation of the quantized values

    Figure 2: The PCM process.

    The fidelity of the acquired signal depends upon the sample rate and the bitdepth: a minimum sampling rate is forced by the fundamental Nyquist-Shannontheorem, and a secondary result, known as Widrow theorem, can be applied toensure the quantization noise is reduced to white additive Gaussian noise. Ifthe quantization function is linear, the method is referred to as LPCM (linearPCM).A pure PCM bit stream typically requires high bit rates, even for non-audiophilequality contents: a rather simple acquisition scheme comes at the price of a greatdeal of information to be kept. The intuition behind the MP3 compression is

    that a lot of this information is redundant, and can be discarded without affec-ting the overall quality of the digitized audio track in a perceptually-significantway.A common LPCM digital format is WAV (Waveform Audio File Format), byMicrosoft and IBM; it is the standard audio format for CDs, where the samplerate is 44.1 kHz and the bit depth 16 bits.

    1.1.2 Analysis polyphase filter bank

    Our auditory perception is not uniform along the range of the frequencies wecan hear, 20 Hz - 20 kHz approximately, for example there is a sensitivity peak

    7

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    8/38

    around 1 - 5 kHz. The MP3 compression is designed to take advantage of this

    fact, and the crucial phase in this sense is the application of a psychoacousticmodel (see below). The analysis polyphase filter bank is the first step in thisdirection, because different frequency ranges are identified and saved separately;this filtering is named polyphase quadrature filter (PQF).For each channel, the encoding process starts with partitioning the input intoframes of 1152 samples and proceeds filtering the spectrum of each of them into

    32 equally-spaced frequency sub-bands, each f{ 232

    wide, where f is the Nyquistfrequency of the input PCM, i.e. f{ 2 is half the sampling frequency. For exam-ple, for f

    44.1 kHz is f{

    2

    22.05 kHz and each band will be 689 Hz wide,r 1, 689 s , r 900, 1378s , . . . , r 21362, 22050s . In this phase, the frequency contributeof each single sample is balanced across the 32 sub-bands, yielding a factor 32information increase, as for each sample 32 values are computed. For this rea-son, in each sub-band the number of values is dually decimated by a factor 32;

    for each frame, these sub-band values are then grouped into 3 sets of 12 each.This process is lossy, the original signal cannot be recovered; it introduces someartifacts too, but they are inaudible. One of the reasons for this is the impos-sibility of constructing bandpass filters with perfectly square response, and thisdetermines sub-bands which overlap a little bit, where the overlap depends uponthe machine precision of the encoder. All these effects are collectively referredto as aliasing.Finally, it is worth noticing that even if this phase is really useful to furtherprocessing oriented to advanced psychoacoustic models, the sub-bands are un-related to any critical bands model of our auditory system.

    Figure 3: The filter bank process.

    1.1.3 FFT

    The discrete Fourier transform has become ubiquitous in digital signal proces-sing as a tool for switching from the time domain to the frequency domain (andback, with the inverse). The DFT is a key ingredient in many compression andediting techniques. For a brief but self-contained introduction to the subjectsee [6].Any algorithm computing the DFT with computational time complexity lessthan O p n2 q , where n is the length of sequence of complex numbers to be tran-

    8

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    9/38

    sformed, is said to be a FFT algorithm, fast FT. At the moment, algorithms

    faster than Op

    n l g nq

    are not known.Let x0, . . . , xN 1 be a sequence of N complexes. The DFT is defined bythe formula:

    Xk

    N

    1

    n

    0

    xne

    2iN

    kn k

    0, . . . , N

    1

    and its inverse is:

    xn 1

    N

    N 1

    k

    0

    Xke2iN

    kn n 0, . . . , N 1

    The result X0, . . . , X N 1 is a complex sequence too, each value of it en-

    coding in its amplitude and phase those of a sinusoidal form of frequency per

    sample corresponding to k{

    N, which are found as|

    Xk| {

    N, argp

    Xkq

    respectively;these sinusoidal forms decompose the function of n represented by the input-sequence. DFT formulae can come in many equivalent flavors, depending onthe application field; for the two above, following the most common conventionthe normalization factors can be any pair of numbers whose product is 1 { N andthe signs of the exponents can be interchanged. If it is the case that the norma-lization factor of the forward transform is 1

    {

    N the zero-frequency value, knownas the DC-value, is the mean of the input sequence. In many applications, theresult sequence is reordered so that the DC score is right in the middle, and fora 1D DFT it is enough to swap the left and the right halves.For a given sequence of length N an n-point DFT is commonly intended as thetransformation of only the sub-sequence formed by the first n elements, discar-ding the others, ifn N, of the original sequence zero-padded in the remainingempty positions otherwise. The greater n the finer will be the frequency contentsdecomposition represented by the sinusoidal forms, as more frequency scores willbe returned, actually as many as were the elements of the possibly zero-paddedinput vector: a single spectrum can be examined at different degrees of detail.

    In this stage, in parallel with polyphase filter bank processing, both a 256 and a1024 points FFT are performed on the input frames. As a frame is 1152 sampleslong, some samples are to be ignored, and the choice is to center the FFT windowdiscarding the first and the last 64 and 448 samples respectively. Thanks to thefact that 256 and 1024 are powers of 2, particularly fast FFT algorithms can beemployed, such as the DIT (decimation-in-time) version of the Cooley-TurkeyFFT algorithm. The information conveyed by the 256-points FFT is useful to

    spot great spectrum differences between adjacent frames, and the 1024-pointsFFT bears the minimum spectrum resolution information needed to carry outeffective compressions. The results are fed to the Psychoacoustic model block,where major compression efforts take place.

    9

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    10/38

    Figure 4: 1024/256 Fourier spectra of a 16 bit sample. A plot like this isobtained by just interpolating linearly between the result scores.

    1.1.4 Psychoacoustic model

    The concept of a psychoacoustic model is really a fundamental one, and a goodstrategy here does make the difference in terms of compression achievementsand overall fidelity of the output bit stream. An encoder adopting a validpsychoacoustic model is referred to as perceptual encoder. The informationgathered at this stage is passed both to the MDCT block and to the Quantizationblock (see below). The tasks here are:

    choosing a particular type of window to alleviate artifacts deriving fromprocessing with the MDCT a discontinuous stream of information, as eachframe is transformed singly

    10

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    11/38

    computing the information necessary to quantize MDCT scores, to save

    just an amount of frequency information proportional to the importanceof the frequency content itself in the contest of the spectrum, applying atheoretical model

    Outputs of this block are simply, respectively:

    for each frame, the window type

    for each frequency, quantization thresholds

    The window type is chosen by comparing the current FFT spectra pair to theprevious one. Relevant differences trigger attacks: new sounds begin and pro-

    duce audible differences (e.g. after 5 seconds of silence a strong guitar riff breaksin). If this is the case, short windowing is used, otherwise the counterpart islong windowing; the names come from their shapes. Long windows come inthree forms, being different according to whether they are followed by, or fol-low, short windows (start or stop long windows, respectively) or not (standardlong). Short windowing is the key to contrast a common flaw in lossy encoders,pre-echo. This artifact denotes a spread of the attack and decay over time pe-riods where they are not meant to be present originally, resulting in an artificialbackward and forward echo effect. Pre-echo is due to the strict time domaindiscontinuities imposed by the use of frames and subsequent frequency contentsbalance through the filter bank. Pre-echo is generally not a problem except forpercussion instruments and forward/decay pre-echo is much attenuated by themasking discussed above.

    Figure 5: Window types. Short windows are actually made up of three overlap-ping windows, allowing for a more precise time resolution. Start and stop longwindows guarantee amplitude continuity with the short ones.

    11

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    12/38

    Figure 6: Finite state machine for choosing the appropriate window type.

    An advanced psychoacoustic model can exploit a great number of psycho-physical phenomena, but most of them falls in one of the following categories:

    range-related

    masking

    As previously observed, there exists a peak in the range of the audible. Then, aswe are less sensitive to extremal frequency contents, very bass or treble soundsnecessitate of higher volume levels. Finally, because the audible range actuallyshrinks, especially about high frequencies, as one gets older, particular soundinformation perceived by children end up to be of no use to the elderly.Masking phenomena are subdivided into:

    simultaneous (frequency domain)

    temporal (time domain)

    The human auditory system is organized in a number of so-called critical bands,a fact that configures our sound perception as driven by a bank of pass-bandfilters. It is worth pointing out that even if this is the same idea as the onebehind the very first step of the encoding process, the application of an analysispolyphase filter bank, the sub-bands we are talking about here are unrelated.Suppose that a stimulus resonating at a particular dominant frequency is heard.The phenomenon of the simultaneous masking determines a particular air pres-sure threshold, and sounds with components in the same critical band need tobe reproduced at volumes higher than the threshold to be heard too.

    12

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    13/38

    Figure 7: Simultaneous masking; direct proportionality can be observed.

    In temporal masking, regardless of the frequencies involved, a particularlyloud sound covers all of the other sounds below the threshold that start afterit ceased, still following a proportional pattern (post-masking); in the samefashion, also weaker sounds starting shortly before the masker are covered (pre-masking).

    Figure 8: Temporal masking.

    In both cases, the final output of the sub-block is a set of threshold values

    for the frequency sub-bands of use in the MDCT to follow.Finally, the choice of the window and the masking model are strictly related.The approximation of the human critical bands in MP3 goes under the nameof scale-factor bands. For short windows, sub-band division is less precise asmasking can be exploited to diminish pre-echo, and on transients there is nogreat need for high frequency resolution.

    1.1.5 MDCT with Windowing

    The modified discrete cosine transform is a particular discrete cosine transformvery popular in digital signal processing for transforming overlapping blocks ofconsecutive data.

    13

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    14/38

    A DCT is basically the result of applying the DFT to a real function whose

    periodic extension is set to be even at the left border, i.e. symmetry about theorigin. By doing this, each frequency content is carried by a cosine only, not bythe sum of both a sine and a cosine like in the DFT. This fact, whose proof isnot trivial, is somewhat not new to the reader familiar with Fourier series onreal functions: when the function is even sine coefficients cancel out and viceversa. Eight variants of DCT are needed to specify all the p ossible left-evenextensions. The DCT turns out to have better compression properties thanthe DFT, in the sense that most of the information is concentrated in the firstcoefficients; roughly speaking, working on the same input data, to express theinformation present in the DCT sequence a DFT sequence of twice the lengthis needed.The MDCT is based on the DCT-IV, and for a sequence

    x0, . . . , x2N 1 ofreals returns a real sequence half the length,

    X0, . . . , X N

    1 ; its formula is:

    Xk

    2N 1

    n 0

    wnxn cos

    N

    n

    1

    2

    N

    2

    k

    1

    2

    &

    where wn is the window score.For a single frame sub-band, a group of 12x3 samples enters the block in thediagram of Figure 1. For long blocks (frames requiring long windows), the 36samples are fed to the MDCT and 18 frequency lines are obtained; for shortones, each 12-uple is processed separately yielding 6 lines each. In both cases,overlap is 50%, i.e. the first half of the MDCT-processed sequence comes fromthe previous frame, the other half is from the current. The output of this stepis the granule, the minimum information unit: for long blocks, 576 frequency

    lines, for short blocks 192x3 lines. The tradeoff between frequency and timeresolution is clear: long blocks have more sub-bands but short ones capturethree, not just one, time slices individually elaborated. These sub-bands arecalled scale-factor bands in the MP3 jargon.

    Figure 9: The center and right images are obtained by removing the secondhalf of the output scores of the DFT and DCT respectively; the results makeclear that the DCT is better at condensing spectrum information in the firstcoefficients.

    The functionality of this block is also an alias reduction algorithm to com-pensate for the effects deriving from a necessarily not-perfect polyphase filterbank. These are quite complex from a theoretical point of view but computa-

    14

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    15/38

    tionally they just reduce to products. In some implementations alias reduction

    takes place before reordering.

    1.1.6 Quantization

    After the psychoacoustic model determines the windowing and the MDCT isperformed it is time for further compression, still driven by the psychoacousticanalysis, on the resulting transformed sequence. At the same time the encoderis constrained to output a bit stream to be reproduced at a given bit rate, andbit rate is a function of bit depth; this bit rate can be constant (CBR) or varia-ble (VBR).As observed above, thanks to masking effects some frequency contents turn outto be of little or no importance, and this suggests strong quantization on them.Quantization always comes with noise though, so the introduced disturbance

    must be insignificant in terms of auditory perception; in other words, the scoresare quantized accordingly with the output of the psychoacoustic model. Quanti-zation consists in a power-law: this helps avoiding regular/periodic quantizationnoise artifacts and allows for larger values to be coded less accurately. Sometreatment prior to quantization helps attenuating this noise. The sub-bands canbe pretreated and quantized both as a whole or with an independent fashion,i.e. globally or non-uniformly. The correction and compression of each singlesub-band is encoded in the respective scale-factor numbers, usually stored asdifferences with respect to the global quantization coefficient, the gain factor.This process is structured into two nested loops, the distortion control loop(outer) and the rate control loop (inner). In this iterative process, the outerloop takes care of adjusting the single sub-band scale-factors for the purposeof adapting the quantization noise to the requirements of the psychoacousticmodel, whereas the inner loop works on the global gain with the aim of fittingthe quantized values in the number of bits constrained by the bit rate.

    Figure 10: The two aspects of the quantization process.

    15

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    16/38

    1.1.7 Huffman Coding

    MP3 also makes use of classic information theory by coding the quantized sam-ples with the Huffman algorithm, a well-known variable length code where theless frequent a symbol the longer the codeword it is assigned to. In order to fit aparticular bit rate, for a fixed sample rate and number of channels, a proper bitdepth is required. Once it is fixed, the algorithm can go on using codewords ofaccordant length. A number of 15 tables is published in the standard, and thefrequency lines are grouped and differently coded with these tables according totheir importance. Each granule is subdivided into three variable-length groups:big values (the scores expected to have the greatest scores in absolute value),quad region (intermediate), zero region (zero-clipped/rounded). In addition tothis, table access is further parametrized yielding a total 29 different ways tocode.

    Figure 11: Huffman binary tree for the source In my mind I have these thoughtson the binary coding alphabet: decoding a single codeword corresponds to goingdown a particular walk from the root to a leaf.

    1.1.8 Bit stream formatting and CRC word generation

    In this final block the construction of the compressed bit stream takes place. Asone expects from the encoding process, the bit stream is an ordered collection offrames, each frame in turn made up by two (mono) or four (stereo) 576-valuesgranules. A frame has additional fields other than effective data:

    16

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    17/38

    header: contains general information for the frame and the synchroniza-

    tion word telling the decoder that a new frame is starting

    Cyclic Redundancy Check: this field carries the checksum for the sensitivepart of the frame, defined in the standard to be the portion of the headerand side information fields that, when corrupted, forces to discard thewhole frame; it is interesting to observe that commonly the loss of someframes does not affect the overall quality in a noticeable way, thanks to thehigh time-domain resolution (in the MPEG family the length of a frame ison the order of milliseconds); use of this field is optional, and it is commonpractice to skip the CRC generation

    side information: everything necessary to properly decode the frame;examples of stored information are block type, scale-factor bands length

    in bits, length and encoding of the Huffman regions

    main data: the effective data of a single granule is formed by a Huffmancoded bits and scale-factors; Huffman coded bits are the actual frequencyline values to be decoded, and the scale-factors used during quantizationhave to be elided when the process is reversed

    ancillary data: the ancillary data subfield is seldom used, and a coupleof famous encoders use it for padding reasons for example; its length isundefined, and this is legit as the synchronization word will allow to locatethe starting point of the following frame

    17

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    18/38

    Figure 12: The basic units of a frame are the header and the audio data; seethe references for details.

    1.2 Decoding

    Since the investigated fingerprinting algorithm requires just partial rather thanfull decoding of the MP3 bit stream the description will not go through thewhole process. Partial decoding is commonly intended as halting at the IMDCTblock, when ready to return to the time domain. The following descriptions willtend to be short enough, as the reader has already gained confidence with theencoding process in the previous section.

    18

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    19/38

    Figure 13: Decoding block diagram.

    1.2.1 Synchronization and Error Checking

    The bit stream is parsed to recognize a correct structure, and each frame isidentified by searching for the next synchronization word. If the CRC bit is on,checksum verification is performed.

    19

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    20/38

    1.2.2 Huffman decoding and Huffman info decoding

    For a correct inversion of the Huffman coding the decoder has to know wherethe first codeword starts, because Huffman is a variable-length encoding, andthe substitution tables. This information is provided to the Huffman decodingblock by the Huffman info decoding. Additional elaboration is performed by theHuffman decoding block, for example zero padding to compensate for missingline frequencies or run-length decoding, especially for high frequencies, wheremany scores are zero-clipped.

    1.2.3 Scale-factor decoding and Requantization

    Three pieces of information are needed in order to inverse the quantizationprocess, scale-factors, which are decoded separately, global gain and additionalside information. The inverse quantization adopts two different techniques, forshort and long blocks.

    1.2.4 Reordering

    Output from the previous block is sets of dequantized frequency lines, repre-senting a short or a long block. If it is the case of a short block, the batch offrequencies is ordered by sub-band, window, frequency, whereas long ones bysub-band, frequency. This ordering difference aims at maximizing the compres-sion factor of the Huffman algorithm, because, in short windows, scores in thesame frequency band are much more likely to have equal values, thus yieldingone single codeword, than scores taken at progressive time intervals with thewindows.

    1.2.5 Joint stereo decoding, Alias reduction and IMDCT

    If the input stream is not a mono one, different channel information has to beproduced. Alias imperfections have to be re-introduced to have a correct audioreconstruction. Finally, the IMDCT remaps the lines yielding 32 sub-bands eachcarrying 18 time domain samples.

    2 The fingerprinting technique

    2.1 Some audio fingerprint techniques

    Beneficial to approaching the discussed technique is a brief synopsis on the theo-

    retical tools employed in some of the most known and successful techniques.

    In [7] a rather straightforward approach to individuating salient audio featu-res is presented:

    1. the audio signal is segmented into overlapping frames

    2. Fourier transform is applied to each frame, but only the spectrum is kept,as our auditory system is poorly receptive to phase shifts

    3. this frequency content is subdivided into a number of bands modeling theauditory critical bands (see 1.1.4)

    20

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    21/38

    4. a bit stream is computed by looking at energy differences of the scores

    both along the frequency and the time axesIn the technique [8], of Shazam smartphone application fame, the data structureat the core of the process is called constellation. A constellation is obtained bypruning most of the points of a spectrogram, a 3D plot visualizing frequency andamplitude of a signal in time, by leaving only those showing a particularly highenergy level with respect to their neighbors. This structure is then convertedto a handy bit stream by applying hash functions.

    The paper [9] introduces a very interesting approach taking advantage of classiccomputer vision techniques, and the idea is to treat spectrogram plots as effec-tive images and performing wavelet analysis on them. Wavelet theory can beseen as an evolution of Fourier theory which allows for a more robust function

    decomposition as not only frequency information but also time information isencoded in the function basis.

    2.2 Synopsis of the technique

    We now examine the fingerprinting technique discussed in the research paper[1]. Information theory plays a primary role in the forward algorithm, as thebit stream is computed by comparing entropy differences between consecutivetime units, and in the backward algorithm too, since the matching processjust involves a number of Hamming distance computations. From an efficiencyperspective, the technique avoids a complete decoding of the MP3 bit streambecause information statistics are extracted from the IMDCT coefficients; thisis a noteworthy advantage over full decompression as uncompressed data, suchas WAV files, are space-consumptive.The outline of the forward process is as follows:

    the source MP3 is partially decoded (see figure 13)

    this partially decoded bit stream is partitioned into basic units calledblocks, each one collecting 22 granules, with an overlap factor of .95

    frequency lines in each block are rearranged, as by granule grouping ablock is formed by both short and long windows

    new sub-band division

    to each rearranged block is applied a particular function called SBE (sub-

    band energy)

    a probability mass function is computed over the SBE scores of each singleblock

    the entropy of each block is calculated

    a bit stream is generated comparing the successive entropy values of theblocks

    For the backward (matching) process:

    21

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    22/38

    for each database fingerprint, the query fingerprint is slided over it and

    the Hamming distance at each window is computed the minimum of these window distances, divided by the length of the query

    bit stream, is returned as the matching score

    this set of scores is sorted in ascending order

    if the right song is in the first ten elements of the sorted list whose scoreis under a threshold value, the query is deemed successful; otherwise thequery is failed

    2.3 Forward algorithm

    The first step in the algorithm consists in scanning the input file and progres-

    sively assembling basic processing units that we refer to as blocks; the readeris advised not to confuse this kind of blocks with the MP3 encoding blocks. Ablock is obtained by grouping together 22 granules, which equals 11 frames. Inthe paper, input MP3 files are implicitly assumed to be mono channel and themost obvious generalization in order to meaningfully process stereo MP3s aswell is to double the size of each block: a block is still built out of 11 framesbut carries 44 granules. An overlap factor of 95% means that block Bi is equalto block Bi 1 in all positions but the last, which has a new granule in it. Dif-ferently stated, a new block is produced by shifting to the left by one positionthe previous one, with the new slot assigned to the read granule. This heavyoverlap assures stability in terms of time domain localization because minimizesboundary differences between the database and query fingerprints: simplisti-

    cally stated each query fingerprint has contents that cannot be too differentfrom a particular set of blocks. The number of granules for a block is fixed andthe authors do not seem to view it as a point of optimization.

    Figure 14: Block overlapping.

    Blocks are groups of granules resulting from both long and short windo-wing, so the frequency content of a block is not homogeneously represented. Anormalization is obtained by applying the following rearrangement formula:

    snp

    i, jq

    5

    snl p i, j q 13

    3j 2n

    3j | snp i, n q | (long)

    sns p i, j q 13

    2

    m 0 | snm

    p i, j q | (short)

    22

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    23/38

    i 0, 1, . . . , 21 j 0, 1, . . . , 191

    where snp i, j q is the input coefficient in the i-th granule j-frequency line of atime-domain block and snm p i, j q denotes the m-th window of a short MP3 block.This operation is to be repeated on each of the N blocks. For the long case,every three consecutive coefficients are grouped; for the short case we take threecoefficients at the same frequency, one per window.Not all the newly computed frequency lines are retained. The saved lines arethen organized in a number of sub-bands based on the scale-factor bands of theshort windows. This choice aims at focusing only on relevant auditive informa-tion relating to transients (e.g. percussive stimuli).

    Figure 15: This table resumes the normalization and reorganization of the spec-trum (image taken from [1]).

    The sub-band energy formula returns as a real number the importance ofthe argument sub-band in the context of the argument block by summing upits scores:

    SBEp

    i, jq

    2i 10

    m 2i 1

    MDCTTj

    MDCTBj

    |

    sn2p

    m, nq |

    i

    0, . . . , N j

    0, 1, . . . , 8

    where SBEp i, j q is the SBE for the i-th block, j-th sub-band, MDCTBj andMDCTTj are the ranges for the sub-band, listed in the table above.The PMF computed over the SBE data indicates how much a certain bandcontributes to the overall information of a single block. The formula is:

    Pp i, j q SBEp i, j q

    8

    j

    0SBEp i, j q

    i 0, . . . , N j 0, 1, . . . , 8

    The classic entropy of the PMF denoted by Pp

    i, jq

    is now computed; for theblock i it is:

    Hp iq

    8

    j 0

    Pp i, j q lg Pp i, j q i 0, . . . , N

    In the paper, the robustness of entropy as an information indicator is discussedboth from an experimental and theoretical perspective. The reader is invited to

    23

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    24/38

    read through this material.

    The final stage is the calculation of the bit stream:

    Sp iq

    4

    0 : Hp iq Hp i 1 q1 : H

    p

    iq

    Hp

    i

    1q

    i 0, 1, . . . , N 1

    2.4 Backward algorithm

    Input to the backward algorithm is a single query bit stream to be matchedagainst a database of known bit streams. Basically, the query is slided over thedatabase bit stream and a Hamming distance is computed at each window; theminimum distance value is returned, divided by the length of the query sample.This is formalized by the following formula:

    BERp

    iq

    minp p

    x1, x2, . . . , xn q p xi

    j

    , xi

    j 1

    , . . . , xi

    j

    n 1 q q

    i 1, . . . , N track j 1, 2, . . . , N n 1

    the Bit Error Rate of the excerpt with respect to the i-th audio track in thedatabase. After the BER is computed for each track in the database, the trackyielding the minimum BER is the final result. It is suggested to return a morecomplete number of results though. The list of tracks is sorted in ascendingorder with respect to the matching score. Of this sorted list, only the first tenelements are returned. In the case the right answer is not present in these tensongs, the query has failed. In all of the test groups, most of the matches areperfect matches: if the audio track is guessed right, then it is the best guess.Finally, the choice of returning not just one result but a rich list can be defended,besides from the fact that this strengthens the technique performance. Even if

    the tracks in the list almost always do not share relevant perceptive similarities,for example they are not of the same genre, they indeed have some traits incommon since their matching scores are of the same order of magnitude. Becauseof this, the user is not only given, if the query successes, the correct answer butalso a number of other suggestions for tracks of affine nature in terms of entropy.

    2.5 Pseudocode

    The pseudocode programs presented depict the essential steps illustrated above.The input files are assumed to be stereo (two channels). Some additional logicis required in order to write an implementation of the technique, but it is leftbehind, for the reason that a working implementation is part of this thesis work.

    The only data structure in use is the static array, with indexes starting at 1.

    2.5.1 Blocks generation

    In the picture of the whole forward algorithm, the time-domain blocks subdi-vision procedure can be positioned both as the first step after partial decodingor as an extension to it. If we opt for the former blocks generation takes placeafter decoding has completed, whereas for the latter, which is the choice in theimplementation, the blocks get built as the partial decoding proceeds, the ope-ration of frame decoding and block generation are interleaved.currFrameV alues

    r

    channelss r

    granuless r

    subbands r

    frequencys

    is the structurecontaining the data of a frame, with a total of 2 2 32 18 values; its type

    24

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    25/38

    is in currFrameTypesr channelss r granules s .

    For block values a new index ranging from 1 to 11 is added so that its form iscurrBlockV aluesr number s r channelss r granuless r subbands r frequency s , and thereis a currBlockTypes too. The first time this procedure is called currentBlockis empty; as soon as it is filled, setReady raises a flag to be checked by thecaller telling that the current block is in a consistent state and can therefore befed to the next step of the forward algorithm. These two pairs of multidimensio-nal arrays are bundled into two vanity arguments currFrame and currBlock.num, ch, gr are the current block position, channel and granule number.

    1: procedure buildBlock(currFrame, currBlock, num, ch, gr)2: if num 12 then3: if num 11 then4: setReady

    5: end if

    6: currBlockTypes r nums r chs r gr s currFrameTypesr chs r gr s

    7: for i 1 to 32 do8: for j 1 to 18 do9: value currFrameV aluesr chs r gr s r is r j s

    10: currBlockV aluesr nums r chs r gr s r is r j s value

    11: end for

    12: end for

    13: else

    14: for k 1 to 10 do15: type currBlockTypes r k 1 s r chs r gr s16: currBlockTypesr k s r chs r gr s type

    17: for i 1 to 32 do18: for j 1 to 18 do19: value currBlockV aluesr k 1 s r chs r gr s r is r j s20: currBlockV aluesr k s r chs r gr s r is r j s value

    21: end for

    22: end for

    23: end for

    24: k 1125: currBlockTypes r k s r chs r gr s currFrameTypesr chs r gr s

    26: for i 1 to 32 do27: for j 1 to 18 do28: value currFrameV aluesr chs r gr s r is r j s

    29: currBlockV aluesr

    chs r

    grs r

    is r

    js

    value30: end for

    31: end for

    32: end if

    33: return currBlock

    34: end procedure

    2.5.2 Frequency rearrangement and sub-band division

    Input to this procedure is a completed block, currBlock. The values of thecurrent block are transformed into a new structure,

    25

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    26/38

    rearBlockV aluesr number s r channels r granules r frequency s , where frequency li-

    nes are indexed with only one value; 66 frequency lines are formed. Sub-banddivision is implicitly performed: just the needed number of frequencies is com-puted but no subband index is considered. The two different cases are clearlyformalized (2 corresponds to short block type). sfreqs is the number of fre-quency lines per sub-band in each short window.

    1: procedure rearranger(currBlock)2: sfreqs 63: for i 1 to 11 do4: for j 1 to 2 do5: for k 1 to 2 do6: c 07:

    for l

    1 to 66 do8: if currBlockTypesr is r j s r k s 2 then9: if l $ 0 and l 0 p mod sfreqs q then

    10: c c 111: end if

    12: base l p mod sfreqs q13: val1 currBlockV aluesr is r j s r k s r cs r bases14: val2 currBlockV aluesr is r j s r k s r cs r base sfreqs s15: val3 currBlockV aluesr is r j s r k s r cs r base 2sfreqs s16: rearBlockV aluesr is r j s r k s r l s val1 val2 val3

    3

    17: else

    18: if l $ 0 and l p mod sq freqs 0 then19: c c 120: end if21: val1 currBlockV aluesr is r j s r k s r 3l p mod 18 q s22: val2 currBlockV aluesr is r j s r k s r 3l 1 p mod 18 q s23: val3 currBlockV aluesr is r j s r k s r 3l 2 p mod 18 q s24: rearBlockV aluesr is r j s r k s r l s val1 val2 val3

    3

    25: end if

    26: end for

    27: end for

    28: end for

    29: end for

    30: return rearBlockV alues

    31: end procedure

    2.5.3 SBE

    The procedure returns a structure, SBEsr channels r frequency s . getNew-BandsBounds returns, for a given sub-band number, the starting position (see2.3). First, the energy for each sub-band in each granule of the block is compu-ted (first group of cycles), then these values are collected channel by channel,sub-band by sub-band.

    26

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    27/38

    1: procedure SBE(rearBlock)

    2: for i 1 to 11 do3: for j 1 to 2 do4: for k 1 to 2 do5: for l 1 to 9 do6: if l $ 11 then7: bandStart getNewBandsBounds(l)8: bandStop getNewBandsBounds(l 1)9: for m bandStart to bandStop do

    10: a grSumr is r j s r k s r l s

    11: b rearBlockV aluesr is r j s r k s r ms

    12: grSumr is r j s r k s r l s a b

    13: end for

    14:

    else15: start getNewBandsBounds(l)16: for m start to 66 do17: a SBEsr is r j s

    18: b rearBlockV aluesr is r j s r k s r ms

    19: grSumr is r j s r k s r l s a b

    20: end for

    21: end if

    22: end for

    23: end for

    24: end for

    25: end for

    26: for i 1 to 2 do27: for j 1 to 9 do28: for l 1 11 do29: for k 1 2 do30: a SBEs r is r j s

    31: b grSumr l s r is r k s r j s

    32: SBEsr is r j s a b

    33: end for

    34: end for

    35: end for

    36: end for

    37: return SBEs

    38: end procedure

    2.5.4 PMF

    The final result is computed in-place on the input structure, SBEs, in two steps.In the first loop the denominators are calculated by reading, for a given channel,all the sub-band energies, and in the second loop final results are written. Incase a denominator is zero, the distribution is fixed to uniform.

    27

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    28/38

    1: procedure PMF(SBEs)

    2: for j 1 to 2 do3: for k 1 to 9 do4: sums r j s sums r j s SBEsr j s r k s

    5: end for

    6: end for

    7: for j 1 to 2 do8: for k 1 to 9 do9: val sums r j s

    10: if val $ 0 then11: SBEsr j s r k s

    SBEs r j s r k s

    val

    12: else

    13: SBEsr j s r k s 19

    14:

    end if15: end for

    16: end for

    17: return SBEs

    18: end procedure

    2.5.5 Entropy

    P MF is the original SBEs data structure rewritten in the previous step tocontain the discrete distribution.

    1: procedure entropy(P MF)

    2: for j

    1 to 2 do3: for k 1 to 66 do4: P MFr j s r k s P MFr j s r k s lg p P M Fr is r j s q

    5: end for

    6: end for

    7: for j 1 to 2 do8: for k 1 to 66 do9: Hr j s Hr j s P M Fr j s r k s

    10: end for

    11: Hr j s Hr j s

    12: end for

    13: end procedure

    2.5.6 Bit stream output

    The bit stream is computed by looking at the sequence of entropy values of theblocks. To evaluate the bit for the i-th block, only the entropy values for the i 1-th bit are needed, just two numbers, one per channel. These new values, passedwith Hr channel s are inserted in an auxiliary structure, Hbufr 2 s r channels forcurrent and previous values. The rule is applied in two different but equivalentways, for even and odd indexes.

    28

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    29/38

    1: procedure bufbit(H, blockNum)

    2: for j 1 to 2 do3: Hbufr blockNum p mod 2q s r j s Hr j s4: if blockNum $ 0 then5: if blockNum p mod 2 q then6: if Hbufr 2 s r j s Hbufr 1 s r j s then7: bitsr j s 08: else

    9: bitsr j s 110: end if

    11: else

    12: if Hbufr 1 s r j s Hbufr 2 s r j s then13: bitsr j s 014:

    else15: bitsr j s 116: end if

    17: end if

    18: end if

    19: end for

    20: return bits

    21: end procedure

    2.5.7 BER

    wholeBits and sampleBits are the database and query bit streams. Slidingalong the database stream is done in a base+offset approach. If the distancejust computed is smaller than the current, the current distance is updated.

    1: procedure BER(wholeBits, sampleBits)2: while base | wholeBits | | sampleBits | do

    3: berTemp 04: for i 1 to | sampleNum| do5: if sampleBits r is $ wholeBits r base is then

    6: berTemp berTemp 17: end if

    8: end for

    9: base base 110: if base 0 then11: berCurr berTemp

    12: else

    13: if berTemp berCurr then

    14: berCurr berTemp

    15: end if

    16: end if

    17: end while

    18: return berCurr| sampleNum|

    19: end procedure

    29

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    30/38

    2.6 Implementation

    Initially Mathworks MATLAB was chosen as the implementation environment.Reasons for this choice are a really ergonomic IDE, a programming languageeasy to take up, an important asset of built-in data structures isolating theprogrammer from low-level issues, and finally the fact that the natural settingof this thesis work is digital signal processing, and MATLAB offers a dedicatedmodule to it, the Signal Processing toolbox. Development process in MATLABgave birth to a complete and working implementation of the forward algori-thm. Unfortunately, the testing on the program was really unsatisfactory, dueto unacceptable running times: processing a standard MP3 song on a commondesktop took up to one hour to complete.Consequently, I was compelled to looking for other solutions. In order to pushperformances to significantly higher levels I opted for the C programming lan-

    guage and, for maximum portability, I conformed to the standard, avoidingexternal libraries. Even if the developed code is for great part machine andarchitecture independent, some tasks, like file management, are unavoidablydependent on the particular operative system of execution. The target ma-chine of preference is a standard UNIX box; nevertheless, the program can besuccessfully compiled and used on Windows machines too, by installing properemulation environments like Cygwin.The code is divided into two distinct parts, the forward and backward algori-thms. The matching module consists of just one compilation unit. The finger-printing module breaks down into the effective algorithm and the code neededto decode an MP3 stream. The decoder is a reference implementation by Frau-nhofer Institute. Once the decoder produces IMDCT coefficients, the decodingprocess is halted (partial decoding) and the current frame is fed to the forwardalgorithm. Time performance for fingerprint generation is quite satisfactory:stereo MP3 files of even 15 minutes are processed in no more than five minu-tes. Things are less rosy for fingerprint matching: matching a single 30-secondsquery against a database of 1000 full-length bit streams is time consuming, evenabout one hour on a common desktop. Anyway, one must bear in mind thatthe fingerprint matching process, in the typical scenario of use, is assumed to beperformed on powerful computing infrastructures as it is carried out by the ser-vices provider. On the contrary, it is fingerprint production for audio excerptswhich requires quickness in order to grant a valuable user experience and, extra-polating what emerges from the tests, it is likely to take a dozen of seconds onsmartphones and alike for samples lasting between five and ten seconds. Testingof the algorithm is discussed thoroughly in the next section.

    3 Testing

    The technique is expected to maintain stability even in presence of noise: it isclear that if the excerpt is acquired, for example, in a crowded room, not onlythe technique has to deal with quantization noise but also with over-the-airdisturbance, e.g. voices of those speaking over in the room. On the other hand,it is very important that a technique has the right guess even if the originaltrack is somewhat distorted in a deliberate way: this is the case in many DJset performances where echoes effects are added or the tracks are played with

    30

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    31/38

    noticeable pitch or tempo changes.

    In [1] resistance to many types of distortion is claimed, and the presented resultsshow very good performance for much of them. In this work a subset of thenoise degradations is selected and, for each of them the stability of the techniqueis tested at various degrees of intensity; these degradations are additive whitenoise, echo, pitch shift. For a more complete analysis, the effect of explicitlyadding voices to the queries is studied.

    3.1 Basic distortions

    3.1.1 White noise

    A white noise signal exhibits a flat power spectral density. The PSD of a signalis, for each wavelength, the contribute in terms of power (work over time units)

    over area units, that is how much energy is conveyed by the signal at a givenfrequency. The color of this noise comes from the fact that a light stimulus to beperceived as white has to carry, in the context of additive synthesis, maximumenergy at every frequency. A white noise signal, given any pair of frequencyranges

    r

    f1, f2 s and r fI

    1, fI

    2 sof equal width, contains the same amount of energy

    in both.Formally a (strong) white noise signal is a vector whose mean is zero, variance isfinite, values are independent and identically distributed; for example, samplinga continuous uniform distribution over r 1, 1 s yields a white vector.

    Figure 16: Spectra of vectors obtained by generating pseudo-random numbersin the interval

    r

    1, 1s

    ; thanks to the law of large numbers, as the number of

    samples grows a better white noise-like signal is observed.

    Clearly, as long as the theoretical requirements are matched, an arbitraryunderlying distribution will fit. Along with the uniform distribution the

    p

    0, q

    -Gaussian distribution is usually chosen and if the noise is added to an inputsignal the disturbance is commonly termed AWGN (additive white gaussiannoise).In a white noise audio signal the samples are amplitudes, and the white noisevector is usually rescaled to run in the allowed range of intensities.The intensity of the disturbance can be calibrated by examining the SNR value(see 3.2).

    31

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    32/38

    3.1.2 Echo

    An echo effect, also called delay, is usually defined by four parameters:

    time interval between sound emission and its return (delay)

    intensity of the repetition (decay)

    gain factor for the input signal (gain in)

    gain factor for the output signal (gain out)

    The echo is commonly used in many electronic music genres; dub for example,originated in Jamaica, finds its peculiarity in ethereal and slow rhythm patternswith echoes at the backbone. Many guitarists make use of delay pedals in theirlive performances too, seeking polyphonic impressions.

    When added to an audio signal, echo can be seen as a self-generated noise: asample at time t is played again at time t k, adding itself to the original samplein this position. The initial samples in the range

    r

    t, t

    ks

    are untouched, andif the response of the fingerprinting technique is strong enough here, the audiotrack is recognized besides echoes.

    3.1.3 Pitch shift

    In color theory a widely used triple of color perception correlates are hue, sa-turation and brightness, informally representing for the stimulus its pure color,how intense it is chromatically and how much light looks to emit respectively.In music theory the nature of a sound is discussed in terms of:

    duration: for a generic sound how much it lasts with respect to a tempo;a modern tempo measure is BPM (beats per minute)

    loudness: informally the volume; correlate of amplitude

    pitch: correlate of frequency; common pitch categories are bass, mid andtreble

    timbre: an aggregate attribute grouping a number of sub-attributes inde-pendent from the previous three; a notable component is ASDR (attack,sustain, decay, release) the time envelope of a sound

    In digital equipments pitch is shifted by positive or negative small incrementscalled cents. An octave is the distance between notes at different pitches: for

    example, for a 440 Hz note, the note one octave above is at 880 Hz, the notean octave below 220 Hz; the ratio is constant. The space of an octave canbe partitioned into 12 sub-intervals called semitones, of 100 cents each. Inthis arrangement, shifting up the pitch by 1000 almost means doubling thefrequency; shifting up of just one cent means adjusting to the frequency obtainedby scaling by the 1200-th root of 2 the starting frequency. A cent is the frequencycounterpart for the amplitude decibel.Pitch is an useful tool when a recorded voice needs to be turned unidentifiable.Pitch controls can also be found in turntable mixers.

    32

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    33/38

    3.1.4 Voice

    Voice, arguably the first ever instrument man has played, has frequency contentsin the

    r

    60, 7000s

    Hz range about. Overlay voices are a really important noiseresistance test for two reasons:

    our auditory system is particularly sensitive to voice

    fingerprint techniques often work on samples recorded in public places,where people talking are the main source of disturbance to the signal

    White noise and voices are mixed to the input signal by actually adding thesamples. Echo and pitch effects are modifications of the input signal itself.

    3.2 SNR

    The signal-to-noise ratio indicates the level of degradation of the signal tellinghow far it stands out with respect to noise. Amplitude SNR in decibels is givenby:

    SN RdB 20log10RMSAcontent

    RMSAnoise

    where RMSAx is the root mean square amplitude of the signal x p x1, x2, . . . , xn q :

    RMSAx

    g

    f

    f

    e

    1

    n

    n

    i

    1

    x2i

    The higher the SNR the better the communication is expected to be since thecontent transmitted over the channel dominates background noise. If the SNRis negative, noise strength is stronger than signal strength and a clear commu-nication is compromised.SNR can be applied to both analog and digital signals, even if for bit streamsthe Eb { N0 (energy per bit to noise power spectral density ratio) indicator ismore appropriate:

    Eb { N0 SN R

    LSE

    where LSE is the link spectral efficiency, measured in (bit/s)/Hz, of the channel.Anyway, for it is a more amenable parameter, we will work with SNR ratherthan Eb { N0.

    3.3 Testing infrastructure

    The test database is a collection of 1000 MP3 files. Many musical styles arecovered, from rap music to post-techno. Random intervals are extracted fromeach song; the samples are 5 s long and make up the set of queries to be finger-printed and matched. Notice that picking sub-songs at random is a stricter testcondition that, for example, taking central portions, because initial and finalparts of a song are likely to be less informative. This is especially the case forfour on the floor club music where often the tails of the track are nothing butthe drum beat to allow an easier beat-matching by the DJ.

    33

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    34/38

    The second step is to generate fingerprints for database songs and queries. Fi-

    nally, each query is matched against the database, and results are returned.The first type of test tries to match clean samples, the other four test noise-deteriorated samples; the noise is added at first hand on the already extractedsamples.The testing process is highly automated through the use of scripts and externalfree programs, MP3info, MP3splt and SoX. Reports are text files showing, foreach sample, the ten database songs yielding the lowest matching scores.

    3.4 Results

    As said, the testing perspective of the technique is broadened, by taking intoconsideration different noise intensities as well as a novel distortion, voice ad-dition. The technique response for clean samples is tested again in two initial

    test cases. The complete list follows:

    without noise addition/clean samples:

    preliminary test: about 100 songs picked at random

    differentiation test: about 125 songs by the same author

    with noise addition - each on the whole database of 1000 songs:

    voices addition

    white noise addition

    echo addition

    pitch shiftOnce a query has completed, a list containing the results is returned. If theright song is in this list, then a match is counted, and if it is the first of the lista perfect match is counted too.For the first two test cases, results are really positive:

    preliminary test: perfect match hit rate 98%, hit rate 100%

    differentiation test: 92%, 98%

    We now proceed presenting the noise distortion test cases.

    3.4.1 White noise

    In the paper, noise addition is tested with an SNR as low as 15 dB. The vo-lume of a signal affects its RMSA, so the choice is to generate white noise atdifferent volume levels. The reference volume is 0 dBFS, the maximum possibledigital level for the volume on the computer running the tests, and in this testcorresponds, at maximum speakers power, to a sound pressure level of about100 dB. Subsequent volume values are different attenuations of the referencevalue, which is scaled. In the following, adjusting the white noise volume to voldetermines a certain RMSA; SN R is the average SNR between the signal andthe database tracks.

    1. vol 1 RMSA .45 9.5dB

    34

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    35/38

    2. vol .5 RMSA .23 3.7dB

    3. vol .25 RMSA .11 2.7dB

    4. vol

    .125

    RMSA

    .06

    8.0dB

    5. vol

    .0625

    RMSA

    .03

    14dB

    6. vol

    .03125

    RMSA

    .01

    23dB

    Figure 17: Plot of volume versus hit rate.

    3.4.2 Echo

    Echo effect is tested with the following delay and decay values (gain is unchan-ged):

    del 50ms, dec 25%

    del 100ms, dec 50%

    del 200ms, dec 60%

    del 400ms, dec 70%

    del 800ms, dec 80%

    del 1600ms, dec 90%

    35

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    36/38

    Figure 18: Plot of| p

    delay, decayq |

    versus hit rate.

    3.4.3 Pitch shift

    Shift cents:

    1000cents

    500cents

    100cents

    100cents

    500cents

    1000cents

    Figure 19: Plot of pitch shift versus hit rate.

    3.4.4 Voice

    Just like the white noise case, the initial signal is progressively attenuated byscaling it.

    36

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    37/38

    Figure 20: Plot of volume versus hit rate.

    4 ConclusionsThis thesis work consisted in the study, implementation and testing of the fin-gerprinting technique illustrated in [1]. The technique employs entropy as thetool to model the fluctuation of information content in the MP3 audio file andgenerate a compact bit stream. Performance was evaluated with respect to themost relevant noise degradations cited in the research paper. Results were sati-sfactory and confirmed the robustness and stability of the algorithms claimed bythe authors, since hit rate statistics turned out to be low only for degradationsof high intensity.

    The main application focus of this kind of techniques is audio track retrie-val, especially in the context of use of mobile phones. For very large databases,

    other interesting applications of the system worth to cite in these Conclusionsare duplicate track detection and track quality sorting. In the first case, asfingerprints can be seen as a way of hash-indexing the audio files, whenever twocomputed fingerprints are very similar, a duplicate track is detected. In addition,given a reference audio track and a number of derived different versions of it, bylooking at the matching score of these versions their relative quality is obtained.

    As said, the concept of entropy from Shannon theory is at the very core of thewhole forward algorithm, where the entropy of each 22-granules block is com-puted and bits are emitted accordingly. In the first place, the actual entropyof the discrete signal can be approximated by entropy estimation approaches,and the easiest one would be to sub-sample the signal, seeking a compromise

    between sub-sample factor and estimation error. More refined versions of esti-mates exist, for example ApEn (approximate entropy), which is a statistic fordetecting how steady a time series is.

    Finally, an analytical comparison study of this technique versus the other knownand successful techniques would be of great interest, also because [1] is relativelyeasy and fast.

    37

  • 7/28/2019 Fingerprint and quality-based audio track retrieval

    38/38

    References

    [1] Wei Li, Yaduo Liu, and Xiangyang Xue. Robust audio identification forMP3 popular music. In Proceedings of the 33rd international ACM SIGIRconference on Research and development in information retrieval, SIGIR 10,pages 627634, New York, NY, USA, 2010. ACM.

    [2] Rassol Raissi. The theory behind MP3, 2002.

    [3] Z.N. Li and M.S. Drew. Fundamentals of Multimedia. Pearson Prentice Hall,2004.

    [4] ISO/IEC. ISO/IEC 11172-3:1993 - Information technology - Coding ofmoving pictures and associated audio for digital storage media at up toabout 1,5 Mbit/s - Part 3: Audio, 1993.

    [5] ISO/IEC. ISO/IEC 13818-3:1995 - Information technology - Generic codingof moving pictures and associated audio information - Part 3: Audio, 1995.

    [6] Rafael C. Gonzalez and Richard E. Woods. Digital Image Processing.Addison-Wesley Longman Publishing Co., Inc., 2001.

    [7] Jaap Haitsma and Ton Kalker. A highly robust audio fingerprinting system.In ISMIR, pages 107115, 2002.

    [8] Avery L. Wang. An Industrial-Strength Audio Search Algorithm. In ISMIR2003, 4th Symposium Conference on Music Information Retrieval, pages713, 2003.

    [9] Shumeet Baluja and Michele Covell. Content fingerprinting using wavelets.In Proc. CVMP, 2006.