Audio Music Monitoring- Analyzing Current Techniques for Song Recognition and Identification

  • Upload
    fuerz

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

  • 7/25/2019 Audio Music Monitoring- Analyzing Current Techniques for Song Recognition and Identification

    1/12

    Audio Music Monitoring: Analyzing Current

    Techniques for Song Recognition and IdentificationE.D. Nishan W. Senevirathna and Lakshman Jayaratne

    Abstractwhen people are attaching or interesting in

    something, usually they are trying to interact with it frequently.

    Music is attached to people since the day of they were born.

    When music repository grows, people faced lots of challenges

    such as finding a song quickly, categorizing, organizing and even

    listening again when they want etc. Because of this, people tend

    to find electronic solutions. To index music, most of theresearchers use content based information retrieval mechanism

    since content based classification doesnt need any additional

    information rather than audio features embedded to it. As well

    as it is the most suitable way to search music, when user dont

    know the meta data attached to it, like author of the song. The

    most valuable application of this audio recognition is copyright

    infringement detection. Throughout this survey we will present

    approaches which were proposed by various researchers to

    detect, recognize music using content base mechanisms. And

    finally we will conclude this by analyzing the current status of

    this era.

    KeywordsAudio fingerprint; features extraction; wavelets;

    broadcast monitoring; Audio classification; Audio identification.

    I.

    INTRODUCTION

    usic repositories in the world are increasing

    exponentially. New artist can come to the field easily

    with new technologies. Once we listen a new song, we cant

    get it again easily if we dont know the meta data of that song

    like author or singer. However the most common method of

    accessing music is through textual meta-data but this is no

    longer function properly against huge music collection. When

    we come to the audio music recognition era, followings are

    the key considerations.

    Can we find an unknown song using a small part of it

    or humming the melody?

    Can we organize, index songs without meta data like

    singer of the song?

    Can we detect copyright infringement? For an example

    after a song was broadcasted in a radio channel.

    Can we identify a cover song when multiple versions

    exist?

    Can we obtain a statistical report about broadcasted

    songs in a radio channel without a manual monitoring

    process?

    Above considerations motivate researches to find proper

    solutions for these challenges. As of now, so many ideas have

    been proposed by researches as well as some of them have

    been implemented, Shazam is one of example for that.

    However still this is a challenging research area since there is

    no optimal solution. This problem become even more

    complex when,

    Audio signal is altered by noise.

    Audio signal is polluted by adding unnecessary audio

    object like advertisement in radio broadcasting.

    When multiple versions are existed.

    Only a small part of a song is available.

    At any of above situations, human auditory system can

    recognize music but providing an automated electronic

    solution is very challenging task since similarity between

    original music and querying music could be very few or these

    similar features may not be possible to model mathematically.

    It means researches need to consider perceptual features also,

    in order to provide a proper solution. Feature extraction can

    be considered as the heart of any of these approaches since the

    accuracy and all are depended on the way of feature

    extraction.

    Rest of this survey, will provide broader overview and

    comparisons of proposed feature extractions, searching

    algorithms and overall solutions architectures.

    M

    DOI: 10.5176/2251-3043_4.3.328

    GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015

    The Author(s) 2015. This article is published with open access by the GSTF

    23

    Received 20 Jul 2015 Accepted 13 Aug 2015

    DOI 10.7603/s40601-014-0015-7

  • 7/25/2019 Audio Music Monitoring- Analyzing Current Techniques for Song Recognition and Identification

    2/12

    II. CLASSIFICATIONS (RECOGNITION)VS.IDENTIFICATIONS

    What is the different between audio recognition

    (classification) and identification? In audio classification,

    audio object will be classified into pre-defined sets like song,

    advertisement, vocals etc. but they are not identified further.

    Ultimately we know that this is a song or advertisement but

    we dont know what that song is! Audio classification is less

    complex than recognition. Most of the time, we can see that

    these two things are combined each other in order to get better

    result. For an example, in audio song recognition system, first

    we can extract only songs among collection of other audio

    objects using audio classifier and output will be fed in to the

    audio recognition system. Using that kind of approach we can

    get better result by narrow downing the search space. There

    are more proposed audio classification approaches. Some of

    them will be discussed in next sub section.

    A.

    Audio classifications1)

    Overview

    There are considerable amount of real world

    applications for audio classification. For an example it will be

    very helpful to be able to search sound effects automatically

    from a very large audio database in films post processing,

    which contains sounds of explosion, windstorm, earthquake,

    animals and so on[1]. As well as audio content analysis and

    classification is also useful for audio-assisted video

    classifications. For an example, all video of gun fight scenes

    should include the sound of shooting and or explosions, but

    image content may vary significantly from one scene to

    another.

    When classifying an audio content into different sets,

    different classes have to be considered. Most of the researches

    have started this classifying speech and music. However these

    classes are depended on the situations. For example, music,

    speech and others can be considered for the parsing of

    news stories whereas audio recording can be classified into

    speech, laughter, silences and non speech for the

    purpose of segmenting discussions recording in meetings[1].

    In any cases above, we have to consider, extract some sort of

    audio features. This is the challenging part as well as past

    researches are differed from this point. But we can consider

    feature extraction of audio classification and feature

    extraction of audio identification separately since most of thetimes these two cases consider disjoin feature sets [7].

    2)

    Feature extraction of audio classification

    Actually, most of the time output of the audio

    classification is the input of the audio identification. This will

    reduce the searching space and speed up the process and help

    to retrieve better results. Most of the researchers, audio

    classification will be broken down into further steps. In [1]

    they used two steps, in the first stage, audio signal is

    segmented and classified into basic types, including speech,

    music, several types of environmental sounds, and silence.

    They called it as the coarse-level classification. In the second

    stage, further classification is conducted within each basic

    type. For speech, they differentiated it into voices of man,

    woman, child as well as speech with a music background and

    so on. For music, it is classified according to the instruments

    or types (for example, classics, blues, jazz, rock and roll,

    music with singing and the plain song). For environmental

    sounds, they classified them into finer classes such as

    applause, bell ring, footstep, windstorm, laughter, birds' cry,

    and so on. They called this as the fine-level classification.

    Overall idea was reducing the searching space step by step in

    order to get better results. As well as we can use proper feature

    extraction mechanism for each finer level classes based on its

    basic type. For an example, due to differences in the

    origination of the three basic types of audio, i.e. speech, musicand environmental sounds, different approaches can be taken

    in their fine classification. Most of the researches have used

    low-level (physical, acoustic) features such as Spectral

    Centroid or Mel-frequency Coefficients but end users may

    prefer to interact with a higher semantic level [2]. For an

    example they may need to find dog barking sound instead of

    environmental sounds. However low-level features can be

    easily extract using signal processing than high-level

    (perceptual) features.

    Most of the researchers have used Hidden Markov

    Model (HMM) and Gaussian Mixture Model (GMM) as the

    pattern recognition tool. Those are the widely used very

    powerful statistical tools in pattern recognition. To use those

    tools we have to extract unique features. Any audio feature

    can be grouped into two or more sets. Most of the researches

    grouped all audio features into two group, physical (or

    mathematical) features and conceptual features. Physical

    features are directly extracted from the audio wave such as

    energy of the wave, frequency, peaks, average zero crossings

    and so on. These features cannot be identified by the human

    auditory system. But perceptual features are the features

    human can understand like loudness, pitch, timbre, rhythm

    and so on. Perceptual features cannot easily be model by

    mathematical functions but those are the very important audio

    features since human uses those features to differentiate

    audios.

    However sometime we can see that audio features

    classified into hierarchical groups with similar characteristics

    [12]. They divide all audio features into six main categories,

    refer the Figure 1.

    GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015

    The Author(s) 2015. This article is published with open access by the GSTF

    24

  • 7/25/2019 Audio Music Monitoring- Analyzing Current Techniques for Song Recognition and Identification

    3/12

    Figure 1. High level, Audio Feature Classification[12].

    However no one can define audio feature and its

    category exactly since there is no broad consensus on the

    allocation of features to particular groups. We can see that

    same feature may be classified into two different groups bytwo different researchers. It is depended on the different

    viewpoints of the authors. Features defined in the figure 1 can

    be further classified into several groups considering the

    structure of each feature.

    Considering the structure of the temporal domain

    feature, in [12], they classified it into three sub groups of

    features: amplitude-based, power-based, and zero crossing-

    based features. Each of these features related to one or more

    physical property of the wave, refer the Figure 2.

    Figure 2. The organization of features in Temporal Domain [12].

    In here, some researches had defined zero crossings

    rate (ZCR) as a physical feature. Frequency domain signals

    are the very important features. Most of the researches

    consider only the frequency domain features. Next we willlook at the frequency domain feature classification done by

    [12] refer the Figure 3.

    Sometime we can see that some researches had further

    classified other four main features as well. But those are not

    very important. Next we will see the main characteristics of

    major features.

    Figure 3. The organization of features in Frequency Domain [12]

    a) Temporal (Raw) Domain features

    Most of the time, we cant extract features without

    altering the native audio signal. But there are several features

    which can be extracted from native audio signal those features

    are known as temporal features. Since we dont want to alter

    the native signal it is very law cost feature extraction

    methodology. But only using this feature we cant uniquely

    identify audio music.

    Zero crossing rate is a main temporal domain feature.

    This is very helpful but low cost feature which is often used

    in audio classification. Usually we define is as the number of

    zero crossings in the temporal domain within one second. It is

    a rough estimation of dominant frequency and the spectral

    centroid [12]. Sometime we obtain ZCR by altering the audiosignal bit. In this case we extract frequency information and

    corresponding intensities scaled sub bands from time domain

    zero crossings. It gives more stable measurement for us and it

    is very helpful in noisy environment. Since noises are always

    spread around zero axes but this is not creating considerable

    amount of peaks therefore peak related zero crossing rate will

    remain unchanged.

    Amplitude-Based Features are another example for

    temporal domain features. We can obtain this feature by

    directly computing the frequency of audio signal. It is again

    good measurement but subject to change even audio signal is

    alter little bit by noise like unwanted affects.

    Power measurement is also a raw domain signal

    which is almost same as the amplitude based features. The

    power or the energy of a signal is the square of the amplitude

    represented by the waveform. Volume is well known power

    measurement feature it is widely used in silence detection and

    speech/music segmentation.

    GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015

    The Author(s) 2015. This article is published with open access by the GSTF

    25

  • 7/25/2019 Audio Music Monitoring- Analyzing Current Techniques for Song Recognition and Identification

    4/12

    b)Physical features

    Most of the audio features are obtain from frequency

    domain since almost all features live in this domain. Before

    extracting frequency domain features we have to transform the

    base signal into some other formats. To do that, we can useseveral methods. The most popular methods are the Fourier

    transform and the autocorrelation. Other popular methods are

    the Cosine transform, Wavelet transform, and the constant Q

    transform [12]. Frequency domain signal can be categorized

    into two major class, physical features and perceptual features.

    Physical domain features are defined using physical

    characteristics of audio signal which have not semantic

    meanings. Next we will discuss mainly used physical features

    and then perceptual features.

    Auto-regression-Based Features: In statistics and

    signal processing, an autoregressive (AR) model is a

    representation of a type of random process; as such, it

    describes certain time-varying processes in nature,economics, etc.[18]. This is widely used standard techniques

    for speech/music discrimination. This can be used to extract

    basic parameters of a speech signal, such as formant

    frequencies and the vocal tract transfer function [18].

    Sometime we can see that this feature group is divided further

    into two group, linear predictive coding (LPC) and Line

    spectral frequencies (LSF). But in here we are not going to

    discuss about these sub group in detailed.

    Short-Time Fourier Transform-Based Features

    (STFT): this is another widely used audio feature based on

    the audio spectrum. STFT can be used to obtain characteristics

    of both frequency component and phase component. There are

    several features under STFT such as Shannon entropy, Renyi

    entropy, spectral centroid, spectral bandwidth, spectral

    flatness measure, spectral crest factor and Mel-frequency

    cepstral coefficients [15].

    Short-time energy function: Energy of an audio

    signal is measured by amplitude of that signal. When we

    represent amplitude variation over time it is called energy

    function of that signal. For speech signals, it is a basis for

    distinguishing voiced speech components from unvoiced

    speech components, as the energy function values for

    unvoiced components are significantly smaller than those of

    the voiced components [1].

    Short-time average zero-crossing rate (ZCR): Thisfeature is another measurement to classify voiced speech

    components and unvoiced speech components. Usually voice

    component have much smaller ZCR than unvoiced component

    [1].

    Short-time fundamental frequency (FuF): Using this

    feature we can find harmonic properties. Usually most

    musical instrument sounds are harmonic. Sometime some

    sound can be mixer of harmonic and non-harmonic. However

    this feature also can be used to classify audio objects [1].

    Spectral Flatness Measure (SFM): which is an

    estimation of the tone-like or noise-like quality for a band inthe spectrum [1]. Really used for audio classifications.

    There are some other widely used physical features

    like, Mel-Frequency Cepstrum Coefficients(MFCC),Papaodysseuset al. (2001) presented the band

    representative vectors, which are an ordered list of indexes

    of bands with prominent tones (i.e. with peaks with significant

    amplitude). Energy of each band is used by Kimura et al.

    (2001). Normalized spectral sub-band centroids are proposed

    by Seo et al. (2005). Haitsma et al. use the energies of 33 bark-

    scaled bands to obtain their hash string, which is the sign of

    the energy band differences (both in the time and thefrequency axis) and so on.

    Most of the time silent audio frames are identified

    earlier and those are not directed to further processing. There

    are several approaches to identify/define a silent frame. Some

    researched have used ZCR property. In [4], they have used

    something like below to define silent frames.

    Before feature extraction, an audio signal (8-bit ISDN

    -law encoding) is pre-emphasized with parameter 0.96 and

    then divided into frames. Given the sampling frequency of

    8000 Hz, the frames are of 256 samples (32ms) each, with

    25% (64 samples or 8ms) overlap in each of the two adjacent

    frames. A frame is hamming-windowed by, wi= 0.540.46

    * cos(2i/256). It is marked as a silent frame if,

    ( )2