4
A Hybrid Selection Method of Audio Descriptors for Singer Identification in North Indian Classical Music. Saurabh Deshmukh Head IT Dept, GHRCEM, Pune, India. Email: - [email protected] Abstract: Singer identification is most important application of Music information retrieval. The process starts with identifying first the audio descriptors then using these feature vectors as input to further classification using Gaussian Mixture Model or Hidden Markov Model as classifiers to identify the singer. The process becomes chaotic if all audio descriptors are used for finding the feature vector; instead if the audio descriptors are selected with respect to the application then the process becomes comparatively simple. In this paper we propose a Hybrid method of selecting correct audio descriptors for the identification of singer of North Indian Classical Music. First only strong (primary) audio descriptors are released on the system in forward pass and the classification impact is to be recorded. Then only selecting the top few audio descriptors having largest impact on the singer identification process are selected and rest are eliminated in the backward pass. Then selecting and releasing all the less significant audio descriptors from the groups that had maximum impact on singer identification process increases the success of correctly identifying the singer. The method reduces substantially the large number of audio descriptors to few, important audio descriptors. The selected audio descriptors are then fed as input to further classifiers. Keywords:-Audio descriptors, North Indian Classical Music, GMM, HMM. 1. INTRODUCTION: An Audio Information Retrieval (AIR) system considers analysis, extraction and comparison of various features of the segmented audio piece. Music Information Retrieval (MIR) is its sub system that deals with singer identification, singing voice separation, musical note detection, pitch detection, melody extraction, timber identification, melody transcription and so on. From last three decades MIR technologies and tools have strengthen their algorithms and found various innovative methodologies to apply them to various music and information related fields. Whitman et al. [1] has presented the earliest work on singer identification, where spectral features identified and computed from audio clips were fed to Artificial Neural Networks (ANN) and SVM for classification. Singer Identification has various applications in multimedia and multimedia database related fields. Also, in telecommunication and security Dr. S.G. Bhirud Professor, Computer Engineering Department, VJTI, Mumbai, India Email: - [email protected] fields, a lot of applications are available with various voice recognition (speech and singing both) methods implemented. Berenzweig et al. [2] have implemented a system that tries to improve the classification performance by identifying the audio segments that contains vocals, at the first place. The system then uses a neural network classifier that is trained and tested using Perceptual Linear Predication Coefficients (PLPC). Further, the singer identification is done using Mel Frequency Cepstral Coefficients’ (MFCC) as extracted singing voice features, given as input to another Neural Network Classifier. Though singing is continues speech, the techniques used for speech analysis and synthesis are not same for singing voice. Unfortunately there is no robust algorithm that works fine on speech and singing voice together. Also credibility of any algorithm for analysis and/or synthesis of speech or singing voice depend heavily on, essentially, the type and attributes of the input, the feature extraction methodology and identification or classification technique used. Audio descriptors are special attributes or characteristic features of the audio segment under consideration. There is a wide variety of these descriptors and identifying them is an essentially first step towards analysis of the audio sample. In this paper we focus and analyze various ways in which the audio descriptors can be classified and used for particular application of identification of singing voice of a north Indian classical vocalist. The purpose behind the selection of north Indian classical music is because of its complexity in terms of micro tonal variations in singing voice and variations in stylizations. Though any music lies in the frequency range of 20 Hz to 20 kHz, a keen thought has to be given to the selection of audio descriptors as input features for north Indian classical music. The audio descriptors can be classified in various ways. In all sets of these classes, not all features are necessary for identifying a singer or any such application. Since selection of all such less relevant 2012 Fifth International Conference on Emerging Trends in Engineering and Technology 978-0-7695-4884-5/12 $26.00 © 2012 IEEE DOI 10.1109/ICETET.2012.62 224 2012 Fifth International Conference on Emerging Trends in Engineering and Technology 978-0-7695-4884-5/12 $26.00 © 2012 IEEE DOI 10.1109/ICETET.2012.62 224

[IEEE 2012 5th International Conference on Emerging Trends in Engineering and Technology (ICETET) - Himeji, Japan (2012.11.5-2012.11.7)] 2012 Fifth International Conference on Emerging

  • Upload
    sunil-g

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE 2012 5th International Conference on Emerging Trends in Engineering and Technology (ICETET) - Himeji, Japan (2012.11.5-2012.11.7)] 2012 Fifth International Conference on Emerging

A Hybrid Selection Method of Audio Descriptors for Singer Identification in North Indian Classical Music.

Saurabh DeshmukhHead IT Dept,

GHRCEM, Pune, India.

Email: - [email protected] Abstract:

Singer identification is most important application of Music information retrieval. The process starts with identifying first the audio descriptors then using these feature vectors as input to further classification using Gaussian Mixture Model or Hidden Markov Model as classifiers to identify the singer. The process becomes chaotic if all audio descriptors are used for finding the feature vector; instead if the audio descriptors are selected with respect to the application then the process becomes comparatively simple. In this paper we propose a Hybrid method of selecting correct audio descriptors for the identification of singer of North Indian Classical Music. First only strong (primary) audio descriptors are released on the system in forward pass and the classification impact is to be recorded. Then only selecting the top few audio descriptors having largest impact on the singer identification process are selected and rest are eliminated in the backward pass. Then selecting and releasing all the less significant audio descriptors from the groups that had maximum impact on singer identification process increases the success of correctly identifying the singer. The method reduces substantially the large number of audio descriptors to few, important audio descriptors. The selected audio descriptors are then fed as input to further classifiers.

Keywords:-Audio descriptors, North Indian Classical Music, GMM, HMM.

1. INTRODUCTION:

An Audio Information Retrieval (AIR) system considers analysis, extraction and comparison of various features of the segmented audio piece. Music Information Retrieval (MIR) is its sub system that deals with singer identification, singing voice separation, musical note detection, pitch detection, melody extraction, timber identification, melody transcription and so on. From last three decades MIR technologies and tools have strengthen their algorithms and found various innovative methodologies to apply them to various music and information related fields.

Whitman et al. [1] has presented the earliestwork on singer identification, where spectral features identified and computed from audio clips were fed to Artificial Neural Networks (ANN) and SVM for classification. Singer Identification has various applications in multimedia and multimedia database related fields. Also, in telecommunication and security

Dr. S.G. Bhirud Professor, Computer Engineering Department,

VJTI,Mumbai, India

Email: - [email protected]

fields, a lot of applications are available with various voice recognition (speech and singing both) methods implemented.

Berenzweig et al. [2] have implemented a system that tries to improve the classification performance by identifying the audio segments that contains vocals, at the first place. The system then uses a neural network classifier that is trained and tested using Perceptual Linear Predication Coefficients (PLPC). Further, the singer identification is done using Mel Frequency Cepstral Coefficients’ (MFCC) as extracted singing voice features,given as input to another Neural Network Classifier.

Though singing is continues speech, the techniques used for speech analysis and synthesis are not same for singing voice. Unfortunately there is no robust algorithm that works fine on speech and singing voice together. Also credibility of any algorithm for analysis and/or synthesis of speech or singing voice depend heavily on, essentially, the type and attributes of the input, the feature extraction methodology and identification or classification technique used.

Audio descriptors are special attributes or characteristic features of the audio segment under consideration. There is a wide variety of these descriptors and identifying them is an essentially first step towards analysis of the audio sample.

In this paper we focus and analyze various ways in which the audio descriptors can be classified and used for particular application of identification of singing voice of a north Indian classical vocalist. The purpose behind the selection of north Indian classical music is because of its complexity in terms of micro tonal variations in singing voice and variations in stylizations.

Though any music lies in the frequency range of 20 Hz to 20 kHz, a keen thought has to be given to the selection of audio descriptors as input features for north Indian classical music. The audio descriptors can be classified in various ways. In all sets of these classes, not all features are necessary for identifying a singer or any such application. Since selection of all such less relevant

2012 Fifth International Conference on Emerging Trends in Engineering and Technology

978-0-7695-4884-5/12 $26.00 © 2012 IEEE

DOI 10.1109/ICETET.2012.62

224

2012 Fifth International Conference on Emerging Trends in Engineering and Technology

978-0-7695-4884-5/12 $26.00 © 2012 IEEE

DOI 10.1109/ICETET.2012.62

224

Page 2: [IEEE 2012 5th International Conference on Emerging Trends in Engineering and Technology (ICETET) - Himeji, Japan (2012.11.5-2012.11.7)] 2012 Fifth International Conference on Emerging

attributes would not only increase the complexity of the calculation but would also would reduce the efficiency of correctly identifying a singer, such all in all selection is not advisable.

2. AUDIO DESCRIPTOR :

Audio feature extraction addresses the analysis and extraction of meaningful information from audio signals in order to obtain a compact and expressive description that is machine-processable [3].

It is observed that the audio features or audio descriptors that are extracted, derived and used for a particular task and domain are often tried for other applications and other domains. We restrict here in this paper to the classification of almost all important audio descriptors for singer identification domain only, especially for north Indian classical vocal.

3. NORTH INDIAN CLASSICAL MUSIC :

Indian classical music is divided mainly into two parts North Indian classical music (Hindusthani Music) and South Indian Classical Music (Carnatic Music). Both the music is based on a concept called raga, and all their music is bounded within the restrictions of musicalnotes specified by this Raga [4].

The key features which are present in North Indian classical music and not available in western music make it important to neatly analyze the audio descriptors describing the singer singing a Raga. Other than many differences, most important, as far as singer identification process for north Indian classical singer and western classical singer, is that the singer uses different voice textures for different types of raga. For example a singer singing YAMAN raga may not sing in same voice texture and attributes as when he would sing TODI raga. Here we don’t mean the typical descriptors describing the singer; rather we mean here the psychological involvement of the singer, as per the raga (type), is hidden in the entire attribute structure which is to be carefully selected.

The reasons behind this change lies in many aspects such as the type of raga, the kind of song he is singing (such as Bada Khyal or Chota khyal or Lakshangeet or Tarana etc), the Gharana the singer belongs to etc. Hence a brute force application of traditional audio descriptors on both of these singers (western and Indian) will not accurately work. First the classification and then the selection of correct audio descriptors of singer identification in north Indian classical music is thus becomes a necessary.

4. GENERAL SINGER IDENTIFICATION PROCESS

Generally any singer identification system works in following way. There are usually three modules viz an Input module, a Query Module and a Retrieval Module. In input module, from audio files, various audio features of the singer are extracted and stored in a feature database. When the query module come up with a query of whether the singer data in hand is known or known the retrieval module gets a query along with feature set of singer to be identified. The Retrieval Module the uses some similarity comparison method or a set of such modules and then gives a feedback whether there is a match available in the feature database or not. As shown in fig 1.

Fig 1: General Structure of Singer Identification Process

5. RELATED WORK:

In various multimedia related applications it is required that transcription and indexing of music data should be done. For example many times we store music by name of the song or by name of the main singer in the song. But there are places where it might be required to search for a particular singer say Kishor Kumar for a song from a database of thousands of songs. It is practically not possible to open and hear all files and find out the singer. Hence with the help of a set of good audio descriptors it is easy to locate a song of a singer without opening and playing the file.

There is a lot of work still under perfection towards automatic identification of a music piece. As per [5,2006] there is lot of work done on singer identification but the work done so far has ignored the influence of background music on singer voice characterization. No attempt has been made to remove the interference of background music from the vocal characteristics.

225225

Page 3: [IEEE 2012 5th International Conference on Emerging Trends in Engineering and Technology (ICETET) - Himeji, Japan (2012.11.5-2012.11.7)] 2012 Fifth International Conference on Emerging

In [6] basic features used to distinguish singers’ voice were using Warped Linear Predictions (WLP). Singer identification was done using Gaussian Mixture Model (GMM) classifier or Support Vector Machines (SVM). As per [7, 2003] a very common speaker recognition method based on Gaussian Mixture Model (GMM) was trained using Mel Scale Frequency Cepstral Coefficients (MFCC) was applied to identify the singer.

As per [Geoffroy Peeters, 2004], there are four points of view to distinguish the audio features,

a) The dynamicity of the feature. b) Time extent (Global or instant) of the descriptor. c) The abstractness of the feature. d) The extraction process of the feature.

For the project of CUIDADO the author has used taxonomy that was most suitable as per the input provided and output required for the project. It uses classification of audio features as following classes: Global Temporal Features (log attack time, increase, decrease, centroid, and duration), Instantaneous Temporal Features (signal autocorrelation, zero crossing rates, Energy Features and Spectral Features), and Global Spectral Shape description (MFCC, DMFCC, DDMFCC, Harmonic features and perceptual features). The feature extraction module requires a pre computing stage to provide adequate signal representation for later processing of descriptors extraction.

While in [8], there are two principles of classification, first, the computational issue of the feature, such as Wavelet Transform features (WT) or Short Time Fourier Transform based features (STFT). The other classification principle corresponds to the qualities of the audio like timbre, rhythm, pitch etc. This philosophy yields in categorization of audio descriptors on the basis of either the way they are extracted and calculated or the way they describe similar audio qualities. Still there remain few descriptors which cannot be categorized in any group. Thus taxonomy of audio descriptors in one specific way has been so impossible yet. There is no broad accord available on the allocation of features to a particular group.

For example [9] assigns Zero Crossing Rate (ZCR) to the group of temporal features while [10] assigns it to perceptual feature group. Thus to classify the audio descriptors in generalized way is highly impossible. But on the other hand, the other way, it is better being done as per the requirements of the particular application as in our case is the singer identification in north Indian classical music.

6. THE PROPOSED METHOD:

Adding irrelevant attributes to a dataset of audio descriptors to be extracted, is not practical and often confuses a machine learning system [11]. In case of North Indian Classical vocalist an automated classification and thud feature selection procedure cannot be used for various reasons. Most important reason is that the automation eliminates usually the less significant descriptor from a group. For example if MFCC is selected then delta or double delta may be rejected. This reduces the chances of catching the micro tonal attributes of the singer. In case of Indian Classical Music a singer singing special features of the music, which are not present in Western Music, such as Shruti# will not be caught, resulting in miss classification of the singer or may be even having separate classification of each singer for its each song. Another reason is that the system with automation will never consider special attributes of the singer such as whether the singer is trained or not trained for classical music and the type of song he/she is singing. There is no provision to give this logic as input and then to automatically calculate the descriptor set.

The backward elimination method adopted by [12] makes some sense when complete set of all audio descriptors is applied on the system and then from output to input we eliminate one by one less significant descriptor and again finding out the effect of this reduction on final singer identification efficiency. This method also has some drawbacks. It would be very chaotic to calculate and store all the possibilities at first and then experimenting one by one on all descriptors. It is possible that few less relevant descriptors form one group along with strong relevant descriptor form other group, if are correlated, may produce better results.

Thus a Hybrid approach in necessary to find out correct reduced set of audio descriptors. On the given data first we release all the strong major descriptors from each group and find out the impact on singer identification. Then if we consider first few (may be five or six as example) strongly dominating audio descriptors then we again release all the other less relevant descriptors from the same group and identify the singer. This becomes relevant in case of North Indian Classical Music because the singers in this differ with very minute variations for similar raga from almost same musical notes. Following figure 2 represents this approach.

226226

Page 4: [IEEE 2012 5th International Conference on Emerging Trends in Engineering and Technology (ICETET) - Himeji, Japan (2012.11.5-2012.11.7)] 2012 Fifth International Conference on Emerging

Fig 2: Hybrid method to reduce no of audio descriptors to identify Singer. Here first selecting strong descriptors and then backward elimination for other non impacting reduces the computational complexity, thus giving a reduced set of audio descriptors.

7. EVALUATION METHOD In order to evaluate this Hybrid method of selection of audio descriptors a set two folded types of database would be preferable. Not only different North Indian Classical Singers should be selected but also same singer singing various different ragas has to be selected. The interference of Tanpura drone could be omitted by selecting only singers singing without any accompaniment on primary basis. Mono channel, Sampling frequency Fs=22050Hz, 16 bits per sample would be sufficient. Also in the selection of ragas, experiments could be carried out by a. Selecting different ragas from same Thaat#, b. Selecting different ragas from different Thaat#. The audio descriptors once retrieved through this Hybrid Selection method could then be fed as input to Gaussian mixture model (GMM) or Hidden Markov Model (HMM) as classifiers.

8. CONCLUSION: The audio descriptor selection is a crucial process in identifying a singer in given database of singers. As much as, to identify correctly a classical singer from a given database, the quality of input is important, it is also important that correct set of audio descriptors should be used. To classify these audio descriptors special attention has to be given to the kind of input given and the kind of final classifier is used. In case of identifying a singer from North Indian classical music, the traditional audio descriptors selection method does not consider the aspects of the specialties of the stylizations in this type of music. With the help of this hybrid selection method the descriptor size drastically reduces. In this a straight approach of selecting strong descriptors at

the first place, releasing them and then finding the most important descriptor classes, then backward eliminating the less important and again firing all the less relevant audio descriptors from the important groups back leads to a successful singer identification in North Indian Classical Music.

9. REFERENCES:

[1] B. Whitman G. Flake and S Lawrence, “Artist detection in Music with Minnow- Match,” in proceedings of the 2001 IEEE workshop on Neural Network for Signal Processing, Falmouth ,MA, 2001, pp. 559-568.

[2] A. Berenzweig, D. Ellis, and S. Lawrence, “Using voice segments to improve artist classification of Music”, in AES 22nd

International Conference, Espoo, Finland, 2002 [3] Dalibor Mitrovic, Matthias Zeppelzauer and Chriscian

Breiteneder, “Features for content –based Audio Retrieval” in proc, advances in computer Vol 78, pp 71-150,2010.

[4] Pandit Bhatkhande, “ Hindustani Sangeet padhati, Bombay, 1937”.

[5] Wei Ho Tsai “Automatic singer recognition of popular music Recordings via estimation and modeling of solo vocal signals” IEEE transactions on Audio, Speech and language processing, Vol 14,no1, January 2006.

[6] Y.E. Kim and B Whitman “ Singer identification in popular music recordings using voice coding features” in Proceedings 3rd International Conference of music information retrieval, Paris, France 2002, PP 164 – 169.

[7] T Zang “Automatic Singer Identification” in Proceedings IEEE International Conference multimedia expo Baltimore, MD 2033

[8] George Tzanetakis “Manipulation, Analysis and Retrieval system for audio signals” PhD Thesis.

[9] S Esmaili, S. Krishnan, K Raahemifar, “Content based audio classification and retrieval using joint time –frequency analysis”, IEEE International Conference on Acoustics, Speech, and Signal Processing Vol 5, P 665-668, Montreal Canady May 2004

[10] L. Lu, H.I. Zhang and S.Z. Li, “content based audio classification and segmentation by using support vector machines”, Multimedia

Systems 8(6): 482 -492, Apr 2003. [11] Ian H. Witten and Eibe Frank. Data Mining: Practical Machine

Learning Tools and Techniques. Morgan Kaufmann, San Francisco, 2 edition.

[12] Martin Rocamora, Perfecto Herrera, “Comparing audio descriptors for Singing Voice detection in music audio files”

227227